{ "cells": [ { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "# Classifying PDF Documents with AutoMM\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/document_prediction/pdf_classification.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/document_prediction/pdf_classification.ipynb)\n", "\n", "PDF comes short from Portable Document Format and is one of the most popular document formats.\n", "We can find PDFs everywhere, from personal resumes to business contracts, and from commercial brochures to government documents. \n", "The list can be endless. \n", "PDF is highly praised for its portability. \n", "There's no worry about the receiver being unable to view the document or see an imperfect version regardless of their operating system and device models.\n", "\n", "Using AutoMM, you can handle and build machine learning models on PDF documents just like working on other modalities such as text and images, without bothering about PDFs processing. \n", "In this tutorial, we will introduce how to classify PDF documents automatically with AutoMM using document foundation models. Let’s get started!\n", "\n", "For document processing, AutoGluon requires poppler to be installed. Check https://poppler.freedesktop.org for source \n", "\n", "https://github.com/oschwartz10612/poppler-windows for Windows release (make sure to add the bin/ folder to PATH after installing) \n", "\n", "`brew install poppler` for Mac" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Get the PDF document dataset\n", "We have created a simple PDFs dataset via manual crawling for demonstration purpose. \n", "It consists of two categories, resume and historical documents (downloaded from [milestone documents](https://www.archives.gov/milestone-documents/list)). \n", "We picked 20 PDF documents for each of the category. \n", "\n", "Now, let's download the dataset and split it into training and test sets." ] }, { "cell_type": "code", "execution_count": null, "id": "aa00faab-252f-44c9-b8f7-57131aa8251c", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!pip install autogluon.multimodal\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import warnings\n", "warnings.filterwarnings('ignore')\n", "import os\n", "import pandas as pd\n", "from autogluon.core.utils.loaders import load_zip\n", "\n", "download_dir = './ag_automm_tutorial_pdf_classifier'\n", "zip_file = \"https://automl-mm-bench.s3.amazonaws.com/doc_classification/pdf_docs_small.zip\"\n", "load_zip.unzip(zip_file, unzip_dir=download_dir)\n", "\n", "dataset_path = os.path.join(download_dir, \"pdf_docs_small\")\n", "pdf_docs = pd.read_csv(f\"{dataset_path}/data.csv\")\n", "train_data = pdf_docs.sample(frac=0.8, random_state=200)\n", "test_data = pdf_docs.drop(train_data.index)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "Now, let's visualize one of the PDF documents. Here, we use the S3 URL of the PDF document and `IFrame` to show it in the tutorial." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from IPython.display import IFrame\n", "IFrame(\"https://automl-mm-bench.s3.amazonaws.com/doc_classification/historical_1.pdf\", width=400, height=500)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "As you can see, this document is an America's historical document in PDF format. \n", "To make sure the MultiModalPredictor can locate the documents correctly, we need to overwrite the document paths." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal.utils.misc import path_expander\n", "\n", "DOC_PATH_COL = \"doc_path\"\n", "\n", "train_data[DOC_PATH_COL] = train_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))\n", "test_data[DOC_PATH_COL] = test_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))\n", "print(test_data.head())" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Create a PDF Document Classifier\n", "\n", "You can create a PDFs classifier easily with `MultiModalPredictor`. \n", "All you need to do is to create a predictor and fit it with the above training dataset. \n", "AutoMM will handle all the details, like (1) detecting if it is PDF format datasets; (2) processing PDFs like converting it into a format that our model can recognize; (3) detecting and recognizing the text in PDF documents; etc., without your notice. \n", "\n", "Here, label is the name of the column that contains the target variable to predict, e.g., it is “label” in our example. \n", "We set the training time limit to 120 seconds for demonstration purposes." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal import MultiModalPredictor\n", "\n", "predictor = MultiModalPredictor(label=\"label\")\n", "predictor.fit(\n", " train_data=train_data,\n", " hyperparameters={\"model.document_transformer.checkpoint_name\":\"microsoft/layoutlm-base-uncased\",\n", " \"optimization.top_k_average_method\":\"best\",\n", " },\n", " time_limit=120,\n", ")" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate on Test Dataset\n", "\n", "You can evaluate the classifier on the test dataset to see how it performs:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "scores = predictor.evaluate(test_data, metrics=[\"accuracy\"])\n", "print('The test acc: %.3f' % scores[\"accuracy\"])" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Predict on a New PDF Document\n", "\n", "Given an example PDF document, we can easily use the final model to predict the label:\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "predictions = predictor.predict({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})\n", "print(f\"Ground-truth label: {test_data.iloc[0]['label']}, Prediction: {predictions}\")\n" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "If probabilities of all categories are needed, you can call predict_proba:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "proba = predictor.predict_proba({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})\n", "print(proba)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Extract Embeddings\n", "\n", "Extracting representation from the whole document learned by a model is also very useful. \n", "We provide extract_embedding function to allow predictor to return the N-dimensional document feature where N depends on the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "feature = predictor.extract_embedding({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})\n", "print(feature[0].shape)" ] }, { "attachments": {}, "cell_type": "markdown", "metadata": {}, "source": [ "## Other Examples\n", "\n", "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n", "\n", "## Customization\n", "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb).\n" ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }