Categorizing Scanned Documents with AutoMM ========================================== Paper documents in an organization are a crucial source of information, regardless of industry. Dealing with paper documents is a headache because they can occupy a significant amount of space, can easily wear or fade with time, and are difficult to keep track of. As such, there is a growing trend to digitizing paper documents via scanners, cameras, etc. However, digitization does not necessarily bring automation, and identifying, categorizing, and analyzing digital documents can still be a labor-intensive process. For example, classifying digital books into different genres, and categorizing scanned receipts into *utilities*, *transportation*, *insurance*, *rent*, *supplies*, etc. are time-consuming and tiresome if done manually. With newer AI technologies, automating digital document processing becomes easier and more effective. It’s fair to say that AI has been the bedrock of modern digital document processing systems. In this tutorial, we show how you can build a scanned document classifier with Autogluon Multimodal using a few lines of code. Let’s get started! Get a Document Dataset ---------------------- Now let’s download a scanned document dataset. This dataset is a sample of `RVL-CDIP `__ which originally consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Here, we sampled around 100 documents and three categories of document including budget (labelled as 0), email (labelled as 1), and form (labelled as 2). .. code:: python import warnings warnings.filterwarnings('ignore') import os import pandas as pd from autogluon.core.utils.loaders import load_zip download_dir = './ag_automm_tutorial_doc_classifier' zip_file = "https://automl-mm-bench.s3.amazonaws.com/doc_classification/rvl_cdip_sample.zip" load_zip.unzip(zip_file, unzip_dir=download_dir) .. parsed-literal:: :class: output Downloading ./ag_automm_tutorial_doc_classifier/file.zip from https://automl-mm-bench.s3.amazonaws.com/doc_classification/rvl_cdip_sample.zip... .. parsed-literal:: :class: output 100%|██████████| 7.95M/7.95M [00:00<00:00, 85.3MiB/s] We load the training and test data below. .. code:: python dataset_path = os.path.join(download_dir, "rvl_cdip_sample") rvl_cdip_data = pd.read_csv(f"{dataset_path}/rvl_cdip_train_data.csv") train_data = rvl_cdip_data.sample(frac=0.8, random_state=200) test_data = rvl_cdip_data.drop(train_data.index) We need to expand the document paths to load them in training. .. code:: python from autogluon.multimodal.utils.misc import path_expander DOC_PATH_COL = "doc_path" train_data[DOC_PATH_COL] = train_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir)) test_data[DOC_PATH_COL] = test_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir)) print(test_data.head()) .. parsed-literal:: :class: output The cache for model files in Transformers v4.22.0 has been updated. Migrating your old cache. This is a one-time only operation. You can interrupt this and resume the migration later on by calling `transformers.utils.move_cache()`. .. parsed-literal:: :class: output Moving 0 files to the new cache system .. parsed-literal:: :class: output 0it [00:00, ?it/s] .. parsed-literal:: :class: output doc_path label 1 /home/ci/autogluon/docs/_build/eval/tutorials/... 0 6 /home/ci/autogluon/docs/_build/eval/tutorials/... 0 7 /home/ci/autogluon/docs/_build/eval/tutorials/... 0 11 /home/ci/autogluon/docs/_build/eval/tutorials/... 0 14 /home/ci/autogluon/docs/_build/eval/tutorials/... 0 Let’s display one of the document. As you can see, this is a budget document consisting of account number, account name, budgeted fund, expenditures, and etc. .. code:: python from IPython.display import Image, display example_image = train_data.iloc[0][DOC_PATH_COL] pil_img = Image(filename=example_image, width=500) display(pil_img) .. figure:: output_document_classification_45a0fb_7_0.jpg :width: 500px Build a Scanned Document Classifier with AutoMM ----------------------------------------------- You can build a scanned document classifier with our MultiModalPredictor. All you need to do is to create a predictor and fit it with the above training dataset. Under the hood, AutoMM will automatically recognize handwritten or typed text, and make use of the recognized text, layout information, as well as the visual features for document classification. Model customization is also quite simple, you can specify the underline foundation model using the ``model.document_transformer.checkpoint_name`` hyperparameter and AutoMM support document foundation models such as `layoutlmv3 `__, `layoutlmv2 `__, `layoutlm-base `__, `layoutxlm `__, etc., as well as pure text models like `bert `__, `deberta `__, just to name a few. Here, ``label`` is the name of the column that contains the target variable to predict, e.g., it is “label” in our example. We set the training time limit to 120 seconds for demonstration purposes. .. code:: python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor(label="label", verbosity=5) predictor.fit( train_data=train_data, hyperparameters={"model.document_transformer.checkpoint_name":"microsoft/layoutlm-base-uncased", "optimization.top_k_average_method":"best", }, time_limit=120, ) .. parsed-literal:: :class: output No path specified. Models will be saved in: "AutogluonModels/ag-20230222_232302/" save path: /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/document/AutogluonModels/ag-20230222_232302 Average length of words of this dataset is 1980.5. column_types: OrderedDict([('doc_path', 'document_image'), ('label', 'categorical')]) image columns: [] AutoMM starts to create your model. ✨ - Model will be saved to "/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/document/AutogluonModels/ag-20230222_232302". - Validation metric is "accuracy". - To track the learning progress, you can open a terminal and launch Tensorboard: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/document/AutogluonModels/ag-20230222_232302 ``` Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai overrides: {'model.document_transformer.checkpoint_name': 'microsoft/layoutlm-base-uncased', 'optimization.top_k_average_method': 'best'} Process col "doc_path" with type "document_image" Process col "label" with type label selected models: ['document_transformer'] model dtypes: ['document'] output_shape: 3 initializing microsoft/layoutlm-base-uncased outer layers are treated as head: ['head.weight', 'head.bias'] initializing microsoft/layoutlm-base-uncased Checkpoint: microsoft/layoutlm-base-uncased uses the text data only for classification. data_processors_count: {'label': 1, 'document': 1} validation_metric_name: val_accuracy minmax_mode: max applying layerwise learning rate decay... trainer.max_steps: -1 len(trainer.datamodule.train_dataloader()): 8 trainer.max_epochs: 10 trainer.accumulate_grad_batches: 16 max steps: 5 warmup steps: 0 lr_schedule: cosine_decay done configuring optimizer and scheduler Removed checkpoint: /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/document/AutogluonModels/ag-20230222_232302/epoch=0-step=1.ckpt Removed checkpoint: /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/document/AutogluonModels/ag-20230222_232302/epoch=1-step=1.ckpt AutoMM has created your model 🎉🎉🎉 - To load the model, use the code below: ```python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/document/AutogluonModels/ag-20230222_232302") ``` - You can open a terminal and launch Tensorboard to visualize the training log: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/document/AutogluonModels/ag-20230222_232302 ``` - If you are not satisfied with the model, try to increase the training time, adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html), or post issues on GitHub: https://github.com/autogluon/autogluon .. parsed-literal:: :class: output Evaluate on Test Dataset ------------------------ You can evaluate the classifier on the test dataset to see how it performs: .. code:: python scores = predictor.evaluate(test_data, metrics=["accuracy"]) print('The test acc: %.3f' % scores["accuracy"]) .. parsed-literal:: :class: output The test acc: 0.750 Predict on a New Document ------------------------- Given an example document, let’s visualize it first, .. code:: python doc_path = test_data.iloc[1][DOC_PATH_COL] from IPython.display import Image, display pil_img = Image(filename=doc_path, width=500) display(pil_img) .. figure:: output_document_classification_45a0fb_13_0.jpg :width: 500px We can easily use the final model to predict the label, .. code:: python predictions = predictor.predict({DOC_PATH_COL: [doc_path]}) print(predictions) .. parsed-literal:: :class: output [0] The above output shows that the trained model correctly classifies the given document into the *budget* category. If probabilities of all categories are needed, you can call predict_proba: .. code:: python proba = predictor.predict_proba({DOC_PATH_COL: [doc_path]}) print(proba) .. parsed-literal:: :class: output [[0.66863143 0.3211263 0.01024226]] Extract Embeddings ------------------ Extracting representation from the whole document learned by a model is also very useful. We provide extract_embedding function to allow predictor to return the N-dimensional document feature where N depends on the model. .. code:: python feature = predictor.extract_embedding({DOC_PATH_COL: [doc_path]}) print(feature[0].shape) .. parsed-literal:: :class: output (768,) Other Examples -------------- You may go to `AutoMM Examples `__ to explore other examples about AutoMM. Customization ------------- To learn how to customize AutoMM, please refer to :ref:`sec_automm_customization`.