AutoMM for Scanned Document Classification¶

Paper documents in an organization are a crucial source of information, regardless of industry. Dealing with paper documents is a headache because they can occupy a significant amount of space, can easily wear or fade with time, and are difficult to keep track of. As such, there is a growing trend to digitizing paper documents via scanners, cameras, etc. However, digitization does not necessarily bring automation, and identifying, categorizing, and analyzing digital documents can still be a labor-intensive process. For example, classifying digital books into different genres, and categorizing scanned receipts into utilities, transportation, insurance, rent, supplies, etc. are time-consuming and tiresome if done manually. With newer AI technologies, automating digital document processing becomes easier and more effective. It’s fair to say that AI has been the bedrock of modern digital document processing systems.

In this tutorial, we show how you can build a scanned document classifier with Autogluon Multimodal using a few lines of code. Let’s get started!

Get a Document Dataset¶

Now let’s download a scanned document dataset. This dataset is a sample of RVL-CDIP which originally consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. Here, we sampled around 100 documents and three categories of document including budget (labelled as 0), email (labelled as 1), and form (labelled as 2).

import warnings
warnings.filterwarnings('ignore')

import os
import pandas as pd
from autogluon.core.utils.loaders import load_zip

download_dir = './ag_automm_tutorial_doc_classifier'
zip_file = "https://automl-mm-bench.s3.amazonaws.com/doc_classification/rvl_cdip_sample.zip"
load_zip.unzip(zip_file, unzip_dir=download_dir)

Downloading ./ag_automm_tutorial_doc_classifier/file.zip from https://automl-mm-bench.s3.amazonaws.com/doc_classification/rvl_cdip_sample.zip...

  0%|          | 0.00/7.95M [00:00<?, ?iB/s]
100%|██████████| 7.95M/7.95M [00:00<00:00, 104MiB/s]

We load the training and test data below.

dataset_path = os.path.join(download_dir, "rvl_cdip_sample")
rvl_cdip_data = pd.read_csv(f"{dataset_path}/rvl_cdip_train_data.csv")
train_data = rvl_cdip_data.sample(frac=0.8, random_state=200)
test_data = rvl_cdip_data.drop(train_data.index)

We need to expand the document paths to load them in training.

from autogluon.multimodal.utils.misc import path_expander

DOC_PATH_COL = "doc_path"

train_data[DOC_PATH_COL] = train_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
test_data[DOC_PATH_COL] = test_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
print(test_data.head())

                                             doc_path  label
 /home/ci/autogluon/docs/tutorials/multimodal/d...      0
 /home/ci/autogluon/docs/tutorials/multimodal/d...      0
 /home/ci/autogluon/docs/tutorials/multimodal/d...      0
/home/ci/autogluon/docs/tutorials/multimodal/d...      0
/home/ci/autogluon/docs/tutorials/multimodal/d...      0

Let’s display one of the document. As you can see, this is a budget document consisting of account number, account name, budgeted fund, expenditures, and etc.

from IPython.display import Image, display

example_image = train_data.iloc[0][DOC_PATH_COL]
pil_img = Image(filename=example_image, width=500)
display(pil_img)

../../../_images/4df76529344dc2caeb016ed125ba438ca7272e190e1ab8706e94e23f24c5e1d2.jpg

Build a Scanned Document Classifier with AutoMM¶

You can build a scanned document classifier with our MultiModalPredictor. All you need to do is to create a predictor and fit it with the above training dataset. Under the hood, AutoMM will automatically recognize handwritten or typed text, and make use of the recognized text, layout information, as well as the visual features for document classification. Model customization is also quite simple, you can specify the underline foundation model using the model.document_transformer.checkpoint_name hyperparameter and AutoMM support document foundation models such as layoutlmv3, layoutlmv2, layoutlm-base, layoutxlm, etc., as well as pure text models like bert, deberta, just to name a few.

Here, label is the name of the column that contains the target variable to predict, e.g., it is “label” in our example. We set the training time limit to 120 seconds for demonstration purposes.

from autogluon.multimodal import MultiModalPredictor

predictor = MultiModalPredictor(label="label")
predictor.fit(
    train_data=train_data,
    hyperparameters={"model.document_transformer.checkpoint_name":"microsoft/layoutlm-base-uncased",
    "optim.top_k_average_method":"best",
    },
    time_limit=120,
)

No path specified. Models will be saved in: "AutogluonModels/ag-20251024_140418"
=================== System Info ===================
AutoGluon Version:  1.4.1b20251024
Python Version:     3.12.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.7.1+cu126
CUDA Version:       12.6
GPU Memory:         GPU 0: 14.57/14.57 GB
Total GPU Memory:   Free: 14.57 GB, Allocated: 0.00 GB, Total: 14.57 GB
GPU Count:          1
Memory Avail:       28.40 GB / 30.95 GB (91.8%)
Disk Space Avail:   180.62 GB / 255.99 GB (70.6%)
===================================================
AutoGluon infers your prediction problem is: 'multiclass' (because dtype of label-column == int, but few unique label-values observed).
	3 unique label values:  [np.int64(0), np.int64(1), np.int64(2)]
	If 'multiclass' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20251024_140418
    ```
INFO: Seed set to 0
The model does not support using an image size that is different from the default size. Provided image size=224. Default size=None. Detailed model configuration=LayoutLMConfig {
  "_name_or_path": "microsoft/layoutlm-base-uncased",
  "attention_probs_dropout_prob": 0.1,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-12,
  "max_2d_position_embeddings": 1024,
  "max_position_embeddings": 512,
  "model_type": "layoutlm",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "output_past": true,
  "pad_token_id": 0,
  "position_embedding_type": "absolute",
  "transformers_version": "4.49.0",
  "type_vocab_size": 2,
  "use_cache": true,
  "vocab_size": 30522
}
. We have ignored the provided image size.
GPU Count: 1
GPU Count to be Used: 1
INFO: Using 16bit Automatic Mixed Precision (AMP)
INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO: 
  | Name              | Type                | Params | Mode 
------------------------------------------------------------------
0 | model             | DocumentTransformer | 112 M  | train
1 | validation_metric | MulticlassAccuracy  | 0      | train
2 | loss_func         | CrossEntropyLoss    | 0      | train
------------------------------------------------------------------
112 M     Trainable params
0         Non-trainable params
112 M     Total params
450.521   Total estimated model params size (MB)
236       Modules in train mode
0         Modules in eval mode
INFO: Epoch 0, global step 1: 'val_accuracy' reached 0.43750 (best 0.43750), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20251024_140418/epoch=0-step=1.ckpt' as top 3
INFO: Epoch 1, global step 2: 'val_accuracy' reached 0.68750 (best 0.68750), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20251024_140418/epoch=1-step=2.ckpt' as top 3
INFO: Epoch 2, global step 3: 'val_accuracy' reached 0.87500 (best 0.87500), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20251024_140418/epoch=2-step=3.ckpt' as top 3
INFO: Epoch 3, global step 4: 'val_accuracy' reached 1.00000 (best 1.00000), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20251024_140418/epoch=3-step=4.ckpt' as top 3
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20251024_140418")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).

<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f12ffd6c9b0>

Evaluate on Test Dataset¶

You can evaluate the classifier on the test dataset to see how it performs:

scores = predictor.evaluate(test_data, metrics=["accuracy"])
print('The test acc: %.3f' % scores["accuracy"])

The test acc: 0.950

INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.

Predict on a New Document¶

Given an example document, let’s visualize it first,

doc_path = test_data.iloc[1][DOC_PATH_COL]
from IPython.display import Image, display
pil_img = Image(filename=doc_path, width=500)
display(pil_img)

../../../_images/f0682d3a484741830e472c13869c479fe9556ac9679f9437a52f88c18033c278.jpg

We can easily use the final model to predict the label,

predictions = predictor.predict({DOC_PATH_COL: [doc_path]})
print(predictions)

[0]

INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.

The above output shows that the trained model correctly classifies the given document into the budget category.

If probabilities of all categories are needed, you can call predict_proba:

proba = predictor.predict_proba({DOC_PATH_COL: [doc_path]})
print(proba)

[[0.9646742  0.02438714 0.0109386 ]]

INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.

Extract Embeddings¶

Extracting representation from the whole document learned by a model is also very useful. We provide extract_embedding function to allow predictor to return the N-dimensional document feature where N depends on the model.

feature = predictor.extract_embedding({DOC_PATH_COL: [doc_path]})
print(feature[0].shape)

(768,)

INFO: 💡 Tip: For seamless cloud uploads and versioning, try installing [litmodels](https://pypi.org/project/litmodels/) to enable LitModelCheckpoint, which syncs automatically with the Lightning model registry.

Other Examples¶

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization¶

To learn how to customize AutoMM, please refer to Customize AutoMM.