Classifying PDF Documents with AutoMM#
PDF comes short from Portable Document Format and is one of the most popular document formats. We can find PDFs everywhere, from personal resumes to business contracts, and from commercial brochures to government documents. The list can be endless. PDF is highly praised for its portability. There’s no worry about the receiver being unable to view the document or see an imperfect version regardless of their operating system and device models.
Using AutoMM, you can handle and build machine learning models on PDF documents just like working on other modalities such as text and images, without bothering about PDFs processing. In this tutorial, we will introduce how to classify PDF documents automatically with AutoMM using document foundation models. Let’s get started!
Install AutoGluon MultiModal with extra dependency PyMuPDF:
!pip install autogluon.multimodal[PyMuPDF]
Requirement already satisfied: autogluon.multimodal[PyMuPDF] in /home/ci/autogluon/multimodal/src (1.0.0b20231228)
Requirement already satisfied: numpy<1.29,>=1.21 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.26.2)
Requirement already satisfied: scipy<1.13,>=1.5.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.11.4)
Requirement already satisfied: pandas<2.2.0,>=2.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (2.1.4)
Requirement already satisfied: scikit-learn<1.5,>=1.3.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.3.2)
Requirement already satisfied: Pillow<11,>=10.0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (10.1.0)
Requirement already satisfied: tqdm<5,>=4.38 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (4.65.2)
Requirement already satisfied: boto3<2,>=1.10 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.34.9)
Requirement already satisfied: torch<2.1,>=2.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (2.0.1)
Requirement already satisfied: lightning<2.1,>=2.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (2.0.9.post0)
Requirement already satisfied: requests<3,>=2.21 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (2.28.2)
Requirement already satisfied: jsonschema<4.18,>=4.14 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (4.17.3)
Requirement already satisfied: seqeval<1.3.0,>=1.2.2 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.2.2)
Requirement already satisfied: evaluate<0.5.0,>=0.4.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.4.1)
Requirement already satisfied: accelerate<0.22.0,>=0.21.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.21.0)
Requirement already satisfied: transformers<4.32.0,>=4.31.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from transformers[sentencepiece]<4.32.0,>=4.31.0->autogluon.multimodal[PyMuPDF]) (4.31.0)
Requirement already satisfied: timm<0.10.0,>=0.9.5 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.9.12)
Requirement already satisfied: torchvision<0.16.0,>=0.14.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.15.2)
Requirement already satisfied: scikit-image<0.21.0,>=0.19.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.20.0)
Requirement already satisfied: text-unidecode<1.4,>=1.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.3)
Requirement already satisfied: torchmetrics<1.2.0,>=1.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.1.2)
Requirement already satisfied: nptyping<2.5.0,>=1.4.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (2.4.1)
Requirement already satisfied: omegaconf<2.3.0,>=2.1.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (2.2.3)
Requirement already satisfied: autogluon.core==1.0.0b20231228 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.0.0b20231228)
Requirement already satisfied: autogluon.features==1.0.0b20231228 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.0.0b20231228)
Requirement already satisfied: autogluon.common==1.0.0b20231228 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.0.0b20231228)
Requirement already satisfied: pytorch-metric-learning<2.0,>=1.3.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.7.3)
Requirement already satisfied: nlpaug<1.2.0,>=1.1.10 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (1.1.11)
Requirement already satisfied: nltk<4.0.0,>=3.4.5 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (3.8.1)
Requirement already satisfied: openmim<0.4.0,>=0.3.7 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.3.9)
Requirement already satisfied: defusedxml<0.7.2,>=0.7.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.7.1)
Requirement already satisfied: jinja2<3.2,>=3.0.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (3.1.2)
Requirement already satisfied: tensorboard<3,>=2.9 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (2.15.1)
Requirement already satisfied: pytesseract<0.3.11,>=0.3.9 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (0.3.10)
Requirement already satisfied: nvidia-ml-py3==7.352.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.multimodal[PyMuPDF]) (7.352.0)
Collecting PyMuPDF<=1.21.1 (from autogluon.multimodal[PyMuPDF])
Downloading PyMuPDF-1.21.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (14.0 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 14.0/14.0 MB 108.2 MB/s eta 0:00:00
?25hRequirement already satisfied: psutil<6,>=5.7.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.common==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (5.9.7)
Requirement already satisfied: setuptools in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.common==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (60.2.0)
Requirement already satisfied: networkx<4,>=3.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.core==1.0.0b20231228->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (3.2.1)
Requirement already satisfied: matplotlib in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.core==1.0.0b20231228->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (3.6.3)
Requirement already satisfied: ray<2.7,>=2.6.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (2.6.3)
Requirement already satisfied: hyperopt<0.2.8,>=0.2.7 in /home/ci/opt/venv/lib/python3.10/site-packages (from autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.2.7)
Requirement already satisfied: packaging>=20.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from accelerate<0.22.0,>=0.21.0->autogluon.multimodal[PyMuPDF]) (23.2)
Requirement already satisfied: pyyaml in /home/ci/opt/venv/lib/python3.10/site-packages (from accelerate<0.22.0,>=0.21.0->autogluon.multimodal[PyMuPDF]) (6.0.1)
Requirement already satisfied: botocore<1.35.0,>=1.34.9 in /home/ci/opt/venv/lib/python3.10/site-packages (from boto3<2,>=1.10->autogluon.multimodal[PyMuPDF]) (1.34.9)
Requirement already satisfied: jmespath<2.0.0,>=0.7.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from boto3<2,>=1.10->autogluon.multimodal[PyMuPDF]) (0.10.0)
Requirement already satisfied: s3transfer<0.11.0,>=0.10.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from boto3<2,>=1.10->autogluon.multimodal[PyMuPDF]) (0.10.0)
Requirement already satisfied: datasets>=2.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (2.16.0)
Requirement already satisfied: dill in /home/ci/opt/venv/lib/python3.10/site-packages (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (0.3.7)
Requirement already satisfied: xxhash in /home/ci/opt/venv/lib/python3.10/site-packages (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (3.4.1)
Requirement already satisfied: multiprocess in /home/ci/opt/venv/lib/python3.10/site-packages (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (0.70.15)
Requirement already satisfied: fsspec>=2021.05.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from fsspec[http]>=2021.05.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (2023.10.0)
Requirement already satisfied: huggingface-hub>=0.7.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (0.20.1)
Requirement already satisfied: responses<0.19 in /home/ci/opt/venv/lib/python3.10/site-packages (from evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (0.18.0)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from jinja2<3.2,>=3.0.3->autogluon.multimodal[PyMuPDF]) (2.1.3)
Requirement already satisfied: attrs>=17.4.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from jsonschema<4.18,>=4.14->autogluon.multimodal[PyMuPDF]) (23.1.0)
Requirement already satisfied: pyrsistent!=0.17.0,!=0.17.1,!=0.17.2,>=0.14.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from jsonschema<4.18,>=4.14->autogluon.multimodal[PyMuPDF]) (0.20.0)
Requirement already satisfied: arrow<3.0,>=1.2.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.3.0)
Requirement already satisfied: backoff<4.0,>=2.2.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.2.1)
Requirement already satisfied: beautifulsoup4<6.0,>=4.8.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (4.12.2)
Requirement already satisfied: click<10.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (8.1.7)
Requirement already satisfied: croniter<1.5.0,>=1.3.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.4.1)
Requirement already satisfied: dateutils<2.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.6.12)
Requirement already satisfied: deepdiff<8.0,>=5.7.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (6.7.1)
Requirement already satisfied: fastapi<2.0,>=0.92.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.108.0)
Requirement already satisfied: inquirer<5.0,>=2.10.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (3.1.4)
Requirement already satisfied: lightning-cloud>=0.5.38 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.5.57)
Requirement already satisfied: lightning-utilities<2.0,>=0.7.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.10.0)
Requirement already satisfied: pydantic<2.2.0,>=1.7.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.10.13)
Requirement already satisfied: python-multipart<2.0,>=0.0.5 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.0.6)
Requirement already satisfied: rich<15.0,>=12.3.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (13.4.2)
Requirement already satisfied: starlette in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.32.0.post1)
Requirement already satisfied: starsessions<2.0,>=1.2.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.3.0)
Requirement already satisfied: traitlets<7.0,>=5.3.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (5.14.0)
Requirement already satisfied: typing-extensions<6.0,>=4.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (4.9.0)
Requirement already satisfied: urllib3<4.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.26.18)
Requirement already satisfied: uvicorn<2.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.25.0)
Requirement already satisfied: websocket-client<3.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.7.0)
Requirement already satisfied: websockets<13.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (12.0)
Requirement already satisfied: pytorch-lightning in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.0.9.post0)
Requirement already satisfied: gdown>=4.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from nlpaug<1.2.0,>=1.1.10->autogluon.multimodal[PyMuPDF]) (4.7.1)
Requirement already satisfied: joblib in /home/ci/opt/venv/lib/python3.10/site-packages (from nltk<4.0.0,>=3.4.5->autogluon.multimodal[PyMuPDF]) (1.3.2)
Requirement already satisfied: regex>=2021.8.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from nltk<4.0.0,>=3.4.5->autogluon.multimodal[PyMuPDF]) (2023.12.25)
Requirement already satisfied: antlr4-python3-runtime==4.9.* in /home/ci/opt/venv/lib/python3.10/site-packages (from omegaconf<2.3.0,>=2.1.1->autogluon.multimodal[PyMuPDF]) (4.9.3)
Requirement already satisfied: colorama in /home/ci/opt/venv/lib/python3.10/site-packages (from openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (0.4.4)
Requirement already satisfied: model-index in /home/ci/opt/venv/lib/python3.10/site-packages (from openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (0.1.11)
Requirement already satisfied: opendatalab in /home/ci/opt/venv/lib/python3.10/site-packages (from openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (0.0.10)
Requirement already satisfied: pip>=19.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (23.3.2)
Requirement already satisfied: tabulate in /home/ci/opt/venv/lib/python3.10/site-packages (from openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (0.9.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/ci/opt/venv/lib/python3.10/site-packages (from pandas<2.2.0,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from pandas<2.2.0,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2023.3.post1)
Requirement already satisfied: tzdata>=2022.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from pandas<2.2.0,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2023.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ci/opt/venv/lib/python3.10/site-packages (from requests<3,>=2.21->autogluon.multimodal[PyMuPDF]) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/ci/opt/venv/lib/python3.10/site-packages (from requests<3,>=2.21->autogluon.multimodal[PyMuPDF]) (3.6)
Requirement already satisfied: certifi>=2017.4.17 in /home/ci/opt/venv/lib/python3.10/site-packages (from requests<3,>=2.21->autogluon.multimodal[PyMuPDF]) (2023.11.17)
Requirement already satisfied: imageio>=2.4.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from scikit-image<0.21.0,>=0.19.1->autogluon.multimodal[PyMuPDF]) (2.33.1)
Requirement already satisfied: tifffile>=2019.7.26 in /home/ci/opt/venv/lib/python3.10/site-packages (from scikit-image<0.21.0,>=0.19.1->autogluon.multimodal[PyMuPDF]) (2023.12.9)
Requirement already satisfied: PyWavelets>=1.1.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from scikit-image<0.21.0,>=0.19.1->autogluon.multimodal[PyMuPDF]) (1.5.0)
Requirement already satisfied: lazy_loader>=0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from scikit-image<0.21.0,>=0.19.1->autogluon.multimodal[PyMuPDF]) (0.3)
Requirement already satisfied: threadpoolctl>=2.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from scikit-learn<1.5,>=1.3.0->autogluon.multimodal[PyMuPDF]) (3.2.0)
Requirement already satisfied: absl-py>=0.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (2.0.0)
Requirement already satisfied: grpcio>=1.48.2 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (1.60.0)
Requirement already satisfied: google-auth<3,>=1.6.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (2.25.2)
Requirement already satisfied: google-auth-oauthlib<2,>=0.5 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (1.2.0)
Requirement already satisfied: markdown>=2.6.8 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (3.5.1)
Requirement already satisfied: protobuf<4.24,>=3.19.6 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (3.20.2)
Requirement already satisfied: six>1.9 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (1.16.0)
Requirement already satisfied: tensorboard-data-server<0.8.0,>=0.7.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (0.7.2)
Requirement already satisfied: werkzeug>=1.0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (3.0.1)
Requirement already satisfied: safetensors in /home/ci/opt/venv/lib/python3.10/site-packages (from timm<0.10.0,>=0.9.5->autogluon.multimodal[PyMuPDF]) (0.4.1)
Requirement already satisfied: filelock in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (3.13.1)
Requirement already satisfied: sympy in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (1.12)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (11.7.99)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (11.7.99)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (11.7.101)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (8.5.0.96)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (11.10.3.66)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (10.2.10.91)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (11.4.0.1)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (11.7.4.91)
Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (2.14.3)
Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (11.7.91)
Requirement already satisfied: triton==2.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (2.0.0)
Requirement already satisfied: wheel in /home/ci/opt/venv/lib/python3.10/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (0.42.0)
Requirement already satisfied: cmake in /home/ci/opt/venv/lib/python3.10/site-packages (from triton==2.0.0->torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (3.28.1)
Requirement already satisfied: lit in /home/ci/opt/venv/lib/python3.10/site-packages (from triton==2.0.0->torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (17.0.6)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from transformers<4.32.0,>=4.31.0->transformers[sentencepiece]<4.32.0,>=4.31.0->autogluon.multimodal[PyMuPDF]) (0.13.3)
Requirement already satisfied: sentencepiece!=0.1.92,>=0.1.91 in /home/ci/opt/venv/lib/python3.10/site-packages (from transformers[sentencepiece]<4.32.0,>=4.31.0->autogluon.multimodal[PyMuPDF]) (0.1.99)
Requirement already satisfied: types-python-dateutil>=2.8.10 in /home/ci/opt/venv/lib/python3.10/site-packages (from arrow<3.0,>=1.2.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.8.19.14)
Requirement already satisfied: soupsieve>1.2 in /home/ci/opt/venv/lib/python3.10/site-packages (from beautifulsoup4<6.0,>=4.8.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.5)
Requirement already satisfied: pyarrow>=8.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from datasets>=2.0.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (14.0.2)
Requirement already satisfied: pyarrow-hotfix in /home/ci/opt/venv/lib/python3.10/site-packages (from datasets>=2.0.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (0.6)
Requirement already satisfied: aiohttp in /home/ci/opt/venv/lib/python3.10/site-packages (from datasets>=2.0.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (3.9.1)
Requirement already satisfied: ordered-set<4.2.0,>=4.0.2 in /home/ci/opt/venv/lib/python3.10/site-packages (from deepdiff<8.0,>=5.7.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (4.1.0)
Requirement already satisfied: cachetools<6.0,>=2.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (5.3.2)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (0.3.0)
Requirement already satisfied: rsa<5,>=3.1.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from google-auth<3,>=1.6.3->tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (4.9)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from google-auth-oauthlib<2,>=0.5->tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (1.3.1)
Requirement already satisfied: future in /home/ci/opt/venv/lib/python3.10/site-packages (from hyperopt<0.2.8,>=0.2.7->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.18.3)
Requirement already satisfied: cloudpickle in /home/ci/opt/venv/lib/python3.10/site-packages (from hyperopt<0.2.8,>=0.2.7->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (3.0.0)
Requirement already satisfied: py4j in /home/ci/opt/venv/lib/python3.10/site-packages (from hyperopt<0.2.8,>=0.2.7->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.10.9.7)
Requirement already satisfied: blessed>=1.19.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from inquirer<5.0,>=2.10.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.20.0)
Requirement already satisfied: python-editor>=1.0.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from inquirer<5.0,>=2.10.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.0.4)
Requirement already satisfied: readchar>=3.0.6 in /home/ci/opt/venv/lib/python3.10/site-packages (from inquirer<5.0,>=2.10.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (4.0.5)
Requirement already satisfied: pyjwt in /home/ci/opt/venv/lib/python3.10/site-packages (from lightning-cloud>=0.5.38->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.8.0)
Requirement already satisfied: msgpack<2.0.0,>=1.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from ray<2.7,>=2.6.3->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.0.7)
Requirement already satisfied: aiosignal in /home/ci/opt/venv/lib/python3.10/site-packages (from ray<2.7,>=2.6.3->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.3.1)
Requirement already satisfied: frozenlist in /home/ci/opt/venv/lib/python3.10/site-packages (from ray<2.7,>=2.6.3->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.4.1)
Requirement already satisfied: aiohttp-cors in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.7.0)
Requirement already satisfied: colorful in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.5.5)
Requirement already satisfied: py-spy>=0.2.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.3.14)
Requirement already satisfied: gpustat>=1.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.1.1)
Requirement already satisfied: opencensus in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.11.3)
Requirement already satisfied: prometheus-client>=0.7.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.19.0)
Requirement already satisfied: smart-open in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (6.4.0)
Requirement already satisfied: virtualenv<20.21.1,>=20.0.24 in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (20.21.0)
Requirement already satisfied: tensorboardX>=1.9 in /home/ci/opt/venv/lib/python3.10/site-packages (from ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (2.6.2.2)
Requirement already satisfied: markdown-it-py>=2.2.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from rich<15.0,>=12.3.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (3.0.0)
Requirement already satisfied: pygments<3.0.0,>=2.13.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from rich<15.0,>=12.3.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.17.2)
Requirement already satisfied: anyio<5,>=3.4.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from starlette->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (4.2.0)
Requirement already satisfied: itsdangerous<3.0.0,>=2.0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from starsessions<2.0,>=1.2.1->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (2.1.2)
Requirement already satisfied: h11>=0.8 in /home/ci/opt/venv/lib/python3.10/site-packages (from uvicorn<2.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.14.0)
Requirement already satisfied: contourpy>=1.0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from matplotlib->autogluon.core==1.0.0b20231228->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /home/ci/opt/venv/lib/python3.10/site-packages (from matplotlib->autogluon.core==1.0.0b20231228->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from matplotlib->autogluon.core==1.0.0b20231228->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (4.47.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from matplotlib->autogluon.core==1.0.0b20231228->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.4.5)
Requirement already satisfied: pyparsing>=2.2.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from matplotlib->autogluon.core==1.0.0b20231228->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (3.1.1)
Requirement already satisfied: pycryptodome in /home/ci/opt/venv/lib/python3.10/site-packages (from opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (3.19.1)
Requirement already satisfied: openxlab in /home/ci/opt/venv/lib/python3.10/site-packages (from opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (0.0.32)
Requirement already satisfied: mpmath>=0.19 in /home/ci/opt/venv/lib/python3.10/site-packages (from sympy->torch<2.1,>=2.0->autogluon.multimodal[PyMuPDF]) (1.3.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /home/ci/opt/venv/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (6.0.4)
Requirement already satisfied: yarl<2.0,>=1.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (1.9.4)
Requirement already satisfied: async-timeout<5.0,>=4.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from aiohttp->datasets>=2.0.0->evaluate<0.5.0,>=0.4.0->autogluon.multimodal[PyMuPDF]) (4.0.3)
Requirement already satisfied: sniffio>=1.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.3.0)
Requirement already satisfied: exceptiongroup>=1.0.2 in /home/ci/opt/venv/lib/python3.10/site-packages (from anyio<5,>=3.4.0->starlette->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (1.2.0)
Requirement already satisfied: wcwidth>=0.1.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from blessed>=1.19.0->inquirer<5.0,>=2.10.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.2.12)
Requirement already satisfied: nvidia-ml-py>=11.450.129 in /home/ci/opt/venv/lib/python3.10/site-packages (from gpustat>=1.0.0->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (12.535.133)
Requirement already satisfied: mdurl~=0.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from markdown-it-py>=2.2.0->rich<15.0,>=12.3.0->lightning<2.1,>=2.0.0->autogluon.multimodal[PyMuPDF]) (0.1.2)
Requirement already satisfied: pyasn1<0.6.0,>=0.4.6 in /home/ci/opt/venv/lib/python3.10/site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (0.5.1)
Requirement already satisfied: oauthlib>=3.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<2,>=0.5->tensorboard<3,>=2.9->autogluon.multimodal[PyMuPDF]) (3.2.2)
Requirement already satisfied: distlib<1,>=0.3.6 in /home/ci/opt/venv/lib/python3.10/site-packages (from virtualenv<20.21.1,>=20.0.24->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.3.8)
Requirement already satisfied: platformdirs<4,>=2.4 in /home/ci/opt/venv/lib/python3.10/site-packages (from virtualenv<20.21.1,>=20.0.24->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (3.11.0)
Requirement already satisfied: opencensus-context>=0.1.3 in /home/ci/opt/venv/lib/python3.10/site-packages (from opencensus->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (0.1.3)
Requirement already satisfied: google-api-core<3.0.0,>=1.0.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from opencensus->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (2.15.0)
Requirement already satisfied: oss2~=2.17.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (2.17.0)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /home/ci/opt/venv/lib/python3.10/site-packages (from requests[socks]->gdown>=4.0.0->nlpaug<1.2.0,>=1.1.10->autogluon.multimodal[PyMuPDF]) (1.7.1)
Requirement already satisfied: googleapis-common-protos<2.0.dev0,>=1.56.2 in /home/ci/opt/venv/lib/python3.10/site-packages (from google-api-core<3.0.0,>=1.0.0->opencensus->ray[default,tune]<2.7,>=2.6.3; extra == "raytune"->autogluon.core[raytune]==1.0.0b20231228->autogluon.multimodal[PyMuPDF]) (1.62.0)
Requirement already satisfied: crcmod>=1.7 in /home/ci/opt/venv/lib/python3.10/site-packages (from oss2~=2.17.0->openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (1.7)
Requirement already satisfied: aliyun-python-sdk-kms>=2.4.1 in /home/ci/opt/venv/lib/python3.10/site-packages (from oss2~=2.17.0->openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (2.16.2)
Requirement already satisfied: aliyun-python-sdk-core>=2.13.12 in /home/ci/opt/venv/lib/python3.10/site-packages (from oss2~=2.17.0->openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (2.14.0)
Requirement already satisfied: cryptography>=2.6.0 in /home/ci/opt/venv/lib/python3.10/site-packages (from aliyun-python-sdk-core>=2.13.12->oss2~=2.17.0->openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (41.0.7)
Requirement already satisfied: cffi>=1.12 in /home/ci/opt/venv/lib/python3.10/site-packages (from cryptography>=2.6.0->aliyun-python-sdk-core>=2.13.12->oss2~=2.17.0->openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (1.16.0)
Requirement already satisfied: pycparser in /home/ci/opt/venv/lib/python3.10/site-packages (from cffi>=1.12->cryptography>=2.6.0->aliyun-python-sdk-core>=2.13.12->oss2~=2.17.0->openxlab->opendatalab->openmim<0.4.0,>=0.3.7->autogluon.multimodal[PyMuPDF]) (2.21)
Installing collected packages: PyMuPDF
Successfully installed PyMuPDF-1.21.1
Get the PDF document dataset#
We have created a simple PDFs dataset via manual crawling for demonstration purpose. It consists of two categories, resume and historical documents (downloaded from milestone documents). We picked 20 PDF documents for each of the category.
Now, let’s download the dataset and split it into training and test sets.
import warnings
warnings.filterwarnings('ignore')
import os
import pandas as pd
from autogluon.core.utils.loaders import load_zip
download_dir = './ag_automm_tutorial_pdf_classifier'
zip_file = "https://automl-mm-bench.s3.amazonaws.com/doc_classification/pdf_docs_small.zip"
load_zip.unzip(zip_file, unzip_dir=download_dir)
dataset_path = os.path.join(download_dir, "pdf_docs_small")
pdf_docs = pd.read_csv(f"{dataset_path}/data.csv")
train_data = pdf_docs.sample(frac=0.8, random_state=200)
test_data = pdf_docs.drop(train_data.index)
Downloading ./ag_automm_tutorial_pdf_classifier/file.zip from https://automl-mm-bench.s3.amazonaws.com/doc_classification/pdf_docs_small.zip...
100%|██████████| 12.7M/12.7M [00:00<00:00, 119MiB/s]
Now, let’s visualize one of the PDF documents. Here, we use the S3 URL of the PDF document and IFrame
to show it in the tutorial.
from IPython.display import IFrame
IFrame("https://automl-mm-bench.s3.amazonaws.com/doc_classification/historical_1.pdf", width=400, height=500)
As you can see, this document is an America’s historical document in PDF format. To make sure the MultiModalPredictor can locate the documents correctly, we need to overwrite the document paths.
from autogluon.multimodal.utils.misc import path_expander
DOC_PATH_COL = "doc_path"
train_data[DOC_PATH_COL] = train_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
test_data[DOC_PATH_COL] = test_data[DOC_PATH_COL].apply(lambda ele: path_expander(ele, base_folder=download_dir))
print(test_data.head())
doc_path label
4 /home/ci/autogluon/docs/tutorials/multimodal/d... resume
12 /home/ci/autogluon/docs/tutorials/multimodal/d... resume
14 /home/ci/autogluon/docs/tutorials/multimodal/d... resume
15 /home/ci/autogluon/docs/tutorials/multimodal/d... resume
16 /home/ci/autogluon/docs/tutorials/multimodal/d... resume
Create a PDF Document Classifier#
You can create a PDFs classifier easily with MultiModalPredictor
.
All you need to do is to create a predictor and fit it with the above training dataset.
AutoMM will handle all the details, like (1) detecting if it is PDF format datasets; (2) processing PDFs like converting it into a format that our model can recognize; (3) detecting and recognizing the text in PDF documents; etc., without your notice.
Here, label is the name of the column that contains the target variable to predict, e.g., it is “label” in our example. We set the training time limit to 120 seconds for demonstration purposes.
from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor(label="label")
predictor.fit(
train_data=train_data,
hyperparameters={"model.document_transformer.checkpoint_name":"microsoft/layoutlm-base-uncased",
"optimization.top_k_average_method":"best",
},
time_limit=120,
)
No path specified. Models will be saved in: "AutogluonModels/ag-20231228_194038"
=================== System Info ===================
AutoGluon Version: 1.0.0b20231228
Python Version: 3.10.8
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Tue Nov 30 00:17:50 UTC 2021
CPU Count: 8
Pytorch Version: 2.0.1+cu117
CUDA Version: 11.7
Memory Avail: 28.72 GB / 30.96 GB (92.8%)
Disk Space Avail: 199.60 GB / 255.99 GB (78.0%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: ['historical', 'resume']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
AutoMM starts to create your model. ✨✨✨
To track the learning progress, you can open a terminal and launch Tensorboard:
```shell
# Assume you have installed tensorboard
tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20231228_194038
```
INFO: Global seed set to 0
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.0GB/14.76GB (Used/Total)
INFO: Using 16bit Automatic Mixed Precision (AMP)
INFO: GPU available: True (cuda), used: True
INFO: TPU available: False, using: 0 TPU cores
INFO: IPU available: False, using: 0 IPUs
INFO: HPU available: False, using: 0 HPUs
INFO: LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
INFO:
| Name | Type | Params
----------------------------------------------------------
0 | model | DocumentTransformer | 112 M
1 | validation_metric | BinaryAUROC | 0
2 | loss_func | CrossEntropyLoss | 0
----------------------------------------------------------
112 M Trainable params
0 Non-trainable params
112 M Total params
450.518 Total estimated model params size (MB)
AutoMM has created your model. 🎉🎉🎉
To load the model, use the code below:
```python
from autogluon.multimodal import MultiModalPredictor
predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/document_prediction/AutogluonModels/ag-20231228_194038")
```
If you are not satisfied with the model, try to increase the training time,
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f22666670d0>
Evaluate on Test Dataset#
You can evaluate the classifier on the test dataset to see how it performs:
scores = predictor.evaluate(test_data, metrics=["accuracy"])
print('The test acc: %.3f' % scores["accuracy"])
The test acc: 0.625
Predict on a New PDF Document#
Given an example PDF document, we can easily use the final model to predict the label:
predictions = predictor.predict({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})
print(f"Ground-truth label: {test_data.iloc[0]['label']}, Prediction: {predictions}")
Ground-truth label: resume, Prediction: ['resume']
If probabilities of all categories are needed, you can call predict_proba:
proba = predictor.predict_proba({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})
print(proba)
[[0.3201368 0.6798632]]
Extract Embeddings#
Extracting representation from the whole document learned by a model is also very useful. We provide extract_embedding function to allow predictor to return the N-dimensional document feature where N depends on the model.
feature = predictor.extract_embedding({DOC_PATH_COL: [test_data.iloc[0][DOC_PATH_COL]]})
print(feature[0].shape)
(768,)
Other Examples#
You may go to AutoMM Examples to explore other examples about AutoMM.
Customization#
To learn how to customize AutoMM, please refer to Customize AutoMM.