Multimodal Prediction#
For problems on multimodal data tables that contain image, text, and tabular data, AutoGluon provides MultiModalPredictor
(abbreviated as AutoMM
)
that automatically selects, fuses, and tunes foundation models from popular packages like timm,
huggingface/transformers,
CLIP, MMDetection etc.
You can not only use AutoMM
to solve standard NLP/Vision tasks
such as sentiment classification, intent detection, paraphrase detection, image classification, but also use it for multimodal problems that involve image,
text, tabular features, object bounding boxes, named entities, etc. Moreover, AutoMM
can be used as a basic model in the multi-layer
stack-ensemble of AutoGluon Tabular, and is powering up the FT-Transformer in TabularPredictor
.
Here are some example use-cases of AutoMM:
Multilingual text classification: Tutorial
Predicting pets’ popularity based on their description, photo, and other metadata: Tutorial, Example
Predicting the price of book: Tutorial
Scoring student’s essays: Example
Image classification: Tutorial
Extracting named entities: Tutorial
Search for relevant text / image via text queries: Tutorial
Document Classification (Experimental): Tutorial
In the following, we decomposed the functionalities of AutoMM and prepared step-by-step guide for each functionality.