Multimodal Prediction¶
For problems on multimodal data tables that contain image, text, and tabular data, AutoGluon provides MultiModalPredictor (abbreviated as AutoMM) that automatically selects and fuses deep learning backbones from popular packages like timm, huggingface/transformers, CLIP, etc. You can use it to build models for multimodal problems that involve image, text, and tabular features, e.g., predicting the product price based on the items’ description, photo, and other metadata, or matching images with text descriptions.
In addition, being good at multimodal problems implies that the predictor will be good for each specific modality. Thus, you can also use AutoMM to solve standard NLP/Vision tasks like sentiment classification, intent detection, paraphrase detection, image classification. Moreover, AutoMM can be used as a basic model in the multi-layer stack-ensemble of TabularPredictor.
In the following, we prepared a few tutorials to help you learn how to use AutoMM to solve problems that involve image, text, and tabular data.
How to train high-quality text prediction models with MultiModalPredictor in under 5 minutes.
How to train image classification models with MultiModalPredictor.
How to use MultiModalPredictor to build models on datasets with languages other than English.
How MultiModalPredictor can be applied to multimodal data tables with a mix of text, numerical, and categorical columns. Here, we train a model to predict the price of books.
How to use MultiModalPredictor to train a model that predicts the adoption speed of pets.
How to use CLIP for zero-shot image classification.
How to use CLIP to extract embeddings for retrieval problem.
How to customize AutoMM configurations.