{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "id": "zcGYEaIQKiql"
   },
   "source": [
    "# AutoGluon Multimodal - Quick Start\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/multimodal_prediction/multimodal-quick-start.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/multimodal_prediction/multimodal-quick-start.ipynb)\n",
    "\n",
    "AutoGluon's `MultiModalPredictor` is a deep learning model zoo of model zoos that can automatically build state-of-the-art deep learning models for inputs including images, text, and tabular data. Convert your data into AutoGluon's multimodal dataframe format, and `MultiModalPredictor` can predict the values of one column based on the other features.\n",
    "\n",
    "Begin by making sure AutoGluon is installed, and then import the required modules."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "!python -m pip install --upgrade pip\n",
    "!python -m pip install autogluon"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "\n",
    "import numpy as np\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "np.random.seed(123)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example Data\n",
    "\n",
    "For this tutorial we use a simplified and subsampled version of the [PetFinder dataset](https://www.kaggle.com/c/petfinder-adoption-prediction). The goal is to predict pet adoption rates based on their adoption profiles. In this simplified version, the adoption speed is grouped into two categories: 0 (slow) and 1 (fast). We begin by downloading a zip file containing the petfinder datasets and unzipping them in the current working directory."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.core.utils.loaders import load_zip\n",
    "\n",
    "download_dir = './ag_multimodal_tutorial'\n",
    "zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_for_tutorial.zip'\n",
    "\n",
    "load_zip.unzip(zip_file, unzip_dir=download_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we use pandas to read the dataset's CSV files into `DataFrames`, noting that the column we are interested in learning to predict is \"AdoptionSpeed\"."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "dataset_path = f'{download_dir}/petfinder_for_tutorial'\n",
    "\n",
    "train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)\n",
    "test_data = pd.read_csv(f'{dataset_path}/test.csv', index_col=0)\n",
    "\n",
    "label_col = 'AdoptionSpeed'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The PetFinder dataset comes with a directory of images, and some records in the data have multiple images associated with them. AutoGluon's multimodal dataframe format requires that image columns contain a string whose value is a path to a single image file. For this example, we will limit the image feature column to only the first image and will need to do some path manipulations to get everything setup correctly for the current directory structure."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image_col = 'Images'\n",
    "\n",
    "train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])\n",
    "test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])\n",
    "\n",
    "def path_expander(path, base_folder):\n",
    "    path_l = path.split(';')\n",
    "    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])\n",
    "\n",
    "train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))\n",
    "test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Each animal's adoption profile includes pictures, a text description, and various tabular features such as age, breed, name, color, and more. Let's look at a picture and description for an example row of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "example_row = train_data.iloc[0]\n",
    "example_image = example_row[image_col]\n",
    "\n",
    "from IPython.display import Image, display\n",
    "pil_img = Image(filename=example_image)\n",
    "display(pil_img)\n",
    "\n",
    "example_row['Description']"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Training\n",
    "\n",
    "Now that the data is in a suitable format, we can fit `MultiModalPredictor` on the training data. Here we set a tight training time budget for this quick demo. More training time will lead to better prediction performance, but we can get surprisingly good performance in a short amount of time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "tags": [
     "hide-output"
    ]
   },
   "outputs": [],
   "source": [
    "from autogluon.multimodal import MultiModalPredictor\n",
    "\n",
    "predictor = MultiModalPredictor(label=label_col).fit(\n",
    "    train_data=train_data,\n",
    "    time_limit=120\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Under the hood `MultiModalPredictor` automatically infers the problem type (classification or regression), detects feature modalities, selects models from the multimodal model pools, and trains the selected models. If multiple backbones are used, MultiModalPredictor appends a late-fusion model (MLP or transformer) on top of them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Prediction\n",
    "\n",
    "After fitting the model, we want to use it to predict the labels in the witheld test dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = predictor.predict(test_data.drop(columns=label_col))\n",
    "predictions[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For classification tasks, we can just as easily get the prediction probabilities for each output class."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "probs = predictor.predict_proba(test_data.drop(columns=label_col))\n",
    "probs[:5]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluation\n",
    "\n",
    "Finally, we can evaluate the predictor on the witheld test dataset on other performance metrics, in this case [roc_auc](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scores = predictor.evaluate(test_data, metrics=[\"roc_auc\"])\n",
    "scores"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "In this quickstart tutorial we saw the basic fit and predict functionality of AutoGluon's `MultiModalPredictor`, but we just scratched the surface on its functionality. Check out the in-depth tutorials to learn about other features of AutoGluon's `MultiModalPredictor` like embedding extraction, distillation, model fine-tuning, text or image prediction, and semantic matching."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}