{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "b143eafd",
   "metadata": {},
   "source": [
    "# Multimodal Data Tables: Tabular, Text, and Image\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/tabular-multimodal.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/tabular-multimodal.ipynb)\n",
    "\n",
    "\n",
    "**Tip**: Prior to reading this tutorial, it is recommended to have a basic understanding of the TabularPredictor API covered in [Predicting Columns in a Table - Quick Start](tabular-quick-start.ipynb).\n",
    "\n",
    "In this tutorial, we will train a multi-modal ensemble using data that contains image, text, and tabular features.\n",
    "\n",
    "Note: A GPU is required for this tutorial in order to train the image and text models. Additionally, GPU installations are required for Torch with appropriate CUDA versions.\n",
    "\n",
    "## The PetFinder Dataset\n",
    "\n",
    "We will be using the [PetFinder dataset](https://www.kaggle.com/c/petfinder-adoption-prediction). The PetFinder dataset provides information about shelter animals that appear on their adoption profile with the goal to predict the adoption rate of the animal. The end goal is for rescue shelters to use the predicted adoption rate to identify animals whose profiles could be improved so that they can find a home.\n",
    "\n",
    "Each animal's adoption profile contains a variety of information, such as pictures of the animal, a text description of the animal, and various tabular features such as age, breed, name, color, and more.\n",
    "\n",
    "To get started, we first need to download the dataset. Datasets that contain images require more than a CSV file, so the dataset is packaged in a zip file in S3. We will first download it and unzip the contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa00faab-252f-44c9-b8f7-57131aa8251c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install autogluon\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5d412c3a",
   "metadata": {},
   "outputs": [],
   "source": [
    "download_dir = './ag_petfinder_tutorial'\n",
    "zip_file = 'https://automl-mm-bench.s3.amazonaws.com/petfinder_kaggle.zip'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cd143ad",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.core.utils.loaders import load_zip\n",
    "load_zip.unzip(zip_file, unzip_dir=download_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9d0d4d92",
   "metadata": {},
   "source": [
    "Now that the data is download and unzipped, let's take a look at the contents:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d7cba19",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "os.listdir(download_dir)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3c13cf90",
   "metadata": {},
   "source": [
    "'file.zip' is the original zip file we downloaded, and 'petfinder_processed' is a directory containing the dataset files."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9c47ef33",
   "metadata": {},
   "outputs": [],
   "source": [
    "dataset_path = download_dir + '/petfinder_processed'\n",
    "os.listdir(dataset_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f1916121",
   "metadata": {},
   "source": [
    "Here we can see the train, test, and dev CSV files, as well as two directories: 'test_images' and 'train_images' which contain the image JPG files.\n",
    "\n",
    "Note: We will be using the dev data as testing data as dev contains the ground truth labels for showing scores via `predictor.leaderboard`.\n",
    "\n",
    "Let's take a peek at the first 10 files inside of the 'train_images' directory:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8b04ec7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "os.listdir(dataset_path + '/train_images')[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "af448fe2",
   "metadata": {},
   "source": [
    "As expected, these are the images we will be training with alongside the other features.\n",
    "\n",
    "Next, we will load the train and dev CSV files:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "18f9225d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "train_data = pd.read_csv(f'{dataset_path}/train.csv', index_col=0)\n",
    "test_data = pd.read_csv(f'{dataset_path}/dev.csv', index_col=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4ee125f2",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e214208d",
   "metadata": {},
   "source": [
    "Looking at the first 3 examples, we can tell that there is a variety of tabular features, a text description ('Description'), and an image path ('Images').\n",
    "\n",
    "For the PetFinder dataset, we will try to predict the speed of adoption for the animal ('AdoptionSpeed'), grouped into 5 categories. This means that we are dealing with a multi-class classification problem."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2094a737",
   "metadata": {},
   "outputs": [],
   "source": [
    "label = 'AdoptionSpeed'\n",
    "image_col = 'Images'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2eca95e5",
   "metadata": {},
   "source": [
    "## Preparing the image column\n",
    "\n",
    "Let's take a look at what a value in the image column looks like:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "312535cc",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data[image_col].iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "206ec739",
   "metadata": {},
   "source": [
    "Currently, AutoGluon only supports one image per row. Since the PetFinder dataset contains one or more images per row, we first need to preprocess the image column to only contain the first image of each row."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a44acef9",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data[image_col] = train_data[image_col].apply(lambda ele: ele.split(';')[0])\n",
    "test_data[image_col] = test_data[image_col].apply(lambda ele: ele.split(';')[0])\n",
    "\n",
    "train_data[image_col].iloc[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2e7a260d",
   "metadata": {},
   "source": [
    "AutoGluon loads images based on the file path provided by the image column.\n",
    "\n",
    "Here we update the path to point to the correct location on disk:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2f09f039",
   "metadata": {},
   "outputs": [],
   "source": [
    "def path_expander(path, base_folder):\n",
    "    path_l = path.split(';')\n",
    "    return ';'.join([os.path.abspath(os.path.join(base_folder, path)) for path in path_l])\n",
    "\n",
    "train_data[image_col] = train_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))\n",
    "test_data[image_col] = test_data[image_col].apply(lambda ele: path_expander(ele, base_folder=dataset_path))\n",
    "\n",
    "train_data[image_col].iloc[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "988ec5fb",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data.head(3)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78735d1a",
   "metadata": {},
   "source": [
    "## Analyzing an example row\n",
    "\n",
    "Now that we have preprocessed the image column, let's take a look at an example row of data and display the text description and the picture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4cdf696c",
   "metadata": {},
   "outputs": [],
   "source": [
    "example_row = train_data.iloc[1]\n",
    "\n",
    "example_row"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d40c53c",
   "metadata": {},
   "outputs": [],
   "source": [
    "example_row['Description']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2851fb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "example_image = example_row['Images']\n",
    "\n",
    "from IPython.display import Image, display\n",
    "pil_img = Image(filename=example_image)\n",
    "display(pil_img)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "37f38298",
   "metadata": {},
   "source": [
    "The PetFinder dataset is fairly large. For the purposes of the tutorial, we will sample 500 rows for training.\n",
    "\n",
    "Training on large multi-modal datasets can be very computationally intensive, especially if using the `best_quality` preset in AutoGluon. When prototyping, it is recommended to sample your data to get an idea of which models are worth training, then gradually train with larger amounts of data and longer time limits as you would with any other machine learning algorithm."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82d0c06b",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_data = train_data.sample(500, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c22846d1",
   "metadata": {},
   "source": [
    "## Constructing the FeatureMetadata\n",
    "\n",
    "Next, let's see what AutoGluon infers the feature types to be by constructing a FeatureMetadata object from the training data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b5dc0f25",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.tabular import FeatureMetadata\n",
    "feature_metadata = FeatureMetadata.from_df(train_data)\n",
    "\n",
    "print(feature_metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a07d1c37",
   "metadata": {},
   "source": [
    "Notice that FeatureMetadata automatically identified the column 'Description' as text, so we don't need to manually specify that it is text.\n",
    "\n",
    "In order to leverage images, we need to tell AutoGluon which column contains the image path. We can do this by specifying a FeatureMetadata object and adding the 'image_path' special type to the image column. We later pass this custom FeatureMetadata to TabularPredictor.fit."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "886c7ebc",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_metadata = feature_metadata.add_special_types({image_col: ['image_path']})\n",
    "\n",
    "print(feature_metadata)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "679d42dd",
   "metadata": {},
   "source": [
    "## Specifying the hyperparameters\n",
    "\n",
    "Next, we need to specify the models we want to train with. This is done via the `hyperparameters` argument to TabularPredictor.fit.\n",
    "\n",
    "AutoGluon has a predefined config that works well for multimodal datasets called 'multimodal'. We can access it via:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a7dba36",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.tabular.configs.hyperparameter_configs import get_hyperparameter_config\n",
    "hyperparameters = get_hyperparameter_config('multimodal')\n",
    "\n",
    "hyperparameters"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ea698424",
   "metadata": {},
   "source": [
    "This hyperparameter config will train a variety of Tabular models as well as finetune an Electra BERT text model, and a ResNet image model.\n",
    "\n",
    "## Fitting with TabularPredictor\n",
    "\n",
    "Now we will train a TabularPredictor on the dataset, using the feature metadata and hyperparameters we defined prior. This TabularPredictor will leverage tabular, text, and image features all at once."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20db89df",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.tabular import TabularPredictor\n",
    "predictor = TabularPredictor(label=label).fit(\n",
    "    train_data=train_data,\n",
    "    hyperparameters=hyperparameters,\n",
    "    feature_metadata=feature_metadata,\n",
    "    time_limit=900,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "986f7e96",
   "metadata": {},
   "source": [
    "After the predictor is fit, we can take a look at the leaderboard and see the performance of the various models:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad480e57",
   "metadata": {},
   "outputs": [],
   "source": [
    "leaderboard = predictor.leaderboard(test_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dce1bd79",
   "metadata": {},
   "source": [
    "That's all it takes to train with image, text, and tabular data (at the same time) using AutoGluon!\n",
    "\n",
    "For more tutorials, refer to [Predicting Columns in a Table - Quick Start](tabular-quick-start.ipynb) and [Predicting Columns in a Table - In Depth](tabular-indepth.ipynb)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}