{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "1c2ce292",
   "metadata": {},
   "source": [
    "# AutoMM for Text + Tabular - Quick Start\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/multimodal_prediction/multimodal_text_tabular.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/multimodal_prediction/multimodal_text_tabular.ipynb)\n",
    "\n",
    "\n",
    "\n",
    "In many applications, text data may be mixed with numeric/categorical data. \n",
    "AutoGluon's `MultiModalPredictor` can train a single neural network that jointly operates on multiple feature types, \n",
    "including text, categorical, and numerical columns. The general idea is to embed the text, categorical and numeric fields \n",
    "separately and fuse these features across modalities. This tutorial demonstrates such an application."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa00faab-252f-44c9-b8f7-57131aa8251c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install autogluon.multimodal\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a3eb3624",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "import warnings\n",
    "import os\n",
    "\n",
    "warnings.filterwarnings('ignore')\n",
    "np.random.seed(123)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6608c094",
   "metadata": {},
   "outputs": [],
   "source": [
    "!python3 -m pip install openpyxl"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ade016b9",
   "metadata": {},
   "source": [
    "## Book Price Prediction Data\n",
    "\n",
    "For demonstration, we use the book price prediction dataset from the [MachineHack Book Price Prediction Hackathon](https://machinehack.com/hackathons/predict_the_price_of_books/overview). Our goal is to predict a book's price given various features like its author, the abstract, the book's rating, etc."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b44e0b3",
   "metadata": {},
   "outputs": [],
   "source": [
    "!mkdir -p price_of_books\n",
    "!wget https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/Data.zip -O price_of_books/Data.zip\n",
    "!cd price_of_books && unzip -o Data.zip\n",
    "!ls price_of_books/Participants_Data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c964975b",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.read_excel(os.path.join('price_of_books', 'Participants_Data', 'Data_Train.xlsx'), engine='openpyxl')\n",
    "train_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "625e5274",
   "metadata": {},
   "source": [
    "We do some basic preprocessing to convert `Reviews` and `Ratings` in the data table to numeric values, and we transform prices to a log-scale."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3977ed8",
   "metadata": {},
   "outputs": [],
   "source": [
    "def preprocess(df):\n",
    "    df = df.copy(deep=True)\n",
    "    df.loc[:, 'Reviews'] = pd.to_numeric(df['Reviews'].apply(lambda ele: ele[:-len(' out of 5 stars')]))\n",
    "    df.loc[:, 'Ratings'] = pd.to_numeric(df['Ratings'].apply(lambda ele: ele.replace(',', '')[:-len(' customer reviews')]))\n",
    "    df.loc[:, 'Price'] = np.log(df['Price'] + 1)\n",
    "    return df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ec39777",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_subsample_size = 1500  # subsample for faster demo, you can try setting to larger values\n",
    "test_subsample_size = 5\n",
    "train_df = preprocess(train_df)\n",
    "train_data = train_df.iloc[100:].sample(train_subsample_size, random_state=123)\n",
    "test_data = train_df.iloc[:100].sample(test_subsample_size, random_state=245)\n",
    "train_data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0aaf5168",
   "metadata": {},
   "source": [
    "## Training\n",
    "\n",
    "We can simply create a MultiModalPredictor and call `predictor.fit()` to train a model that operates on across all types of features. \n",
    "Internally, the neural network will be automatically generated based on the inferred data type of each feature column. \n",
    "To save time, we subsample the data and only train for three minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b6286fcd",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.multimodal import MultiModalPredictor\n",
    "import uuid\n",
    "\n",
    "time_limit = 3 * 60  # set to larger value in your applications\n",
    "model_path = f\"./tmp/{uuid.uuid4().hex}-automm_text_book_price_prediction\"\n",
    "predictor = MultiModalPredictor(label='Price', path=model_path)\n",
    "predictor.fit(train_data, time_limit=time_limit)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19752cb3",
   "metadata": {},
   "source": [
    "## Prediction\n",
    "\n",
    "We can easily obtain predictions and extract data embeddings using the MultiModalPredictor."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "217e5c1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = predictor.predict(test_data)\n",
    "print('Predictions:')\n",
    "print('------------')\n",
    "print(np.exp(predictions) - 1)\n",
    "print()\n",
    "print('True Value:')\n",
    "print('------------')\n",
    "print(np.exp(test_data['Price']) - 1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "37ad1e64",
   "metadata": {},
   "outputs": [],
   "source": [
    "performance = predictor.evaluate(test_data)\n",
    "print(performance)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "72c67048",
   "metadata": {},
   "outputs": [],
   "source": [
    "embeddings = predictor.extract_embedding(test_data)\n",
    "embeddings.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ae530b1b",
   "metadata": {},
   "source": [
    "\n",
    "## Other Examples\n",
    "\n",
    "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n",
    "\n",
    "## Customization\n",
    "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python",
   "pygments_lexer": "ipython"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}