{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "33e2a40a",
   "metadata": {},
   "source": [
    "# AutoMM for Named Entity Recognition in Chinese - Quick Start\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/text_prediction/chinese_ner.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/text_prediction/chinese_ner.ipynb)\n",
    "\n",
    "In this tutorial, we will demonstrate how to use AutoMM for Chinese Named Entity Recognition using an e-commerce dataset extracted from one of the most popular online marketplaces, [TaoBao.com](https://taobao.com). \n",
    "The dataset is collected and labelled by [Jie et al.](https://aclanthology.org/N19-1079.pdf) and the text column mainly consists of product descriptions. \n",
    "The following figure shows an example of Taobao product description.\n",
    "\n",
    "![Taobao product description. A rabbit toy for lunar new year decoration.](https://automl-mm-bench.s3.amazonaws.com/ner/images_for_tutorial/chinese_ner.png)\n",
    "\n",
    "\n",
    "## Load the Data \n",
    "We have preprocessed the dataset to make it ready-to-use with AutoMM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa00faab-252f-44c9-b8f7-57131aa8251c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install autogluon.multimodal\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83684244",
   "metadata": {},
   "outputs": [],
   "source": [
    "import autogluon.multimodal\n",
    "from autogluon.core.utils.loaders import load_pd\n",
    "from autogluon.multimodal.utils import visualize_ner\n",
    "train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/taobao-ner/chinese_ner_train.csv')\n",
    "dev_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/taobao-ner/chinese_ner_dev.csv')\n",
    "train_data.head(5)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "61ece6f8",
   "metadata": {},
   "source": [
    "HPPX, HCCX, XH, and MISC stand for brand, product, pattern, and Miscellaneous information (e.g., product Specification), respectively. \n",
    "Let's visualize one of the examples, which is about *online games top up services*."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "64f15637",
   "metadata": {},
   "outputs": [],
   "source": [
    "visualize_ner(train_data[\"text_snippet\"].iloc[0], train_data[\"entity_annotations\"].iloc[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "19e97989",
   "metadata": {},
   "source": [
    "## Training\n",
    "With AutoMM, the process of Chinese entity recognition is the same as English entity recognition. \n",
    "All you need to do is to select a suitable foundation model checkpoint that are pretrained on Chinese or multilingual documents. \n",
    "Here we use the `'hfl/chinese-lert-small'` backbone for demonstration purpose.\n",
    "\n",
    "Now, let’s create a predictor for named entity recognition by setting the problem_type to ner and specifying the label column. \n",
    "Afterwards, we call predictor.fit() to train the model for a few minutes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c7cabf56",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.multimodal import MultiModalPredictor\n",
    "import uuid\n",
    "\n",
    "label_col = \"entity_annotations\"\n",
    "model_path = f\"./tmp/{uuid.uuid4().hex}-automm_ner\"  # You can rename it to the model path you like\n",
    "predictor = MultiModalPredictor(problem_type=\"ner\", label=label_col, path=model_path)\n",
    "predictor.fit(\n",
    "    train_data=train_data,\n",
    "    hyperparameters={'model.ner_text.checkpoint_name':'hfl/chinese-lert-small'},\n",
    "    time_limit=300, #second\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c4977807",
   "metadata": {},
   "source": [
    "## Evaluation \n",
    "To check the model performance on the test dataset, all you need to do is to call `predictor.evaluate(...)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0539aa3c",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictor.evaluate(dev_data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b2f4499e",
   "metadata": {},
   "source": [
    "## Prediction and Visualization\n",
    "You can easily obtain the predictions given an input sentence by by calling `predictor.predict(...)`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "83b8259e",
   "metadata": {},
   "outputs": [],
   "source": [
    "output = predictor.predict(dev_data)\n",
    "visualize_ner(dev_data[\"text_snippet\"].iloc[0], output[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c963ab5e",
   "metadata": {},
   "source": [
    "Now, let's make predictions on the rabbit toy example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "82f0089e",
   "metadata": {},
   "outputs": [],
   "source": [
    "sentence = \"2023年兔年挂件新年装饰品小挂饰乔迁之喜门挂小兔子\"\n",
    "predictions = predictor.predict({'text_snippet': [sentence]})\n",
    "visualize_ner(sentence, predictions[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c5cf274",
   "metadata": {},
   "source": [
    "## Other Examples\n",
    "\n",
    "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n",
    "\n",
    "## Customization\n",
    "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb)."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}