{ "cells": [ { "cell_type": "markdown", "id": "33e2a40a", "metadata": {}, "source": [ "# AutoMM for Named Entity Recognition in Chinese - Quick Start\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/text_prediction/chinese_ner.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/multimodal/text_prediction/chinese_ner.ipynb)\n", "\n", "In this tutorial, we will demonstrate how to use AutoMM for Chinese Named Entity Recognition using an e-commerce dataset extracted from one of the most popular online marketplaces, [TaoBao.com](https://taobao.com). \n", "The dataset is collected and labelled by [Jie et al.](https://aclanthology.org/N19-1079.pdf) and the text column mainly consists of product descriptions. \n", "The following figure shows an example of Taobao product description.\n", "\n", "![Taobao product description. A rabbit toy for lunar new year decoration.](https://automl-mm-bench.s3.amazonaws.com/ner/images_for_tutorial/chinese_ner.png)\n", "\n", "\n", "## Load the Data \n", "We have preprocessed the dataset to make it ready-to-use with AutoMM." ] }, { "cell_type": "code", "execution_count": null, "id": "aa00faab-252f-44c9-b8f7-57131aa8251c", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!pip install autogluon.multimodal\n" ] }, { "cell_type": "code", "execution_count": null, "id": "83684244", "metadata": {}, "outputs": [], "source": [ "import autogluon.multimodal\n", "from autogluon.core.utils.loaders import load_pd\n", "from autogluon.multimodal.utils import visualize_ner\n", "train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/taobao-ner/chinese_ner_train.csv')\n", "dev_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/taobao-ner/chinese_ner_dev.csv')\n", "train_data.head(5)" ] }, { "cell_type": "markdown", "id": "61ece6f8", "metadata": {}, "source": [ "HPPX, HCCX, XH, and MISC stand for brand, product, pattern, and Miscellaneous information (e.g., product Specification), respectively. \n", "Let's visualize one of the examples, which is about *online games top up services*." ] }, { "cell_type": "code", "execution_count": null, "id": "64f15637", "metadata": {}, "outputs": [], "source": [ "visualize_ner(train_data[\"text_snippet\"].iloc[0], train_data[\"entity_annotations\"].iloc[0])" ] }, { "cell_type": "markdown", "id": "19e97989", "metadata": {}, "source": [ "## Training\n", "With AutoMM, the process of Chinese entity recognition is the same as English entity recognition. \n", "All you need to do is to select a suitable foundation model checkpoint that are pretrained on Chinese or multilingual documents. \n", "Here we use the `'hfl/chinese-lert-small'` backbone for demonstration purpose.\n", "\n", "Now, let’s create a predictor for named entity recognition by setting the problem_type to ner and specifying the label column. \n", "Afterwards, we call predictor.fit() to train the model for a few minutes." ] }, { "cell_type": "code", "execution_count": null, "id": "c7cabf56", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal import MultiModalPredictor\n", "import uuid\n", "\n", "label_col = \"entity_annotations\"\n", "model_path = f\"./tmp/{uuid.uuid4().hex}-automm_ner\" # You can rename it to the model path you like\n", "predictor = MultiModalPredictor(problem_type=\"ner\", label=label_col, path=model_path)\n", "predictor.fit(\n", " train_data=train_data,\n", " hyperparameters={'model.ner_text.checkpoint_name':'hfl/chinese-lert-small'},\n", " time_limit=300, #second\n", ")" ] }, { "cell_type": "markdown", "id": "c4977807", "metadata": {}, "source": [ "## Evaluation \n", "To check the model performance on the test dataset, all you need to do is to call `predictor.evaluate(...)`." ] }, { "cell_type": "code", "execution_count": null, "id": "0539aa3c", "metadata": {}, "outputs": [], "source": [ "predictor.evaluate(dev_data)" ] }, { "cell_type": "markdown", "id": "b2f4499e", "metadata": {}, "source": [ "## Prediction and Visualization\n", "You can easily obtain the predictions given an input sentence by by calling `predictor.predict(...)`." ] }, { "cell_type": "code", "execution_count": null, "id": "83b8259e", "metadata": {}, "outputs": [], "source": [ "output = predictor.predict(dev_data)\n", "visualize_ner(dev_data[\"text_snippet\"].iloc[0], output[0])" ] }, { "cell_type": "markdown", "id": "c963ab5e", "metadata": {}, "source": [ "Now, let's make predictions on the rabbit toy example." ] }, { "cell_type": "code", "execution_count": null, "id": "82f0089e", "metadata": {}, "outputs": [], "source": [ "sentence = \"2023年兔年挂件新年装饰品小挂饰乔迁之喜门挂小兔子\"\n", "predictions = predictor.predict({'text_snippet': [sentence]})\n", "visualize_ner(sentence, predictions[0])" ] }, { "cell_type": "markdown", "id": "0c5cf274", "metadata": {}, "source": [ "## Other Examples\n", "\n", "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n", "\n", "## Customization\n", "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb)." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }