AutoMM for Chinese Named Entity Recognition =========================================== In this tutorial, we will demonstrate how to use AutoMM for Chinese Named Entity Recognition using an e-commerce dataset extracted from one of the most popular online marketplaces, `TaoBao.com `__. The dataset is collected and labelled by `Jie et al. `__ and the text column mainly consists of product descriptions. The following figure shows an example of Taobao product description. .. figure:: https://automl-mm-bench.s3.amazonaws.com/ner/images_for_tutorial/chinese_ner.png :width: 200px Taobao product description. A rabbit toy for lunar new year decoration. Load the Data ------------- We have preprocessed the dataset to make it ready-to-use with AutoMM. .. code:: python import autogluon.multimodal from autogluon.core.utils.loaders import load_pd from autogluon.multimodal.utils import visualize_ner train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/taobao-ner/chinese_ner_train.csv') dev_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/taobao-ner/chinese_ner_dev.csv') train_data.head(5) .. raw:: html

	text_snippet	entity_annotations
0	雄争霸点卡/七雄争霸元宝/七雄争霸100元1000元宝直充,自动充值	[{"entity_group": "HCCX", "start": 3, "end": 5...
1	简约韩版粗跟艾熙百思图亲子鞋冬季百搭街头母女圆头翻边绒面厚底	[{"entity_group": "HPPX", "start": 6, "end": 8...
2	羚跑商务背包双肩包男士防盗多功能出差韩版休闲15.6寸电脑包皮潮	[{"entity_group": "HPPX", "start": 0, "end": 2...
3	热水袋防爆充电暖宝卡通毛绒萌萌可爱注水暖宫暖手宝暖水袋	[{"entity_group": "HCCX", "start": 0, "end": 3...
4	童装11周岁13儿童夏装男童套装2017新款10中大童15男孩12秋季5潮7	[{"entity_group": "HCCX", "start": 0, "end": 2...

HPPX, HCCX, XH, and MISC stand for brand, product, pattern, and Miscellaneous information (e.g., product Specification), respectively. Let’s visualize one of the examples, which is about *online games top up services*. .. code:: python visualize_ner(train_data["text_snippet"].iloc[0], train_data["entity_annotations"].iloc[0]) .. raw:: html 雄争霸点卡 HCCX /七雄争霸 MISC 元宝 HCCX /七雄争霸 MISC 100元 MISC 1000 MISC 元宝 HCCX 直充,自动充值 Training -------- With AutoMM, the process of Chinese entity recognition is the same as English entity recognition. All you need to do is to select a suitable foundation model checkpoint that are pretrained on Chinese or multilingual documents. Here we use the ``'hfl/chinese-lert-small'`` backbone for demonstration purpose. Now, let’s create a predictor for named entity recognition by setting the problem_type to ner and specifying the label column. Afterwards, we call predictor.fit() to train the model for a few minutes. .. code:: python from autogluon.multimodal import MultiModalPredictor import uuid label_col = "entity_annotations" model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner" # You can rename it to the model path you like predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path) predictor.fit( train_data=train_data, hyperparameters={'model.ner_text.checkpoint_name':'hfl/chinese-lert-small'}, time_limit=300, #second ) .. parsed-literal:: :class: output Global seed set to 123 AutoMM starts to create your model. ✨ - Model will be saved to "/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner". - Validation metric is "ner_token_f1". - To track the learning progress, you can open a terminal and launch Tensorboard: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner ``` Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai Using 16bit None Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params -------------------------------------------------------- 0 | model | HFAutoModelForNER | 15.1 M 1 | validation_metric | F1Score | 0 2 | loss_func | CrossEntropyLoss | 0 -------------------------------------------------------- 15.1 M Trainable params 0 Non-trainable params 15.1 M Total params 30.173 Total estimated model params size (MB) Epoch 0, global step 21: 'val_ner_token_f1' reached 0.22717 (best 0.22717), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=0-step=21.ckpt' as top 3 Epoch 0, global step 42: 'val_ner_token_f1' reached 0.64928 (best 0.64928), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=0-step=42.ckpt' as top 3 Epoch 1, global step 64: 'val_ner_token_f1' reached 0.73101 (best 0.73101), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=1-step=64.ckpt' as top 3 Epoch 1, global step 85: 'val_ner_token_f1' reached 0.75396 (best 0.75396), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=1-step=85.ckpt' as top 3 Epoch 2, global step 107: 'val_ner_token_f1' reached 0.77042 (best 0.77042), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=2-step=107.ckpt' as top 3 Epoch 2, global step 128: 'val_ner_token_f1' reached 0.79051 (best 0.79051), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=2-step=128.ckpt' as top 3 Epoch 3, global step 150: 'val_ner_token_f1' reached 0.80004 (best 0.80004), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=3-step=150.ckpt' as top 3 Epoch 3, global step 171: 'val_ner_token_f1' reached 0.80981 (best 0.80981), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=3-step=171.ckpt' as top 3 Epoch 4, global step 193: 'val_ner_token_f1' reached 0.81573 (best 0.81573), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=4-step=193.ckpt' as top 3 Time limit reached. Elapsed time is 0:05:00. Signaling Trainer to stop. Epoch 4, global step 209: 'val_ner_token_f1' reached 0.82137 (best 0.82137), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/ae278224ee8b4ed78c54dd113d77baf9-automm_ner/epoch=4-step=209.ckpt' as top 3 Start to fuse 3 checkpoints via the greedy soup algorithm. .. parsed-literal:: :class: output Downloading builder script: 0%| | 0.00/6.34k [00:00 Evaluation ---------- To check the model performance on the test dataset, all you need to do is to call ``predictor.evaluate(...)``. .. code:: python predictor.evaluate(dev_data) .. parsed-literal:: :class: output {'hccx': {'precision': 0.7821270310192023, 'recall': 0.8302626421011368, 'f1': 0.8054763262977752, 'number': 2551}, 'hppx': {'precision': 0.5245398773006135, 'recall': 0.6151079136690647, 'f1': 0.5662251655629138, 'number': 278}, 'misc': {'precision': 0.6073298429319371, 'recall': 0.6904761904761905, 'f1': 0.6462395543175488, 'number': 504}, 'xh': {'precision': 0.6325757575757576, 'recall': 0.7016806722689075, 'f1': 0.6653386454183267, 'number': 238}, 'overall_precision': 0.7243606303280806, 'overall_recall': 0.7852142257070849, 'overall_f1': 0.7535608707336737, 'overall_accuracy': 0.8673729105042306} Prediction and Visualization ---------------------------- You can easily obtain the predictions given an input sentence by by calling ``predictor.predict(...)``. .. code:: python output = predictor.predict(dev_data) visualize_ner(dev_data["text_snippet"].iloc[0], output[0]) .. raw:: html 家用防尘厨房厨师帽子 HCCX 车间工厂鸭 HCCX 舌工作帽 HCCX 男女食堂餐厅食品 HCCX 卫生帽 HCCX Now, let’s make predictions on the rabbit toy example. .. code:: python sentence = "2023年兔年挂件新年装饰品小挂饰乔迁之喜门挂小兔子" predictions = predictor.predict({'text_snippet': [sentence]}) visualize_ner(sentence, predictions[0]) .. raw:: html 2023年兔年挂件 HCCX 新年装饰品 HCCX 小挂饰 HCCX 乔 HPPX 迁 MISC 之 HPPX 喜 HCCX 门挂 HCCX 小兔子 HCCX Other Examples -------------- You may go to `AutoMM Examples `__ to explore other examples about AutoMM. Customization ------------- To learn how to customize AutoMM, please refer to :ref:`sec_automm_customization`.