Named Entity Recognition with AutoMM - Quick Start ================================================== Named entity recognition (NER) refers to identifying and categorizing key information (entities) from unstructured text. An entity can be a word or a series of words which correspond to categories such as cities, time expressions, monetary values, facilities, person, organization, etc. An NER model usually takes as input an unannotated block of text and output an annotated block of text that highlights the named entities with predefined categories. For example, given the following sentences, - Albert Einstein was born in Germany and is widely acknowledged to be one of the greatest physicists. The model will tell you that “Albert Einstein” is a PERSON and “Germany” is a LOCATION. In the following, we will introduce how to use AutoMM for the NER task, including how to prepare your data, how to train your model, and what you can expect from the model outputs. Prepare Your Data ----------------- Like other tasks in AutoMM, all you need to do is to prepare your data as data tables (i.e., dataframes) which contain a text column and an annotation column. The text column stores the raw textual data which contains the entities you want to identify. Correspondingly, the annotation column stores the label information (e.g., the *category* and the *start/end* offset in character level) for the entities. AutoMM requires the *annotation column* to have the following json format (Note: do not forget to call json.dumps() to convert python objects into a json string before creating your dataframe). .. code:: python import json json.dumps([ {"entity_group": "PERSON", "start": 0, "end": 15}, {"entity_group": "LOCATION", "start": 28, "end": 35} ]) .. parsed-literal:: :class: output '[{"entity_group": "PERSON", "start": 0, "end": 15}, {"entity_group": "LOCATION", "start": 28, "end": 35}]' where **entity_group** is the category of the entity and **start** is a character-level position indicating where the entity begins while **end** represents the ending position of the entity. To make sure that AutoMM can recognise your json annotations, it is required to use the exactly same keys/properties (entity_group, start, end) specified above when constructing your data. You can annote “Albert Einstein” as a single entity group or you can also assign each word a label. Following is an example of visualizing the annotations with the ``visualize_ner`` utility. .. code:: python from autogluon.multimodal.utils import visualize_ner sentence = "Albert Einstein was born in Germany and is widely acknowledged to be one of the greatest physicists." annotation = [{"entity_group": "PERSON", "start": 0, "end": 15}, {"entity_group": "LOCATION", "start": 28, "end": 35}] visualize_ner(sentence, annotation) .. raw:: html Albert Einstein PERSON was born in Germany LOCATION and is widely acknowledged to be one of the greatest physicists. If you are already familiar with the NER task, you probably have heard about the `BIO `__ (Beginning-Inside-Outside) format. You can adopt this format (which is not compulsory) to add an *I-prefix* or a *B-prefix* to each tag to indicate whether the tag is the beginning of the annotated chunk or inside the chunk. For example, you can annotate “Albert” as “B-PERSON” because it is the beginning of the name and “Einstein” as “I-PERSON” as it is inside the PERSON chunk. You do not need to worry about the *O* tags, an *O* tag indicates that a word belongs to no chunk, as AutoMM will take care of that automatically. Now, let’s look at an example dataset. This dataset is converted from the `MIT movies corpus `__ which provides annotations on entity groups such as actor, character, director, genre, song, title, trailer, year, etc. .. code:: python from autogluon.core.utils.loaders import load_pd train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/train_v2.csv') test_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/test_v2.csv') train_data.head(5) .. raw:: html
text_snippet entity_annotations
0 what movies star bruce willis [{"entity_group": "ACTOR", "start": 17, "end":...
1 show me films with drew barrymore from the 1980s [{"entity_group": "ACTOR", "start": 19, "end":...
2 what movies starred both al pacino and robert ... [{"entity_group": "ACTOR", "start": 25, "end":...
3 find me all of the movies that starred harold ... [{"entity_group": "ACTOR", "start": 39, "end":...
4 find me a movie with a quote about baseball in it []
Let’s print a row. .. code:: python print(f"text_snippet: {train_data['text_snippet'][1]}") print(f"entity_annotations: {train_data['entity_annotations'][1]}") visualize_ner(train_data['text_snippet'][1], train_data['entity_annotations'][1]) .. parsed-literal:: :class: output text_snippet: show me films with drew barrymore from the 1980s entity_annotations: [{"entity_group": "ACTOR", "start": 19, "end": 33}, {"entity_group": "YEAR", "start": 43, "end": 48}] .. raw:: html show me films with drew barrymore ACTOR from the 1980s YEAR Training -------- Now, let’s create a predictor for named entity recognition by setting the *problem_type* to **ner** and specifying the label column. Afterwards, we call predictor.fit() to train the model for five minutes. To achieve reasonable performance in your applications, you are recommended to set a longer enough time_limit (e.g., 30/60 minutes). You can also specify your backbone model and other hyperparameters using the hyperparameters argument. Here, we save the model to the directory ``"automm_ner"``. For the purpose of demonstration, we use the lightweighted ``'google/electra-small-discriminator'`` backbone. .. code:: python from autogluon.multimodal import MultiModalPredictor import uuid label_col = "entity_annotations" model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner" # You can rename it to the model path you like predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path) predictor.fit( train_data=train_data, hyperparameters={'model.ner_text.checkpoint_name':'google/electra-small-discriminator'}, time_limit=300, #second ) .. parsed-literal:: :class: output Global seed set to 123 AutoMM starts to create your model. ✨ - Model will be saved to "/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner". - Validation metric is "ner_token_f1". - To track the learning progress, you can open a terminal and launch Tensorboard: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner ``` Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai Using 16bit None Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params -------------------------------------------------------- 0 | model | HFAutoModelForNER | 13.5 M 1 | validation_metric | F1Score | 0 2 | loss_func | CrossEntropyLoss | 0 -------------------------------------------------------- 13.5 M Trainable params 0 Non-trainable params 13.5 M Total params 26.979 Total estimated model params size (MB) Epoch 0, global step 34: 'val_ner_token_f1' reached 0.10718 (best 0.10718), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/epoch=0-step=34.ckpt' as top 3 Epoch 0, global step 69: 'val_ner_token_f1' reached 0.68034 (best 0.68034), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/epoch=0-step=69.ckpt' as top 3 Epoch 1, global step 103: 'val_ner_token_f1' reached 0.81298 (best 0.81298), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/epoch=1-step=103.ckpt' as top 3 Epoch 1, global step 138: 'val_ner_token_f1' reached 0.83926 (best 0.83926), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/epoch=1-step=138.ckpt' as top 3 Epoch 2, global step 172: 'val_ner_token_f1' reached 0.85641 (best 0.85641), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/epoch=2-step=172.ckpt' as top 3 Epoch 2, global step 207: 'val_ner_token_f1' reached 0.86685 (best 0.86685), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/epoch=2-step=207.ckpt' as top 3 Time limit reached. Elapsed time is 0:05:00. Signaling Trainer to stop. Epoch 3, global step 208: 'val_ner_token_f1' reached 0.86614 (best 0.86685), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/epoch=3-step=208.ckpt' as top 3 Start to fuse 3 checkpoints via the greedy soup algorithm. AutoMM has created your model 🎉🎉🎉 - To load the model, use the code below: ```python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner") ``` - You can open a terminal and launch Tensorboard to visualize the training log: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner ``` - If you are not satisfied with the model, try to increase the training time, adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html), or post issues on GitHub: https://github.com/autogluon/autogluon .. parsed-literal:: :class: output Evaluation ---------- Evaluation is also straightforward, we use `seqeval `__ for NER evaluation and the supported metrics are *overall_recall*, *overall_precision*, *overall_f1*, *overall_accuracy*. If you are interested in seeing the performance on a specific entity group, you can use the entity group name as the evaluation metric with which you will obtain the performances (precision, recall, f1) on the given entity group: .. code:: python predictor.evaluate(test_data, metrics=['overall_recall', "overall_precision", "overall_f1", "actor"]) .. parsed-literal:: :class: output {'overall_recall': 0.8462258849971905, 'overall_precision': 0.8184782608695652, 'overall_f1': 0.8321208214384381, 'actor': {'precision': 0.8195819581958196, 'recall': 0.9174876847290641, 'f1': 0.8657757117954678, 'number': 812}} Prediction + Visualization -------------------------- You can easily obtain the predictions given an input sentence by by calling predictor.predict(). If you are running the code in a Jupyter notebook, you can also easily visualize the predictions using the ``visualize_ner`` function which will highlight the named entities and their labels in a text. .. code:: python from autogluon.multimodal.utils import visualize_ner sentence = "Game of Thrones is an American fantasy drama television series created by David Benioff" predictions = predictor.predict({'text_snippet': [sentence]}) print('Predicted entities:', predictions[0]) # Visualize visualize_ner(sentence, predictions[0]) .. parsed-literal:: :class: output Predicted entities: [{'entity_group': 'TITLE', 'start': 0, 'end': 15}, {'entity_group': 'GENRE', 'start': 22, 'end': 30}, {'entity_group': 'GENRE', 'start': 31, 'end': 44}, {'entity_group': 'DIRECTOR', 'start': 74, 'end': 87}] .. raw:: html Game of Thrones TITLE is an American GENRE fantasy drama GENRE television series created by David Benioff DIRECTOR Prediction Probabilities ------------------------ You can also output the probabilities for a deep dive into the predictions. .. code:: python predictions = predictor.predict_proba({'text_snippet': [sentence]}) print(predictions[0][0]['probability']) .. parsed-literal:: :class: output {'O': 0.1278, 'B-ACTOR': 0.00119, 'B-CHARACTER': 0.004654, 'B-SONG': 0.01459, 'I-YEAR': 0.0004828, 'B-REVIEW': 0.001122, 'B-RATINGS_AVERAGE': 0.0005517, 'I-GENRE': 0.001364, 'I-ACTOR': 0.0003242, 'B-GENRE': 0.04572, 'I-RATINGS_AVERAGE': 0.001418, 'B-PLOT': 0.1918, 'I-RATING': 0.0002159, 'I-PLOT': 0.001692, 'B-TITLE': 0.5815, 'I-REVIEW': 0.0005555, 'B-DIRECTOR': 0.00111, 'I-DIRECTOR': 0.001166, 'I-TITLE': 0.00989, 'B-TRAILER': 0.00419, 'I-SONG': 0.001418, 'I-CHARACTER': 0.005314, 'I-TRAILER': 0.0002644, 'B-RATING': 0.000916, 'B-YEAR': 0.0004828} Reloading and Continuous Training --------------------------------- The trained predictor is automatically saved and you can easily reload it using the path. If you are not saftisfied with the current model performance, you can continue training the loaded model with new data. .. code:: python new_predictor = MultiModalPredictor.load(model_path) new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner_continue_train" new_predictor.fit(train_data, time_limit=60, save_path=new_model_path) test_score = new_predictor.evaluate(test_data, metrics=['overall_f1', 'ACTOR']) print(test_score) .. parsed-literal:: :class: output Load pretrained checkpoint: /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/d493502bfd424df69392cc6ee3751572-automm_ner/model.ckpt Global seed set to 123 AutoMM starts to create your model. ✨ - Model will be saved to "/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/24f600fb7de24ff0ae67a0c60a04e5bf-automm_ner_continue_train". - Validation metric is "ner_token_f1". - To track the learning progress, you can open a terminal and launch Tensorboard: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/24f600fb7de24ff0ae67a0c60a04e5bf-automm_ner_continue_train ``` Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai Using 16bit None Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params -------------------------------------------------------- 0 | model | HFAutoModelForNER | 13.5 M 1 | validation_metric | F1Score | 0 2 | loss_func | CrossEntropyLoss | 0 -------------------------------------------------------- 13.5 M Trainable params 0 Non-trainable params 13.5 M Total params 26.979 Total estimated model params size (MB) Epoch 0, global step 34: 'val_ner_token_f1' reached 0.87022 (best 0.87022), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/24f600fb7de24ff0ae67a0c60a04e5bf-automm_ner_continue_train/epoch=0-step=34.ckpt' as top 3 Time limit reached. Elapsed time is 0:01:00. Signaling Trainer to stop. Epoch 0, global step 42: 'val_ner_token_f1' reached 0.87246 (best 0.87246), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/24f600fb7de24ff0ae67a0c60a04e5bf-automm_ner_continue_train/epoch=0-step=42.ckpt' as top 3 Start to fuse 2 checkpoints via the greedy soup algorithm. AutoMM has created your model 🎉🎉🎉 - To load the model, use the code below: ```python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/24f600fb7de24ff0ae67a0c60a04e5bf-automm_ner_continue_train") ``` - You can open a terminal and launch Tensorboard to visualize the training log: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/24f600fb7de24ff0ae67a0c60a04e5bf-automm_ner_continue_train ``` - If you are not satisfied with the model, try to increase the training time, adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html), or post issues on GitHub: https://github.com/autogluon/autogluon .. parsed-literal:: :class: output {'overall_f1': 0.834435869165748, 'ACTOR': {'precision': 0.814773980154355, 'recall': 0.9100985221674877, 'f1': 0.8598022105875508, 'number': 812}} Other Examples -------------- You may go to `AutoMM Examples `__ to explore other examples about AutoMM. Customization ------------- To learn how to customize AutoMM, please refer to :ref:`sec_automm_customization`.