Named Entity Recognition with AutoMM - Quick Start

Named entity recognition (NER) refers to identifying and categorizing key information (entities) from unstructured text. An entity can be a word or a series of words which correspond to categories such as cities, time expressions, monetary values, facilities, person, organization, etc. An NER model usually takes as input an unannotated block of text and output an annotated block of text that highlights the named entities with predefined categories. For example, given the following sentences,

  • Albert Einstein was born in Germany and is widely acknowledged to be one of the greatest physicists.

The model will tell you that “Albert Einstein” is a PERSON and “Germany” is a LOCATION. In the following, we will introduce how to use AutoMM for the NER task, including how to prepare your data, how to train your model, and what you can expect from the model outputs.

Prepare Your Data

Like other tasks in AutoMM, all you need to do is to prepare your data as data tables (i.e., dataframes) which contain a text column and an annotation column. The text column stores the raw textual data which contains the entities you want to identify. Correspondingly, the annotation column stores the label information (e.g., the category and the start/end offset in character level) for the entities. AutoMM requires the annotation column to have the following json format (Note: do not forget to call json.dumps() to convert python objects into a json string before creating your dataframe).

  • [{“entity_group”: “PERSON”, “start”: 0, “end”: 15}, {“entity_group”: “LOCATION”, “start”: 28, “end”: 35}]

where entity_group is the category of the entity and start is a character position indicating where the entity begins while end represents the ending position of the enity. To make sure that AutoMM can recognise your json annotations, it is required to use the exactly same keys/properties (entity_group, start, end) specified above when constructing your data. You can annote “Albert Einstein” as a single entity group or you can also assign each word a label.

If you are already familar with the NER task, you probably have heard about the BIO (Benginning-Inside-Outside) format. You can adopt this format (which is not compulsory) to add an I-prefix or a B-prefix to each tag to indicate wether the tag is the beginning of the annotated chunk or inside the chunk. For example, you can annotate “Albert” as “B-PERSON” because it is the beginning of the name and “Einstein” as “I-PERSON” as it is inside the PERSON chunk. You do not need to worry about the O tags, an O tag indicates that a word belongs to no chunk, as AutoMM will take care of that automatically.

Now, let’s look at an example dataset. This dataset is converted from the MIT movies corpus which provides annotations on entity groups such as actor, character, director, genre, song, title, trailer, year, etc.

from autogluon.core.utils.loaders import load_pd
train_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/train.csv')
test_data = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/ner/mit-movies/test.csv')
train_data.head(5)
text_snippet entity_annotations
0 what movies star bruce willis [{"entity_group": "B-ACTOR", "start": 17, "end...
1 show me films with drew barrymore from the 1980s [{"entity_group": "B-ACTOR", "start": 19, "end...
2 what movies starred both al pacino and robert ... [{"entity_group": "B-ACTOR", "start": 25, "end...
3 find me all of the movies that starred harold ... [{"entity_group": "B-ACTOR", "start": 39, "end...
4 find me a movie with a quote about baseball in it []

Let’s print the first row.

print(f"text_snippet: {train_data['text_snippet'][0]}")
print(f"entity_annotations: {train_data['entity_annotations'][0]}")
text_snippet: what movies star bruce willis
entity_annotations: [{"entity_group": "B-ACTOR", "start": 17, "end": 22}, {"entity_group": "I-ACTOR", "start": 23, "end": 29}]

Training

Now, let’s create a predictor for named entity recognition by seting the problem_type to ner and specifying the label column. Then we call predictor.fit() to train the model for five minutes. To achieve reasonable performance in your applications, you are recommended to set a longer enough time_limit (e.g., 30/60 minutes). You can also specify your backbone model and other hyperparameters using the hyperparameters argument. Here, we save the model to the path “./automm_ner”.

from autogluon.multimodal import MultiModalPredictor
import uuid

label_col = "entity_annotations"
model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner"
predictor = MultiModalPredictor(problem_type="ner", label=label_col, path=model_path)
predictor.fit(
    train_data=train_data,
    time_limit=300, #second
)
Global seed set to 123
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type              | Params
--------------------------------------------------------
0 | model             | HFAutoModelForNER | 108 M
1 | validation_metric | Accuracy          | 0
2 | loss_func         | CrossEntropyLoss  | 0
--------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
216.661   Total estimated model params size (MB)
Epoch 0, global step 34: 'val_overall_accuracy' reached 0.76942 (best 0.76942), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/3bece11056714b4d801784a0d672ceca-automm_ner/epoch=0-step=34.ckpt' as top 3
Time limit reached. Elapsed time is 0:05:00. Signaling Trainer to stop.
Epoch 0, global step 62: 'val_overall_accuracy' reached 0.91413 (best 0.91413), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/3bece11056714b4d801784a0d672ceca-automm_ner/epoch=0-step=62.ckpt' as top 3
Downloading builder script:   0%|          | 0.00/6.34k [00:00<?, ?B/s]
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7fe06c5d5730>

Evaluation

Evaluation is also straightforward, we use seqeval for NER evaluation and the supported metrics are overall_recall, overall_precision, overall_f1, overall_accuracy. If you are interested in seeing the performance on a specific entity group, you can use the entity group name as the evaluation metric with which you will obtain the performances (precision, recall, f1) on the given entity group:

predictor.evaluate(test_data,  metrics=['overall_recall', "overall_precision", "overall_f1", "actor"])
{'overall_recall': 0.8126990073047388,
 'overall_precision': 0.7802553497572379,
 'overall_f1': 0.7961467889908257,
 'actor': {'precision': 0.7293762575452716,
  'recall': 0.8928571428571429,
  'f1': 0.8028792912513844,
  'number': 812}}

Prediction

You can easily obtain the predictions given an input sentence by by calling predictor.predict().

sentence = "Game of Thrones is an American fantasy drama television series created by David Benioff"
predictions = predictor.predict({'text_snippet': [sentence]})
print('Predicted entities:', predictions[0])

for entity in predictions[0]:
    print(f"Word '{sentence[entity['start']:entity['end']]}' belongs to group: {entity['entity_group']}")
Predicted entities: [{'entity_group': 'B-TITLE', 'start': 0, 'end': 4}, {'entity_group': 'I-TITLE', 'start': 5, 'end': 7}, {'entity_group': 'I-TITLE', 'start': 8, 'end': 15}, {'entity_group': 'B-GENRE', 'start': 31, 'end': 38}, {'entity_group': 'I-GENRE', 'start': 39, 'end': 44}, {'entity_group': 'I-DIRECTOR', 'start': 80, 'end': 87}]
Word 'Game' belongs to group: B-TITLE
Word 'of' belongs to group: I-TITLE
Word 'Thrones' belongs to group: I-TITLE
Word 'fantasy' belongs to group: B-GENRE
Word 'drama' belongs to group: I-GENRE
Word 'Benioff' belongs to group: I-DIRECTOR

Reloading and Continuous Training

The trained predictor is automatically saved and you can easily reload it using the path. If you are not saftisfied with the current model performance, you can continue training the loaded model with new data.

new_predictor = MultiModalPredictor.load(model_path)
new_model_path = f"./tmp/{uuid.uuid4().hex}-automm_ner_continue_train"
new_predictor.fit(train_data, time_limit=60, save_path=new_model_path)
test_score = new_predictor.evaluate(test_data, metrics=['overall_f1', 'ACTOR'])
print(test_score)
Global seed set to 123
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type              | Params
--------------------------------------------------------
0 | model             | HFAutoModelForNER | 108 M
1 | validation_metric | Accuracy          | 0
2 | loss_func         | CrossEntropyLoss  | 0
--------------------------------------------------------
108 M     Trainable params
0         Non-trainable params
108 M     Total params
216.661   Total estimated model params size (MB)
Time limit reached. Elapsed time is 0:01:00. Signaling Trainer to stop.
Epoch 0, global step 13: 'val_overall_accuracy' reached 0.91504 (best 0.91504), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/tmp/f96b1d7bba114f82a8db5222544e1779-automm_ner_continue_train/epoch=0-step=13.ckpt' as top 3
{'overall_f1': 0.7912087912087913, 'ACTOR': {'precision': 0.7528916929547844, 'recall': 0.8817733990147784, 'f1': 0.8122518434486671, 'number': 812}}

Other Examples

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization

To learn how to customize AutoMM, please refer to Customize AutoMM.