.. _sec_textprediction_beginner: Text Prediction - Quick Start ============================= Here we introduce the ``TextPrediction`` task, which helps you automatically train and deploy models for various Natural Language Processing (NLP) problems. This tutorial presents two examples to demonstrate how ``TextPrediction`` can be used for different NLP tasks including: - `Sentiment Analysis `__ - `Sentence Similarity `__ The general usage is similar to AutoGluon's ``TabularPrediction`` module. We treat NLP datasets as tables where certain columns contain text fields and a special column contains the labels to predict. Here, the labels can be discrete categories (classification) or numerical values (regression). ``TextPrediction`` fits neural networks to your data via transfer learning from pretrained NLP models like: `BERT `__, `ALBERT `__, and `ELECTRA `__. ``TextPrediction`` also trains multiple models with different hyperparameters and returns the best model, a process called Hyperparameter Optimization (HPO). .. code:: python import numpy as np import warnings warnings.filterwarnings('ignore') np.random.seed(123) Sentiment Analysis ------------------ First, we consider the Stanford Sentiment Treebank (`SST `__) dataset, which consists of movie reviews and their associated sentiment. Given a new movie review, the goal is to predict the sentiment reflected in the text (in this case a **binary classification** problem, where reviews are labeled as 1 if they convey a positive opinion and labeled as 0 otherwise). Let's first load the data and view some examples, noting the labels are stored in a column called **label**. .. code:: python from autogluon.utils.tabular.utils.loaders.load_pd import load train_data = load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet') dev_data = load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet') rand_idx = np.random.permutation(np.arange(len(train_data)))[:2000] train_data = train_data.iloc[rand_idx] train_data.head(10) .. parsed-literal:: :class: output Loaded data from: https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/train.parquet | Columns = 2 / 2 | Rows = 67349 -> 67349 Loaded data from: https://autogluon-text.s3-accelerate.amazonaws.com/glue/sst/dev.parquet | Columns = 2 / 2 | Rows = 872 -> 872 .. raw:: html
sentence label
2434 goes by quickly 1
27796 reading lines from a teleprompter 0
249 degraded , handheld blair witch video-cam foot... 0
12115 reminds us how realistically nuanced a robert ... 1
50834 indulges in the worst elements of all of them . 0
43622 are nowhere near as vivid as the 19th-century ... 0
3955 throughout a film that is both gripping and co... 1
51011 to see over and over again 1
31232 that fails to match the freshness of the actre... 0
32153 this is an undeniably intriguing film from an ... 1
Above the data happen to be stored in a `Parquet `__ table format, but you can also directly ``load()`` data from a `CSV `__ file instead. While here we load files from `AWS S3 cloud storage `__, these could instead be local files on your machine. After loading, ``train_data`` is simply a `Pandas DataFrame `__, where each row represents a different training example (for machine learning to be appropriate, the rows should be independent and identically distributed). To ensure this tutorial runs quickly, we simply call ``fit()`` with a subset of 2000 training examples and limit its runtime to approximately 1 minute. To achieve reasonable performance in your applications, you should set much longer ``time_limits`` (eg. 1 hour), or do not specify ``time_limits`` at all. .. code:: python from autogluon import TextPrediction as task predictor = task.fit(train_data, label='label', time_limits=60, ngpus_per_trial=1, seed=123, output_directory='./ag_sst') .. parsed-literal:: :class: output /var/lib/jenkins/miniconda3/envs/autogluon_docs-v0_0_15/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) NumPy-shape semantics has been activated in your code. This is required for creating and manipulating scalar and zero-size tensors, which were not supported in MXNet before, as in the official NumPy library. Please DO NOT manually deactivate this semantics while using `mxnet.numpy` and `mxnet.numpy_extension` modules. 2020-12-08 20:17:00,613 - root - INFO - All Logs will be saved to ./ag_sst/ag_text_prediction.log 2020-12-08 20:17:00,623 - root - INFO - Train Dataset: 2020-12-08 20:17:00,624 - root - INFO - Columns: - Text( name="sentence" #total/missing=1600/0 length, min/avg/max=4/51.93/259 ) - Categorical( name="label" #total/missing=1600/0 num_class (total/non_special)=2/2 categories=[0, 1] freq=[714, 886] ) 2020-12-08 20:17:00,624 - root - INFO - Tuning Dataset: 2020-12-08 20:17:00,625 - root - INFO - Columns: - Text( name="sentence" #total/missing=400/0 length, min/avg/max=4/55.12/218 ) - Categorical( name="label" #total/missing=400/0 num_class (total/non_special)=2/2 categories=[0, 1] freq=[167, 233] ) 2020-12-08 20:17:00,625 - root - INFO - Label columns=['label'], Feature columns=['sentence'], Problem types=['classification'], Label shapes=[2] 2020-12-08 20:17:00,626 - root - INFO - Eval Metric=acc, Stop Metric=acc, Log Metrics=['f1', 'mcc', 'auc', 'acc', 'nll'] .. parsed-literal:: :class: output HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value=''))) .. parsed-literal:: :class: output 0%| | 0/200 [00:00`__ dataset for illustration. .. code:: python train_data = load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/train.parquet')[['sentence1', 'sentence2', 'score']] dev_data = load('https://autogluon-text.s3-accelerate.amazonaws.com/glue/sts/dev.parquet')[['sentence1', 'sentence2', 'score']] train_data.head(10) .. raw:: html
sentence1 sentence2 score
0 A plane is taking off. An air plane is taking off. 5.00
1 A man is playing a large flute. A man is playing a flute. 3.80
2 A man is spreading shreded cheese on a pizza. A man is spreading shredded cheese on an uncoo... 3.80
3 Three men are playing chess. Two men are playing chess. 2.60
4 A man is playing the cello. A man seated is playing the cello. 4.25
5 Some men are fighting. Two men are fighting. 4.25
6 A man is smoking. A man is skating. 0.50
7 The man is playing the piano. The man is playing the guitar. 1.60
8 A man is playing on a guitar and singing. A woman is playing an acoustic guitar and sing... 2.20
9 A person is throwing a cat on to the ceiling. A person throws a cat on the ceiling. 5.00
In this data, the **score** column contains numerical values (which we'd like to predict) that are human-annotated similarity scores for each given pair of sentences. .. code:: python print('Min score=', min(train_data['score']), ', Max score=', max(train_data['score'])) .. parsed-literal:: :class: output Min score= 0.0 , Max score= 5.0 .. parsed-literal:: :class: output /var/lib/jenkins/miniconda3/envs/autogluon_docs-v0_0_15/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above. and should_run_async(code) Let's train a regression model to predict these scores with ``task.fit()``. Note that we only need to specify the label column and AutoGluon automatically determines the type of prediction problem and an appropriate loss function. Once again, you should increase the short ``time_limits`` below to obtain reasonable performance in your own applications. .. code:: python predictor_sts = task.fit(train_data, label='score', time_limits='1min', ngpus_per_trial=1, seed=123, output_directory='./ag_sts') .. parsed-literal:: :class: output 2020-12-08 20:19:01,791 - root - INFO - All Logs will be saved to ./ag_sts/ag_text_prediction.log 2020-12-08 20:19:01,810 - root - INFO - Train Dataset: 2020-12-08 20:19:01,810 - root - INFO - Columns: - Text( name="sentence1" #total/missing=4599/0 length, min/avg/max=16/57.68/367 ) - Text( name="sentence2" #total/missing=4599/0 length, min/avg/max=15/57.47/265 ) - Numerical( name="score" #total/missing=4599/0 shape=() ) 2020-12-08 20:19:01,811 - root - INFO - Tuning Dataset: 2020-12-08 20:19:01,811 - root - INFO - Columns: - Text( name="sentence1" #total/missing=1150/0 length, min/avg/max=17/57.81/315 ) - Text( name="sentence2" #total/missing=1150/0 length, min/avg/max=16/57.79/311 ) - Numerical( name="score" #total/missing=1150/0 shape=() ) 2020-12-08 20:19:01,811 - root - INFO - Label columns=['score'], Feature columns=['sentence1', 'sentence2'], Problem types=['regression'], Label shapes=[()] 2020-12-08 20:19:01,812 - root - INFO - Eval Metric=mse, Stop Metric=mse, Log Metrics=['mse', 'rmse', 'mae'] .. parsed-literal:: :class: output HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=4.0), HTML(value=''))) .. parsed-literal:: :class: output 0%| | 0/576 [00:00`__ package. Due to an ongoing upgrade of GluonNLP, we are currently using a custom version of the package: `autogluon-contrib-nlp `__. In a future release, AutoGluon will switch to using the official GluonNLP, but the APIs demonstrated here will remain the same.