.. _sec_textprediction_multimodal: Text Prediction - Multimodal Table with Text ============================================ In many applications, text data may be mixed with numeric/categorical data. AutoGluon's ``TextPredictor`` can train a single neural network that jointly operates on multiple feature types, including text, categorical, and numerical columns. The general idea is to embed the text, categorical and numeric fields separately and fuse these features across modalities. This tutorial demonstrates such an application. .. code:: python import numpy as np import pandas as pd import warnings import os warnings.filterwarnings('ignore') np.random.seed(123) .. code:: python !python3 -m pip install openpyxl .. parsed-literal:: :class: output Collecting openpyxl Using cached openpyxl-3.0.10-py2.py3-none-any.whl (242 kB) Collecting et-xmlfile Using cached et_xmlfile-1.1.0-py3-none-any.whl (4.7 kB) Installing collected packages: et-xmlfile, openpyxl Successfully installed et-xmlfile-1.1.0 openpyxl-3.0.10 Book Price Prediction Data -------------------------- For demonstration, we use the book price prediction dataset from the `MachineHack Salary Prediction Hackathon `__. Our goal is to predict a book's price given various features like its author, the abstract, the book's rating, etc. .. code:: python !mkdir -p price_of_books !wget https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/Data.zip -O price_of_books/Data.zip !cd price_of_books && unzip -o Data.zip !ls price_of_books/Participants_Data .. parsed-literal:: :class: output --2022-06-23 06:14:20-- https://automl-mm-bench.s3.amazonaws.com/machine_hack_competitions/predict_the_price_of_books/Data.zip Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 54.231.225.113 Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|54.231.225.113|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 3521673 (3.4M) [application/zip] Saving to: ‘price_of_books/Data.zip’ price_of_books/Data 100%[===================>] 3.36M 5.75MB/s in 0.6s 2022-06-23 06:14:21 (5.75 MB/s) - ‘price_of_books/Data.zip’ saved [3521673/3521673] Archive: Data.zip inflating: Participants_Data/Data_Test.xlsx inflating: Participants_Data/Data_Train.xlsx inflating: Participants_Data/Sample_Submission.xlsx Data_Test.xlsx Data_Train.xlsx Sample_Submission.xlsx .. code:: python train_df = pd.read_excel(os.path.join('price_of_books', 'Participants_Data', 'Data_Train.xlsx'), engine='openpyxl') train_df.head() .. raw:: html
Title Author Edition Reviews Ratings Synopsis Genre BookCategory Price
0 The Prisoner's Gold (The Hunters 3) Chris Kuzneski Paperback,– 10 Mar 2016 4.0 out of 5 stars 8 customer reviews THE HUNTERS return in their third brilliant no... Action & Adventure (Books) Action & Adventure 220.00
1 Guru Dutt: A Tragedy in Three Acts Arun Khopkar Paperback,– 7 Nov 2012 3.9 out of 5 stars 14 customer reviews A layered portrait of a troubled genius for wh... Cinema & Broadcast (Books) Biographies, Diaries & True Accounts 202.93
2 Leviathan (Penguin Classics) Thomas Hobbes Paperback,– 25 Feb 1982 4.8 out of 5 stars 6 customer reviews "During the time men live without a common Pow... International Relations Humour 299.00
3 A Pocket Full of Rye (Miss Marple) Agatha Christie Paperback,– 5 Oct 2017 4.1 out of 5 stars 13 customer reviews A handful of grain is found in the pocket of a... Contemporary Fiction (Books) Crime, Thriller & Mystery 180.00
4 LIFE 70 Years of Extraordinary Photography Editors of Life Hardcover,– 10 Oct 2006 5.0 out of 5 stars 1 customer review For seven decades, "Life" has been thrilling t... Photography Textbooks Arts, Film & Photography 965.62
We do some basic preprocessing to convert ``Reviews`` and ``Ratings`` in the data table to numeric values, and we transform prices to a log-scale. .. code:: python def preprocess(df): df = df.copy(deep=True) df.loc[:, 'Reviews'] = pd.to_numeric(df['Reviews'].apply(lambda ele: ele[:-len(' out of 5 stars')])) df.loc[:, 'Ratings'] = pd.to_numeric(df['Ratings'].apply(lambda ele: ele.replace(',', '')[:-len(' customer reviews')])) df.loc[:, 'Price'] = np.log(df['Price'] + 1) return df .. code:: python train_subsample_size = 1500 # subsample for faster demo, you can try setting to larger values test_subsample_size = 5 train_df = preprocess(train_df) train_data = train_df.iloc[100:].sample(train_subsample_size, random_state=123) test_data = train_df.iloc[:100].sample(test_subsample_size, random_state=245) train_data.head() .. raw:: html
Title Author Edition Reviews Ratings Synopsis Genre BookCategory Price
949 Furious Hours Casey Cep Paperback,– 1 Jun 2019 4.0 NaN ‘It’s been a long time since I picked up a boo... True Accounts (Books) Biographies, Diaries & True Accounts 5.743003
5504 REST API Design Rulebook Mark Masse Paperback,– 7 Nov 2011 5.0 NaN In todays market, where rival web services com... Computing, Internet & Digital Media (Books) Computing, Internet & Digital Media 5.786897
5856 The Atlantropa Articles: A Novel Cody Franklin Paperback,– Import, 1 Nov 2018 4.5 2.0 #1 Amazon Best Seller! Dystopian Alternate His... Action & Adventure (Books) Romance 6.893656
4137 Hickory Dickory Dock (Poirot) Agatha Christie Paperback,– 5 Oct 2017 4.3 21.0 There’s more than petty theft going on in a Lo... Action & Adventure (Books) Crime, Thriller & Mystery 5.192957
3205 The Stanley Kubrick Archives (Bibliotheca Univ... Alison Castle Hardcover,– 21 Aug 2016 4.6 3.0 In 1968, when Stanley Kubrick was asked to com... Cinema & Broadcast (Books) Humour 6.889591
Training -------- We can simply create a TextPredictor and call ``predictor.fit()`` to train a model that operates on across all types of features. Internally, the neural network will be automatically generated based on the inferred data type of each feature column. To save time, we subsample the data and only train for three minutes. .. code:: python from autogluon.text import TextPredictor time_limit = 3 * 60 # set to larger value in your applications predictor = TextPredictor(label='Price', path='ag_text_book_price_prediction') predictor.fit(train_data, time_limit=time_limit) .. parsed-literal:: :class: output Global seed set to 123 Auto select gpus: [0] Using 16bit native Automatic Mixed Precision (AMP) GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params ---------------------------------------------------------- 0 | model | MultimodalFusionMLP | 109 M 1 | validation_metric | MeanSquaredError | 0 2 | loss_func | MSELoss | 0 ---------------------------------------------------------- 109 M Trainable params 0 Non-trainable params 109 M Total params 219.565 Total estimated model params size (MB) Epoch 0, global step 4: 'val_rmse' reached 2.11614 (best 2.11614), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-text-v3/docs/_build/eval/tutorials/text_prediction/ag_text_book_price_prediction/epoch=0-step=4.ckpt' as top 3 Epoch 0, global step 10: 'val_rmse' reached 0.98453 (best 0.98453), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-text-v3/docs/_build/eval/tutorials/text_prediction/ag_text_book_price_prediction/epoch=0-step=10.ckpt' as top 3 Epoch 1, global step 14: 'val_rmse' reached 0.93582 (best 0.93582), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-text-v3/docs/_build/eval/tutorials/text_prediction/ag_text_book_price_prediction/epoch=1-step=14.ckpt' as top 3 Epoch 1, global step 20: 'val_rmse' reached 0.94639 (best 0.93582), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-text-v3/docs/_build/eval/tutorials/text_prediction/ag_text_book_price_prediction/epoch=1-step=20.ckpt' as top 3 Epoch 2, global step 24: 'val_rmse' reached 0.86243 (best 0.86243), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-text-v3/docs/_build/eval/tutorials/text_prediction/ag_text_book_price_prediction/epoch=2-step=24.ckpt' as top 3 Epoch 2, global step 30: 'val_rmse' reached 0.84884 (best 0.84884), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-text-v3/docs/_build/eval/tutorials/text_prediction/ag_text_book_price_prediction/epoch=2-step=30.ckpt' as top 3 Time limit reached. Elapsed time is 0:03:02. Signaling Trainer to stop. Auto select gpus: [0] HPU available: False, using: 0 HPUs Auto select gpus: [0] HPU available: False, using: 0 HPUs Auto select gpus: [0] HPU available: False, using: 0 HPUs .. parsed-literal:: :class: output Prediction ---------- We can easily obtain predictions and extract data embeddings using the TextPredictor. .. code:: python predictions = predictor.predict(test_data) print('Predictions:') print('------------') print(np.exp(predictions) - 1) print() print('True Value:') print('------------') print(np.exp(test_data['Price']) - 1) .. parsed-literal:: :class: output Auto select gpus: [0] HPU available: False, using: 0 HPUs .. parsed-literal:: :class: output Predictions: ------------ 1 423.562317 31 427.339661 19 929.927307 45 507.602417 82 526.195251 Name: Price, dtype: float32 True Value: ------------ 1 202.93 31 799.00 19 352.00 45 395.10 82 409.00 Name: Price, dtype: float64 .. code:: python performance = predictor.evaluate(test_data) print(performance) .. parsed-literal:: :class: output Auto select gpus: [0] HPU available: False, using: 0 HPUs .. parsed-literal:: :class: output {'rmse': 0.6315059662240039} .. code:: python embeddings = predictor.extract_embedding(test_data) embeddings.shape .. parsed-literal:: :class: output Auto select gpus: [0] HPU available: False, using: 0 HPUs .. parsed-literal:: :class: output (5, 128) .. _sec_textprediction_architecture: What's happening inside? ------------------------ Internally, we use different networks to encode the text columns, categorical columns, and numerical columns. The features generated by individual networks are aggregated by a late-fusion aggregator. The aggregator can output both the logits or score predictions. The architecture can be illustrated as follows: .. figure:: https://autogluon-text-data.s3.amazonaws.com/figures/fuse-late.png :width: 600px Multimodal Network with Late Fusion Here, we use the pretrained NLP backbone to extract the text features and then use two other towers to extract the feature from categorical column and the numerical column. In addition, to deal with multiple text fields, we separate these fields with the ``[SEP]`` token and alternate 0s and 1s as the segment IDs, which is shown as follows: .. figure:: https://autogluon-text-data.s3.amazonaws.com/figures/preprocess.png :width: 600px Preprocessing How does this compare with TabularPredictor? -------------------------------------------- Note that ``TabularPredictor`` can also handle data tables with text, numeric, and categorical columns, but it uses an ensemble of many types of models and may featurize text. ``TextPredictor`` instead directly fits individual Transformer neural network models directly to the raw text (which are also capable of handling additional numeric/categorical columns). We generally recommend TabularPredictor if your table contains mainly numeric/categorical columns and TextPredictor if your table contains mainly text columns, but you may easily try both and we encourage this. In fact, ``TabularPredictor.fit(..., hyperparameters='multimodal')`` will train a TextPredictor along with many tabular models and ensemble them together. Refer to the tutorial ":ref:`sec_tabularprediction_text_multimodal`" for more details. Other Examples -------------- You may go to https://github.com/awslabs/autogluon/tree/master/examples/text\_prediction to explore other TextPredictor examples, including scripts to train a TextPredictor on the complete book price prediction dataset.