Text Prediction - Solving Multilingual Problems

People around the world speaks lots of languages. According to SIL International’s Ethnologue: Languages of the World, there are more than 7,100 spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of real-world problems involve text written in languages other than English.

In this tutorial, we introduce how AutoGluon Text can help you build multilingual models. For the purpose of demonstration, we use the Cross-Lingual Amazon Product Review Sentiment dataset, which comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways:

  • Finetune the German BERT

  • Cross-lingual transfer from English to German

Load Dataset

The Cross-Lingual Amazon Product Review Sentiment dataset contains Amazon product reviews in four languages. Here, we load the English and German fold of the dataset. In the label column, 0 means negative sentiment and 1 means positive sentiment.

!wget https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip
!unzip -o amazon_review_sentiment_cross_lingual.zip -d .
--2022-07-28 21:14:34--  https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip
Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 52.217.231.73
Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|52.217.231.73|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 250619863 (239M) [application/zip]
Saving to: ‘amazon_review_sentiment_cross_lingual.zip’

amazon_review_senti 100%[===================>] 239.01M  41.8MB/s    in 6.4s

2022-07-28 21:14:40 (37.6 MB/s) - ‘amazon_review_sentiment_cross_lingual.zip’ saved [250619863/250619863]

Archive:  amazon_review_sentiment_cross_lingual.zip
   creating: ./amazon_review_sentiment_cross_lingual/
  inflating: ./amazon_review_sentiment_cross_lingual/fr_train.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/fr_unlabled.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/jp_train.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/de_unlabled.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/jp_unlabled.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/en_train.1000.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/en_train.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/jp_test.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/de_test.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/fr_test.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/de_train.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/en_test.tsv
  inflating: ./amazon_review_sentiment_cross_lingual/en_unlabled.tsv
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
                .sample(1000, random_state=123)
train_de_df.reset_index(inplace=True, drop=True)

test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_de_df.reset_index(inplace=True, drop=True)
print(train_de_df)
     label                                               text
0        0  Dieser Film, nur so triefend von Kitsch, ist h...
1        0  Wie so oft: Das Buch begeistert, der Film entt...
2        1  Schon immer versuchten Männer ihre Gefühle geg...
3        1  Wenn man sich durch 10 Minuten Disney-Trailer ...
4        1  Eine echt geile nummer zum Abtanzen und feiern...
..     ...                                                ...
995      0  Ich dachte dies wäre ein richtig spannendes Bu...
996      0  Wer sich den Schrott wirklich noch ansehen möc...
997      0  Sicher, der Film greift ein aktuelles und hoch...
998      1  Dieser Bildband lässt das Herz von Sarah Kay-F...
999      1  ...so das war nun mein drittes Buch von Jenny-...

[1000 rows x 2 columns]
train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
                .sample(1000, random_state=123)
train_en_df.reset_index(inplace=True, drop=True)

test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',
                          sep='\t',
                          header=None,
                          names=['label', 'text']) \
               .sample(200, random_state=123)
test_en_df.reset_index(inplace=True, drop=True)
print(train_en_df)
     label                                               text
0        0  This is a film that literally sees little wron...
1        0  This music is pretty intelligent, but not very...
2        0  One of the best pieces of rock ever recorded, ...
3        0  Reading the posted reviews here, is like revis...
4        1  I've just finished page 341, the last page. It...
..     ...                                                ...
995      1  This album deserves to be (at least) as popula...
996      1  This book, one of the few that takes a more ac...
997      1  I loved it because it really did show Sagan th...
998      1  Stuart Gordons "DAGON" is a unique horror gem ...
999      0  I've heard Al Lee speak before and thought tha...

[1000 rows x 2 columns]

Finetune the German BERT

Our first approach is to finetune the German BERT model pretrained by deepset. Since AutoGluon Text integrates with the Huggingface/Transformers (as explained in Text Prediction - Customization), we directly load the German BERT model via Huggingface/Transformers, with the key as bert-base-german-cased. To simplify the experiment, we also just finetune for 4 epochs.

from autogluon.text import TextPredictor

predictor = TextPredictor(label='label')
predictor.fit(train_de_df,
              hyperparameters={
                  'model.hf_text.checkpoint_name': 'bert-base-german-cased',
                  'optimization.max_epochs': 4
              })
Global seed set to 123
Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/474k [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/419M [00:00<?, ?B/s]
Auto select gpus: [0]
Using 16bit native Automatic Mixed Precision (AMP)
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 109 M
1 | validation_metric | AUROC                        | 0
2 | loss_func         | CrossEntropyLoss             | 0
-------------------------------------------------------------------
109 M     Trainable params
0         Non-trainable params
109 M     Total params
218.166   Total estimated model params size (MB)
Epoch 0, global step 3: 'val_roc_auc' reached 0.66895 (best 0.66895), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=0-step=3.ckpt' as top 3
Epoch 0, global step 7: 'val_roc_auc' reached 0.78300 (best 0.78300), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=0-step=7.ckpt' as top 3
Epoch 1, global step 10: 'val_roc_auc' reached 0.83890 (best 0.83890), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=1-step=10.ckpt' as top 3
Epoch 1, global step 14: 'val_roc_auc' reached 0.86888 (best 0.86888), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=1-step=14.ckpt' as top 3
Epoch 2, global step 17: 'val_roc_auc' reached 0.88600 (best 0.88600), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=2-step=17.ckpt' as top 3
Epoch 2, global step 21: 'val_roc_auc' reached 0.89516 (best 0.89516), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=2-step=21.ckpt' as top 3
Epoch 3, global step 24: 'val_roc_auc' reached 0.89791 (best 0.89791), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=3-step=24.ckpt' as top 3
Epoch 3, global step 28: 'val_roc_auc' reached 0.89876 (best 0.89876), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=3-step=28.ckpt' as top 3
<autogluon.text.text_prediction.predictor.TextPredictor at 0x7fd750781460>
score = predictor.evaluate(test_de_df)
print('Score on the German Testset:')
print(score)
Score on the German Testset:
{'roc_auc': 0.9416065705128206}
score = predictor.evaluate(test_en_df)
print('Score on the English Testset:')
print(score)
Score on the English Testset:
{'roc_auc': 0.5831574716108934}

We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for both German and English.

Cross-lingual Transfer

In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the other language (e.g., German) to English and apply the English model. However, as showed in “Unsupervised Cross-lingual Representation Learning at Scale”, there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining. The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct zero-shot cross lingual transfer, meaning that you can directly apply the model trained in the English dataset to datasets in other languages. It also outperforms the baseline “TRANSLATE-TEST”, meaning to translate the data from other languages to English and apply the English model.

In AutoGluon, you can just turn on presets="multilingual" to load a backbone that is suitable for zero-shot transfer. Internally, we will automatically use state-of-the-art models like DeBERTa-V3.

from autogluon.text import TextPredictor

predictor = TextPredictor(label='label')
predictor.fit(train_en_df,
              presets='multilingual',
              hyperparameters={
                  'optimization.max_epochs': 4
              })
Global seed set to 123
Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/579 [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]
Downloading:   0%|          | 0.00/534M [00:00<?, ?B/s]
Auto select gpus: [0]
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name              | Type                         | Params
-------------------------------------------------------------------
0 | model             | HFAutoModelForTextPrediction | 278 M
1 | validation_metric | AUROC                        | 0
2 | loss_func         | CrossEntropyLoss             | 0
-------------------------------------------------------------------
278 M     Trainable params
0         Non-trainable params
278 M     Total params
1,112.881 Total estimated model params size (MB)
Epoch 0, global step 3: 'val_roc_auc' reached 0.52815 (best 0.52815), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=0-step=3.ckpt' as top 1
Epoch 0, global step 7: 'val_roc_auc' reached 0.78608 (best 0.78608), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=0-step=7.ckpt' as top 1
Epoch 1, global step 10: 'val_roc_auc' reached 0.81878 (best 0.81878), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=1-step=10.ckpt' as top 1
Epoch 1, global step 14: 'val_roc_auc' reached 0.90109 (best 0.90109), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=1-step=14.ckpt' as top 1
Epoch 2, global step 17: 'val_roc_auc' reached 0.94079 (best 0.94079), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=2-step=17.ckpt' as top 1
Epoch 2, global step 21: 'val_roc_auc' reached 0.95140 (best 0.95140), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=2-step=21.ckpt' as top 1
Epoch 3, global step 24: 'val_roc_auc' reached 0.95260 (best 0.95260), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=3-step=24.ckpt' as top 1
Epoch 3, global step 28: 'val_roc_auc' reached 0.95370 (best 0.95370), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=3-step=28.ckpt' as top 1
<autogluon.text.text_prediction.predictor.TextPredictor at 0x7fd73a5c7880>
score_in_en = predictor.evaluate(test_en_df)
print('Score in the English Testset:')
print(score_in_en)
Score in the English Testset:
{'roc_auc': 0.931263189629183}
score_in_de = predictor.evaluate(test_de_df)
print('Score in the German Testset:')
print(score_in_de)
Score in the German Testset:
{'roc_auc': 0.9345953525641025}

We can see that the model works for both German and English!

Let’s also inspect the model’s performance on Japanese:

test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',
                          sep='\t', header=None, names=['label', 'text']) \
               .sample(200, random_state=123)
test_jp_df.reset_index(inplace=True, drop=True)
print(test_jp_df)
     label                                               text
0        1  原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ...
1        1  ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない...
2        0  一番の問題点は青島が出ていない事でしょう。  TV番組では『芸人が出ていればバラエティだから...
3        0  昔、 りんたろう監督によるアニメ「カムイの剣」があった。  「カムイの剣」…を観た人なら本作...
4        1  以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と...
..     ...                                                ...
195      0  原作が面白く、このDVDも期待して観ただけに非常にがっかりしました。  脚本としては単に格闘...
196      0                              フェードインやフェードアウトが多すぎます。
197      0  流通形態云々については特に革命と言う気はしない。  これからもCDは普通に発売されるだろうし...
198      1  もうTVとか、最近の映画とか、観なくていいよ。  脳に楽なエンターテイメントだから。  脳を...
199      0  みんなさんは、手塚治虫先生の「1985への出発」という漫画を読んだことがありますでしょうか?...

[200 rows x 2 columns]
print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))
score_in_jp = predictor.evaluate(test_jp_df)
print('Score in the Japanese Testset:')
print(score_in_jp)
Negative labe ratio of the Japanese Testset= 0.575
Score in the Japanese Testset:
{'roc_auc': 0.8901278772378517}

Amazingly, the model also works for Japanese!