.. _sec_textprediction_multilingual:

Text Prediction - Solving Multilingual Problems
===============================================


People around the world speaks lots of languages. According to `SIL
International <https://en.wikipedia.org/wiki/SIL_International>`__\ ’s
`Ethnologue: Languages of the
World <https://en.wikipedia.org/wiki/Ethnologue>`__, there are more than
**7,100** spoken and signed languages. In fact, web data nowadays are
highly multilingual and lots of real-world problems involve text written
in languages other than English.

In this tutorial, we introduce how AutoGluon Text can help you build
multilingual models. For the purpose of demonstration, we use the
`Cross-Lingual Amazon Product Review
Sentiment <https://webis.de/data/webis-cls-10.html>`__ dataset, which
comprises about 800,000 Amazon product reviews in four languages:
English, German, French, and Japanese. We will demonstrate how to use
AutoGluon Text to build sentiment classification models on the German
fold of this dataset in two ways:

-  Finetune the German BERT
-  Cross-lingual transfer from English to German

Load Dataset
------------

The `Cross-Lingual Amazon Product Review
Sentiment <https://webis.de/data/webis-cls-10.html>`__ dataset contains
Amazon product reviews in four languages. Here, we load the English and
German fold of the dataset. In the label column, ``0`` means negative
sentiment and ``1`` means positive sentiment.

.. code:: python

    !wget https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip
    !unzip -o amazon_review_sentiment_cross_lingual.zip -d .


.. parsed-literal::
    :class: output

    --2022-07-28 21:14:34--  https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip
    Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 52.217.231.73
    Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|52.217.231.73|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 250619863 (239M) [application/zip]
    Saving to: ‘amazon_review_sentiment_cross_lingual.zip’
    
    amazon_review_senti 100%[===================>] 239.01M  41.8MB/s    in 6.4s    
    
    2022-07-28 21:14:40 (37.6 MB/s) - ‘amazon_review_sentiment_cross_lingual.zip’ saved [250619863/250619863]
    
    Archive:  amazon_review_sentiment_cross_lingual.zip
       creating: ./amazon_review_sentiment_cross_lingual/
      inflating: ./amazon_review_sentiment_cross_lingual/fr_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/fr_unlabled.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/jp_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/de_unlabled.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/jp_unlabled.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_train.1000.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/jp_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/de_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/fr_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/de_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_unlabled.tsv  


.. code:: python

    import pandas as pd
    import warnings
    warnings.filterwarnings('ignore')
    
    train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',
                              sep='\t', header=None, names=['label', 'text']) \
                    .sample(1000, random_state=123)
    train_de_df.reset_index(inplace=True, drop=True)
    
    test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',
                              sep='\t', header=None, names=['label', 'text']) \
                   .sample(200, random_state=123)
    test_de_df.reset_index(inplace=True, drop=True)
    print(train_de_df)


.. parsed-literal::
    :class: output

         label                                               text
    0        0  Dieser Film, nur so triefend von Kitsch, ist h...
    1        0  Wie so oft: Das Buch begeistert, der Film entt...
    2        1  Schon immer versuchten Männer ihre Gefühle geg...
    3        1  Wenn man sich durch 10 Minuten Disney-Trailer ...
    4        1  Eine echt geile nummer zum Abtanzen und feiern...
    ..     ...                                                ...
    995      0  Ich dachte dies wäre ein richtig spannendes Bu...
    996      0  Wer sich den Schrott wirklich noch ansehen möc...
    997      0  Sicher, der Film greift ein aktuelles und hoch...
    998      1  Dieser Bildband lässt das Herz von Sarah Kay-F...
    999      1  ...so das war nun mein drittes Buch von Jenny-...
    
    [1000 rows x 2 columns]


.. code:: python

    train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',
                              sep='\t',
                              header=None,
                              names=['label', 'text']) \
                    .sample(1000, random_state=123)
    train_en_df.reset_index(inplace=True, drop=True)
    
    test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',
                              sep='\t',
                              header=None,
                              names=['label', 'text']) \
                   .sample(200, random_state=123)
    test_en_df.reset_index(inplace=True, drop=True)
    print(train_en_df)


.. parsed-literal::
    :class: output

         label                                               text
    0        0  This is a film that literally sees little wron...
    1        0  This music is pretty intelligent, but not very...
    2        0  One of the best pieces of rock ever recorded, ...
    3        0  Reading the posted reviews here, is like revis...
    4        1  I've just finished page 341, the last page. It...
    ..     ...                                                ...
    995      1  This album deserves to be (at least) as popula...
    996      1  This book, one of the few that takes a more ac...
    997      1  I loved it because it really did show Sagan th...
    998      1  Stuart Gordons "DAGON" is a unique horror gem ...
    999      0  I've heard Al Lee speak before and thought tha...
    
    [1000 rows x 2 columns]


Finetune the German BERT
------------------------

Our first approach is to finetune the `German BERT
model <https://www.deepset.ai/german-bert>`__ pretrained by deepset.
Since AutoGluon Text integrates with the
`Huggingface/Transformers <https://huggingface.co/docs/transformers/index>`__
(as explained in :ref:`sec_textprediction_customization`), we directly
load the German BERT model via Huggingface/Transformers, with the key as
`bert-base-german-cased <https://huggingface.co/bert-base-german-cased>`__.
To simplify the experiment, we also just finetune for 4 epochs.

.. code:: python

    from autogluon.text import TextPredictor
    
    predictor = TextPredictor(label='label')
    predictor.fit(train_de_df,
                  hyperparameters={
                      'model.hf_text.checkpoint_name': 'bert-base-german-cased',
                      'optimization.max_epochs': 4
                  })


.. parsed-literal::
    :class: output

    Global seed set to 123


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/474k [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/419M [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Auto select gpus: [0]
    Using 16bit native Automatic Mixed Precision (AMP)
    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    IPU available: False, using: 0 IPUs
    HPU available: False, using: 0 HPUs
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    
      | Name              | Type                         | Params
    -------------------------------------------------------------------
    0 | model             | HFAutoModelForTextPrediction | 109 M 
    1 | validation_metric | AUROC                        | 0     
    2 | loss_func         | CrossEntropyLoss             | 0     
    -------------------------------------------------------------------
    109 M     Trainable params
    0         Non-trainable params
    109 M     Total params
    218.166   Total estimated model params size (MB)
    Epoch 0, global step 3: 'val_roc_auc' reached 0.66895 (best 0.66895), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=0-step=3.ckpt' as top 3
    Epoch 0, global step 7: 'val_roc_auc' reached 0.78300 (best 0.78300), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=0-step=7.ckpt' as top 3
    Epoch 1, global step 10: 'val_roc_auc' reached 0.83890 (best 0.83890), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=1-step=10.ckpt' as top 3
    Epoch 1, global step 14: 'val_roc_auc' reached 0.86888 (best 0.86888), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=1-step=14.ckpt' as top 3
    Epoch 2, global step 17: 'val_roc_auc' reached 0.88600 (best 0.88600), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=2-step=17.ckpt' as top 3
    Epoch 2, global step 21: 'val_roc_auc' reached 0.89516 (best 0.89516), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=2-step=21.ckpt' as top 3
    Epoch 3, global step 24: 'val_roc_auc' reached 0.89791 (best 0.89791), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=3-step=24.ckpt' as top 3
    Epoch 3, global step 28: 'val_roc_auc' reached 0.89876 (best 0.89876), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211450/epoch=3-step=28.ckpt' as top 3


.. parsed-literal::
    :class: output

    <autogluon.text.text_prediction.predictor.TextPredictor at 0x7fd750781460>


.. code:: python

    score = predictor.evaluate(test_de_df)
    print('Score on the German Testset:')
    print(score)


.. parsed-literal::
    :class: output

    Score on the German Testset:
    {'roc_auc': 0.9416065705128206}


.. code:: python

    score = predictor.evaluate(test_en_df)
    print('Score on the English Testset:')
    print(score)


.. parsed-literal::
    :class: output

    Score on the English Testset:
    {'roc_auc': 0.5831574716108934}


We can find that the model can achieve good performance on the German
dataset but performs poorly on the English dataset. Next, we will show
how to enable cross-lingual transfer so you can get a model that can
magically work for **both German and English**.

Cross-lingual Transfer
----------------------

In the real-world scenario, it is pretty common that you have trained a
model for English and would like to extend the model to support other
languages like German. This setting is also known as cross-lingual
transfer. One way to solve the problem is to apply a machine translation
model to translate the sentences from the other language (e.g., German)
to English and apply the English model. However, as showed in
`“Unsupervised Cross-lingual Representation Learning at
Scale” <https://arxiv.org/pdf/1911.02116.pdf>`__, there is a better and
cost-friendlier way for cross lingual transfer, enabled via large-scale
multilingual pretraining. The author showed that via large-scale
pretraining, the backbone (called XLM-R) is able to conduct *zero-shot*
cross lingual transfer, meaning that you can directly apply the model
trained in the English dataset to datasets in other languages. It also
outperforms the baseline “TRANSLATE-TEST”, meaning to translate the data
from other languages to English and apply the English model.

In AutoGluon, you can just turn on ``presets="multilingual"`` to load a
backbone that is suitable for zero-shot transfer. Internally, we will
automatically use state-of-the-art models like
`DeBERTa-V3 <https://arxiv.org/abs/2111.09543>`__.

.. code:: python

    from autogluon.text import TextPredictor
    
    predictor = TextPredictor(label='label')
    predictor.fit(train_en_df,
                  presets='multilingual',
                  hyperparameters={
                      'optimization.max_epochs': 4
                  })


.. parsed-literal::
    :class: output

    Global seed set to 123


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/52.0 [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/579 [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/4.11M [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/534M [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Auto select gpus: [0]
    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    IPU available: False, using: 0 IPUs
    HPU available: False, using: 0 HPUs
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    
      | Name              | Type                         | Params
    -------------------------------------------------------------------
    0 | model             | HFAutoModelForTextPrediction | 278 M 
    1 | validation_metric | AUROC                        | 0     
    2 | loss_func         | CrossEntropyLoss             | 0     
    -------------------------------------------------------------------
    278 M     Trainable params
    0         Non-trainable params
    278 M     Total params
    1,112.881 Total estimated model params size (MB)
    Epoch 0, global step 3: 'val_roc_auc' reached 0.52815 (best 0.52815), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=0-step=3.ckpt' as top 1
    Epoch 0, global step 7: 'val_roc_auc' reached 0.78608 (best 0.78608), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=0-step=7.ckpt' as top 1
    Epoch 1, global step 10: 'val_roc_auc' reached 0.81878 (best 0.81878), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=1-step=10.ckpt' as top 1
    Epoch 1, global step 14: 'val_roc_auc' reached 0.90109 (best 0.90109), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=1-step=14.ckpt' as top 1
    Epoch 2, global step 17: 'val_roc_auc' reached 0.94079 (best 0.94079), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=2-step=17.ckpt' as top 1
    Epoch 2, global step 21: 'val_roc_auc' reached 0.95140 (best 0.95140), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=2-step=21.ckpt' as top 1
    Epoch 3, global step 24: 'val_roc_auc' reached 0.95260 (best 0.95260), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=3-step=24.ckpt' as top 1
    Epoch 3, global step 28: 'val_roc_auc' reached 0.95370 (best 0.95370), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/text_prediction/AutogluonModels/ag-20220728_211808/epoch=3-step=28.ckpt' as top 1


.. parsed-literal::
    :class: output

    <autogluon.text.text_prediction.predictor.TextPredictor at 0x7fd73a5c7880>


.. code:: python

    score_in_en = predictor.evaluate(test_en_df)
    print('Score in the English Testset:')
    print(score_in_en)


.. parsed-literal::
    :class: output

    Score in the English Testset:
    {'roc_auc': 0.931263189629183}


.. code:: python

    score_in_de = predictor.evaluate(test_de_df)
    print('Score in the German Testset:')
    print(score_in_de)


.. parsed-literal::
    :class: output

    Score in the German Testset:
    {'roc_auc': 0.9345953525641025}


We can see that the model works for both German and English!

Let’s also inspect the model’s performance on Japanese:

.. code:: python

    test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',
                              sep='\t', header=None, names=['label', 'text']) \
                   .sample(200, random_state=123)
    test_jp_df.reset_index(inplace=True, drop=True)
    print(test_jp_df)


.. parsed-literal::
    :class: output

         label                                               text
    0        1  原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ...
    1        1  ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない...
    2        0  一番の問題点は青島が出ていない事でしょう。  ＴＶ番組では『芸人が出ていればバラエティだから...
    3        0  昔、 りんたろう監督によるアニメ「カムイの剣」があった。  「カムイの剣」…を観た人なら本作...
    4        1  以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と...
    ..     ...                                                ...
    195      0  原作が面白く、このDVDも期待して観ただけに非常にがっかりしました。  脚本としては単に格闘...
    196      0                              フェードインやフェードアウトが多すぎます。
    197      0  流通形態云々については特に革命と言う気はしない。  これからもＣＤは普通に発売されるだろうし...
    198      1  もうＴＶとか、最近の映画とか、観なくていいよ。  脳に楽なエンターテイメントだから。  脳を...
    199      0  みんなさんは、手塚治虫先生の「1985への出発」という漫画を読んだことがありますでしょうか？...
    
    [200 rows x 2 columns]


.. code:: python

    print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))
    score_in_jp = predictor.evaluate(test_jp_df)
    print('Score in the Japanese Testset:')
    print(score_in_jp)


.. parsed-literal::
    :class: output

    Negative labe ratio of the Japanese Testset= 0.575
    Score in the Japanese Testset:
    {'roc_auc': 0.8901278772378517}


Amazingly, the model also works for Japanese!