.. _sec_automm_textprediction_multilingual:

AutoMM for Text - Multilingual Problems
=======================================


People around the world speaks lots of languages. According to `SIL
International <https://en.wikipedia.org/wiki/SIL_International>`__'s
`Ethnologue: Languages of the
World <https://en.wikipedia.org/wiki/Ethnologue>`__, there are more than
**7,100** spoken and signed languages. In fact, web data nowadays are
highly multilingual and lots of real-world problems involve text written
in languages other than English.

In this tutorial, we introduce how ``MultiModalPredictor`` can help you
build multilingual models. For the purpose of demonstration, we use the
`Cross-Lingual Amazon Product Review
Sentiment <https://webis.de/data/webis-cls-10.html>`__ dataset, which
comprises about 800,000 Amazon product reviews in four languages:
English, German, French, and Japanese. We will demonstrate how to use
AutoGluon Text to build sentiment classification models on the German
fold of this dataset in two ways:

-  Finetune the German BERT
-  Cross-lingual transfer from English to German

Load Dataset
------------

The `Cross-Lingual Amazon Product Review
Sentiment <https://webis.de/data/webis-cls-10.html>`__ dataset contains
Amazon product reviews in four languages. Here, we load the English and
German fold of the dataset. In the label column, ``0`` means negative
sentiment and ``1`` means positive sentiment.

.. code:: python

    !wget https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip
    !unzip -o amazon_review_sentiment_cross_lingual.zip -d .


.. parsed-literal::
    :class: output

    --2022-07-19 00:00:17--  https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip
    Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 52.216.105.27
    Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|52.216.105.27|:443... connected.
    HTTP request sent, awaiting response... 200 OK
    Length: 250619863 (239M) [application/zip]
    Saving to: ‘amazon_review_sentiment_cross_lingual.zip’
    
    amazon_review_senti 100%[===================>] 239.01M  22.4MB/s    in 11s     
    
    2022-07-19 00:00:29 (21.5 MB/s) - ‘amazon_review_sentiment_cross_lingual.zip’ saved [250619863/250619863]
    
    Archive:  amazon_review_sentiment_cross_lingual.zip
       creating: ./amazon_review_sentiment_cross_lingual/
      inflating: ./amazon_review_sentiment_cross_lingual/fr_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/fr_unlabled.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/jp_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/de_unlabled.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/jp_unlabled.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_train.1000.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/jp_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/de_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/fr_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/de_train.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_test.tsv  
      inflating: ./amazon_review_sentiment_cross_lingual/en_unlabled.tsv  


.. code:: python

    import pandas as pd
    import warnings
    warnings.filterwarnings('ignore')
    
    train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',
                              sep='\t', header=None, names=['label', 'text']) \
                    .sample(1000, random_state=123)
    train_de_df.reset_index(inplace=True, drop=True)
    
    test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',
                              sep='\t', header=None, names=['label', 'text']) \
                   .sample(200, random_state=123)
    test_de_df.reset_index(inplace=True, drop=True)
    print(train_de_df)


.. parsed-literal::
    :class: output

         label                                               text
    0        0  Dieser Film, nur so triefend von Kitsch, ist h...
    1        0  Wie so oft: Das Buch begeistert, der Film entt...
    2        1  Schon immer versuchten Männer ihre Gefühle geg...
    3        1  Wenn man sich durch 10 Minuten Disney-Trailer ...
    4        1  Eine echt geile nummer zum Abtanzen und feiern...
    ..     ...                                                ...
    995      0  Ich dachte dies wäre ein richtig spannendes Bu...
    996      0  Wer sich den Schrott wirklich noch ansehen möc...
    997      0  Sicher, der Film greift ein aktuelles und hoch...
    998      1  Dieser Bildband lässt das Herz von Sarah Kay-F...
    999      1  ...so das war nun mein drittes Buch von Jenny-...
    
    [1000 rows x 2 columns]


.. code:: python

    train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',
                              sep='\t',
                              header=None,
                              names=['label', 'text']) \
                    .sample(1000, random_state=123)
    train_en_df.reset_index(inplace=True, drop=True)
    
    test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',
                              sep='\t',
                              header=None,
                              names=['label', 'text']) \
                   .sample(200, random_state=123)
    test_en_df.reset_index(inplace=True, drop=True)
    print(train_en_df)


.. parsed-literal::
    :class: output

         label                                               text
    0        0  This is a film that literally sees little wron...
    1        0  This music is pretty intelligent, but not very...
    2        0  One of the best pieces of rock ever recorded, ...
    3        0  Reading the posted reviews here, is like revis...
    4        1  I've just finished page 341, the last page. It...
    ..     ...                                                ...
    995      1  This album deserves to be (at least) as popula...
    996      1  This book, one of the few that takes a more ac...
    997      1  I loved it because it really did show Sagan th...
    998      1  Stuart Gordons "DAGON" is a unique horror gem ...
    999      0  I've heard Al Lee speak before and thought tha...
    
    [1000 rows x 2 columns]


Finetune the German BERT
------------------------

Our first approach is to finetune the `German BERT
model <https://www.deepset.ai/german-bert>`__ pretrained by deepset.
Since ``MultiModalPredictor`` integrates with the
`Huggingface/Transformers <https://huggingface.co/docs/transformers/index>`__
(as explained in :ref:`sec_textprediction_customization`), we directly
load the German BERT model available in Huggingface/Transformers, with
the key as
`bert-base-german-cased <https://huggingface.co/bert-base-german-cased>`__.
To simplify the experiment, we also just finetune for 4 epochs.

.. code:: python

    from autogluon.multimodal import MultiModalPredictor
    
    predictor = MultiModalPredictor(label='label')
    predictor.fit(train_de_df,
                  hyperparameters={
                      'model.hf_text.checkpoint_name': 'bert-base-german-cased',
                      'optimization.max_epochs': 4
                  })


.. parsed-literal::
    :class: output

    Global seed set to 123


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/433 [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/249k [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/474k [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Downloading:   0%|          | 0.00/419M [00:00<?, ?B/s]


.. parsed-literal::
    :class: output

    Auto select gpus: [0]
    Using 16bit native Automatic Mixed Precision (AMP)
    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    IPU available: False, using: 0 IPUs
    HPU available: False, using: 0 HPUs
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    
      | Name              | Type                         | Params
    -------------------------------------------------------------------
    0 | model             | HFAutoModelForTextPrediction | 109 M 
    1 | validation_metric | AUROC                        | 0     
    2 | loss_func         | CrossEntropyLoss             | 0     
    -------------------------------------------------------------------
    109 M     Trainable params
    0         Non-trainable params
    109 M     Total params
    218.166   Total estimated model params size (MB)
    Epoch 0, global step 3: 'val_roc_auc' reached 0.66895 (best 0.66895), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=0-step=3.ckpt' as top 3
    Epoch 0, global step 7: 'val_roc_auc' reached 0.78300 (best 0.78300), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=0-step=7.ckpt' as top 3
    Epoch 1, global step 10: 'val_roc_auc' reached 0.83890 (best 0.83890), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=1-step=10.ckpt' as top 3
    Epoch 1, global step 14: 'val_roc_auc' reached 0.86888 (best 0.86888), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=1-step=14.ckpt' as top 3
    Epoch 2, global step 17: 'val_roc_auc' reached 0.88600 (best 0.88600), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=2-step=17.ckpt' as top 3
    Epoch 2, global step 21: 'val_roc_auc' reached 0.89516 (best 0.89516), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=2-step=21.ckpt' as top 3
    Epoch 3, global step 24: 'val_roc_auc' reached 0.89791 (best 0.89791), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=3-step=24.ckpt' as top 3
    Epoch 3, global step 28: 'val_roc_auc' reached 0.89876 (best 0.89876), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000038/epoch=3-step=28.ckpt' as top 3


.. parsed-literal::
    :class: output

    <autogluon.multimodal.predictor.MultiModalPredictor at 0x7ff68215f8b0>


.. code:: python

    score = predictor.evaluate(test_de_df)
    print('Score on the German Testset:')
    print(score)


.. parsed-literal::
    :class: output

    Score on the German Testset:
    {'roc_auc': 0.9416065705128206}


.. code:: python

    score = predictor.evaluate(test_en_df)
    print('Score on the English Testset:')
    print(score)


.. parsed-literal::
    :class: output

    Score on the English Testset:
    {'roc_auc': 0.5831574716108934}


We can find that the model can achieve good performance on the German
dataset but performs poorly on the English dataset. Next, we will show
how to enable cross-lingual transfer so you can get a model that can
magically work for **both German and English**.

Cross-lingual Transfer
----------------------

In the real-world scenario, it is pretty common that you have trained a
model for English and would like to extend the model to support other
languages like German. This setting is also known as cross-lingual
transfer. One way to solve the problem is to apply a machine translation
model to translate the sentences from the other language (e.g., German)
to English and apply the English model. However, as showed in
`"Unsupervised Cross-lingual Representation Learning at
Scale" <https://arxiv.org/pdf/1911.02116.pdf>`__, there is a better and
cost-friendlier way for cross lingual transfer, enabled via large-scale
multilingual pretraining. The author showed that via large-scale
pretraining, the backbone (called XLM-R) is able to conduct *zero-shot*
cross lingual transfer, meaning that you can directly apply the model
trained in the English dataset to datasets in other languages. It also
outperforms the baseline "TRANSLATE-TEST", meaning to translate the data
from other languages to English and apply the English model.

In AutoGluon, you can just turn on ``presets="multilingual"`` in
MultiModalPredictor to load a backbone that is suitable for zero-shot
transfer. Internally, we will automatically use state-of-the-art models
like `DeBERTa-V3 <https://arxiv.org/abs/2111.09543>`__.

.. code:: python

    from autogluon.multimodal import MultiModalPredictor
    
    predictor = MultiModalPredictor(label='label')
    predictor.fit(train_en_df,
                  presets='multilingual',
                  hyperparameters={
                      'optimization.max_epochs': 4
                  })


.. parsed-literal::
    :class: output

    Global seed set to 123
    Auto select gpus: [0]
    GPU available: True, used: True
    TPU available: False, using: 0 TPU cores
    IPU available: False, using: 0 IPUs
    HPU available: False, using: 0 HPUs
    LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
    
      | Name              | Type                         | Params
    -------------------------------------------------------------------
    0 | model             | HFAutoModelForTextPrediction | 278 M 
    1 | validation_metric | AUROC                        | 0     
    2 | loss_func         | CrossEntropyLoss             | 0     
    -------------------------------------------------------------------
    278 M     Trainable params
    0         Non-trainable params
    278 M     Total params
    1,112.881 Total estimated model params size (MB)
    Epoch 0, global step 3: 'val_roc_auc' reached 0.52815 (best 0.52815), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=0-step=3.ckpt' as top 1
    Epoch 0, global step 7: 'val_roc_auc' reached 0.78618 (best 0.78618), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=0-step=7.ckpt' as top 1
    Epoch 1, global step 10: 'val_roc_auc' reached 0.81798 (best 0.81798), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=1-step=10.ckpt' as top 1
    Epoch 1, global step 14: 'val_roc_auc' reached 0.89959 (best 0.89959), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=1-step=14.ckpt' as top 1
    Epoch 2, global step 17: 'val_roc_auc' reached 0.94079 (best 0.94079), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=2-step=17.ckpt' as top 1
    Epoch 2, global step 21: 'val_roc_auc' reached 0.95080 (best 0.95080), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=2-step=21.ckpt' as top 1
    Epoch 3, global step 24: 'val_roc_auc' reached 0.95200 (best 0.95200), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=3-step=24.ckpt' as top 1
    Epoch 3, global step 28: 'val_roc_auc' reached 0.95300 (best 0.95300), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=3-step=28.ckpt' as top 1


.. parsed-literal::
    :class: output

    <autogluon.multimodal.predictor.MultiModalPredictor at 0x7ff6787ddac0>


.. code:: python

    score_in_en = predictor.evaluate(test_en_df)
    print('Score in the English Testset:')
    print(score_in_en)


.. parsed-literal::
    :class: output

    Score in the English Testset:
    {'roc_auc': 0.9305597427394232}


.. code:: python

    score_in_de = predictor.evaluate(test_de_df)
    print('Score in the German Testset:')
    print(score_in_de)


.. parsed-literal::
    :class: output

    Score in the German Testset:
    {'roc_auc': 0.9344951923076923}


We can see that the model works for both German and English!

Let's also inspect the model's performance on Japanese:

.. code:: python

    test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',
                              sep='\t', header=None, names=['label', 'text']) \
                   .sample(200, random_state=123)
    test_jp_df.reset_index(inplace=True, drop=True)
    print(test_jp_df)


.. parsed-literal::
    :class: output

         label                                               text
    0        1  原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ...
    1        1  ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない...
    2        0  一番の問題点は青島が出ていない事でしょう。  ＴＶ番組では『芸人が出ていればバラエティだから...
    3        0  昔、 りんたろう監督によるアニメ「カムイの剣」があった。  「カムイの剣」…を観た人なら本作...
    4        1  以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と...
    ..     ...                                                ...
    195      0  原作が面白く、このDVDも期待して観ただけに非常にがっかりしました。  脚本としては単に格闘...
    196      0                              フェードインやフェードアウトが多すぎます。
    197      0  流通形態云々については特に革命と言う気はしない。  これからもＣＤは普通に発売されるだろうし...
    198      1  もうＴＶとか、最近の映画とか、観なくていいよ。  脳に楽なエンターテイメントだから。  脳を...
    199      0  みんなさんは、手塚治虫先生の「1985への出発」という漫画を読んだことがありますでしょうか？...
    
    [200 rows x 2 columns]


.. code:: python

    print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))
    score_in_jp = predictor.evaluate(test_jp_df)
    print('Score in the Japanese Testset:')
    print(score_in_jp)


.. parsed-literal::
    :class: output

    Negative labe ratio of the Japanese Testset= 0.575
    Score in the Japanese Testset:
    {'roc_auc': 0.890230179028133}


Amazingly, the model also works for Japanese!

Other Examples
--------------

You may go to `AutoMM
Examples <https://github.com/awslabs/autogluon/tree/master/examples/automm>`__
to explore other examples about AutoMM.

Customization
-------------

To learn how to customize AutoMM, please refer to
:ref:`sec_automm_customization`.