.. _sec_automm_textprediction_multilingual: AutoMM for Text - Multilingual Problems ======================================= People around the world speaks lots of languages. According to `SIL International `__'s `Ethnologue: Languages of the World `__, there are more than **7,100** spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of real-world problems involve text written in languages other than English. In this tutorial, we introduce how ``MultiModalPredictor`` can help you build multilingual models. For the purpose of demonstration, we use the `Cross-Lingual Amazon Product Review Sentiment `__ dataset, which comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways: - Finetune the German BERT - Cross-lingual transfer from English to German Load Dataset ------------ The `Cross-Lingual Amazon Product Review Sentiment `__ dataset contains Amazon product reviews in four languages. Here, we load the English and German fold of the dataset. In the label column, ``0`` means negative sentiment and ``1`` means positive sentiment. .. code:: python !wget https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip !unzip -o amazon_review_sentiment_cross_lingual.zip -d . .. parsed-literal:: :class: output --2022-07-19 00:00:17-- https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 52.216.105.27 Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|52.216.105.27|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 250619863 (239M) [application/zip] Saving to: ‘amazon_review_sentiment_cross_lingual.zip’ amazon_review_senti 100%[===================>] 239.01M 22.4MB/s in 11s 2022-07-19 00:00:29 (21.5 MB/s) - ‘amazon_review_sentiment_cross_lingual.zip’ saved [250619863/250619863] Archive: amazon_review_sentiment_cross_lingual.zip creating: ./amazon_review_sentiment_cross_lingual/ inflating: ./amazon_review_sentiment_cross_lingual/fr_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/fr_unlabled.tsv inflating: ./amazon_review_sentiment_cross_lingual/jp_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/de_unlabled.tsv inflating: ./amazon_review_sentiment_cross_lingual/jp_unlabled.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_train.1000.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/jp_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/de_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/fr_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/de_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_unlabled.tsv .. code:: python import pandas as pd import warnings warnings.filterwarnings('ignore') train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(1000, random_state=123) train_de_df.reset_index(inplace=True, drop=True) test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_de_df.reset_index(inplace=True, drop=True) print(train_de_df) .. parsed-literal:: :class: output label text 0 0 Dieser Film, nur so triefend von Kitsch, ist h... 1 0 Wie so oft: Das Buch begeistert, der Film entt... 2 1 Schon immer versuchten Männer ihre Gefühle geg... 3 1 Wenn man sich durch 10 Minuten Disney-Trailer ... 4 1 Eine echt geile nummer zum Abtanzen und feiern... .. ... ... 995 0 Ich dachte dies wäre ein richtig spannendes Bu... 996 0 Wer sich den Schrott wirklich noch ansehen möc... 997 0 Sicher, der Film greift ein aktuelles und hoch... 998 1 Dieser Bildband lässt das Herz von Sarah Kay-F... 999 1 ...so das war nun mein drittes Buch von Jenny-... [1000 rows x 2 columns] .. code:: python train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(1000, random_state=123) train_en_df.reset_index(inplace=True, drop=True) test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_en_df.reset_index(inplace=True, drop=True) print(train_en_df) .. parsed-literal:: :class: output label text 0 0 This is a film that literally sees little wron... 1 0 This music is pretty intelligent, but not very... 2 0 One of the best pieces of rock ever recorded, ... 3 0 Reading the posted reviews here, is like revis... 4 1 I've just finished page 341, the last page. It... .. ... ... 995 1 This album deserves to be (at least) as popula... 996 1 This book, one of the few that takes a more ac... 997 1 I loved it because it really did show Sagan th... 998 1 Stuart Gordons "DAGON" is a unique horror gem ... 999 0 I've heard Al Lee speak before and thought tha... [1000 rows x 2 columns] Finetune the German BERT ------------------------ Our first approach is to finetune the `German BERT model `__ pretrained by deepset. Since ``MultiModalPredictor`` integrates with the `Huggingface/Transformers `__ (as explained in :ref:`sec_textprediction_customization`), we directly load the German BERT model available in Huggingface/Transformers, with the key as `bert-base-german-cased `__. To simplify the experiment, we also just finetune for 4 epochs. .. code:: python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor(label='label') predictor.fit(train_de_df, hyperparameters={ 'model.hf_text.checkpoint_name': 'bert-base-german-cased', 'optimization.max_epochs': 4 }) .. parsed-literal:: :class: output Global seed set to 123 .. parsed-literal:: :class: output Downloading: 0%| | 0.00/29.0 [00:00 .. code:: python score = predictor.evaluate(test_de_df) print('Score on the German Testset:') print(score) .. parsed-literal:: :class: output Score on the German Testset: {'roc_auc': 0.9416065705128206} .. code:: python score = predictor.evaluate(test_en_df) print('Score on the English Testset:') print(score) .. parsed-literal:: :class: output Score on the English Testset: {'roc_auc': 0.5831574716108934} We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for **both German and English**. Cross-lingual Transfer ---------------------- In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the other language (e.g., German) to English and apply the English model. However, as showed in `"Unsupervised Cross-lingual Representation Learning at Scale" `__, there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining. The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct *zero-shot* cross lingual transfer, meaning that you can directly apply the model trained in the English dataset to datasets in other languages. It also outperforms the baseline "TRANSLATE-TEST", meaning to translate the data from other languages to English and apply the English model. In AutoGluon, you can just turn on ``presets="multilingual"`` in MultiModalPredictor to load a backbone that is suitable for zero-shot transfer. Internally, we will automatically use state-of-the-art models like `DeBERTa-V3 `__. .. code:: python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor(label='label') predictor.fit(train_en_df, presets='multilingual', hyperparameters={ 'optimization.max_epochs': 4 }) .. parsed-literal:: :class: output Global seed set to 123 Auto select gpus: [0] GPU available: True, used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params ------------------------------------------------------------------- 0 | model | HFAutoModelForTextPrediction | 278 M 1 | validation_metric | AUROC | 0 2 | loss_func | CrossEntropyLoss | 0 ------------------------------------------------------------------- 278 M Trainable params 0 Non-trainable params 278 M Total params 1,112.881 Total estimated model params size (MB) Epoch 0, global step 3: 'val_roc_auc' reached 0.52815 (best 0.52815), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=0-step=3.ckpt' as top 1 Epoch 0, global step 7: 'val_roc_auc' reached 0.78618 (best 0.78618), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=0-step=7.ckpt' as top 1 Epoch 1, global step 10: 'val_roc_auc' reached 0.81798 (best 0.81798), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=1-step=10.ckpt' as top 1 Epoch 1, global step 14: 'val_roc_auc' reached 0.89959 (best 0.89959), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=1-step=14.ckpt' as top 1 Epoch 2, global step 17: 'val_roc_auc' reached 0.94079 (best 0.94079), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=2-step=17.ckpt' as top 1 Epoch 2, global step 21: 'val_roc_auc' reached 0.95080 (best 0.95080), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=2-step=21.ckpt' as top 1 Epoch 3, global step 24: 'val_roc_auc' reached 0.95200 (best 0.95200), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=3-step=24.ckpt' as top 1 Epoch 3, global step 28: 'val_roc_auc' reached 0.95300 (best 0.95300), saving model to '/var/lib/jenkins/workspace/workspace/autogluon-tutorial-multimodal-v3/docs/_build/eval/tutorials/multimodal/AutogluonModels/ag-20220719_000402/epoch=3-step=28.ckpt' as top 1 .. parsed-literal:: :class: output .. code:: python score_in_en = predictor.evaluate(test_en_df) print('Score in the English Testset:') print(score_in_en) .. parsed-literal:: :class: output Score in the English Testset: {'roc_auc': 0.9305597427394232} .. code:: python score_in_de = predictor.evaluate(test_de_df) print('Score in the German Testset:') print(score_in_de) .. parsed-literal:: :class: output Score in the German Testset: {'roc_auc': 0.9344951923076923} We can see that the model works for both German and English! Let's also inspect the model's performance on Japanese: .. code:: python test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_jp_df.reset_index(inplace=True, drop=True) print(test_jp_df) .. parsed-literal:: :class: output label text 0 1 原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ... 1 1 ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない... 2 0 一番の問題点は青島が出ていない事でしょう。 TV番組では『芸人が出ていればバラエティだから... 3 0 昔、 りんたろう監督によるアニメ「カムイの剣」があった。 「カムイの剣」…を観た人なら本作... 4 1 以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と... .. ... ... 195 0 原作が面白く、このDVDも期待して観ただけに非常にがっかりしました。 脚本としては単に格闘... 196 0 フェードインやフェードアウトが多すぎます。 197 0 流通形態云々については特に革命と言う気はしない。 これからもCDは普通に発売されるだろうし... 198 1 もうTVとか、最近の映画とか、観なくていいよ。 脳に楽なエンターテイメントだから。 脳を... 199 0 みんなさんは、手塚治虫先生の「1985への出発」という漫画を読んだことがありますでしょうか?... [200 rows x 2 columns] .. code:: python print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df)) score_in_jp = predictor.evaluate(test_jp_df) print('Score in the Japanese Testset:') print(score_in_jp) .. parsed-literal:: :class: output Negative labe ratio of the Japanese Testset= 0.575 Score in the Japanese Testset: {'roc_auc': 0.890230179028133} Amazingly, the model also works for Japanese! Other Examples -------------- You may go to `AutoMM Examples `__ to explore other examples about AutoMM. Customization ------------- To learn how to customize AutoMM, please refer to :ref:`sec_automm_customization`.