.. _sec_automm_textprediction_multilingual: AutoMM for Text - Multilingual Problems ======================================= People around the world speaks lots of languages. According to `SIL International `__\ ’s `Ethnologue: Languages of the World `__, there are more than **7,100** spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of real-world problems involve text written in languages other than English. In this tutorial, we introduce how ``MultiModalPredictor`` can help you build multilingual models. For the purpose of demonstration, we use the `Cross-Lingual Amazon Product Review Sentiment `__ dataset, which comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways: - Finetune the German BERT - Cross-lingual transfer from English to German *Note:* You are recommended to also check :ref:`sec_automm_efficient_finetuning_basic` about how to achieve better performance via parameter-efficient finetuning. Load Dataset ------------ The `Cross-Lingual Amazon Product Review Sentiment `__ dataset contains Amazon product reviews in four languages. Here, we load the English and German fold of the dataset. In the label column, ``0`` means negative sentiment and ``1`` means positive sentiment. .. code:: python !wget --quiet https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip !unzip -q -o amazon_review_sentiment_cross_lingual.zip -d . .. code:: python import pandas as pd import warnings warnings.filterwarnings('ignore') train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(1000, random_state=123) train_de_df.reset_index(inplace=True, drop=True) test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_de_df.reset_index(inplace=True, drop=True) print(train_de_df) .. parsed-literal:: :class: output label text 0 0 Dieser Film, nur so triefend von Kitsch, ist h... 1 0 Wie so oft: Das Buch begeistert, der Film entt... 2 1 Schon immer versuchten Männer ihre Gefühle geg... 3 1 Wenn man sich durch 10 Minuten Disney-Trailer ... 4 1 Eine echt geile nummer zum Abtanzen und feiern... .. ... ... 995 0 Ich dachte dies wäre ein richtig spannendes Bu... 996 0 Wer sich den Schrott wirklich noch ansehen möc... 997 0 Sicher, der Film greift ein aktuelles und hoch... 998 1 Dieser Bildband lässt das Herz von Sarah Kay-F... 999 1 ...so das war nun mein drittes Buch von Jenny-... [1000 rows x 2 columns] .. code:: python train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(1000, random_state=123) train_en_df.reset_index(inplace=True, drop=True) test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_en_df.reset_index(inplace=True, drop=True) print(train_en_df) .. parsed-literal:: :class: output label text 0 0 This is a film that literally sees little wron... 1 0 This music is pretty intelligent, but not very... 2 0 One of the best pieces of rock ever recorded, ... 3 0 Reading the posted reviews here, is like revis... 4 1 I've just finished page 341, the last page. It... .. ... ... 995 1 This album deserves to be (at least) as popula... 996 1 This book, one of the few that takes a more ac... 997 1 I loved it because it really did show Sagan th... 998 1 Stuart Gordons "DAGON" is a unique horror gem ... 999 0 I've heard Al Lee speak before and thought tha... [1000 rows x 2 columns] Finetune the German BERT ------------------------ Our first approach is to finetune the `German BERT model `__ pretrained by deepset. Since ``MultiModalPredictor`` integrates with the `Huggingface/Transformers `__ (as explained in :ref:`sec_textprediction_customization`), we directly load the German BERT model available in Huggingface/Transformers, with the key as `bert-base-german-cased `__. To simplify the experiment, we also just finetune for 4 epochs. .. code:: python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor(label='label') predictor.fit(train_de_df, hyperparameters={ 'model.hf_text.checkpoint_name': 'bert-base-german-cased', 'optimization.max_epochs': 2 }) .. parsed-literal:: :class: output Global seed set to 123 No path specified. Models will be saved in: "AutogluonModels/ag-20230222_233527/" AutoMM starts to create your model. ✨ - Model will be saved to "/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527". - Validation metric is "roc_auc". - To track the learning progress, you can open a terminal and launch Tensorboard: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527 ``` Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai Using 16bit None Automatic Mixed Precision (AMP) GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params ------------------------------------------------------------------- 0 | model | HFAutoModelForTextPrediction | 109 M 1 | validation_metric | AUROC | 0 2 | loss_func | CrossEntropyLoss | 0 ------------------------------------------------------------------- 109 M Trainable params 0 Non-trainable params 109 M Total params 218.166 Total estimated model params size (MB) Epoch 0, global step 3: 'val_roc_auc' reached 0.66800 (best 0.66800), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527/epoch=0-step=3.ckpt' as top 3 Epoch 0, global step 7: 'val_roc_auc' reached 0.61966 (best 0.66800), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527/epoch=0-step=7.ckpt' as top 3 Epoch 1, global step 10: 'val_roc_auc' reached 0.78260 (best 0.78260), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527/epoch=1-step=10.ckpt' as top 3 Epoch 1, global step 14: 'val_roc_auc' reached 0.78631 (best 0.78631), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527/epoch=1-step=14.ckpt' as top 3 `Trainer.fit` stopped: `max_epochs=2` reached. Start to fuse 3 checkpoints via the greedy soup algorithm. AutoMM has created your model 🎉🎉🎉 - To load the model, use the code below: ```python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527") ``` - You can open a terminal and launch Tensorboard to visualize the training log: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233527 ``` - If you are not satisfied with the model, try to increase the training time, adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html), or post issues on GitHub: https://github.com/autogluon/autogluon .. parsed-literal:: :class: output .. code:: python score = predictor.evaluate(test_de_df) print('Score on the German Testset:') print(score) .. parsed-literal:: :class: output Score on the German Testset: {'roc_auc': 0.7999799679487181} .. code:: python score = predictor.evaluate(test_en_df) print('Score on the English Testset:') print(score) .. parsed-literal:: :class: output Score on the English Testset: {'roc_auc': 0.43256959099587977} We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for **both German and English**. Cross-lingual Transfer ---------------------- In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the other language (e.g., German) to English and apply the English model. However, as showed in `“Unsupervised Cross-lingual Representation Learning at Scale” `__, there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining. The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct *zero-shot* cross lingual transfer, meaning that you can directly apply the model trained in the English dataset to datasets in other languages. It also outperforms the baseline “TRANSLATE-TEST”, meaning to translate the data from other languages to English and apply the English model. In AutoGluon, you can just turn on ``presets="multilingual"`` in MultiModalPredictor to load a backbone that is suitable for zero-shot transfer. Internally, we will automatically use state-of-the-art models like `DeBERTa-V3 `__. .. code:: python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor(label='label') predictor.fit(train_en_df, presets='multilingual', hyperparameters={ 'optimization.max_epochs': 2 }) .. parsed-literal:: :class: output Global seed set to 123 No path specified. Models will be saved in: "AutogluonModels/ag-20230222_233716/" AutoMM starts to create your model. ✨ - Model will be saved to "/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716". - Validation metric is "roc_auc". - To track the learning progress, you can open a terminal and launch Tensorboard: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716 ``` Enjoy your coffee, and let AutoMM do the job ☕☕☕ Learn more at https://auto.gluon.ai GPU available: True (cuda), used: True TPU available: False, using: 0 TPU cores IPU available: False, using: 0 IPUs HPU available: False, using: 0 HPUs LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0] | Name | Type | Params ------------------------------------------------------------------- 0 | model | HFAutoModelForTextPrediction | 278 M 1 | validation_metric | AUROC | 0 2 | loss_func | CrossEntropyLoss | 0 ------------------------------------------------------------------- 278 M Trainable params 0 Non-trainable params 278 M Total params 1,112.881 Total estimated model params size (MB) Epoch 0, global step 3: 'val_roc_auc' reached 0.53635 (best 0.53635), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716/epoch=0-step=3.ckpt' as top 1 Epoch 0, global step 7: 'val_roc_auc' reached 0.66387 (best 0.66387), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716/epoch=0-step=7.ckpt' as top 1 Epoch 1, global step 10: 'val_roc_auc' reached 0.69297 (best 0.69297), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716/epoch=1-step=10.ckpt' as top 1 Epoch 1, global step 14: 'val_roc_auc' reached 0.70957 (best 0.70957), saving model to '/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716/epoch=1-step=14.ckpt' as top 1 `Trainer.fit` stopped: `max_epochs=2` reached. AutoMM has created your model 🎉🎉🎉 - To load the model, use the code below: ```python from autogluon.multimodal import MultiModalPredictor predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716") ``` - You can open a terminal and launch Tensorboard to visualize the training log: ```shell # Assume you have installed tensorboard tensorboard --logdir /home/ci/autogluon/docs/_build/eval/tutorials/multimodal/text_prediction/AutogluonModels/ag-20230222_233716 ``` - If you are not satisfied with the model, try to increase the training time, adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html), or post issues on GitHub: https://github.com/autogluon/autogluon .. parsed-literal:: :class: output .. code:: python score_in_en = predictor.evaluate(test_en_df) print('Score in the English Testset:') print(score_in_en) .. parsed-literal:: :class: output Score in the English Testset: {'roc_auc': 0.6678725756205406} .. code:: python score_in_de = predictor.evaluate(test_de_df) print('Score in the German Testset:') print(score_in_de) .. parsed-literal:: :class: output Score in the German Testset: {'roc_auc': 0.6973157051282051} We can see that the model works for both German and English! Let’s also inspect the model’s performance on Japanese: .. code:: python test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_jp_df.reset_index(inplace=True, drop=True) print(test_jp_df) .. parsed-literal:: :class: output label text 0 1 原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ... 1 1 ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない... 2 0 一番の問題点は青島が出ていない事でしょう。 TV番組では『芸人が出ていればバラエティだから... 3 0 昔、 りんたろう監督によるアニメ「カムイの剣」があった。 「カムイの剣」…を観た人なら本作... 4 1 以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と... .. ... ... 195 0 原作が面白く、このDVDも期待して観ただけに非常にがっかりしました。 脚本としては単に格闘... 196 0 フェードインやフェードアウトが多すぎます。 197 0 流通形態云々については特に革命と言う気はしない。 これからもCDは普通に発売されるだろうし... 198 1 もうTVとか、最近の映画とか、観なくていいよ。 脳に楽なエンターテイメントだから。 脳を... 199 0 みんなさんは、手塚治虫先生の「1985への出発」という漫画を読んだことがありますでしょうか?... [200 rows x 2 columns] .. code:: python print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df)) score_in_jp = predictor.evaluate(test_jp_df) print('Score in the Japanese Testset:') print(score_in_jp) .. parsed-literal:: :class: output Negative labe ratio of the Japanese Testset= 0.575 Score in the Japanese Testset: {'roc_auc': 0.7899744245524297} Amazingly, the model also works for Japanese! Other Examples -------------- You may go to `AutoMM Examples `__ to explore other examples about AutoMM. Customization ------------- To learn how to customize AutoMM, please refer to :ref:`sec_automm_customization`.