.. _sec_textprediction_multilingual: Text Prediction - Solving Multilingual Problems =============================================== People around the world speaks lots of languages. According to `SIL International `__\ ’s `Ethnologue: Languages of the World `__, there are more than **7,100** spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of real-world problems involve text written in languages other than English. In this tutorial, we introduce how AutoGluon Text can help you build multilingual models. For the purpose of demonstration, we use the `Cross-Lingual Amazon Product Review Sentiment `__ dataset, which comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways: - Finetune the German BERT - Cross-lingual transfer from English to German Load Dataset ------------ The `Cross-Lingual Amazon Product Review Sentiment `__ dataset contains Amazon product reviews in four languages. Here, we load the English and German fold of the dataset. In the label column, ``0`` means negative sentiment and ``1`` means positive sentiment. .. code:: python !wget https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip !unzip -o amazon_review_sentiment_cross_lingual.zip -d . .. parsed-literal:: :class: output --2022-07-28 21:14:34-- https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip Resolving automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)... 52.217.231.73 Connecting to automl-mm-bench.s3.amazonaws.com (automl-mm-bench.s3.amazonaws.com)|52.217.231.73|:443... connected. HTTP request sent, awaiting response... 200 OK Length: 250619863 (239M) [application/zip] Saving to: ‘amazon_review_sentiment_cross_lingual.zip’ amazon_review_senti 100%[===================>] 239.01M 41.8MB/s in 6.4s 2022-07-28 21:14:40 (37.6 MB/s) - ‘amazon_review_sentiment_cross_lingual.zip’ saved [250619863/250619863] Archive: amazon_review_sentiment_cross_lingual.zip creating: ./amazon_review_sentiment_cross_lingual/ inflating: ./amazon_review_sentiment_cross_lingual/fr_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/fr_unlabled.tsv inflating: ./amazon_review_sentiment_cross_lingual/jp_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/de_unlabled.tsv inflating: ./amazon_review_sentiment_cross_lingual/jp_unlabled.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_train.1000.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/jp_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/de_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/fr_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/de_train.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_test.tsv inflating: ./amazon_review_sentiment_cross_lingual/en_unlabled.tsv .. code:: python import pandas as pd import warnings warnings.filterwarnings('ignore') train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(1000, random_state=123) train_de_df.reset_index(inplace=True, drop=True) test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_de_df.reset_index(inplace=True, drop=True) print(train_de_df) .. parsed-literal:: :class: output label text 0 0 Dieser Film, nur so triefend von Kitsch, ist h... 1 0 Wie so oft: Das Buch begeistert, der Film entt... 2 1 Schon immer versuchten Männer ihre Gefühle geg... 3 1 Wenn man sich durch 10 Minuten Disney-Trailer ... 4 1 Eine echt geile nummer zum Abtanzen und feiern... .. ... ... 995 0 Ich dachte dies wäre ein richtig spannendes Bu... 996 0 Wer sich den Schrott wirklich noch ansehen möc... 997 0 Sicher, der Film greift ein aktuelles und hoch... 998 1 Dieser Bildband lässt das Herz von Sarah Kay-F... 999 1 ...so das war nun mein drittes Buch von Jenny-... [1000 rows x 2 columns] .. code:: python train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(1000, random_state=123) train_en_df.reset_index(inplace=True, drop=True) test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_en_df.reset_index(inplace=True, drop=True) print(train_en_df) .. parsed-literal:: :class: output label text 0 0 This is a film that literally sees little wron... 1 0 This music is pretty intelligent, but not very... 2 0 One of the best pieces of rock ever recorded, ... 3 0 Reading the posted reviews here, is like revis... 4 1 I've just finished page 341, the last page. It... .. ... ... 995 1 This album deserves to be (at least) as popula... 996 1 This book, one of the few that takes a more ac... 997 1 I loved it because it really did show Sagan th... 998 1 Stuart Gordons "DAGON" is a unique horror gem ... 999 0 I've heard Al Lee speak before and thought tha... [1000 rows x 2 columns] Finetune the German BERT ------------------------ Our first approach is to finetune the `German BERT model `__ pretrained by deepset. Since AutoGluon Text integrates with the `Huggingface/Transformers `__ (as explained in :ref:`sec_textprediction_customization`), we directly load the German BERT model via Huggingface/Transformers, with the key as `bert-base-german-cased `__. To simplify the experiment, we also just finetune for 4 epochs. .. code:: python from autogluon.text import TextPredictor predictor = TextPredictor(label='label') predictor.fit(train_de_df, hyperparameters={ 'model.hf_text.checkpoint_name': 'bert-base-german-cased', 'optimization.max_epochs': 4 }) .. parsed-literal:: :class: output Global seed set to 123 .. parsed-literal:: :class: output Downloading: 0%| | 0.00/29.0 [00:00 .. code:: python score = predictor.evaluate(test_de_df) print('Score on the German Testset:') print(score) .. parsed-literal:: :class: output Score on the German Testset: {'roc_auc': 0.9416065705128206} .. code:: python score = predictor.evaluate(test_en_df) print('Score on the English Testset:') print(score) .. parsed-literal:: :class: output Score on the English Testset: {'roc_auc': 0.5831574716108934} We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for **both German and English**. Cross-lingual Transfer ---------------------- In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the other language (e.g., German) to English and apply the English model. However, as showed in `“Unsupervised Cross-lingual Representation Learning at Scale” `__, there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining. The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct *zero-shot* cross lingual transfer, meaning that you can directly apply the model trained in the English dataset to datasets in other languages. It also outperforms the baseline “TRANSLATE-TEST”, meaning to translate the data from other languages to English and apply the English model. In AutoGluon, you can just turn on ``presets="multilingual"`` to load a backbone that is suitable for zero-shot transfer. Internally, we will automatically use state-of-the-art models like `DeBERTa-V3 `__. .. code:: python from autogluon.text import TextPredictor predictor = TextPredictor(label='label') predictor.fit(train_en_df, presets='multilingual', hyperparameters={ 'optimization.max_epochs': 4 }) .. parsed-literal:: :class: output Global seed set to 123 .. parsed-literal:: :class: output Downloading: 0%| | 0.00/52.0 [00:00 .. code:: python score_in_en = predictor.evaluate(test_en_df) print('Score in the English Testset:') print(score_in_en) .. parsed-literal:: :class: output Score in the English Testset: {'roc_auc': 0.931263189629183} .. code:: python score_in_de = predictor.evaluate(test_de_df) print('Score in the German Testset:') print(score_in_de) .. parsed-literal:: :class: output Score in the German Testset: {'roc_auc': 0.9345953525641025} We can see that the model works for both German and English! Let’s also inspect the model’s performance on Japanese: .. code:: python test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv', sep='\t', header=None, names=['label', 'text']) \ .sample(200, random_state=123) test_jp_df.reset_index(inplace=True, drop=True) print(test_jp_df) .. parsed-literal:: :class: output label text 0 1 原作はビクトル・ユーゴの長編小説だが、私が子供の頃読んだのは短縮版の「ああ無情」。それでもこ... 1 1 ほかの作品のレビューにみんな書いているのに、何故この作品について書いている人が一人しかいない... 2 0 一番の問題点は青島が出ていない事でしょう。 TV番組では『芸人が出ていればバラエティだから... 3 0 昔、 りんたろう監督によるアニメ「カムイの剣」があった。 「カムイの剣」…を観た人なら本作... 4 1 以前のアルバムを聴いていないのでなんとも言えないが、クラシックなメタルを聞いてきた耳には、と... .. ... ... 195 0 原作が面白く、このDVDも期待して観ただけに非常にがっかりしました。 脚本としては単に格闘... 196 0 フェードインやフェードアウトが多すぎます。 197 0 流通形態云々については特に革命と言う気はしない。 これからもCDは普通に発売されるだろうし... 198 1 もうTVとか、最近の映画とか、観なくていいよ。 脳に楽なエンターテイメントだから。 脳を... 199 0 みんなさんは、手塚治虫先生の「1985への出発」という漫画を読んだことがありますでしょうか?... [200 rows x 2 columns] .. code:: python print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df)) score_in_jp = predictor.evaluate(test_jp_df) print('Score in the Japanese Testset:') print(score_in_jp) .. parsed-literal:: :class: output Negative labe ratio of the Japanese Testset= 0.575 Score in the Japanese Testset: {'roc_auc': 0.8901278772378517} Amazingly, the model also works for Japanese!