{ "cells": [ { "cell_type": "markdown", "id": "b03a4ef8", "metadata": {}, "source": [ "# AutoMM for Text - Multilingual Problems\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/text_prediction/multilingual_text.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/text_prediction/multilingual_text.ipynb)\n", "\n", "\n", "\n", "People around the world speaks lots of languages. According to [SIL International](https://en.wikipedia.org/wiki/SIL_International)'s [Ethnologue: Languages of the World](https://en.wikipedia.org/wiki/Ethnologue), \n", "there are more than **7,100** spoken and signed languages. In fact, web data nowadays are highly multilingual and lots of \n", "real-world problems involve text written in languages other than English.\n", "\n", "In this tutorial, we introduce how `MultiModalPredictor` can help you build multilingual models. For the purpose of demonstration, \n", "we use the [Cross-Lingual Amazon Product Review Sentiment](https://webis.de/data/webis-cls-10.html) dataset, which \n", "comprises about 800,000 Amazon product reviews in four languages: English, German, French, and Japanese. \n", "We will demonstrate how to use AutoGluon Text to build sentiment classification models on the German fold of this dataset in two ways:\n", "\n", "- Finetune the German BERT\n", "- Cross-lingual transfer from English to German\n", "\n", "*Note:* You are recommended to also check [Single GPU Billion-scale Model Training via Parameter-Efficient Finetuning](../advanced_topics/efficient_finetuning_basic.ipynb) about how to achieve better performance via parameter-efficient finetuning. \n", "\n", "## Load Dataset\n", "\n", "The [Cross-Lingual Amazon Product Review Sentiment](https://webis.de/data/webis-cls-10.html) dataset contains Amazon product reviews in four languages. \n", "Here, we load the English and German fold of the dataset. In the label column, `0` means negative sentiment and `1` means positive sentiment." ] }, { "cell_type": "code", "execution_count": null, "id": "aa00faab-252f-44c9-b8f7-57131aa8251c", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!pip install autogluon.multimodal\n" ] }, { "cell_type": "code", "execution_count": null, "id": "1568130d", "metadata": {}, "outputs": [], "source": [ "!wget --quiet https://automl-mm-bench.s3.amazonaws.com/multilingual-datasets/amazon_review_sentiment_cross_lingual.zip -O amazon_review_sentiment_cross_lingual.zip\n", "!unzip -q -o amazon_review_sentiment_cross_lingual.zip -d ." ] }, { "cell_type": "code", "execution_count": null, "id": "0e6d8762", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "import warnings\n", "warnings.filterwarnings('ignore')\n", "\n", "train_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_train.tsv',\n", " sep='\\t', header=None, names=['label', 'text']) \\\n", " .sample(1000, random_state=123)\n", "train_de_df.reset_index(inplace=True, drop=True)\n", "\n", "test_de_df = pd.read_csv('amazon_review_sentiment_cross_lingual/de_test.tsv',\n", " sep='\\t', header=None, names=['label', 'text']) \\\n", " .sample(200, random_state=123)\n", "test_de_df.reset_index(inplace=True, drop=True)\n", "print(train_de_df)" ] }, { "cell_type": "code", "execution_count": null, "id": "fb9af1b4", "metadata": {}, "outputs": [], "source": [ "train_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_train.tsv',\n", " sep='\\t',\n", " header=None,\n", " names=['label', 'text']) \\\n", " .sample(1000, random_state=123)\n", "train_en_df.reset_index(inplace=True, drop=True)\n", "\n", "test_en_df = pd.read_csv('amazon_review_sentiment_cross_lingual/en_test.tsv',\n", " sep='\\t',\n", " header=None,\n", " names=['label', 'text']) \\\n", " .sample(200, random_state=123)\n", "test_en_df.reset_index(inplace=True, drop=True)\n", "print(train_en_df)" ] }, { "cell_type": "markdown", "id": "cd6e6f88", "metadata": {}, "source": [ "## Finetune the German BERT\n", "\n", "Our first approach is to finetune the [German BERT model](https://www.deepset.ai/german-bert) pretrained by deepset. \n", "Since `MultiModalPredictor` integrates with the [Huggingface/Transformers](https://huggingface.co/docs/transformers/index) (as explained in [Customize AutoMM](../advanced_topics/customization.ipynb)), \n", "we directly load the German BERT model available in Huggingface/Transformers, with the key as [bert-base-german-cased](https://huggingface.co/bert-base-german-cased). \n", "To simplify the experiment, we also just finetune for 4 epochs." ] }, { "cell_type": "code", "execution_count": null, "id": "7847c87d", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal import MultiModalPredictor\n", "\n", "predictor = MultiModalPredictor(label='label')\n", "predictor.fit(train_de_df,\n", " hyperparameters={\n", " 'model.hf_text.checkpoint_name': 'bert-base-german-cased',\n", " 'optimization.max_epochs': 2\n", " })" ] }, { "cell_type": "code", "execution_count": null, "id": "5c74da3e", "metadata": {}, "outputs": [], "source": [ "score = predictor.evaluate(test_de_df)\n", "print('Score on the German Testset:')\n", "print(score)" ] }, { "cell_type": "code", "execution_count": null, "id": "6b579571", "metadata": {}, "outputs": [], "source": [ "score = predictor.evaluate(test_en_df)\n", "print('Score on the English Testset:')\n", "print(score)" ] }, { "cell_type": "markdown", "id": "4ea2ef6b", "metadata": {}, "source": [ "We can find that the model can achieve good performance on the German dataset but performs poorly on the English dataset. \n", "Next, we will show how to enable cross-lingual transfer so you can get a model that can magically work for **both German and English**.\n", "\n", "## Cross-lingual Transfer\n", "\n", "In the real-world scenario, it is pretty common that you have trained a model for English and would like to extend the model to support other languages like German. \n", "This setting is also known as cross-lingual transfer. One way to solve the problem is to apply a machine translation model to translate the sentences from the \n", "other language (e.g., German) to English and apply the English model.\n", "However, as showed in [\"Unsupervised Cross-lingual Representation Learning at Scale\"](https://arxiv.org/pdf/1911.02116.pdf), \n", "there is a better and cost-friendlier way for cross lingual transfer, enabled via large-scale multilingual pretraining.\n", "The author showed that via large-scale pretraining, the backbone (called XLM-R) is able to conduct *zero-shot* cross lingual transfer, \n", "meaning that you can directly apply the model trained in the English dataset to datasets in other languages. \n", "It also outperforms the baseline \"TRANSLATE-TEST\", meaning to translate the data from other languages to English and apply the English model. \n", "\n", "In AutoGluon, you can just turn on `presets=\"multilingual\"` in MultiModalPredictor to load a backbone that is suitable for zero-shot transfer. \n", "Internally, we will automatically use state-of-the-art models like [DeBERTa-V3](https://arxiv.org/abs/2111.09543)." ] }, { "cell_type": "code", "execution_count": null, "id": "64cfea80", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal import MultiModalPredictor\n", "\n", "predictor = MultiModalPredictor(label='label')\n", "predictor.fit(train_en_df,\n", " presets='multilingual',\n", " hyperparameters={\n", " 'optimization.max_epochs': 2\n", " })" ] }, { "cell_type": "code", "execution_count": null, "id": "72e73dc9", "metadata": {}, "outputs": [], "source": [ "score_in_en = predictor.evaluate(test_en_df)\n", "print('Score in the English Testset:')\n", "print(score_in_en)" ] }, { "cell_type": "code", "execution_count": null, "id": "e7073e87", "metadata": {}, "outputs": [], "source": [ "score_in_de = predictor.evaluate(test_de_df)\n", "print('Score in the German Testset:')\n", "print(score_in_de)" ] }, { "cell_type": "markdown", "id": "4e057696", "metadata": {}, "source": [ "We can see that the model works for both German and English!\n", "\n", "Let's also inspect the model's performance on Japanese:" ] }, { "cell_type": "code", "execution_count": null, "id": "9f79a6f4", "metadata": {}, "outputs": [], "source": [ "test_jp_df = pd.read_csv('amazon_review_sentiment_cross_lingual/jp_test.tsv',\n", " sep='\\t', header=None, names=['label', 'text']) \\\n", " .sample(200, random_state=123)\n", "test_jp_df.reset_index(inplace=True, drop=True)\n", "print(test_jp_df)" ] }, { "cell_type": "code", "execution_count": null, "id": "c2f64266", "metadata": {}, "outputs": [], "source": [ "print('Negative labe ratio of the Japanese Testset=', test_jp_df['label'].value_counts()[0] / len(test_jp_df))\n", "score_in_jp = predictor.evaluate(test_jp_df)\n", "print('Score in the Japanese Testset:')\n", "print(score_in_jp)" ] }, { "cell_type": "markdown", "id": "c4ba2df9", "metadata": {}, "source": [ "Amazingly, the model also works for Japanese!\n", "\n", "## Other Examples\n", "\n", "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n", "\n", "## Customization\n", "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb)." ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython" } }, "nbformat": 4, "nbformat_minor": 5 }