{ "cells": [ { "cell_type": "markdown", "id": "cc9d9fb8", "metadata": {}, "source": [ "# Text-to-Text Semantic Matching with AutoMM \n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/semantic_matching/text2text_matching.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/semantic_matching/text2text_matching.ipynb)\n", "\n", "\n", "\n", "Computing the similarity between two sentences/passages is a common task in NLP, with several practical applications such as web search, question answering, documents deduplication, plagiarism comparison, natural language inference, recommendation engines, etc. In general, text similarity models will take two sentences/passages as input and transform them into vectors, and then similarity scores calculated using cosine similarity, dot product, or Euclidean distances are used to measure how alike or different of the two text pieces. \n", "\n", "## Prepare your Data\n", "In this tutorial, we will demonstrate how to use AutoMM for text-to-text semantic matching with the Stanford Natural Language Inference ([SNLI](https://nlp.stanford.edu/projects/snli/)) corpus. SNLI is a corpus contains around 570k human-written sentence pairs labeled with *entailment*, *contradiction*, and *neutral*. It is a widely used benchmark for evaluating the representation and inference capbility of machine learning methods. The following table contains three examples taken from this corpus.\n", "\n", "| Premise | Hypothesis | Label |\n", "|-----------------------------------------------------------|----------------------------------------------------------------------|---------------|\n", "| A black race car starts up in front of a crowd of people. | A man is driving down a lonely road. | contradiction |\n", "| An older and younger man smiling. | Two men are smiling and laughing at the cats playing on the floor. | neutral |\n", "| A soccer game with multiple males playing. | Some men are playing a sport. | entailment |\n", "\n", "Here, we consider sentence pairs with label *entailment* as positive pairs (labeled as 1) and those with label *contradiction* as negative pairs (labeled as 0). Sentence pairs with neural relationship are discarded. The following code downloads and loads the corpus into dataframes." ] }, { "cell_type": "code", "execution_count": null, "id": "aa00faab-252f-44c9-b8f7-57131aa8251c", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!pip install autogluon.multimodal\n" ] }, { "cell_type": "code", "execution_count": null, "id": "52934a32", "metadata": {}, "outputs": [], "source": [ "from autogluon.core.utils.loaders import load_pd\n", "import pandas as pd\n", "\n", "snli_train = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_train.csv', delimiter=\"|\")\n", "snli_test = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_test.csv', delimiter=\"|\")\n", "snli_train.head()" ] }, { "cell_type": "markdown", "id": "620d4fdd", "metadata": {}, "source": [ "## Train your Model\n", "\n", "Ideally, we want to obtain a model that can return high/low scores for positive/negative text pairs. Traditional text similarity methods only work on a lexical level without taking the semantic aspect into account, for example, using term frequency or tf-idf vectors. With AutoMM, we can easily train a model that captures the semantic relationship between sentences. Basically, it uses [BERT](https://arxiv.org/abs/1810.04805) to project each sentence into a high-dimensional vector and treat the matching problem as a classification problem following the design in [sentence transformers](https://www.sbert.net/).", "\n", "With AutoMM, you just need to specify the query, response, and label column names and fit the model on the training dataset without worrying the implementation details. Note that the labels should be binary, and we need to specify the `match_label`, which means two sentences have the same semantic meaning. In practice, your tasks may have different labels, e.g., duplicate or not duplicate. You may need to define the `match_label` by considering your specific task contexts." ] }, { "cell_type": "code", "execution_count": null, "id": "340bbd54", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal import MultiModalPredictor\n", "\n", "# Initialize the model\n", "predictor = MultiModalPredictor(\n", " problem_type=\"text_similarity\",\n", " query=\"premise\", # the column name of the first sentence\n", " response=\"hypothesis\", # the column name of the second sentence\n", " label=\"label\", # the label column name\n", " match_label=1, # the label indicating that query and response have the same semantic meanings.\n", " eval_metric='auc', # the evaluation metric\n", " )\n", "\n", "# Fit the model\n", "predictor.fit(\n", " train_data=snli_train,\n", " time_limit=180,\n", ")" ] }, { "cell_type": "markdown", "id": "ff2bcd53", "metadata": {}, "source": [ "## Evaluate on Test Dataset\n", "You can evaluate the macther on the test dataset to see how it performs with the roc_auc score:" ] }, { "cell_type": "code", "execution_count": null, "id": "08319098", "metadata": {}, "outputs": [], "source": [ "score = predictor.evaluate(snli_test)\n", "print(\"evaluation score: \", score)" ] }, { "cell_type": "markdown", "id": "afcb14db", "metadata": {}, "source": [ "## Predict on a New Sentence Pair\n", "We create a new sentence pair with similar meaning (expected to be predicted as $1$) and make predictions using the trained model." ] }, { "cell_type": "code", "execution_count": null, "id": "0282cff8", "metadata": {}, "outputs": [], "source": [ "pred_data = pd.DataFrame.from_dict({\"premise\":[\"The teacher gave his speech to an empty room.\"], \n", " \"hypothesis\":[\"There was almost nobody when the professor was talking.\"]})\n", "\n", "predictions = predictor.predict(pred_data)\n", "print('Predicted entities:', predictions[0])" ] }, { "cell_type": "markdown", "id": "23e3914d", "metadata": {}, "source": [ "## Predict Matching Probabilities\n", "We can also compute the matching probabilities of sentence pairs." ] }, { "cell_type": "code", "execution_count": null, "id": "c4cc5ce9", "metadata": {}, "outputs": [], "source": [ "probabilities = predictor.predict_proba(pred_data)\n", "print(probabilities)" ] }, { "cell_type": "markdown", "id": "74df0af2", "metadata": {}, "source": [ "## Extract Embeddings\n", "Moreover, we support extracting embeddings separately for two sentence groups." ] }, { "cell_type": "code", "execution_count": null, "id": "2d2db12c", "metadata": {}, "outputs": [], "source": [ "embeddings_1 = predictor.extract_embedding({\"premise\":[\"The teacher gave his speech to an empty room.\"]})\n", "print(embeddings_1.shape)\n", "embeddings_2 = predictor.extract_embedding({\"hypothesis\":[\"There was almost nobody when the professor was talking.\"]})\n", "print(embeddings_2.shape)" ] }, { "cell_type": "markdown", "id": "7301438a", "metadata": {}, "source": [ "## Other Examples\n", "\n", "You may go to [AutoMM Examples](https://github.com/autogluon/autogluon/tree/master/examples/automm) to explore other examples about AutoMM.\n", "\n", "\n", "## Customization\n", "\n", "To learn how to customize AutoMM, please refer to [Customize AutoMM](../advanced_topics/customization.ipynb)." ] } ], "metadata": { "language_info": { "name": "python" } }, "nbformat": 4, "nbformat_minor": 5 }