{ "cells": [ { "cell_type": "markdown", "id": "3d80eab2", "metadata": {}, "source": [ "# Text Semantic Search with AutoMM\n", "\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/semantic_matching/text_semantic_search.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/multimodal/semantic_matching/text_semantic_search.ipynb)\n", "\n", "## 1. Introduction to semantic embedding\n", "\n", "Semantic embedding is one of the main workhorses behind the modern search technology. Instead of directly matching the query to candidates by term frequency (e.g., BM25), a semantic search algorithm matches them by first converting the text $x$ into a feature vector $\\phi(x)$ then comparing the similarities using a distance metric defined in that vector space. These feature vectors, known as a \"vector embedding\", are often trained end-to-end on large text corpus, so that they encode the *semantic* meaning of the text. For example, synonyms are embedded to a similar region of the vector space and relationships between words are often revealed by algebraic operations (see Figure 1 for an example). For these reasons, a vector embedding of text are also known as a **semantic embedding**. With a semantic embedding of the query and the search candidate documents, a search algorithm can often be reduced to finding most similar vectors. This new approach to search is known as **semantic search**.\n", "\n", "![Similar sentences have similar embeddings. Image from [Medium](https://medium.com/towards-data-science/fine-grained-analysis-of-sentence-embeddings-a3ff0a42cce5)](https://miro.medium.com/max/1400/0*esMqhzu9WhLiP3bD.jpg)\n", "\n", "\n", "There are three main advantages of using semantic embeddings for a search problem over classical information-retrieval methods (e.g., bag-of-words or TF/IDF). First, it returns candidates that are related according to the meaning of the text, rather than similar word usage. This helps to discover paraphrased text and similar concepts described in very different ways. Secondly, semantic search is often more computationally efficient. Vector embeddings of the candidates can be pre-computed and stored in data structures. Highly scalable sketching techniques such as locality-sensitive hashing (LSH) and max-inner product search (MIPS) are available for efficiently finding similar vectors in the embedding space. Last but not least, the semantic embedding approach allows us to straightforwardly generalize the same search algorithm beyond text, such as multi-modality search. For example, can we use a text query to search for images without textual annotations? Can we search for a website using an image query? With semantic search, one can simply use the most appropriate vector embedding of these multi-modal objects and jointly train the embeddings using datasets with both text and images.\n", "\n", "This tutorial provides you a gentle entry point in deploying AutoMM to semantic search." ] }, { "cell_type": "code", "execution_count": null, "id": "aa00faab-252f-44c9-b8f7-57131aa8251c", "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "!pip install autogluon.multimodal\n" ] }, { "cell_type": "code", "execution_count": null, "id": "d5d6e8ce", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!pip3 install ir_datasets\n", "import ir_datasets\n", "import pandas as pd\n", "pd.set_option('display.max_colwidth', None)" ] }, { "cell_type": "markdown", "id": "90521edf", "metadata": {}, "source": [ "## 2. Dataset\n", "In this tutorial, we will use the NF Corpus (Nutrition Facts) dataset from the `ir_datasets` package.\n", "We also convert the query data, document data, and their relevance data into dataframes." ] }, { "cell_type": "code", "execution_count": null, "id": "bd6a9c12", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "dataset = ir_datasets.load(\"beir/nfcorpus/test\")\n", "\n", "# prepare dataset\n", "doc_data = pd.DataFrame(dataset.docs_iter())\n", "query_data = pd.DataFrame(dataset.queries_iter())\n", "labeled_data = pd.DataFrame(dataset.qrels_iter())\n", "label_col = \"relevance\"\n", "query_id_col = \"query_id\"\n", "doc_id_col = \"doc_id\"\n", "text_col = \"text\"\n", "id_mappings={query_id_col: query_data.set_index(query_id_col)[text_col], doc_id_col: doc_data.set_index(doc_id_col)[text_col]}" ] }, { "cell_type": "markdown", "id": "c96a37bd", "metadata": {}, "source": [ "The labeled data contain the query ids, document ids, and their relevance scores." ] }, { "cell_type": "code", "execution_count": null, "id": "c524865f", "metadata": {}, "outputs": [], "source": [ "labeled_data.head()" ] }, { "cell_type": "markdown", "id": "3abb09d9", "metadata": {}, "source": [ "The query data store the query ids and their corresponding query contents." ] }, { "cell_type": "code", "execution_count": null, "id": "d4f6df27", "metadata": {}, "outputs": [], "source": [ "query_data.head()" ] }, { "cell_type": "markdown", "id": "02144c82", "metadata": {}, "source": [ "We need to remove the urls that are not used in search." ] }, { "cell_type": "code", "execution_count": null, "id": "a4c5c083", "metadata": {}, "outputs": [], "source": [ "query_data = query_data.drop(\"url\", axis=1)\n", "query_data.head()" ] }, { "cell_type": "markdown", "id": "74f30de0", "metadata": {}, "source": [ "The doc data have the document ids as well as the corresponding contents." ] }, { "cell_type": "code", "execution_count": null, "id": "f65065cf", "metadata": {}, "outputs": [], "source": [ "doc_data.head(1)" ] }, { "cell_type": "markdown", "id": "b5d60a8d", "metadata": {}, "source": [ "Similar to the query data, we remove the url column. We also need to concatenate all the valid texts into one column." ] }, { "cell_type": "code", "execution_count": null, "id": "6c5f0d19", "metadata": {}, "outputs": [], "source": [ "doc_data[text_col] = doc_data[[text_col, \"title\"]].apply(\" \".join, axis=1)\n", "doc_data = doc_data.drop([\"title\", \"url\"], axis=1)\n", "doc_data.head(1)" ] }, { "cell_type": "markdown", "id": "ddadbd17", "metadata": {}, "source": [ "There are 323 queries, 3633 documents, and 12334 relevance scores in the dataset.\n", "\n", "\n", "## 3. `NDCG` Evaluation\n", "\n", "Users pay the most attention to the first result, then the second, and etc. \n", "As a result, precision matters the most for the top-ranked results. \n", "In this tutorial, we use **Normalized Discounted Cumulative Gain (NDCG)** for measuring the ranking performance.\n", "\n", "### 3.1 CG, DCG, IDCG and NDCG Formulas\n", "\n", "In order to understand the NDCG metric, we must first understand CG (Cumulative Gain) and DCG (Discounted Cumulative Gain), as well as understanding the two assumptions that we make when we use DCG and its related measures:\n", "\n", "1. Highly relevant documents are more useful when appearing earlier in the search engine results list.\n", "2. Highly relevant documents are more useful than marginally relevant documents, which are more useful than non-relevant documents\n", "\n", "First, the primitive **Cumulative Gain (CG)**, which adds the relevance score ($rel$) up to a specified rank position $p$:\n", "\n", "$$ \\mathrm{CG}_p = \\sum_{i=1}^p \\mathrm{rel}_i. $$\n", "\n", "Then, the **Discounted Cumulative Gain (DCG)**, which penalizes each relevance score logarithmically based on its position in the results:\n", "\n", "$$ \\mathrm{DCG}_p = \\sum_{i=1}^p \\frac{\\mathrm{rel}_i}{\\log_2(i + 1)}. $$\n", "\n", "Next, the **Ideal DCG (IDCG)**, which is the DCG of the best possible results based on the given ratings:\n", "\n", "$$ \\mathrm{IDCG}_p = \\sum_{i=1}^{|\\mathrm{REL}_p|} \\frac{\\mathrm{rel}_i}{\\log_2(i + 1)}. $$\n", "\n", "where $|mathrm{REL}_p|$ is the list of relevant documents (ordered by their relevance) in the corpus up to the position $p$.\n", "\n", "And finally, the **NDCG**:\n", "\n", "$$ \\mathrm{NDCG}_p = \\frac{\\mathrm{DCG}_p}{\\mathrm{IDCG}_p}. $$\n", "\n", "We provide an util function to compute the ranking scores. Moreover, we also support measuring NDCG under different cutoffs values." ] }, { "cell_type": "code", "execution_count": null, "id": "18d90aab", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal.utils import compute_ranking_score\n", "cutoffs = [5, 10, 20]" ] }, { "cell_type": "markdown", "id": "692cb9e7", "metadata": {}, "source": [ "## 4. Use BM25\n", "\n", "BM25 (or Okapi BM25) is a popular ranking algorithm currently used by OpenSearch for scoring document relevancy to a query. \n", "We will use BM25's NDCG scores as baselines in this tutorial.\n", "\n", "### 4.1 Define the formula\n", "\n", "$$ score_{BM25} = \\sum_i^n \\mathrm{IDF}(q_i) \\frac{f(q_i, D) \\cdot (k1 + 1)}{f(q_i, D) + k1 \\cdot (1 - b + b \\cdot \\frac{fieldLen}{avgFieldLen})}$$\n", "\n", "where $\\mathrm{IDF}(q_i)$ is the inverse document frequency of the $i^{th}$ query term, and the actual formula used by BM25 for this part is:\n", "\n", "$$ \\log(1 + \\frac{docCount - f(q_i) + 0.5)}{f(q_i) + 0.5}). $$\n", "\n", "$k1$ is a tunable hyperparameter that limits how much a single query term can affect the score of a given document. In ElasticSearch, it is [default](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html) to be 1.2.\n", "\n", "$b$ is another hyperparameter variable that determines the effect of document length compared to the average document length in the corpus. In ElasticSearch, it is [default](https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules-similarity.html) to be 0.75. \n", "\n", "In this tutorial, we will be using the package `rank_bm25` to avoid the complexity of implementing the algorithm from scratch.\n", "\n", "### 4.2 Define functions" ] }, { "cell_type": "code", "execution_count": null, "id": "4355200a", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "!pip3 install rank_bm25\n", "from collections import defaultdict\n", "import string\n", "import nltk\n", "import numpy as np\n", "from nltk.corpus import stopwords\n", "from nltk.tokenize import word_tokenize\n", "from rank_bm25 import BM25Okapi\n", "\n", "nltk.download('stopwords')\n", "nltk.download('punkt')\n", "\n", "def tokenize_corpus(corpus):\n", " stop_words = set(stopwords.words(\"english\") + list(string.punctuation))\n", " \n", " tokenized_docs = []\n", " for doc in corpus:\n", " tokens = nltk.word_tokenize(doc.lower())\n", " tokenized_doc = [w for w in tokens if w not in stop_words and len(w) > 2]\n", " tokenized_docs.append(tokenized_doc)\n", " return tokenized_docs\n", "\n", "def rank_documents_bm25(queries_text, queries_id, docs_id, top_k, bm25):\n", " tokenized_queries = tokenize_corpus(queries_text)\n", " \n", " results = {qid: {} for qid in queries_id}\n", " for query_idx, query in enumerate(tokenized_queries):\n", " scores = bm25.get_scores(query)\n", " scores_top_k_idx = np.argsort(scores)[::-1][:top_k]\n", " for doc_idx in scores_top_k_idx:\n", " results[queries_id[query_idx]][docs_id[doc_idx]] = float(scores[doc_idx])\n", " return results\n", "\n", "def get_qrels(dataset):\n", " \"\"\"\n", " Get the ground truth of relevance score for all queries\n", " \"\"\"\n", " qrel_dict = defaultdict(dict)\n", " for qrel in dataset.qrels_iter():\n", " qrel_dict[qrel.query_id][qrel.doc_id] = qrel.relevance\n", " return qrel_dict\n", "\n", "def evaluate_bm25(doc_data, query_data, qrel_dict, cutoffs):\n", " \n", " tokenized_corpus = tokenize_corpus(doc_data[text_col].tolist())\n", " bm25_model = BM25Okapi(tokenized_corpus, k1=1.2, b=0.75)\n", " \n", " results = rank_documents_bm25(query_data[text_col].tolist(), query_data[query_id_col].tolist(), doc_data[doc_id_col].tolist(), max(cutoffs), bm25_model)\n", " ndcg = compute_ranking_score(results=results, qrel_dict=qrel_dict, metrics=[\"ndcg\"], cutoffs=cutoffs)\n", " \n", " return ndcg" ] }, { "cell_type": "code", "execution_count": null, "id": "f0a62e63", "metadata": {}, "outputs": [], "source": [ "qrel_dict = get_qrels(dataset)\n", "evaluate_bm25(doc_data, query_data, qrel_dict, cutoffs)" ] }, { "cell_type": "markdown", "id": "1a3f93c0", "metadata": {}, "source": [ "## 5. Use AutoMM\n", "AutoMM provides easy-to-use APIs to evaluate the ranking performance, extract embeddings, and conduct semantic search.\n", "\n", "### 5.1 Initialize Predictor\n", "\n", "For text data, we can initialize `MultiModalPredictor` with problem type `text_similarity`. \n", "We need to specify `query`, `response`, and `label` with the corresponding column names in the `labeled_data` dataframe." ] }, { "cell_type": "code", "execution_count": null, "id": "e8146eca", "metadata": {}, "outputs": [], "source": [ "%%capture\n", "from autogluon.multimodal import MultiModalPredictor\n", "\n", "predictor = MultiModalPredictor(\n", " query=query_id_col,\n", " response=doc_id_col,\n", " label=label_col,\n", " problem_type=\"text_similarity\",\n", " hyperparameters={\"model.hf_text.checkpoint_name\": \"sentence-transformers/all-MiniLM-L6-v2\"}\n", " )" ] }, { "cell_type": "markdown", "id": "cb03db5d", "metadata": {}, "source": [ "### 5.2 Evaluate Ranking\n", "Evaluating the ranking performance is easy with the `evaluate` API. \n", "During evaluation, the predictor automatically extracts embeddings, computes cosine similarities, ranks the results, and computes the scores." ] }, { "cell_type": "code", "execution_count": null, "id": "7d9ea3a2", "metadata": {}, "outputs": [], "source": [ "predictor.evaluate(\n", " labeled_data,\n", " query_data=query_data[[query_id_col]],\n", " response_data=doc_data[[doc_id_col]],\n", " id_mappings=id_mappings,\n", " cutoffs=cutoffs,\n", " metrics=[\"ndcg\"],\n", " )" ] }, { "cell_type": "markdown", "id": "04cd35ea", "metadata": {}, "source": [ "We can find significant improvements over the BM25's performances.\n", "\n", "### 5.3 Semantic Search\n", "We also provide an util function for semantic search." ] }, { "cell_type": "code", "execution_count": null, "id": "c0446b2f", "metadata": {}, "outputs": [], "source": [ "from autogluon.multimodal.utils import semantic_search\n", "hits = semantic_search(\n", " matcher=predictor,\n", " query_data=query_data[text_col].tolist(),\n", " response_data=doc_data[text_col].tolist(),\n", " query_chunk_size=len(query_data),\n", " top_k=max(cutoffs),\n", " )" ] }, { "cell_type": "markdown", "id": "98cc5b31", "metadata": {}, "source": [ "We rank the docs according to cosine similarities between the query and document embeddings.\n", "For simplicity, we use `torch.topk` with [linear complexity](https://github.com/pytorch/pytorch/blob/4262c8913c2bddb8d91565888b4871790301faba/aten/src/ATen/native/cuda/TensorTopK.cu#L92-L121) (O(n+k)) to get the k most similar vector embeddings to the query embedding. However, in practice, more efficient methods for similarity search are often used, e.g. [Faiss](https://github.com/facebookresearch/faiss).\n", "\n", "### 5.4 Extract Embeddings\n", "Extracting embeddings is important to deploy models to industry search engines. In general, a system extracts the embeddings for database items offline. During the online search, it only needs to encode query data and then efficiently matches the query embeddings with the saved database embeddings." ] }, { "cell_type": "code", "execution_count": null, "id": "07136ec3", "metadata": {}, "outputs": [], "source": [ "query_embeds = predictor.extract_embedding(query_data[[query_id_col]], id_mappings=id_mappings, as_tensor=True)\n", "doc_embeds = predictor.extract_embedding(doc_data[[doc_id_col]], id_mappings=id_mappings, as_tensor=True)" ] }, { "cell_type": "markdown", "id": "84846fc2", "metadata": {}, "source": [ "## 6. Hybrid BM25\n", "\n", "We are proposing a new method of search ranking called *Hybrid BM25*, which combines BM25 and semantic embedding for scoring. The key idea is to use BM25 as the first-stage retrieval method (say it recalls 1000 documents for each query), then use a pretrained language model (PLM) to score all the recalled documents (1000 documents). \n", "\n", "We then rerank the retrieved documents with the score calculated as:\n", "\n", "$$ score = \\beta * normalized\\_BM25 + ( 1 - \\beta) * score\\_of\\_plm $$\n", "\n", "where \n", "\n", "$$ normalized\\_BM25(q_i, D_j) = \\frac{\\textsf{BM25}(q_i,D_j) - \\min_{a\\in \\mathcal{Q},b\\in\\mathcal{D}}(\\textsf{BM25}(a,b))}{\\max_{a\\in \\mathcal{Q},b\\in\\mathcal{D}}(\\textsf{BM25}(a,b)) - \\min_{a\\in \\mathcal{Q},b\\in\\mathcal{D}}(\\textsf{BM25}(a,b))},$$\n", "\n", "and $\\beta$ is a tunable parameter, which we will default to $0.3$ in our tutorial.\n", "\n", "### 6.1 Define functions" ] }, { "cell_type": "code", "execution_count": null, "id": "56b360ba", "metadata": {}, "outputs": [], "source": [ "import torch\n", "from autogluon.multimodal.utils import compute_semantic_similarity\n", "\n", "def hybridBM25(query_data, query_embeds, doc_data, doc_embeds, recall_num, top_k, beta):\n", " # Recall documents with BM25 scores\n", " tokenized_corpus = tokenize_corpus(doc_data[text_col].tolist())\n", " bm25_model = BM25Okapi(tokenized_corpus, k1=1.2, b=0.75)\n", " bm25_scores = rank_documents_bm25(query_data[text_col].tolist(), query_data[query_id_col].tolist(), doc_data[doc_id_col].tolist(), recall_num, bm25_model)\n", " \n", " all_bm25_scores = [score for scores in bm25_scores.values() for score in scores.values()]\n", " max_bm25_score = max(all_bm25_scores)\n", " min_bm25_score = min(all_bm25_scores)\n", "\n", " q_embeddings = {qid: embed for qid, embed in zip(query_data[query_id_col].tolist(), query_embeds)}\n", " d_embeddings = {did: embed for did, embed in zip(doc_data[doc_id_col].tolist(), doc_embeds)}\n", " \n", " query_ids = query_data[query_id_col].tolist()\n", " results = {qid: {} for qid in query_ids}\n", " for idx, qid in enumerate(query_ids):\n", " rec_docs = bm25_scores[qid]\n", " rec_doc_emb = [d_embeddings[doc_id] for doc_id in rec_docs.keys()]\n", " rec_doc_id = [doc_id for doc_id in rec_docs.keys()]\n", " rec_doc_emb = torch.stack(rec_doc_emb)\n", " scores = compute_semantic_similarity(q_embeddings[qid], rec_doc_emb)\n", " scores[torch.isnan(scores)] = -1\n", " top_k_values, top_k_idxs = torch.topk(\n", " scores,\n", " min(top_k + 1, len(scores[0])),\n", " dim=1,\n", " largest=True,\n", " sorted=False,\n", " )\n", "\n", " for doc_idx, score in zip(top_k_idxs[0], top_k_values[0]):\n", " doc_id = rec_doc_id[int(doc_idx)]\n", " # Hybrid scores from BM25 and cosine similarity of embeddings\n", " results[qid][doc_id] = \\\n", " (1 - beta) * float(score.numpy()) \\\n", " + beta * (bm25_scores[qid][doc_id] - min_bm25_score) / (max_bm25_score - min_bm25_score)\n", " \n", " return results\n", "\n", "\n", "def evaluate_hybridBM25(query_data, query_embeds, doc_data, doc_embeds, recall_num, beta, cutoffs):\n", " results = hybridBM25(query_data, query_embeds, doc_data, doc_embeds, recall_num, max(cutoffs), beta)\n", " ndcg = compute_ranking_score(results=results, qrel_dict=qrel_dict, metrics=[\"ndcg\"], cutoffs=cutoffs)\n", " return ndcg" ] }, { "cell_type": "code", "execution_count": null, "id": "bd754987", "metadata": {}, "outputs": [], "source": [ "recall_num = 1000\n", "beta = 0.3\n", "query_embeds = predictor.extract_embedding(query_data[[query_id_col]], id_mappings=id_mappings, as_tensor=True)\n", "doc_embeds = predictor.extract_embedding(doc_data[[doc_id_col]], id_mappings=id_mappings, as_tensor=True)\n", "evaluate_hybridBM25(query_data, query_embeds, doc_data, doc_embeds, recall_num, beta, cutoffs)" ] }, { "cell_type": "markdown", "id": "6297f7ad", "metadata": {}, "source": [ "We were able to improve the ranking scores over the naive BM25.\n", "\n", "## 7. Summary\n", "\n", "In this tutorial, we have demonstrated how to use AutoMM for semantic search, and showcased the obvious improvements over the classical BM25. We further improved the ranking scores by combining BM25 and AutoMM (Hybrid BM25)." ] } ], "metadata": { "language_info": { "name": "python", "pygments_lexer": "ipython" } }, "nbformat": 4, "nbformat_minor": 5 }