AutoGluon-RAG Evaluation Module#

Overview#

The EvaluationModule in AutoGluon-RAG is designed to facilitate the evaluation of the Retrieval-Augmented Generation (RAG) pipeline. This module allows users to easily assess the performance of the RAG pipeline using various datasets and evaluation metrics. The primary purpose of this module is to provide a flexible and extensible framework for evaluating the quality and effectiveness of the generated responses by the RAG pipeline.

The EvaluationModule is created to:

Simplify the evaluation process for the AutoGluon-RAG pipeline.
Support multiple evaluation metrics, including custom metrics.
Provide functionality to save evaluation data and results.
Allow users to preprocess data, extract queries, and expected responses from various datasets.

Usage#

Initialization#

To initialize the EvaluationModule, you need to provide an instance of AutoGluonRAG, the name of the dataset, a list of metrics, and additional optional parameters such as preprocessing functions and paths for saving evaluation data.

from agrag.agrag import AutoGluonRAG
from agrag.evaluation.evaluator import EvaluationModule
from module import preprocessing_fn, query_fn, response_fn

evaluation_dir = "evaluation_data"
agrag = AutoGluonRAG(preset_quality="medium_quality", data_dir=evaluation_dir)
evaluator = EvaluationModule(rag_instance=agrag)
evaluator.run_evaluation(
    dataset_name="huggingface_dataset/dataset_name",
    metrics=["exact_match", "transformer_matcher"],
    save_evaluation_data=True,
    evaluation_dir=evaluation_dir,
    preprocessing_fn=preprocessing_fn,
    query_fn=query_fn,
    response_fn=response_fn,
    hf_dataset_params={"name": "dev"},
)

Refer to src/agrag/evaluation/datasets/google_natural_questions/evaluate_agrag.py for a detailed example of how to evaluate AutoGluon-RAG on the Google Natural Questions dataset from HuggingFace.

Arguments to Evaluation Module#

agrag#

Type: AutoGluonRAG
Description: The AutoGluonRAG instance to be evaluated.

from agrag.agrag import AutoGluonRAG

agrag = AutoGluonRAG(preset_quality="medium_quality", data_dir=evaluation_dir)
# Calling agrag.initialize_rag_pipeline() is optional since the EvaluationModule will initialize the pipeline if it has not been done already.

AutoGluon-RAG for Large Datasets#

For large datasets, a naive version of AutoGluon-RAG may not be sufficient. Here are some steps you can take when working with a large corpus for RAG:

Using optimized indices for Vector DB. Refer to the documentation for the supported vector databases on how you can use optimized indices such as clustered and quantized databases. Set the parameters appropriately in your configuration file.

For example, here is an optimized FAISS setup (in the configuration file) using quantization that runs correctly for the Google Natural Questions dataset:
```
 vector_db:
 db_type: faiss
 faiss_index_type: IndexIVFPQ
 faiss_quantized_index_params:
     nlist: 50
     m: 8
     bits: 8
 faiss_index_nprobe: 15
```
Use GPUs: Make sure to use GPUs appropriately in each module (wherever applicable). You can set the num_gpus parameter in the configuration file under each module.
System Memory: Make sure your system has enough RAM for at least the size of the dataset. You will require more memory for the documents that will be generated from the datasets, the embeddings, the metadata, and the vector db index. We recommend running evaluation on a remote instance (such as AWS EC2) instead of running locally. If you would like to run locally, you can choose to run a subset of the evaluation data by setting max_eval_size when calling the run_evaluation function (see the next section).

Arguments to `run_evaluation` function#

NOTE#

Every time you run the run_evaluation function, you may need to set the agrag.data_dir parameter if you change the dataset being used. In that case, you will have to reinitialize the RAG pipeline.

Alternatively, you can index all your evaluation datasets at once, or create multiple instances of AutoGluonRAG.

dataset_name#

Type: str
Description: The name of the dataset to use for evaluation.

metrics#

Type: List[Union[str, Callable]]
Description: The list of metrics to use for evaluation. Supported metrics include:

"bertscore": Uses the BERTScore metric from HuggingFace.
"bleu": Uses the BLEU metric from HuggingFace
"hf_exact_match" Uses the Exact Match Metric from HuggingFace
"inclusive_exact_match": Uses the Inclusive Exact Match metric. This is a custom metric defined in this module since it is a bit more lenient compared to the HuggingFace exact_match metric. It also counts events where the expected response is contained within the generated response as a success. This metric supports the following optional parameters:
- regexes_to_ignore (List[str], optional): A list of regex expressions of characters to ignore when calculating the exact matches. Defaults to None. Note: the regex changes are applied before capitalization is normalized.
- ignore_case (bool, optional): If True, turns everything to lowercase so that capitalization differences are ignored. Defaults to False.
- ignore_punctuation (bool, optional): If True, removes punctuation before comparing strings. Defaults to False.
- ignore_numbers (bool, optional): If True, removes all digits before comparing strings. Defaults to False.
"pedant": Uses the PEDANT metric from QA Metrics.
"transformer_matcher": Uses the Transformer Matcher metric from QA Metrics.
<callable_custom_metric>: Any callable Python function or a function from a Python package. It must take in at least the arguments predictions and references, where predictions is a List of generated responses and references is a List[List] of expected responses.

metric_score_params#

Type: dict Description: Optional, additional parameters to pass into evaluation metric functions when computing scores.

metric_init_params#

Type: dict Description: Optional, additional parameters to pass into evaluation metric functions when initializing the functions.

preprocessing_fn#

Type: Callable
Description: A function to preprocess the content before saving. This must be a function that returns the relevant content from the dataset row. For example, to extract text from the Google Natural Questions dataset, you can pass in this function as preprocessing_fn:

def preprocess_google_nq(row):
    """
    Extracts text from HTML content for the Google NQ dataset.

    Parameters:
    ----------
    row : dict
        A row from the Google NQ dataset containing HTML content.

    Returns:
    -------
    str
        The extracted text content.
    """
    html_content = row["document"]["html"]
    return extract_text_from_html(html_content) # Function to extract text from HTML

query_fn#

Type: Callable
Description: A function to extract the query from the dataset row. For example, to extract the query from the Google Natural Questions dataset, you can pass in this function as query_fn:

def get_google_nq_query(row):
    """
    Extracts the query from a row in the Google NQ dataset.

    Parameters:
    ----------
    row : dict
        A row from the Google NQ dataset.

    Returns:
    -------
    str
        The query.
    """
    return row["question"]["text"]

response_fn#

Type: Callable
Description: A function to extract the expected responses from the dataset row. For example, to extract the expected responses from the Google Natural Questions dataset, you can pass in this function as query_fn:

def get_google_nq_responses(row):
    """
    Extracts the expected responses from a row in the Google NQ dataset.

    Parameters:
    ----------
    row : dict
        A row from the Google NQ dataset.

    Returns:
    -------
    List[str]
        A list of expected responses.
    """
    short_answers = row["annotations"]["short_answers"]
    return [answer["text"][0] for answer in short_answers if answer["text"]]

hf_dataset_params#

Type: dict
Description: Additional parameters to pass into the HuggingFace load_dataset function.

split#

Type: str
Description: The dataset split to use (default is “validation”).

save_evaluation_data#

Type: bool
Description: Whether to save evaluation data to files (default is True). Set this to False if you already have a directory of evaluation files to pass into AutoGluon RAG.

evaluation_dir#

Type: str
Description: The directory for evaluation data (default is “./evaluation_data”).

save_csv_path#

Type: str
Description: The path to save the evaluation results as a CSV file (default is None). If no path is provided, the evaluation results will not be saved.

max_eval_size#

Type: int, optional
Description: The maximum number of datapoints to process for evaluation (default is None). If this value is not less than the total number of datapoints (rows), the entire dataset will be used.

Using a Custom Dataset#

If you would like to use your own dataset, you must structure it with following columns:

“content”: Column containing document content (in the form of plaintext) that will be extracted and passed into the RAG pipeline
“query”: Column containing a singular query for each document that will be passed into the generator.
“expected_responses”: A list of expected responses for each query that will be used for evaluation.

For the parameters preprocessing_fn, query_fn, response_fn, the EvaluationModule will use the functions in src/agrag/evaluation/dataset_utils.py by default.