Using AutoGluon-RAG to process data from documents/websites.#
agrag = AutoGluonRAG(
data_dir="path/to/data",
preset_quality="medium_quality", # or path to config file
)
agrag.initialize_data_module()
processed_data = self.process_data()
Here, instead of calling initialize_rag_pipeline
to initialize the entire pipeline, we simply initialize the data module to process the data.
process_data
returns a pandas DataFrame
with the following columns: "doc_id", "chunk_id", "text"
.
You can obtain the actual text by:
text_list = processed_data["text"].tolist()
text_array = np.array(text_list)