{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# AutoGluon Tabular - Foundational Models\n", "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/master/docs/tutorials/tabular/tabular-foundational-models.ipynb)\n", "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/master/docs/tutorials/tabular/tabular-foundational-models.ipynb)\n", "\n", "In this tutorial, we introduce support for cutting-edge foundational tabular models that leverage pre-training and in-context learning to achieve state-of-the-art performance on tabular datasets. These models represent a significant advancement in automated machine learning for structured data.\n", "\n", "In this tutorial, we'll explore four foundational tabular models:\n", "\n", "1. **Mitra** - AutoGluon's new state-of-the-art tabular foundation model\n", "2. **TabICL** - In-context learning for large tabular datasets\n", "3. **TabPFNv2** - Prior-fitted networks for accurate predictions on small data\n", "\n", "These models excel particularly on small to medium-sized datasets and can run in both zero-shot and fine-tuning modes." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Installation\n", "\n", "First, let's install AutoGluon with support for foundational models:" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "tags": [ "hide-output" ] }, "outputs": [], "source": [ "# Individual model installations:\n", "!pip install uv\n", "!uv pip install autogluon.tabular[mitra] # For Mitra\n", "!uv pip install autogluon.tabular[tabicl] # For TabICL\n", "!uv pip install autogluon.tabular[tabpfn] # For TabPFNv2\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "from autogluon.tabular import TabularDataset, TabularPredictor\n", "from sklearn.model_selection import train_test_split\n", "from sklearn.datasets import load_wine, fetch_california_housing" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example Data\n", "\n", "For this tutorial, we'll demonstrate the foundational models on three different datasets to showcase their versatility:\n", "\n", "1. **Wine Dataset** (Multi-class Classification) - Medium-sized dataset for comparing model performance\n", "3. **California Housing** (Regression) - Regression dataset\n", "\n", "Let's load and prepare these datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Load datasets\n", "\n", "# 1. Wine (Multi-class Classification)\n", "wine_data = load_wine()\n", "wine_df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)\n", "wine_df['target'] = wine_data.target\n", "\n", "# 2. California Housing (Regression)\n", "housing_data = fetch_california_housing()\n", "housing_df = pd.DataFrame(housing_data.data, columns=housing_data.feature_names)\n", "housing_df['target'] = housing_data.target\n", "\n", "print(\"Dataset shapes:\")\n", "print(f\"Wine: {wine_df.shape}\")\n", "print(f\"California Housing: {housing_df.shape}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Create Train/Test Splits\n", "\n", "Let's create train/test splits for our datasets:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create train/test splits (80/20)\n", "wine_train, wine_test = train_test_split(wine_df, test_size=0.2, random_state=42, stratify=wine_df['target'])\n", "housing_train, housing_test = train_test_split(housing_df, test_size=0.2, random_state=42)\n", "\n", "print(\"Training set sizes:\")\n", "print(f\"Wine: {len(wine_train)} samples\")\n", "print(f\"Housing: {len(housing_train)} samples\")\n", "\n", "# Convert to TabularDataset\n", "wine_train_data = TabularDataset(wine_train)\n", "wine_test_data = TabularDataset(wine_test)\n", "housing_train_data = TabularDataset(housing_train)\n", "housing_test_data = TabularDataset(housing_test)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 1. Mitra: AutoGluon's Tabular Foundation Model\n", "\n", "[Mitra](https://huggingface.co/autogluon/mitra-classifier) is a new state-of-the-art tabular foundation model developed by the AutoGluon team, natively supported in AutoGluon with just three lines of code via `predictor.fit())`. Built on the in-context learning paradigm and pretrained exclusively on synthetic data, Mitra introduces a principled pretraining approach by carefully selecting and mixing diverse synthetic priors to promote robust generalization across a wide range of real-world tabular datasets.\n", "\n", "📊 **Mitra achieves state-of-the-art performance** on major benchmarks including TabRepo, TabZilla, AMLB, and TabArena, especially excelling on small tabular datasets with fewer than 5,000 samples and 100 features, for both classification and regression tasks.\n", "\n", "🧠 **Mitra supports both zero-shot and fine-tuning modes** and runs seamlessly on both GPU and CPU. Its weights are fully open-sourced under the Apache-2.0 license, making it a privacy-conscious and production-ready solution for enterprises concerned about data sharing and hosting.\n", "\n", "🔗 **Learn more on Hugging Face:**\n", "- Classification model: [autogluon/mitra-classifier](https://huggingface.co/autogluon/mitra-classifier)\n", "- Regression model: [autogluon/mitra-regressor](https://huggingface.co/autogluon/mitra-regressor)\n", "\n", "### Using Mitra for Classification" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create predictor with Mitra\n", "print(\"Training Mitra classifier on classification dataset...\")\n", "mitra_predictor = TabularPredictor(label='target')\n", "mitra_predictor.fit(\n", " wine_train_data,\n", " hyperparameters={\n", " 'MITRA': {'fine_tune': False}\n", " },\n", " )\n", "\n", "print(\"\\nMitra training completed!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluate Mitra Performance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Make predictions\n", "mitra_predictions = mitra_predictor.predict(wine_test_data)\n", "print(\"Sample Mitra predictions:\")\n", "print(mitra_predictions.head(10))\n", "\n", "# Show prediction probabilities for first few samples\n", "mitra_predictions = mitra_predictor.predict_proba(wine_test_data)\n", "print(mitra_predictions.head())\n", "\n", "# Show model leaderboard\n", "print(\"\\nMitra Model Leaderboard:\")\n", "mitra_predictor.leaderboard(wine_test_data, silent=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Finetuning with Mitra" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mitra_predictor_ft = TabularPredictor(label='target')\n", "mitra_predictor_ft.fit(\n", " wine_train_data,\n", " hyperparameters={\n", " 'MITRA': {'fine_tune': True, 'fine_tune_steps': 10}\n", " },\n", " time_limit=120, # 2 minutes\n", " )\n", "\n", "print(\"\\nMitra fine-tuning completed!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Evaluating Fine-tuned Mitra Performance" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Show model leaderboard\n", "print(\"\\nMitra Model Leaderboard:\")\n", "mitra_predictor_ft.leaderboard(wine_test_data, silent=True)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Using Mitra for Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "\n", "# Create predictor with Mitra for regression\n", "print(\"Training Mitra regressor on California Housing dataset...\")\n", "mitra_reg_predictor = TabularPredictor(\n", " label='target',\n", " path='./mitra_regressor_model',\n", " problem_type='regression'\n", ")\n", "mitra_reg_predictor.fit(\n", " housing_train_data.sample(1000), # sample 1000 rows\n", " hyperparameters={\n", " 'MITRA': {'fine_tune': False}\n", " },\n", ")\n", "\n", "# Evaluate regression performance\n", "mitra_reg_predictor.leaderboard(housing_test_data)\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 2. TabICL: In-Context Learning for Tabular Data\n", "\n", "**TabICL** (\"**Tab**ular **I**n-**C**ontext **L**earning\") is a foundational model designed specifically for in-context learning on large tabular datasets.\n", "\n", "**Paper**: [\"TabICL: A Tabular Foundation Model for In-Context Learning on Large Data\"](https://arxiv.org/abs/2502.05564) \n", "**Authors**: Jingang Qu, David Holzmüller, Gaël Varoquaux, Marine Le Morvan \n", "**GitHub**: https://github.com/soda-inria/tabicl\n", "\n", "TabICL leverages transformer architecture with in-context learning capabilities, making it particularly effective for scenarios where you have limited training data but access to related examples." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train TabICL on dataset\n", "print(\"Training TabICL on wine dataset...\")\n", "tabicl_predictor = TabularPredictor(\n", " label='target',\n", " path='./tabicl_model'\n", ")\n", "tabicl_predictor.fit(\n", " wine_train_data,\n", " hyperparameters={\n", " 'TABICL': {},\n", " },\n", ")\n", "\n", "# Show prediction probabilities for first few samples\n", "tabicl_predictions = tabicl_predictor.predict_proba(wine_test_data)\n", "print(tabicl_predictions.head())\n", "\n", "# Show TabICL leaderboard\n", "print(\"\\nTabICL Model Details:\")\n", "tabicl_predictor.leaderboard(wine_test_data, silent=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## 3. TabPFNv2: Prior-Fitted Networks\n", "\n", "**TabPFNv2** (\"**Tab**ular **P**rior-**F**itted **N**etworks **v2**\") is designed for accurate predictions on small tabular datasets by using prior-fitted network architectures.\n", "\n", "**Paper**: [\"Accurate predictions on small data with a tabular foundation model\"](https://www.nature.com/articles/s41586-024-08328-6) \n", "**Authors**: Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister & Frank Hutter \n", "**GitHub**: https://github.com/PriorLabs/TabPFN\n", "\n", "TabPFNv2 excels on small datasets (< 10,000 samples) by leveraging prior knowledge encoded in the network architecture." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Train TabPFNv2 on Wine dataset (perfect size for TabPFNv2)\n", "print(\"Training TabPFNv2 on Wine dataset...\")\n", "tabpfnv2_predictor = TabularPredictor(\n", " label='target',\n", " path='./tabpfnv2_model'\n", ")\n", "tabpfnv2_predictor.fit(\n", " wine_train_data,\n", " hyperparameters={\n", " 'TABPFNV2': {\n", " # TabPFNv2 works best with default parameters on small datasets\n", " },\n", " },\n", ")\n", "\n", "# Show prediction probabilities for first few samples\n", "tabpfnv2_predictions = tabpfnv2_predictor.predict_proba(wine_test_data)\n", "print(tabpfnv2_predictions.head())\n", "\n", "\n", "tabpfnv2_predictor.leaderboard(wine_test_data)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Advanced Usage: Combining Multiple Foundational Models\n", "\n", "AutoGluon allows you to combine multiple foundational models in a single predictor for enhanced performance through model stacking and ensembling:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Configure multiple foundational models together\n", "multi_foundation_config = {\n", " 'MITRA': {\n", " 'fine_tune': True,\n", " 'fine_tune_steps': 10\n", " },\n", " 'TABPFNV2': {},\n", " 'TABICL': {},\n", "}\n", "\n", "print(\"Training ensemble of foundational models...\")\n", "ensemble_predictor = TabularPredictor(\n", " label='target',\n", " path='./ensemble_foundation_model'\n", ").fit(\n", " wine_train_data,\n", " hyperparameters=multi_foundation_config,\n", " time_limit=300, # More time for multiple models\n", ")\n", "\n", "# Evaluate ensemble performance\n", "ensemble_predictor.leaderboard(wine_test_data)\n" ] } ], "metadata": { "kernelspec": { "display_name": "tutorial", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.18" } }, "nbformat": 4, "nbformat_minor": 4 }