Adding a custom metric to AutoGluon¶

Tip: If you are new to AutoGluon, review Predicting Columns in a Table - Quick Start to learn the basics of the AutoGluon API.

This tutorial describes how to add a custom evaluation metric to AutoGluon that is used to inform validation scores, model ensembling, hyperparameter tuning, and more.

In this example, we show a variety of evaluation metrics and how to convert them to an AutoGluon Scorer (Scorer source code), which can then be passed to AutoGluon models and predictors.

First, we will randomly generate 10 ground truth labels and predictions and show how to calculate metric scores from them.

import numpy as np

rng = np.random.default_rng(seed=42)
y_true = rng.integers(low=0, high=2, size=10)
y_pred = rng.integers(low=0, high=2, size=10)

print(f'y_true: {y_true}')
print(f'y_pred: {y_pred}')

y_true: [0 1 1 0 0 1 0 1 0 0]
y_pred: [1 1 1 1 1 1 1 0 1 0]

Ensuring Metric is Serializable¶

Custom metrics must be defined in a separate Python file and imported so that they can be pickled (Python’s serialization protocol). If a custom metric is not pickleable, AutoGluon will crash during fit when trying to parallelize model training with Ray. In the below example, you would want to create a new python file such as my_metrics.py with ag_accuracy_scorer defined in it, and then use it via from my_metrics import ag_accuracy_scorer.

If your metric is not serializable, you will get many errors similar to: _pickle.PicklingError: Can't pickle. Refer to https://github.com/autogluon/autogluon/issues/1637 for an example. For an example of how to specify a custom metric on Kaggle, refer to this Kaggle Notebook.

The custom metrics in this tutorial are not serializable for ease of demonstration. If the best_quality preset was used, calls to fit() would crash.

Custom Accuracy Metric¶

We will start by creating a custom accuracy metric. A prediction is correct if the predicted value is the same as the true value, otherwise it is wrong.

First, lets use the default sklearn accuracy scorer:

import sklearn.metrics

sklearn.metrics.accuracy_score(y_true, y_pred)

0.4

There are a variety of limitations with the above logic. For example, without outside knowledge of the metric it is unknown:

What the optimal value is (1)
If higher values are better (True)
If the metric requires predictions, class predictions, or class probabilities (class predictions)

Now, let’s convert this evaluation metric to an AutoGluon Scorer to address these limitations.

We do this by calling autogluon.core.metrics.make_scorer (Source code: autogluon/core/metrics/__init__.py).

from autogluon.core.metrics import make_scorer

ag_accuracy_scorer = make_scorer(name='accuracy',
                                 score_func=sklearn.metrics.accuracy_score,
                                 optimum=1,
                                 greater_is_better=True,
                                 needs_class=True)

When creating the Scorer, we need to specify a name for the Scorer. This does not need to be any particular value but is used when printing information about the Scorer during training.

Next, we specify the score_func. This is the function we want to wrap, in this case, sklearn’s accuracy_score function.

We then need to specify the optimum value. This is necessary when calculating error (also known as regret) as opposed to score. error is defined as sign * optimum - score, where sign=1 if greater_is_better=True, else sign=-1. It is also useful to identify when a score is optimal and cannot be improved. Because the best possible value from sklearn.metrics.accuracy_score is 1, we specify optimum=1.

Next we need to specify greater_is_better. In this case, greater_is_better=True because the best value returned is 1, and the worst value returned is less than 1 (0). It is very important to set this value correctly, otherwise AutoGluon will try to optimize for the worst model instead of the best.

Finally, we specify a bool needs_* based on the type of metric we are using. The following options are available: [needs_pred, needs_proba, needs_class, needs_threshold, needs_quantile]. All of them default to False except needs_pred which is inferred based on the other 4, of which only one can be set to True. If none are specified, the metric is treated as a regression metric (needs_pred=True).

Below is a detailed description of each:

needs_pred : bool | str, default="auto"
    Whether score_func requires the predict model method output as input to scoring.
    If "auto", will be inferred based on the values of the other `needs_*` arguments.
    Defaults to True if all other `needs_*` are False.
    Examples: ["root_mean_squared_error", "mean_squared_error", "r2", "mean_absolute_error", "median_absolute_error", "spearmanr", "pearsonr"]

needs_proba : bool, default=False
    Whether score_func requires predict_proba to get probability estimates out of a classifier.
    These scorers can benefit from calibration methods such as temperature scaling.
    Examples: ["log_loss", "roc_auc_ovo", "roc_auc_ovr", "pac"]

needs_class : bool, default=False
    Whether score_func requires class predictions (classification only).
    This is required to determine if the scorer is impacted by a decision threshold.
    These scorers can benefit from decision threshold calibration methods such as via `predictor.calibrate_decision_threshold()`.
    Examples: ["accuracy", "balanced_accuracy", "f1", "precision", "recall", "mcc", "quadratic_kappa", "f1_micro", "f1_macro", "f1_weighted"]

needs_threshold : bool, default=False
    Whether score_func takes a continuous decision certainty.
    This only works for binary classification.
    These scorers care about the rank order of the prediction probabilities to calculate their scores, and are undefined if given a single sample to score.
    Examples: ["roc_auc", "average_precision"]

needs_quantile : bool, default=False
    Whether score_func is based on quantile predictions.
    This only works for quantile regression.
    Examples: ["pinball_loss"]

Because we are creating an accuracy scorer, we need the class prediction, and therefore we specify needs_class=True.

Advanced Note: optimum must correspond to the optimal value from the original metric callable (in this case sklearn.metrics.accuracy_score). Hypothetically, if a metric callable was greater_is_better=False with an optimal value of -2, you should specify optimum=-2, greater_is_better=False. In this case, if raw_metric_value=-0.5 then Scorer would return score=0.5 to enforce higher_is_better (score = sign * raw_metric_value). Scorer’s error would be error=1.5 because sign (-1) * optimum (-2) - score (0.5) = 1.5

Once created, the AutoGluon Scorer can be called in the same fashion as the original metric to compute score.

# score
ag_accuracy_scorer(y_true, y_pred)

0.4

Alternatively, .score is an alias to the above callable for convenience:

ag_accuracy_scorer.score(y_true, y_pred)

0.4

To get the error instead of score:

# error, error=sign*optimum-score -> error=1*1-score -> error=1-score
ag_accuracy_scorer.error(y_true, y_pred)

# Can also convert score to error and vice-versa:
# score = ag_accuracy_scorer(y_true, y_pred)
# error = ag_accuracy_scorer.convert_score_to_error(score)
# score = ag_accuracy_scorer.convert_error_to_score(error)

# Can also convert score to the original score that would be returned in `score_func`:
# score_orig = ag_accuracy_scorer.convert_score_to_original(score)  # score_orig = sign * score

0.6

Note that score is in higher_is_better format, while error is in lower_is_better format. An error of 0 corresponds to a perfect prediction.

Custom Mean Squared Error Metric¶

Next, let’s show examples of how to convert regression metrics into Scorers.

First we generate random ground truth labels and their predictions, however this time they are floats instead of integers.

y_true = rng.random(10)
y_pred = rng.random(10)

print(f'y_true: {y_true}')
print(f'y_pred: {y_pred}')

y_true: [0.37079802 0.92676499 0.64386512 0.82276161 0.4434142  0.22723872
 0.55458479 0.06381726 0.82763117 0.6316644 ]
y_pred: [0.75808774 0.35452597 0.97069802 0.89312112 0.7783835  0.19463871
 0.466721   0.04380377 0.15428949 0.68304895]

A common regression metric is Mean Squared Error:

sklearn.metrics.mean_squared_error(y_true, y_pred)

0.11666381947652146

ag_mean_squared_error_scorer = make_scorer(name='mean_squared_error',
                                           score_func=sklearn.metrics.mean_squared_error,
                                           optimum=0,
                                           greater_is_better=False)

In this case, optimum=0 because this is an error metric.

Additionally, greater_is_better=False because sklearn reports error as positive values, and the lower the value is, the better.

A very important point about AutoGluon Scorers is that internally, they will always report scores in greater_is_better=True form. This means if the original metric was greater_is_better=False, AutoGluon’s Scorer will flip the value. Therefore, score will be represented as a negative value.

This is done to ensure consistency between different metrics.

# score
ag_mean_squared_error_scorer(y_true, y_pred)

-0.11666381947652146

# error, error=sign*optimum-score -> error=-1*0-score -> error=-score
ag_mean_squared_error_scorer.error(y_true, y_pred)

0.11666381947652146

We can also specify metrics outside of sklearn. For example, below is a minimal implementation of mean squared error:

def mse_func(y_true: np.ndarray, y_pred: np.ndarray) -> float:
    return ((y_true - y_pred) ** 2).mean()

mse_func(y_true, y_pred)

np.float64(0.11666381947652146)

All that is required is that the function take two arguments: y_true, and y_pred (or y_pred_proba), as numpy arrays, and return a float value.

With the same code as before, we can create an AutoGluon Scorer.

ag_mean_squared_error_custom_scorer = make_scorer(name='mean_squared_error',
                                                  score_func=mse_func,
                                                  optimum=0,
                                                  greater_is_better=False)
ag_mean_squared_error_custom_scorer(y_true, y_pred)

np.float64(-0.11666381947652146)

Custom ROC AUC Metric¶

Here we show an example of a thresholding metric, roc_auc. A thresholding metric cares about the relative ordering of predictions, but not their absolute values.

y_true = rng.integers(low=0, high=2, size=10)
y_pred_proba = rng.random(10)

print(f'y_true:       {y_true}')
print(f'y_pred_proba: {y_pred_proba}')

y_true:       [1 1 0 1 0 0 1 0 0 0]
y_pred_proba: [0.18947136 0.12992151 0.47570493 0.22690935 0.66981399 0.43715192
 0.8326782  0.7002651  0.31236664 0.8322598 ]

sklearn.metrics.roc_auc_score(y_true, y_pred_proba)

0.25

We will need to specify needs_threshold=True in order for downstream models to properly use the metric.

# Score functions that need decision values
ag_roc_auc_scorer = make_scorer(name='roc_auc',
                                score_func=sklearn.metrics.roc_auc_score,
                                optimum=1,
                                greater_is_better=True,
                                needs_threshold=True)
ag_roc_auc_scorer(y_true, y_pred_proba)

0.25

Using Custom Metrics in TabularPredictor¶

Now that we have created several custom Scorers, let’s use them for training and evaluating models.

For this tutorial, we will be using the Adult Income dataset.

from autogluon.tabular import TabularDataset

train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')  # can be local CSV file as well, returns Pandas DataFrame
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')  # another Pandas DataFrame
label = 'class'  # specifies which column we want to predict
train_data = train_data.sample(n=1000, random_state=0)  # subsample dataset for faster demo

train_data.head(5)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	class
6118	51	Private	39264	Some-college	10	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	>50K
23204	58	Private	51662	10th	6	Married-civ-spouse	Other-service	Wife	White	Female	0	0	8	United-States	<=50K
29590	40	Private	326310	Some-college	10	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	44	United-States	<=50K
18116	37	Private	222450	HS-grad	9	Never-married	Sales	Not-in-family	White	Male	0	2339	40	El-Salvador	<=50K
33964	62	Private	109190	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	15024	0	40	United-States	>50K

from autogluon.tabular import TabularPredictor

predictor = TabularPredictor(label=label).fit(train_data, hyperparameters='toy')

predictor.leaderboard(test_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20251219_224533"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version:  1.5.0b20251219
Python Version:     3.12.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.9.1+cu128
CUDA Version:       12.8
GPU Memory:         GPU 0: 14.57/14.57 GB
Total GPU Memory:   Free: 14.57 GB, Allocated: 0.00 GB, Total: 14.57 GB
GPU Count:          1
Memory Avail:       28.51 GB / 30.95 GB (92.1%)
Disk Space Avail:   204.95 GB / 255.99 GB (80.1%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme'  : New in v1.5: The state-of-the-art for tabular data. Massively better than 'best' on datasets <100000 samples by using new Tabular Foundation Models (TFMs) meta-learned on https://tabarena.ai: TabPFNv2, TabICL, Mitra, TabDPT, and TabM. Requires a GPU and `pip install autogluon.tabular[tabarena]` to install TabPFN, TabICL, and TabDPT.
	presets='best'     : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='best_v150': New in v1.5: Better quality than 'best' and 5x+ faster to train. Give it a try!
	presets='high'     : Strong accuracy with fast inference speed.
	presets='high_v150': New in v1.5: Better quality than 'high' and 5x+ faster to train. Give it a try!
	presets='good'     : Good accuracy with very fast inference speed.
	presets='medium'   : Fast training time, ideal for initial prototyping.
Using hyperparameters preset: hyperparameters='toy'
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20251219_224533"
Train Data Rows:    1000
Train Data Columns: 14
Label Column:       class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       binary
Preprocessing data ...
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    29168.38 MB
	Train Data (Original)  Memory Usage: 0.50 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
		('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('int', ['bool']) : 1 | ['sex']
	0.1s = Fit runtime
	14 features in original data used to generate 14 features in processed data.
	Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.09s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200
User-specified model hyperparameters to be fit:
{
	'NN_TORCH': [{'num_epochs': 5}],
	'GBM': [{'num_boost_round': 10}],
	'CAT': [{'iterations': 10}],
	'XGB': [{'n_estimators': 10}],
}
Fitting 4 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBM ...
	Fitting with cpus=4, gpus=0, mem=0.0/28.5 GB
	0.77	 = Validation score   (accuracy)
	0.28s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: CatBoost ...
	Fitting with cpus=4, gpus=0
	0.86	 = Validation score   (accuracy)
	0.11s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: XGBoost ...
	Fitting with cpus=4, gpus=0
	0.84	 = Validation score   (accuracy)
	0.26s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetTorch ...
	Fitting with cpus=4, gpus=0, mem=0.0/28.4 GB
/home/ci/opt/venv/lib/python3.12/site-packages/sklearn/compose/_column_transformer.py:975: FutureWarning: The parameter `force_int_remainder_cols` is deprecated and will be removed in 1.9. It has no effect. Leave it to its default value to avoid this warning.
  warnings.warn(
	0.835	 = Validation score   (accuracy)
	1.55s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	Fitting 1 model on all data | Fitting with cpus=8, gpus=0, mem=0.0/28.3 GB
	Ensemble Weights: {'CatBoost': 1.0}
	0.86	 = Validation score   (accuracy)
	0.05s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 2.42s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 43457.5 rows/s (200 batch size)
Disabling decision threshold calibration for metric `accuracy` due to having fewer than 10000 rows of validation data for calibration, to avoid overfitting (200 rows).
	`accuracy` is generally not improved through threshold calibration. Force calibration via specifying `calibrate_decision_threshold=True`.
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20251219_224533")

	model	score_test	score_val	eval_metric	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	CatBoost	0.842768	0.860	accuracy	0.005882	0.003718	0.114652	0.005882	0.003718	0.114652	1	True	2
1	WeightedEnsemble_L2	0.842768	0.860	accuracy	0.007861	0.004602	0.160706	0.001980	0.000884	0.046054	2	True	5
2	XGBoost	0.836831	0.840	accuracy	0.020826	0.005827	0.259298	0.020826	0.005827	0.259298	1	True	3
3	NeuralNetTorch	0.830484	0.835	accuracy	0.047334	0.010368	1.552910	0.047334	0.010368	1.552910	1	True	4
4	LightGBM	0.780940	0.770	accuracy	0.005432	0.002629	0.281562	0.005432	0.002629	0.281562	1	True	1

We can pass our custom metrics into predictor.leaderboard via the extra_metrics argument:

predictor.leaderboard(test_data, extra_metrics=[ag_roc_auc_scorer, ag_accuracy_scorer])

	model	score_test	roc_auc	accuracy	score_val	eval_metric	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	CatBoost	0.842768	0.863760	0.842768	0.860	accuracy	0.005619	0.003718	0.114652	0.005619	0.003718	0.114652	1	True	2
1	WeightedEnsemble_L2	0.842768	0.863760	0.842768	0.860	accuracy	0.007517	0.004602	0.160706	0.001898	0.000884	0.046054	2	True	5
2	XGBoost	0.836831	0.890173	0.836831	0.840	accuracy	0.020543	0.005827	0.259298	0.020543	0.005827	0.259298	1	True	3
3	NeuralNetTorch	0.830484	0.879722	0.830484	0.835	accuracy	0.047685	0.010368	1.552910	0.047685	0.010368	1.552910	1	True	4
4	LightGBM	0.780940	0.861131	0.780940	0.770	accuracy	0.005407	0.002629	0.281562	0.005407	0.002629	0.281562	1	True	1

We can also pass our custom metric into the Predictor itself by specifying it during initialization via the eval_metric parameter:

predictor_custom = TabularPredictor(label=label, eval_metric=ag_roc_auc_scorer).fit(train_data, hyperparameters='toy')

predictor_custom.leaderboard(test_data)

No path specified. Models will be saved in: "AutogluonModels/ag-20251219_224538"
Verbosity: 2 (Standard Logging)
=================== System Info ===================
AutoGluon Version:  1.5.0b20251219
Python Version:     3.12.10
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Wed Mar 12 14:53:59 UTC 2025
CPU Count:          8
Pytorch Version:    2.9.1+cu128
CUDA Version:       12.8
GPU Memory:         GPU 0: 14.57/14.57 GB
Total GPU Memory:   Free: 14.57 GB, Allocated: 0.00 GB, Total: 14.57 GB
GPU Count:          1
Memory Avail:       28.30 GB / 30.95 GB (91.4%)
Disk Space Avail:   204.95 GB / 255.99 GB (80.1%)
===================================================
No presets specified! To achieve strong results with AutoGluon, it is recommended to use the available presets. Defaulting to `'medium'`...
	Recommended Presets (For more details refer to https://auto.gluon.ai/stable/tutorials/tabular/tabular-essentials.html#presets):
	presets='extreme'  : New in v1.5: The state-of-the-art for tabular data. Massively better than 'best' on datasets <100000 samples by using new Tabular Foundation Models (TFMs) meta-learned on https://tabarena.ai: TabPFNv2, TabICL, Mitra, TabDPT, and TabM. Requires a GPU and `pip install autogluon.tabular[tabarena]` to install TabPFN, TabICL, and TabDPT.
	presets='best'     : Maximize accuracy. Recommended for most users. Use in competitions and benchmarks.
	presets='best_v150': New in v1.5: Better quality than 'best' and 5x+ faster to train. Give it a try!
	presets='high'     : Strong accuracy with fast inference speed.
	presets='high_v150': New in v1.5: Better quality than 'high' and 5x+ faster to train. Give it a try!
	presets='good'     : Good accuracy with very fast inference speed.
	presets='medium'   : Fast training time, ideal for initial prototyping.
Using hyperparameters preset: hyperparameters='toy'
Beginning AutoGluon training ...
AutoGluon will save models to "/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20251219_224538"
Train Data Rows:    1000
Train Data Columns: 14
Label Column:       class
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
	2 unique label values:  [' >50K', ' <=50K']
	If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
Problem Type:       binary
Preprocessing data ...
Selected class <--> label mapping:  class 1 =  >50K, class 0 =  <=50K
	Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
	To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
	Available Memory:                    28976.62 MB
	Train Data (Original)  Memory Usage: 0.50 MB (0.0% of available memory)
	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
	Stage 1 Generators:
		Fitting AsTypeFeatureGenerator...
			Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
	Stage 2 Generators:
		Fitting FillNaFeatureGenerator...
	Stage 3 Generators:
		Fitting IdentityFeatureGenerator...
		Fitting CategoryFeatureGenerator...
			Fitting CategoryMemoryMinimizeFeatureGenerator...
	Stage 4 Generators:
		Fitting DropUniqueFeatureGenerator...
	Stage 5 Generators:
		Fitting DropDuplicatesFeatureGenerator...
	Types of features in original data (raw dtype, special dtypes):
		('int', [])    : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
	Types of features in processed data (raw dtype, special dtypes):
		('category', [])  : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
		('int', [])       : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
		('int', ['bool']) : 1 | ['sex']
	0.1s = Fit runtime
	14 features in original data used to generate 14 features in processed data.
	Train Data (Processed) Memory Usage: 0.06 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.09s ...
AutoGluon will gauge predictive performance using evaluation metric: 'roc_auc'
	This metric expects predicted probabilities rather than predicted class labels, so you'll need to use predict_proba() instead of predict()
	To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 800, Val Rows: 200
User-specified model hyperparameters to be fit:
{
	'NN_TORCH': [{'num_epochs': 5}],
	'GBM': [{'num_boost_round': 10}],
	'CAT': [{'iterations': 10}],
	'XGB': [{'n_estimators': 10}],
}
Fitting 4 L1 models, fit_strategy="sequential" ...
Fitting model: LightGBM ...
	Fitting with cpus=4, gpus=0, mem=0.0/28.3 GB
	0.85	 = Validation score   (roc_auc)
	0.19s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: CatBoost ...
	Fitting with cpus=4, gpus=0
	0.8693	 = Validation score   (roc_auc)
	0.03s	 = Training   runtime
	0.0s	 = Validation runtime
Fitting model: XGBoost ...
	Fitting with cpus=4, gpus=0
	0.8616	 = Validation score   (roc_auc)
	0.04s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: NeuralNetTorch ...
	Fitting with cpus=4, gpus=0, mem=0.0/28.3 GB
/home/ci/opt/venv/lib/python3.12/site-packages/sklearn/compose/_column_transformer.py:975: FutureWarning: The parameter `force_int_remainder_cols` is deprecated and will be removed in 1.9. It has no effect. Leave it to its default value to avoid this warning.
  warnings.warn(
	0.8551	 = Validation score   (roc_auc)
	0.54s	 = Training   runtime
	0.01s	 = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
	Fitting 1 model on all data | Fitting with cpus=8, gpus=0, mem=0.0/28.3 GB
	Ensemble Weights: {'XGBoost': 0.333, 'LightGBM': 0.278, 'CatBoost': 0.278, 'NeuralNetTorch': 0.111}
	0.8787	 = Validation score   (roc_auc)
	0.14s	 = Training   runtime
	0.0s	 = Validation runtime
AutoGluon training complete, total runtime = 1.1s ... Best model: WeightedEnsemble_L2 | Estimated inference throughput: 8398.3 rows/s (200 batch size)
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("/home/ci/autogluon/docs/tutorials/tabular/advanced/AutogluonModels/ag-20251219_224538")

	model	score_test	score_val	eval_metric	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	WeightedEnsemble_L2	0.900791	0.878668	roc_auc	0.086485	0.023814	0.940180	0.003109	0.001678	0.142725	2	True	5
1	XGBoost	0.890173	0.861627	roc_auc	0.020870	0.005075	0.035173	0.020870	0.005075	0.035173	1	True	3
2	CatBoost	0.887425	0.869325	roc_auc	0.006420	0.003531	0.028206	0.006420	0.003531	0.028206	1	True	2
3	NeuralNetTorch	0.879722	0.855113	roc_auc	0.049622	0.010175	0.543369	0.049622	0.010175	0.543369	1	True	4
4	LightGBM	0.870968	0.849980	roc_auc	0.006464	0.003356	0.190708	0.006464	0.003356	0.190708	1	True	1

That’s all it takes to create and use custom metrics in AutoGluon!

If you create a custom metric, consider submitting a PR so that we can officially add it to AutoGluon!

For a tutorial on implementing custom models in AutoGluon, refer to Adding a custom model to AutoGluon.

For more tutorials, refer to Predicting Columns in a Table - Quick Start and Predicting Columns in a Table - In Depth.