TabularPredictor.fit#
- TabularPredictor.fit(train_data, tuning_data=None, time_limit=None, presets=None, hyperparameters=None, feature_metadata='infer', infer_limit=None, infer_limit_batch_size=None, fit_weighted_ensemble=True, num_cpus='auto', num_gpus='auto', **kwargs)[source]#
Fit models to predict a column of a data table (label) based on the other columns (features).
- Parameters
train_data (str or
TabularDataset
orpd.DataFrame
) – Table of the training data, which is similar to a pandas DataFrame. If str is passed, train_data will be loaded using the str value as the file path.tuning_data (str or
TabularDataset
orpd.DataFrame
, default = None) – Another dataset containing validation data reserved for tuning processes such as early stopping and hyperparameter tuning. This dataset should be in the same format as train_data. If str is passed, tuning_data will be loaded using the str value as the file path. Note: final model returned may be fit on tuning_data as well as train_data. Do not provide your evaluation test data here! In particular, when num_bag_folds > 0 or num_stack_levels > 0, models will be trained on both tuning_data and train_data. If tuning_data = None, fit() will automatically hold out some random validation examples from train_data.time_limit (int, default = None) – Approximately how long fit() should run for (wallclock time in seconds). If not specified, fit() will run until all models have completed training, but will not repeatedly bag models unless num_bag_sets is specified.
presets (list or str or dict, default = ['medium_quality']) –
List of preset configurations for various arguments in fit(). Can significantly impact predictive accuracy, memory-footprint, and inference latency of trained models, and various other properties of the returned predictor. It is recommended to specify presets and avoid specifying most other fit() arguments or model hyperparameters prior to becoming familiar with AutoGluon. As an example, to get the most accurate overall predictor (regardless of its efficiency), set presets=’best_quality’. To get good quality with minimal disk usage, set presets=[‘good_quality’, ‘optimize_for_deployment’] Any user-specified arguments in fit() will override the values used by presets. If specifying a list of presets, later presets will override earlier presets if they alter the same argument. For precise definitions of the provided presets, see file: autogluon/tabular/configs/presets_configs.py. Users can specify custom presets by passing in a dictionary of argument values as an element to the list.
Available Presets: [‘best_quality’, ‘high_quality’, ‘good_quality’, ‘medium_quality’, ‘optimize_for_deployment’, ‘interpretable’, ‘ignore_text’]
It is recommended to only use one quality based preset in a given call to fit() as they alter many of the same arguments and are not compatible with each-other.
- In-depth Preset Info:
- best_quality={‘auto_stack’: True}
Best predictive accuracy with little consideration to inference time or disk usage. Achieve even better results by specifying a large time_limit value. Recommended for applications that benefit from the best possible model accuracy.
- high_quality={‘auto_stack’: True, ‘refit_full’: True, ‘set_best_to_refit_full’: True, ‘_save_bag_folds’: False}
High predictive accuracy with fast inference. ~10x-200x faster inference and ~10x-200x lower disk usage than best_quality. Recommended for applications that require reasonable inference speed and/or model size.
- good_quality={‘auto_stack’: True, ‘refit_full’: True, ‘set_best_to_refit_full’: True, ‘_save_bag_folds’: False, ‘hyperparameters’: ‘light’}
Good predictive accuracy with very fast inference. ~4x faster inference and ~4x lower disk usage than high_quality. Recommended for applications that require fast inference speed.
- medium_quality={‘auto_stack’: False}
Medium predictive accuracy with very fast inference and very fast training time. ~20x faster training than good_quality. This is the default preset in AutoGluon, but should generally only be used for quick prototyping, as good_quality results in significantly better predictive accuracy and faster inference time.
- optimize_for_deployment={‘keep_only_best’: True, ‘save_space’: True}
Optimizes result immediately for deployment by deleting unused models and removing training artifacts. Often can reduce disk usage by ~2-4x with no negatives to model accuracy or inference speed. This will disable numerous advanced functionality, but has no impact on inference. This will make certain functionality less informative, such as predictor.leaderboard() and predictor.fit_summary().
Because unused models will be deleted under this preset, methods like predictor.leaderboard() and predictor.fit_summary() will no longer show the full set of models that were trained during fit().
Recommended for applications where the inner details of AutoGluon’s training is not important and there is no intention of manually choosing between the final models. This preset pairs well with the other presets such as good_quality to make a very compact final model. Identical to calling predictor.delete_models(models_to_keep=’best’, dry_run=False) and predictor.save_space() directly after fit().
- interpretable={‘auto_stack’: False, ‘hyperparameters’: ‘interpretable’}
Fits only interpretable rule-based models from the imodels package. Trades off predictive accuracy for conciseness.
- ignore_text={‘_feature_generator_kwargs’: {‘enable_text_ngram_features’: False, ‘enable_text_special_features’: False, ‘enable_raw_text_features’: False}}
Disables automated feature generation when text features are detected. This is useful to determine how beneficial text features are to the end result, as well as to ensure features are not mistaken for text when they are not. Ignored if feature_generator was also specified.
hyperparameters (str or dict, default = 'default') –
Determines the hyperparameters used by the models. If str is passed, will use a preset hyperparameter configuration.
- Valid str options: [‘default’, ‘light’, ‘very_light’, ‘toy’, ‘multimodal’]
’default’: Default AutoGluon hyperparameters intended to maximize accuracy without significant regard to inference time or disk usage. ‘light’: Results in smaller models. Generally will make inference speed much faster and disk usage much lower, but with worse accuracy. ‘very_light’: Results in much smaller models. Behaves similarly to ‘light’, but in many cases with over 10x less disk usage and a further reduction in accuracy. ‘toy’: Results in extremely small models. Only use this when prototyping, as the model quality will be severely reduced. ‘multimodal’: [EXPERIMENTAL] Trains a multimodal transformer model alongside tabular models. Requires that some text columns appear in the data, a GPU, and CUDA-enabled MXNet.
When combined with ‘best_quality’ presets option, this can achieve extremely strong results in multimodal data tables that contain columns with text in addition to numeric/categorical columns.
Reference autogluon/tabular/configs/hyperparameter_configs.py for information on the hyperparameters associated with each preset.
- Keys are strings that indicate which model types to train.
- Stable model options include:
’GBM’ (LightGBM) ‘CAT’ (CatBoost) ‘XGB’ (XGBoost) ‘RF’ (random forest) ‘XT’ (extremely randomized trees) ‘KNN’ (k-nearest neighbors) ‘LR’ (linear regression) ‘NN_TORCH’ (neural network implemented in Pytorch) ‘FASTAI’ (neural network with FastAI backend) ‘AG_AUTOMM’ (MultimodalPredictor from autogluon.multimodal. Supports Tabular, Text, and Image modalities. GPU is required.)
- Experimental model options include:
’FT_TRANSFORMER’ (Tabular Transformer, GPU is recommended. Does not scale well to >100 features.) ‘FASTTEXT’ (FastText) ‘VW’ (VowpalWabbit) ‘AG_TEXT_NN’ (Multimodal Text+Tabular model, GPU is required. Recommended to instead use its successor, ‘AG_AUTOMM’.) ‘AG_IMAGE_NN’ (Image model, GPU is required. Recommended to instead use its successor, ‘AG_AUTOMM’.)
If a certain key is missing from hyperparameters, then fit() will not train any models of that type. Omitting a model key from hyperparameters is equivalent to including this model key in excluded_model_types. For example, set hyperparameters = { ‘NN_TORCH’:{…} } if say you only want to train (PyTorch) neural networks and no other types of models.
- Values = dict of hyperparameter settings for each model type, or list of dicts.
Each hyperparameter can either be a single fixed value or a search space containing many possible values. Unspecified hyperparameters will be set to default values (or default search spaces if hyperparameter_tune_kwargs=’auto’). Caution: Any provided search spaces will error if hyperparameter_tune_kwargs=None (Default). To train multiple models of a given type, set the value to a list of hyperparameter dictionaries.
For example, hyperparameters = {‘RF’: [{‘criterion’: ‘gini’}, {‘criterion’: ‘entropy’}]} will result in 2 random forest models being trained with separate hyperparameters.
- Some model types have preset hyperparameter configs keyed under strings as shorthand for a complex model hyperparameter configuration known to work well:
’GBM’: [‘GBMLarge’]
- Advanced functionality: Bring your own model / Custom model support
AutoGluon fully supports custom models. For a detailed tutorial on creating and using custom models with AutoGluon, refer to https://auto.gluon.ai/stable/tutorials/tabular_prediction/tabular-custom-model.html
- Advanced functionality: Custom stack levels
By default, AutoGluon re-uses the same models and model hyperparameters at each level during stack ensembling. To customize this behaviour, create a hyperparameters dictionary separately for each stack level, and then add them as values to a new dictionary, with keys equal to the stack level.
Example: hyperparameters = {1: {‘RF’: rf_params1}, 2: {‘CAT’: [cat_params1, cat_params2], ‘NN_TORCH’: {}}} This will result in a stack ensemble that has one custom random forest in level 1 followed by two CatBoost models with custom hyperparameters and a default neural network in level 2, for a total of 4 models.
If a level is not specified in hyperparameters, it will default to using the highest specified level to train models. This can also be explicitly controlled by adding a ‘default’ key.
- Default:
- hyperparameters = {
‘NN_TORCH’: {}, ‘GBM’: [
{‘extra_trees’: True, ‘ag_args’: {‘name_suffix’: ‘XT’}}, {}, ‘GBMLarge’,
], ‘CAT’: {}, ‘XGB’: {}, ‘FASTAI’: {}, ‘RF’: [
{‘criterion’: ‘gini’, ‘ag_args’: {‘name_suffix’: ‘Gini’, ‘problem_types’: [‘binary’, ‘multiclass’]}}, {‘criterion’: ‘entropy’, ‘ag_args’: {‘name_suffix’: ‘Entr’, ‘problem_types’: [‘binary’, ‘multiclass’]}}, {‘criterion’: ‘squared_error’, ‘ag_args’: {‘name_suffix’: ‘MSE’, ‘problem_types’: [‘regression’]}},
], ‘XT’: [
{‘criterion’: ‘gini’, ‘ag_args’: {‘name_suffix’: ‘Gini’, ‘problem_types’: [‘binary’, ‘multiclass’]}}, {‘criterion’: ‘entropy’, ‘ag_args’: {‘name_suffix’: ‘Entr’, ‘problem_types’: [‘binary’, ‘multiclass’]}}, {‘criterion’: ‘squared_error’, ‘ag_args’: {‘name_suffix’: ‘MSE’, ‘problem_types’: [‘regression’]}},
], ‘KNN’: [
{‘weights’: ‘uniform’, ‘ag_args’: {‘name_suffix’: ‘Unif’}}, {‘weights’: ‘distance’, ‘ag_args’: {‘name_suffix’: ‘Dist’}},
],
}
- Details regarding the hyperparameters you can specify for each model are provided in the following files:
- NN: autogluon.tabular.models.tabular_nn.hyperparameters.parameters
Note: certain hyperparameter settings may cause these neural networks to train much slower.
- GBM: autogluon.tabular.models.lgb.hyperparameters.parameters
See also the lightGBM docs: https://lightgbm.readthedocs.io/en/latest/Parameters.html
- CAT: autogluon.tabular.models.catboost.hyperparameters.parameters
See also the CatBoost docs: https://catboost.ai/docs/concepts/parameter-tuning.html
- XGB: autogluon.tabular.models.xgboost.hyperparameters.parameters
See also the XGBoost docs: https://xgboost.readthedocs.io/en/latest/parameter.html
- FASTAI: autogluon.tabular.models.fastainn.hyperparameters.parameters
See also the FastAI docs: https://docs.fast.ai/tabular.models.html
- RF: See sklearn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
Note: Hyperparameter tuning is disabled for this model.
- XT: See sklearn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.ExtraTreesClassifier.html
Note: Hyperparameter tuning is disabled for this model.
- KNN: See sklearn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Note: Hyperparameter tuning is disabled for this model.
- LR: autogluon.tabular.models.lr.hyperparameters.parameters
Note: Hyperparameter tuning is disabled for this model. Note: ‘penalty’ parameter can be used for regression to specify regularization method: ‘L1’ and ‘L2’ values are supported.
- Advanced functionality: Custom AutoGluon model arguments
- These arguments are optional and can be specified in any model’s hyperparameters.
Example: hyperparameters = {‘RF’: {…, ‘ag_args’: {‘name_suffix’: ‘CustomModelSuffix’, ‘disable_in_hpo’: True}}
- ag_args: Dictionary of customization options related to meta properties of the model such as its name, the order it is trained, the problem types it is valid for, and the type of HPO it utilizes.
- Valid keys:
name: (str) The name of the model. This overrides AutoGluon’s naming logic and all other name arguments if present. name_main: (str) The main name of the model. Example: ‘RandomForest’. name_prefix: (str) Add a custom prefix to the model name. Unused by default. name_suffix: (str) Add a custom suffix to the model name. Unused by default. priority: (int) Determines the order in which the model is trained. Larger values result in the model being trained earlier. Default values range from 100 (KNN) to 0 (custom), dictated by model type. If you want this model to be trained first, set priority = 999. problem_types: (list) List of valid problem types for the model. problem_types=[‘binary’] will result in the model only being trained if problem_type is ‘binary’. disable_in_hpo: (bool) If True, the model will only be trained if hyperparameter_tune_kwargs=None. valid_stacker: (bool) If False, the model will not be trained as a level 2 or higher stacker model. valid_base: (bool) If False, the model will not be trained as a level 1 (base) model. hyperparameter_tune_kwargs: (dict) Refer to
TabularPredictor.fit()
hyperparameter_tune_kwargs argument. If specified here, will override global HPO settings for this model.
Reference the default hyperparameters for example usage of these options.
- ag_args_fit: Dictionary of model fit customization options related to how and with what constraints the model is trained. These parameters affect stacker fold models, but not stacker models themselves.
Clarification: time_limit is the internal time in seconds given to a particular model to train, which is dictated in part by the time_limit argument given during predictor.fit() but is not the same. Valid keys:
stopping_metric: (str or
autogluon.core.metrics.Scorer
, default=None) The metric to use for early stopping of the model. If None, model will decide. max_memory_usage_ratio: (float, default=1.0) The ratio of memory usage relative to the default to allow before early stopping or killing the model. Values greater than 1.0 will be increasingly prone to out-of-memory errors. max_time_limit_ratio: (float, default=1.0) The ratio of the provided time_limit to use during model fit(). If time_limit=10 and max_time_limit_ratio=0.3, time_limit would be changed to 3. Does not alter max_time_limit or min_time_limit values. max_time_limit: (float, default=None) Maximum amount of time to allow this model to train for (in sec). If the provided time_limit is greater than this value, it will be replaced by max_time_limit. min_time_limit: (float, default=0) Allow this model to train for at least this long (in sec), regardless of the time limit it would otherwise be granted.If min_time_limit >= max_time_limit, time_limit will be set to min_time_limit. If min_time_limit=None, time_limit will be set to None and the model will have no training time restriction.
- num_cpus(int or str, default=’auto’)
How many CPUs to use during model fit. If ‘auto’, model will decide.
- num_gpus(int or str, default=’auto’)
How many GPUs to use during model fit. If ‘auto’, model will decide. Some models can use GPUs but don’t by default due to differences in model quality. Set to 0 to disable usage of GPUs.
- ag_args_ensemble: Dictionary of hyperparameters shared by all models that control how they are ensembled, if bag mode is enabled.
- Valid keys:
use_orig_features: (bool) Whether a stack model will use the original features along with the stack features to train (akin to skip-connections). If the model has no stack features (no base models), this value is ignored and the stack model will use the original features. max_base_models: (int, default=25) Maximum number of base models whose predictions form the features input to this stacker model. If more than max_base_models base models are available, only the top max_base_models models with highest validation score are used. max_base_models_per_type: (int, default=5) Similar to max_base_models. If more than max_base_models_per_type of any particular model type are available, only the top max_base_models_per_type of that type are used. This occurs before the max_base_models filter. num_folds: (int, default=None) If specified, the number of folds to fit in the bagged model.
If specified, overrides any other value used to determine the number of folds such as predictor.fit num_bag_folds argument.
- max_sets: (int, default=None) If specified, the maximum sets to fit in the bagged model.
The lesser of max_sets and the predictor.fit num_bag_sets argument will be used for the given model. Useful if a particular model is expensive relative to others and you want to avoid repeated bagging of the expensive model while still repeated bagging the cheaper models.
- save_bag_folds: (bool, default=True)
If True, bagged models will save their fold models (the models from each individual fold of bagging). This is required to use bagged models for prediction. If False, bagged models will not save their fold models. This means that bagged models will not be valid models during inference.
This should only be set to False when planning to call predictor.refit_full() or when refit_full is set and set_best_to_refit_full=True. Particularly useful if disk usage is a concern. By not saving the fold models, bagged models will use only very small amounts of disk space during training. In many training runs, this will reduce peak disk usage by >10x.
- fold_fitting_strategy: (AbstractFoldFittingStrategy default=auto) Whether to fit folds in parallel or in sequential order.
If parallel_local, folds will be trained in parallel with evenly distributed computing resources. This could bring 2-4x speedup compared to SequentialLocalFoldFittingStrategy, but could consume much more memory. If sequential_local, folds will be trained in sequential. If auto, strategy will be determined by OS and whether ray is installed or not. MacOS support for parallel_local is unstable, and may crash if enabled.
- num_folds_parallel: (int or str, default=’auto’) Number of folds to be trained in parallel if using ParallelLocalFoldFittingStrategy. Consider lowering this value if you encounter either out of memory issue or CUDA out of memory issue(when trained on gpu).
if ‘auto’, will try to train all folds in parallel.
feature_metadata (
autogluon.tabular.FeatureMetadata
or str, default = ‘infer’) – The feature metadata used in various inner logic in feature preprocessing. If ‘infer’, will automatically construct a FeatureMetadata object based on the properties of train_data. In this case, train_data is input intoautogluon.tabular.FeatureMetadata.from_df()
to infer feature_metadata. If ‘infer’ incorrectly assumes the dtypes of features, consider explicitly specifying feature_metadata.infer_limit (float, default = None) – The inference time limit in seconds per row to adhere to during fit. If infer_limit=0.05 and infer_limit_batch_size=1000, AutoGluon will avoid training models that take longer than 50 ms/row to predict when given a batch of 1000 rows to predict (must predict 1000 rows in no more than 50 seconds). If bagging is enabled, the inference time limit will be respected based on estimated inference speed of _FULL models after refit_full is called, NOT on the inference speed of the bagged ensembles. The inference times calculated for models are assuming predictor.persist_models(‘all’) is called after fit. If None, no limit is enforced. If it is impossible to satisfy the constraint, an exception will be raised.
infer_limit_batch_size (int, default = None) – The batch size to use when predicting in bulk to estimate per-row inference time. Must be an integer greater than 0. If None and infer_limit is specified, will default to 10000. It is recommended to set to 10000 unless you must satisfy an online-inference scenario. Small values, especially infer_limit_batch_size=1, will result in much larger per-row inference times and should be avoided if possible. Refer to infer_limit for more details on how this is used. If specified when infer_limit=None, the inference time will be logged during training but will not be limited.
fit_weighted_ensemble (bool, default = True) – If True, a WeightedEnsembleModel will be fit in each stack layer. A weighted ensemble will often be stronger than an individual model while being very fast to train. It is recommended to keep this value set to True to maximize predictive quality.
num_cpus (int, default = "auto") – The total amount of cpus you want AutoGluon predictor to use. Auto means AutoGluon will make the decision based on the total number of cpus available and the model requirement for best performance. Users generally don’t need to set this value
num_gpus (int, default = "auto") – The total amount of gpus you want AutoGluon predictor to use. Auto means AutoGluon will make the decision based on the total number of gpus available and the model requirement for best performance. Users generally don’t need to set this value
**kwargs –
- auto_stackbool, default = False
Whether AutoGluon should automatically utilize bagging and multi-layer stack ensembling to boost predictive accuracy. Set this = True if you are willing to tolerate longer training times in order to maximize predictive accuracy! Automatically sets num_bag_folds and num_stack_levels arguments based on dataset properties. Note: Setting num_bag_folds and num_stack_levels arguments will override auto_stack. Note: This can increase training time (and inference time) by up to 20x, but can greatly improve predictive performance.
- num_bag_foldsint, default = None
Number of folds used for bagging of models. When num_bag_folds = k, training time is roughly increased by a factor of k (set = 0 to disable bagging). Disabled by default (0), but we recommend values between 5-10 to maximize predictive performance. Increasing num_bag_folds will result in models with lower bias but that are more prone to overfitting. num_bag_folds = 1 is an invalid value, and will raise a ValueError. Values > 10 may produce diminishing returns, and can even harm overall results due to overfitting. To further improve predictions, avoid increasing num_bag_folds much beyond 10 and instead increase num_bag_sets.
- num_bag_setsint, default = None
Number of repeats of kfold bagging to perform (values must be >= 1). Total number of models trained during bagging = num_bag_folds * num_bag_sets. Defaults to 1 if time_limit is not specified, otherwise 20 (always disabled if num_bag_folds is not specified). Values greater than 1 will result in superior predictive performance, especially on smaller problems and with stacking enabled (reduces overall variance).
- num_stack_levelsint, default = None
Number of stacking levels to use in stack ensemble. Roughly increases model training time by factor of num_stack_levels+1 (set = 0 to disable stack ensembling). Disabled by default (0), but we recommend values between 1-3 to maximize predictive performance. To prevent overfitting, num_bag_folds >= 2 must also be set or else a ValueError will be raised.
- holdout_fracfloat, default = None
Fraction of train_data to holdout as tuning data for optimizing hyperparameters (ignored unless tuning_data = None, ignored if num_bag_folds != 0 unless use_bag_holdout == True). Default value (if None) is selected based on the number of rows in the training data. Default values range from 0.2 at 2,500 rows to 0.01 at 250,000 rows. Default value is doubled if hyperparameter_tune_kwargs is set, up to a maximum of 0.2. Disabled if num_bag_folds >= 2 unless use_bag_holdout == True.
- use_bag_holdoutbool, default = False
If True, a holdout_frac portion of the data is held-out from model bagging. This held-out data is only used to score models and determine weighted ensemble weights. Enable this if there is a large gap between score_val and score_test in stack models. Note: If tuning_data was specified, tuning_data is used as the holdout data. Disabled if not bagging.
- hyperparameter_tune_kwargsstr or dict, default = None
Hyperparameter tuning strategy and kwargs (for example, how many HPO trials to run). If None, then hyperparameter tuning will not be performed. Valid preset values:
’auto’: Uses the ‘random’ preset. ‘random’: Performs HPO via random search using local scheduler.
The ‘searcher’ key is required when providing a dict.
- feature_prune_kwargs: dict, default = None
Performs layer-wise feature pruning via recursive feature elimination with permutation feature importance. This fits all models in a stack layer once, discovers a pruned set of features, fits all models in the stack layer again with the pruned set of features, and updates input feature lists for models whose validation score improved. If None, do not perform feature pruning. If empty dictionary, perform feature pruning with default configurations. For valid dictionary keys, refer to
autogluon.core.utils.feature_selection.FeatureSelector
and autogluon.core.trainer.abstract_trainer.AbstractTrainer._proxy_model_feature_prune documentation. To force all models to work with the pruned set of features, set force_prune=True in the dictionary.- ag_argsdict, default = None
Keyword arguments to pass to all models (i.e. common hyperparameters shared by all AutoGluon models). See the ag_args argument from “Advanced functionality: Custom AutoGluon model arguments” in the hyperparameters argument documentation for valid values. Identical to specifying ag_args parameter for all models in hyperparameters. If a key in ag_args is already specified for a model in hyperparameters, it will not be altered through this argument.
- ag_args_fitdict, default = None
Keyword arguments to pass to all models. See the ag_args_fit argument from “Advanced functionality: Custom AutoGluon model arguments” in the hyperparameters argument documentation for valid values. Identical to specifying ag_args_fit parameter for all models in hyperparameters. If a key in ag_args_fit is already specified for a model in hyperparameters, it will not be altered through this argument.
- ag_args_ensembledict, default = None
Keyword arguments to pass to all models. See the ag_args_ensemble argument from “Advanced functionality: Custom AutoGluon model arguments” in the hyperparameters argument documentation for valid values. Identical to specifying ag_args_ensemble parameter for all models in hyperparameters. If a key in ag_args_ensemble is already specified for a model in hyperparameters, it will not be altered through this argument.
- excluded_model_typeslist, default = None
Banned subset of model types to avoid training during fit(), even if present in hyperparameters. Reference hyperparameters documentation for what models correspond to each value. Useful when a particular model type such as ‘KNN’ or ‘custom’ is not desired but altering the hyperparameters dictionary is difficult or time-consuming.
Example: To exclude both ‘KNN’ and ‘custom’ models, specify excluded_model_types=[‘KNN’, ‘custom’].
- refit_fullbool or str, default = False
Whether to retrain all models on all of the data (training + validation) after the normal training procedure. This is equivalent to calling predictor.refit_full(model=refit_full) after fit. If refit_full=True, it will be treated as refit_full=’all’. If refit_full=False, refitting will not occur. Valid str values:
all: refits all models. best: refits only the best model (and its ancestors if it is a stacker model). {model_name}: refits only the specified model (and its ancestors if it is a stacker model).
- For bagged models:
Reduces a model’s inference time by collapsing bagged ensembles into a single model fit on all of the training data. This process will typically result in a slight accuracy reduction and a large inference speedup. The inference speedup will generally be between 10-200x faster than the original bagged ensemble model.
The inference speedup factor is equivalent to (k * n), where k is the number of folds (num_bag_folds) and n is the number of finished repeats (num_bag_sets) in the bagged ensemble.
- The runtime is generally 10% or less of the original fit runtime.
The runtime can be roughly estimated as 1 / (k * n) of the original fit runtime, with k and n defined above.
- For non-bagged models:
Optimizes a model’s accuracy by retraining on 100% of the data without using a validation set. Will typically result in a slight accuracy increase and no change to inference time. The runtime will be approximately equal to the original fit runtime.
This process does not alter the original models, but instead adds additional models. If stacker models are refit by this process, they will use the refit_full versions of the ancestor models during inference. Models produced by this process will not have validation scores, as they use all of the data for training.
Therefore, it is up to the user to determine if the models are of sufficient quality by including test data in predictor.leaderboard(test_data). If the user does not have additional test data, they should reference the original model’s score for an estimate of the performance of the refit_full model.
Warning: Be aware that utilizing refit_full models without separately verifying on test data means that the model is untested, and has no guarantee of being consistent with the original model.
The time taken by this process is not enforced by time_limit.
- set_best_to_refit_fullbool, default = False
If True, will change the default model that Predictor uses for prediction when model is not specified to the refit_full version of the model that exhibited the highest validation score. Only valid if refit_full is set.
- keep_only_bestbool, default = False
- If True, only the best model and its ancestor models are saved in the outputted predictor. All other models are deleted.
If you only care about deploying the most accurate predictor with the smallest file-size and no longer need any of the other trained models or functionality beyond prediction on new data, then set: keep_only_best=True, save_space=True. This is equivalent to calling predictor.delete_models(models_to_keep=’best’, dry_run=False) directly after fit().
- If used with refit_full and set_best_to_refit_full, the best model will be the refit_full model, and the original bagged best model will be deleted.
refit_full will be automatically set to ‘best’ in this case to avoid training models which will be later deleted.
- save_spacebool, default = False
- If True, reduces the memory and disk size of predictor by deleting auxiliary model files that aren’t needed for prediction on new data.
This is equivalent to calling predictor.save_space() directly after fit().
This has NO impact on inference accuracy. It is recommended if the only goal is to use the trained model for prediction. Certain advanced functionality may no longer be available if save_space=True. Refer to predictor.save_space() documentation for more details.
- feature_generator
autogluon.features.generators.AbstractFeatureGenerator
, default =autogluon.features.generators.AutoMLPipelineFeatureGenerator
The feature generator used by AutoGluon to process the input data to the form sent to the models. This often includes automated feature generation and data cleaning. It is generally recommended to keep the default feature generator unless handling an advanced use-case. To control aspects of the default feature generation process, you can pass in an
AutoMLPipelineFeatureGenerator
object constructed using some of these kwargs:- enable_numeric_featuresbool, default True
Whether to keep features of ‘int’ and ‘float’ raw types. These features are passed without alteration to the models. Appends IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[‘int’, ‘float’]))) to the generator group.
- enable_categorical_featuresbool, default True
Whether to keep features of ‘object’ and ‘category’ raw types. These features are processed into memory optimized ‘category’ features. Appends CategoryFeatureGenerator() to the generator group.
- enable_datetime_featuresbool, default True
Whether to keep features of ‘datetime’ raw type and ‘object’ features identified as ‘datetime_as_object’ features. These features will be converted to ‘int’ features representing milliseconds since epoch. Appends DatetimeFeatureGenerator() to the generator group.
- enable_text_special_featuresbool, default True
Whether to use ‘object’ features identified as ‘text’ features to generate ‘text_special’ features such as word count, capital letter ratio, and symbol counts. Appends TextSpecialFeatureGenerator() to the generator group.
- enable_text_ngram_featuresbool, default True
Whether to use ‘object’ features identified as ‘text’ features to generate ‘text_ngram’ features. Appends TextNgramFeatureGenerator(vectorizer=vectorizer) to the generator group.
- enable_raw_text_featuresbool, default False
Whether to keep the raw text features. Appends IdentityFeatureGenerator(infer_features_in_args=dict(required_special_types=[‘text’])) to the generator group.
- vectorizerCountVectorizer, default CountVectorizer(min_df=30, ngram_range=(1, 3), max_features=10000, dtype=np.uint8)
sklearn CountVectorizer object to use in TextNgramFeatureGenerator. Only used if enable_text_ngram_features=True.
- unlabeled_datapd.DataFrame, default = None
[Experimental Parameter] Collection of data without labels that we can use to pretrain on. This is the same schema as train_data, except without the labels. Currently, unlabeled_data is only used for pretraining a TabTransformer model. If you do not specify ‘TRANSF’ with unlabeled_data, then no pretraining will occur and unlabeled_data will be ignored! After the pretraining step, we will finetune using the TabTransformer model as well. If TabTransformer is ensembled with other models, like in typical AutoGluon fashion, then the output of this “pretrain/finetune” will be ensembled with other models, which will not used the unlabeled_data. The “pretrain/finetune flow” is also known as semi-supervised learning. The typical use case for unlabeled_data is to add signal to your model where you may not have sufficient training data. e.g. 500 hand-labeled samples (perhaps a hard human task), whole data set (unlabeled) is thousands/millions. However, this isn’t the only use case. Given enough unlabeled data(millions of rows), you may see improvements to any amount of labeled data.
- verbosityint
If specified, overrides the existing predictor.verbosity value.
- calibrate: bool or str, default = ‘auto’
Note: It is recommended to use [‘auto’, False] as the values and avoid True. If ‘auto’ will automatically set to True if the problem_type and eval_metric are suitable for calibration. If True and the problem_type is classification, temperature scaling will be used to calibrate the Predictor’s estimated class probabilities (which may improve metrics like log_loss) and will train a scalar parameter on the validation set. If True and the problem_type is quantile regression, conformalization will be used to calibrate the Predictor’s estimated quantiles (which may improve the prediction interval coverage, and bagging could further improve it) and will compute a set of scalar parameters on the validation set.
- Return type
TabularPredictor
object. Returns self.
Examples
>>> from autogluon.tabular import TabularDataset, TabularPredictor >>> train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv') >>> label = 'class' >>> predictor = TabularPredictor(label=label).fit(train_data) >>> test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv') >>> leaderboard = predictor.leaderboard(test_data) >>> y_test = test_data[label] >>> test_data = test_data.drop(columns=[label]) >>> y_pred = predictor.predict(test_data) >>> perf = predictor.evaluate_predictions(y_true=y_test, y_pred=y_pred)
To maximize predictive performance, use the following:
>>> eval_metric = 'roc_auc' # set this to the metric you ultimately care about >>> time_limit = 3600 # set as long as you are willing to wait (in sec) >>> predictor = TabularPredictor(label=label, eval_metric=eval_metric).fit(train_data, presets=['best_quality'], time_limit=time_limit)