autogluon.tabular.TabularPredictor¶
- class autogluon.tabular.TabularPredictor(label: str, problem_type: str = None, eval_metric: str | Scorer = None, path: str = None, verbosity: int = 2, log_to_file: bool = False, log_file_path: str = 'auto', sample_weight: str = None, weight_evaluation: bool = False, groups: str = None, positive_class: int | str | None = None, **kwargs)[source]¶
AutoGluon TabularPredictor predicts values in a column of a tabular dataset (classification or regression).
- Parameters:
label (str) – Name of the column that contains the target variable to predict.
problem_type (str, default = None) – Type of prediction problem, i.e. is this a binary/multiclass classification or regression problem (options: ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’). If problem_type = None, the prediction problem type is inferred based on the label-values in provided dataset.
eval_metric (str or Scorer, default = None) –
Metric by which predictions will be ultimately evaluated on test data. AutoGluon tunes factors such as hyperparameters, early-stopping, ensemble-weights, etc. in order to improve this metric on validation data.
If eval_metric = None, it is automatically chosen based on problem_type. Defaults to ‘accuracy’ for binary and multiclass classification, ‘root_mean_squared_error’ for regression, and ‘pinball_loss’ for quantile.
- Otherwise, options for classification:
[‘accuracy’, ‘balanced_accuracy’, ‘log_loss’, ‘f1’, ‘f1_macro’, ‘f1_micro’, ‘f1_weighted’, ‘roc_auc’, ‘roc_auc_ovo’, ‘roc_auc_ovo_macro’, ‘roc_auc_ovo_weighted’, ‘roc_auc_ovr’, ‘roc_auc_ovr_macro’, ‘roc_auc_ovr_micro’, ‘roc_auc_ovr_weighted’, ‘average_precision’, ‘precision’, ‘precision_macro’, ‘precision_micro’, ‘precision_weighted’, ‘recall’, ‘recall_macro’, ‘recall_micro’, ‘recall_weighted’, ‘mcc’, ‘pac_score’]
- Options for regression:
[‘root_mean_squared_error’, ‘mean_squared_error’, ‘mean_absolute_error’, ‘median_absolute_error’, ‘mean_absolute_percentage_error’, ‘r2’, ‘symmetric_mean_absolute_percentage_error’]
For more information on these options, see sklearn.metrics: https://scikit-learn.org/stable/modules/classes.html#sklearn-metrics-metrics For metric source code, see autogluon.core.metrics.
You can also pass your own evaluation function here as long as it follows formatting of the functions defined in folder autogluon.core.metrics. For detailed instructions on creating and using a custom metric, refer to https://auto.gluon.ai/stable/tutorials/tabular/advanced/tabular-custom-metric.html
path (Union[str, pathlib.Path], default = None) – Path to directory where models and intermediate outputs should be saved. If unspecified, a time-stamped folder called “AutogluonModels/ag-[TIMESTAMP]” will be created in the working directory to store all models. Note: To call fit() twice and save all results of each fit, you must specify different path locations or don’t specify path at all. Otherwise files from first fit() will be overwritten by second fit().
verbosity (int, default = 2) –
Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels). Verbosity levels:
0: Only log exceptions 1: Only log warnings + exceptions 2: Standard logging 3: Verbose logging (ex: log validation score every 50 iterations) 4: Maximally verbose logging (ex: log validation score every iteration)
log_to_file (bool, default = False) – Whether to save the logs into a file for later reference
log_file_path (str, default = "auto") – File path to save the logs. If auto, logs will be saved under predictor_path/logs/predictor_log.txt. Will be ignored if log_to_file is set to False
sample_weight (str, default = None) – If specified, this column-name indicates which column of the data should be treated as sample weights. This column will NOT be considered as a predictive feature. Sample weights should be non-negative (and cannot be nan), with larger values indicating which rows are more important than others. If you want your usage of sample weights to match results obtained outside of this Predictor, then ensure sample weights for your training (or tuning) data sum to the number of rows in the training (or tuning) data. You may also specify two special strings: ‘auto_weight’ (automatically choose a weighting strategy based on the data) or ‘balance_weight’ (equally weight classes in classification, no effect in regression). If specifying your own sample_weight column, make sure its name does not match these special strings.
weight_evaluation (bool, default = False) – Only considered when sample_weight column is not None. Determines whether sample weights should be taken into account when computing evaluation metrics on validation/test data. If True, then weighted metrics will be reported based on the sample weights provided in the specified sample_weight (in which case sample_weight column must also be present in test data). In this case, the ‘best’ model used by default for prediction will also be decided based on a weighted version of evaluation metric. Note: we do not recommend specifying weight_evaluation when sample_weight is ‘auto_weight’ or ‘balance_weight’, instead specify appropriate eval_metric.
groups (str, default = None) –
[Experimental] If specified, AutoGluon will use the column named the value of groups in train_data during .fit as the data splitting indices for the purposes of bagging. This column will not be used as a feature during model training. This parameter is ignored if bagging is not enabled. To instead specify a custom validation set with bagging disabled, specify tuning_data in .fit. The data will be split via sklearn.model_selection.LeaveOneGroupOut. Use this option to control the exact split indices AutoGluon uses. It is not recommended to use this option unless it is required for very specific situations. Bugs may arise from edge cases if the provided groups are not valid to properly train models, such as if not all classes are present during training in multiclass classification. It is up to the user to sanitize their groups.
As an example, if you want your data folds to preserve adjacent rows in the table without shuffling, then for 3 fold bagging with 6 rows of data, the groups column values should be [0, 0, 1, 1, 2, 2].
positive_class (str or int, default = None) –
Used to determine the positive class in binary classification. This is used for certain metrics such as ‘f1’ which produce different scores depending on which class is considered the positive class. If not set, will be inferred as the second element of the existing unique classes after sorting them.
If classes are [0, 1], then 1 will be selected as the positive class. If classes are [‘def’, ‘abc’], then ‘def’ will be selected as the positive class. If classes are [True, False], then True will be selected as the positive class.
**kwargs –
- learner_typeAbstractLearner, default = DefaultLearner
A class which inherits from AbstractLearner. This dictates the inner logic of predictor. If you don’t know what this is, keep it as the default.
- learner_kwargsdict, default = None
Kwargs to send to the learner. Options include:
- ignored_columnslist, default = None
Banned subset of column names that predictor may not use as predictive features (e.g. unique identifier to a row or user-ID). These columns are ignored during fit().
- label_count_thresholdint, default = 10
For multi-class classification problems, this is the minimum number of times a label must appear in dataset in order to be considered an output class. AutoGluon will ignore any classes whose labels do not appear at least this many times in the dataset (i.e. will never predict them).
- cache_databool, default = True
When enabled, the training and validation data are saved to disk for future reuse. Enables advanced functionality in predictor such as fit_extra() and feature importance calculation on the original data.
- trainer_typeAbstractTabularTrainer, default = AutoTrainer
A class inheriting from AbstractTabularTrainer that controls training/ensembling of many models. If you don’t know what this is, keep it as the default.
- __init__(label: str, problem_type: str = None, eval_metric: str | Scorer = None, path: str = None, verbosity: int = 2, log_to_file: bool = False, log_file_path: str = 'auto', sample_weight: str = None, weight_evaluation: bool = False, groups: str = None, positive_class: int | str | None = None, **kwargs)[source]¶
Methods
Calibrate the decision threshold in binary classification to optimize a given metric.
Clone the predictor and all of its artifacts to a new location on local disk.
Clone the predictor and all of its artifacts to a new location on local disk, then delete the clones artifacts unnecessary during prediction.
Compile models for accelerated prediction.
Deletes models from predictor. This can be helpful to minimize memory usage and disk usage, particularly for model deployment. This will remove all references to the models in predictor. For example, removed models will not appear in predictor.leaderboard(). WARNING: If delete_from_disk=True, this will DELETE ALL FILES in the deleted model directories, regardless if they were created by AutoGluon or not. DO NOT STORE FILES INSIDE OF THE MODEL DIRECTORY THAT ARE UNRELATED TO AUTOGLUON.
Returns the combined size of all files under the predictor.path directory in bytes.
Returns the size of each file under the predictor.path directory in bytes.
[EXPERIMENTAL] Distill AutoGluon's most accurate ensemble-predictor into single models which are simpler/faster and require less memory/compute.
Report the predictive performance evaluated over a given dataset.
Evaluate the provided prediction probabilities against ground truth labels.
Calculates feature importance scores for the given model via permutation importance.
Returns a list of feature names dependent on the value of feature_stage.
Fit models to predict a column of a data table (label) based on the other columns (features).
Fits additional models after the original
TabularPredictor.fit()
call.[Advanced] Uses additional data (pseudo_data) to try to achieve better model quality.
Output summary of information about models produced during fit().
Fits new weighted ensemble models to combine predictions of previously-trained models.
[EXPERIMENTAL] Returns a dictionary of predictor metadata. Warning: This functionality is currently in preview mode. The metadata information returned may change in structure in future versions without warning. The definitions of various metadata values are not yet documented. The output of this function should not be used for programmatic decisions. Contains information such as row count, column count, model training time, validation scores, hyperparameters, and much more.
Output summary of information about models produced during fit() as a
pd.DataFrame
. Includes information on test and validation scores for all models, model training times, inference times, and stack levels. Output DataFrame columns include: 'model': The name of the model.Retrieves learning curves generated during predictor.fit().
Load a TabularPredictor object previously produced by fit() from file and returns this object.
Loads the internal data representation used during model training. Individual AutoGluon models like the neural network may apply additional feature transformations that are not reflected in this method. This method only applies universal transforms employed by all AutoGluon models. Warning, the internal representation may: Have different features compared to the original data. Have different row counts compared to the original data. Have indices which do not align with the original data. Have label values which differ from those in the original data. Internal data representations should NOT be combined with the original data, in most cases this is not possible.
Load log files of a predictor
[Advanced] Get the model failures that occurred during the fitting of this model, in the form of a pandas DataFrame.
Returns the hyperparameters of a given model.
Returns metadata information about the given model.
Returns the list of model names trained in this predictor object.
Returns a dictionary of original model name -> refit full model name.
Persist models in memory for reduced inference latency.
Output the visualized stack ensemble architecture of a model trained by fit().
Use trained models to produce predictions of label column values for new data.
Given prediction probabilities, convert to predictions.
Returns a dictionary of predictions where the key is the model name and the value is the model's prediction probabilities on the data.
Note: This is advanced functionality not intended for normal usage.
Use trained models to produce predicted class probabilities rather than class-labels (if task is classification).
Returns a dictionary of prediction probabilities where the key is the model name and the value is the model's prediction probabilities on the data.
Note: This is advanced functionality not intended for normal usage.
Retrain model on all of the data (training + validation). For bagged models: Optimizes a model's inference time by collapsing bagged ensembles into a single model fit on all of the training data. This process will typically result in a slight accuracy reduction and a large inference speedup. The inference speedup will generally be between 10-200x faster than the original bagged ensemble model. The inference speedup factor is equivalent to (k * n), where k is the number of folds (num_bag_folds) and n is the number of finished repeats (num_bag_sets) in the bagged ensemble. The runtime is generally 10% or less of the original fit runtime. The runtime can be roughly estimated as 1 / (k * n) of the original fit runtime, with k and n defined above. For non-bagged models: Optimizes a model's accuracy by retraining on 100% of the data without using a validation set. Will typically result in a slight accuracy increase and no change to inference time. The runtime will be approximately equal to the original fit runtime. This process does not alter the original models, but instead adds additional models. If stacker models are refit by this process, they will use the refit_full versions of the ancestor models during inference. Models produced by this process will not have validation scores, as they use all of the data for training. Therefore, it is up to the user to determine if the models are of sufficient quality by including test data in predictor.leaderboard(test_data). If the user does not have additional test data, they should reference the original model's score for an estimate of the performance of the refit_full model. Warning: Be aware that utilizing refit_full models without separately verifying on test data means that the model is untested, and has no guarantee of being consistent with the original model. cache_data must have been set to True during the original training to enable this functionality.
Save this Predictor to file in directory specified by this Predictor's path.
Reduces the memory and disk size of predictor by deleting auxiliary model files that aren't needed for prediction on new data.
Set predictor.decision_threshold.
Sets the model to be used by default when calling predictor.predict(data).
[Advanced] Computes and returns the necessary information to perform zeroshot HPO simulation.
Transforms data features through the AutoGluon feature generator. This is useful to gain an understanding of how AutoGluon interprets the data features. The output of this function can be used to train further models, even outside of AutoGluon. This can be useful for training your own models on the same data representation as AutoGluon. Individual AutoGluon models like the neural network may apply additional feature transformations that are not reflected in this method. This method only applies universal transforms employed by all AutoGluon models. When data=None, `base_models=[{best_model}], and bagging was enabled during fit(): This returns the out-of-fold predictions of the best model, which can be used as training input to a custom user stacker model.
Transforms data labels to the internal label representation.
Unpersist models in memory for reduced memory usage.
Attributes
can_predict_proba
Return True if predictor can return prediction probabilities via .predict_proba, otherwise return False.
class_labels
Alias to self.classes_
class_labels_internal
For multiclass problems, this list contains the internal class labels in sorted order of internal predict_proba() output. For binary problems, this list contains the internal class labels in sorted order of internal predict_proba(as_multiclass=True) output. The value will always be class_labels_internal=[0, 1] for binary problems, with 0 as the negative class, and 1 as the positive class. For other problem types, will equal None.
class_labels_internal_map
For binary and multiclass classification problems, this dictionary contains the mapping of the original labels to the internal labels. For example, in binary classification, label values of 'True' and 'False' will be mapped to the internal representation 1 and 0. Therefore, class_labels_internal_map would equal {'True': 1, 'False': 0} For other problem types, will equal None. For multiclass, it is possible for not all of the label values to have a mapping. This indicates that the internal models will never predict those missing labels, and training rows associated with the missing labels were dropped.
classes_
For multiclass problems, this list contains the class labels in sorted order of predict_proba() output. For binary problems, this list contains the class labels in sorted order of predict_proba(as_multiclass=True) output. classes_[0] corresponds to internal label = 0 (negative class), classes_[1] corresponds to internal label = 1 (positive class). This is relevant for certain metrics such as F1 where True and False labels impact the metric score differently. For other problem types, will equal None. For example if pred = predict_proba(x, as_multiclass=True), then ith index of pred provides predicted probability that x belongs to class given by classes_[i].
decision_threshold
The decision threshold used to convert prediction probabilities to predictions.
eval_metric
The metric used to evaluate predictive performance
feature_metadata
Returns the internal FeatureMetadata.
feature_metadata_in
Returns the input FeatureMetadata.
has_val
Return True if holdout validation data was used during fit, else return False.
is_fit
Return True if predictor.fit has been called, otherwise return False.
label
labels, response variable, target variable, dependent variable, y, etc).
model_best
Returns the string model name of the best model by validation score that can infer.
original_features
Original features user passed in to fit before processing
path
Path to directory where all models used by this Predictor are stored
positive_class
Returns the positive class name in binary classification.
predictor_file_name
problem_type
What type of prediction problem this Predictor has been trained for
quantile_levels