autogluon.features#

Feature Generators#

AbstractFeatureGenerator

Abstract feature generator implementation from which all AutoGluon feature generators inherit.

AutoMLPipelineFeatureGenerator

Pipeline feature generator with simplified arguments to handle most Tabular data including text and dates adequately.

PipelineFeatureGenerator

PipelineFeatureGenerator is an implementation of BulkFeatureGenerator with various smart defaults and edge case handling functionality to enable robust data handling. It is recommended that users base any custom feature generators meant for end-to-end data transformation from PipelineFeatureGenerator. Reference AutoMLPipelineFeatureGenerator for an example of extending PipelineFeatureGenerator. It is not recommended that PipelineFeatureGenerator be used as a generator within any other generator's pre or post generators.

BulkFeatureGenerator

BulkFeatureGenerator is used for complex feature generation pipelines where multiple generators are required, with some generators requiring the output of other generators as input (multi-stage generation).

AsTypeFeatureGenerator

Enforces type conversion on the data to match the types seen during fitting.

BinnedFeatureGenerator

BinnedFeatureGenerator bins incoming int and float features to num_bins unique int values, maintaining relative rank order.

CategoryFeatureGenerator

CategoryFeatureGenerator is used to convert object types to category types, as well as remove rare categories and optimize memory usage.

DatetimeFeatureGenerator

Transforms datetime features into numeric features.

DropDuplicatesFeatureGenerator

Drops features which are exact duplicates of other features, leaving only one instance of the data.

DropUniqueFeatureGenerator

Drops features which only have 1 unique value or which have nearly no repeated values (based on max_unique_ratio) and are of category or object type.

DummyFeatureGenerator

Ignores all input features and returns a single int feature with all 0 values.

FillNaFeatureGenerator

Fills missing values in the data.

IdentityFeatureGenerator

IdentityFeatureGenerator simply passes the data along without alterations.

LabelEncoderFeatureGenerator

Converts category features to int features by mapping to the category codes.

CategoryMemoryMinimizeFeatureGenerator

Minimizes memory usage of category features by converting the category values to monotonically increasing int values.

NumericMemoryMinimizeFeatureGenerator

Clips and converts dtype of int features to minimize memory usage.

RenameFeatureGenerator

RenameFeatureGenerator renames the columns without altering their values.

TextNgramFeatureGenerator

Generates ngram features from text features.

TextSpecialFeatureGenerator

TextSpecialFeatureGenerator generates text specific features from incoming raw text features.

AbstractFeatureGenerator#

class autogluon.features.generators.AbstractFeatureGenerator(features_in: list | None = None, feature_metadata_in: FeatureMetadata | None = None, post_generators: list | None = None, pre_enforce_types=False, pre_drop_useless=False, post_drop_duplicates=False, reset_index=False, column_names_as_str=True, name_prefix: str | None = None, name_suffix: str | None = None, infer_features_in_args: dict | None = None, infer_features_in_args_strategy='overwrite', banned_feature_special_types: List[str] | None = None, log_prefix='', verbosity=2)[source]#

Abstract feature generator implementation from which all AutoGluon feature generators inherit. The purpose of a feature generator is to transform data from one form to another in a stateful manner. First, the generator is initialized with various arguments that dictate the way features are generated. Then, the generator is fit through either the .fit() or .fit_transform() methods using training data typically in pandas DataFrame format. Finally, the generator can transform new data with the same initial format as the training data through the .transform() method.

Parameters:
  • features_in (list, default None) – List of feature names the generator will expect and use in the fit and transform methods. Any feature in an incoming DataFrame that is not present in features_in is dropped and will not influence the transformation logic. If None, infer during fit from the _infer_features_in method. Equivalent to feature_metadata_in.get_features() post-fit.

  • feature_metadata_in (autogluon.common.features.feature_metadata.FeatureMetadata, default None) – FeatureMetadata object corresponding to the training data input features. If None, infer during fit from the _infer_feature_metadata_in method. Any features not present in features_in (if provided) will be removed from feature_metadata_in.

  • post_generators (list of FeatureGenerators, default None) – FeatureGenerators which will fit and transform sequentially after this object’s transformation logic, feeding their output into the next generator’s input. The output of the final FeatureGenerator will be the used as the transformed output.

  • pre_enforce_types (bool, default False) – If True, the exact raw types (int64, float32, etc.) of the training data will be enforced on future data, either converting the types to the training types or raising an exception if unable. This is important to set to True on the outer feature generator in a feature generation pipeline to ensure incorrect dtypes are not passed downstream, but is often redundant when used on inner feature generators inside a pipeline.

  • pre_drop_useless (bool, default False) – If True, features_in will be pruned at fit time of features containing only a single unique value across all rows.

  • post_drop_duplicates (bool, default False) – If True, a DropDuplicatesFeatureGenerator will be appended to post_generators. This feature generator will drop any duplicate features found in the data, keeping only one feature within any duplicate feature sets. Warning: For large datasets with many features, this may be very computationally expensive or even computationally infeasible.

  • reset_index (bool, default False) – If True, for the duration of fit and transform, the input data’s index is reset to be monotonically increasing from 0 to N-1 for a dataset of N rows. At the end of fit and transform, the original index is re-applied to the output data. This is important to set to True on the outer feature generator in a feature generation pipeline to ensure that a non-default index does not cause corruption of the inner feature generation if any inner feature generator does not properly handle non-default indices. This index reset is also applied to the y label data if provided during fit.

  • column_names_as_str (bool, default True) – If True, the column names of the input data are converted to string if they were not already. This solves any issues related to downstream FeatureGenerators and models which cannot handle integer column names, and allows column name prefix and suffix operations to avoid errors. Note that for performance purposes, column names are only converted at transform time if they were not strings at fit time. Ensure consistent column names as input to avoid errors.

  • name_prefix (str, default None) – Name prefix to add to all output feature names.

  • name_suffix (str, default None) – Name suffix to add to all output feature names.

  • infer_features_in_args (dict, default None) – Used as the kwargs input to FeatureMetadata.get_features(**kwargs) when inferring self.features_in. This is merged with the output dictionary of self.get_default_infer_features_in_args() depending on the value of infer_features_in_args_strategy. Only used when features_in is None. If None, then self.get_default_infer_features_in_args() is used directly. Refer to FeatureMetadata.get_features documentation for a full description of valid keys. Note: This is advanced functionality that is not necessary for most situations.

  • infer_features_in_args_strategy (str, default 'overwrite') – Determines how infer_features_in_args and self.get_default_infer_features_in_args() are combined to result in self._infer_features_in_args which dictates the features_in inference logic. If ‘overwrite’: infer_features_in_args is used exclusively and self.get_default_infer_features_in_args() is ignored. If ‘update’: self.get_default_infer_features_in_args() is dictionary updated by infer_features_in_args. If infer_features_in_args is None, this is ignored.

  • banned_feature_special_types (List[str], default None) – List of feature special types to additionally exclude from input. Will update self.get_default_infer_features_in_args().

  • log_prefix (str, default '') – Prefix string added to all logging statements made by the generator.

  • verbosity (int, default 2) – Controls the verbosity of logging. 0 will silence logs, 1 will only log warnings, 2 will log info level information, and 3 will log info level information and provide detailed feature type input and output information. Logging is still controlled by the global logger configuration, and therefore a verbosity of 3 does not guarantee that logs will be output.

features_in#

List of feature names the generator will expect and use in the fit and transform methods. Equivalent to feature_metadata_in.get_features() post-fit.

Type:

list of str

features_out#

List of feature names present in the output of fit_transform and transform methods. Equivalent to feature_metadata.get_features() post-fit.

Type:

list of str

feature_metadata_in#

The FeatureMetadata of data pre-transformation (data used as input to fit and transform methods).

Type:

FeatureMetadata

feature_metadata#

The FeatureMetadata of data post-transformation (data outputted by fit_transform and transform methods).

Type:

FeatureMetadata

feature_metadata_real#

The FeatureMetadata of data post-transformation consisting of the exact dtypes as opposed to the grouped raw dtypes found in feature_metadata_in, with grouped raw dtypes substituting for the special dtypes. This is only used in the print_feature_metadata_info method and is intended for introspection. It can be safely set to None to reduce memory and disk usage post-fit.

type:

FeatureMetadata

Methods

fit(X: DataFrame, **kwargs)[source]#

Fit generator to the provided data. Because of how the generators track output features and types, it is generally required that the data be transformed during fit, so the fit function is rarely useful to implement beyond a simple call to fit_transform.

Parameters:
  • X (DataFrame) – Input data used to fit the generator.

  • **kwargs – Any additional arguments that a particular generator implementation could use. See fit_transform method for common kwargs values.

fit_transform(X: DataFrame, y: Series | None = None, feature_metadata_in: FeatureMetadata | None = None, **kwargs) DataFrame[source]#

Fit generator to the provided data and return the transformed version of the data as if fit and transform were called sequentially with the same data. This is generally more efficient than calling fit and transform separately and can be up to twice as fast if the fit process requires transformation of the data. This cannot be called after the generator has been fit, and will result in an AssertionError.

Parameters:
  • X (DataFrame) – Input data used to fit the generator.

  • y (Series, optional) – Input data’s labels used to fit the generator. Most generators do not utilize labels. y.index must be equal to X.index to avoid misalignment.

  • feature_metadata_in (FeatureMetadata, optional) – Identical to providing feature_metadata_in during generator initialization. Ignored if self.feature_metadata_in is already specified. If neither are set, feature_metadata_in will be inferred from the _infer_feature_metadata_in method.

  • **kwargs – Any additional arguments that a particular generator implementation could use. Passed to _fit_transform and _fit_generators methods.

Returns:

X_out

Return type:

DataFrame object which is the transformed version of the input data X.

Returns feature links including all pre and post generators.

Get the feature dependence chain between this generator and all of its post generators.

get_tags() dict[source]#

Gets the tags for this generator.

is_valid_metadata_in(feature_metadata_in: FeatureMetadata)[source]#
True if input data with feature metadata of feature_metadata_in could result in non-empty output.

This is dictated by feature_metadata_in.get_features(**self._infer_features_in_args) not being empty.

False if the features represented in feature_metadata_in do not contain any usable types for the generator.

For example, if only numeric features are passed as input to TextSpecialFeatureGenerator which requires text input features, this will return False. However, if both numeric and text features are passed, this will return True since the text features would be valid input (the numeric features would simply be dropped).

print_feature_metadata_info(log_level: int = 20)[source]#

Outputs detailed logs of a fit feature generator including the input and output FeatureMetadata objects’ feature types.

Parameters:

log_level (int, default 20) – Log level of the logging statements.

print_generator_info(log_level: int = 20)[source]#

Outputs detailed logs of the generator, such as the fit runtime.

Parameters:

log_level (int, default 20) – Log level of the logging statements.

transform(X: DataFrame) DataFrame[source]#

Transforms input data into the output data format. Will raise an AssertionError if called before the generator has been fit using fit or fit_transform methods.

Parameters:

X (DataFrame) – Input data to be transformed by the generator. Input data must contain all features in features_in, and should have the same dtypes as in the data provided to fit. Extra columns present in X that are not in features_in will be ignored and not affect the output.

Returns:

X_out

Return type:

DataFrame object which is the transformed version of the input data X.

AutoMLPipelineFeatureGenerator#

class autogluon.features.generators.AutoMLPipelineFeatureGenerator(enable_numeric_features=True, enable_categorical_features=True, enable_datetime_features=True, enable_text_special_features=True, enable_text_ngram_features=True, enable_raw_text_features=False, enable_vision_features=True, vectorizer=None, text_ngram_params=None, **kwargs)[source]#

Pipeline feature generator with simplified arguments to handle most Tabular data including text and dates adequately. This is the default feature generation pipeline used by AutoGluon when unspecified. For more customization options, refer to PipelineFeatureGenerator and BulkFeatureGenerator.

Parameters:
  • enable_numeric_features (bool, default True) – Whether to keep features of ‘int’ and ‘float’ raw types. These features are passed without alteration to the models. Appends IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[‘int’, ‘float’]))) to the generator group.

  • enable_categorical_features (bool, default True) – Whether to keep features of ‘object’ and ‘category’ raw types. These features are processed into memory optimized ‘category’ features. Appends CategoryFeatureGenerator() to the generator group.

  • enable_datetime_features (bool, default True) – Whether to keep features of ‘datetime’ raw type and ‘object’ features identified as ‘datetime_as_object’ features. These features will be converted to ‘int’ features representing milliseconds since epoch. Appends DatetimeFeatureGenerator() to the generator group.

  • enable_text_special_features (bool, default True) – Whether to use ‘object’ features identified as ‘text’ features to generate ‘text_special’ features such as word count, capital letter ratio, and symbol counts. Appends TextSpecialFeatureGenerator() to the generator group.

  • enable_text_ngram_features (bool, default True) – Whether to use ‘object’ features identified as ‘text’ features to generate ‘text_ngram’ features. Appends TextNgramFeatureGenerator(vectorizer=vectorizer, text_ngram_params) to the generator group. See text_ngram.py for valid parameters.

  • enable_raw_text_features (bool, default False) – Whether to use the raw text features. The generated raw text features will end up with ‘_raw_text’ suffix. For example, ‘sentence’ –> ‘sentence_raw_text’

  • enable_vision_features (bool, default True) – [Experimental] Whether to keep ‘object’ features identified as ‘image_path’ special type. Features of this form should have a string path to an image file as their value. Only vision models can leverage these features, and these features will not be treated as categorical. Note: ‘image_path’ features will not be automatically inferred. These features must be explicitly specified as such in a custom FeatureMetadata object. Note: It is recommended that the string paths use absolute paths rather than relative, as it will likely be more stable.

  • vectorizer (sklearn.feature_extraction.text.CountVectorizer, default CountVectorizer(min_df=30, ngram_range=(1, 3), max_features=10000, dtype=np.uint8) # noqa) – sklearn CountVectorizer object to use in TextNgramFeatureGenerator. Only used if enable_text_ngram_features=True.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

Examples

>>> from autogluon.tabular import TabularDataset
>>> from autogluon.features.generators import AutoMLPipelineFeatureGenerator
>>>
>>> feature_generator = AutoMLPipelineFeatureGenerator()
>>>
>>> label = 'class'
>>> train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
>>> X_train = train_data.drop(labels=[label], axis=1)
>>> y_train = train_data[label]
>>>
>>> X_train_transformed = feature_generator.fit_transform(X=X_train, y=y_train)
>>>
>>> test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
>>>
>>> X_test_transformed = feature_generator.transform(test_data)

PipelineFeatureGenerator#

class autogluon.features.generators.PipelineFeatureGenerator(pre_generators=None, post_generators=None, pre_drop_useless=True, pre_enforce_types=True, reset_index=True, post_drop_duplicates=True, verbosity=3, **kwargs)[source]#

PipelineFeatureGenerator is an implementation of BulkFeatureGenerator with various smart defaults and edge case handling functionality to enable robust data handling. It is recommended that users base any custom feature generators meant for end-to-end data transformation from PipelineFeatureGenerator.

Reference AutoMLPipelineFeatureGenerator for an example of extending PipelineFeatureGenerator.

It is not recommended that PipelineFeatureGenerator be used as a generator within any other generator’s pre or post generators.

BulkFeatureGenerator#

class autogluon.features.generators.BulkFeatureGenerator(generators: List[List[AbstractFeatureGenerator]], pre_generators: List[AbstractFeatureGenerator] | None = None, **kwargs)[source]#

BulkFeatureGenerator is used for complex feature generation pipelines where multiple generators are required, with some generators requiring the output of other generators as input (multi-stage generation). For ML problems, it is expected that the user uses a feature generator that is an instance of or is inheriting from BulkFeatureGenerator, as single feature generators typically will not satisfy the feature generation needs of all input data types. Unless you are an expert user, we recommend you create custom FeatureGenerators based off of PipelineFeatureGenerator instead of BulkFeatureGenerator.

Parameters:
  • generators (List[List[AbstractFeatureGenerator]]) –

    generators is a list of generator groups, where a generator group is a list of generators. Feature generators within generators[i] (generator group) are all fit on the same data, and their outputs are then concatenated to form the output of generators[i]. generators[i+1] are then fit on the output of generators[i]. The last generator group’s output is the output of _fit_transform and _transform methods. Due to the flexibility of generators, at the time of initialization, generators will prepend pre_generators and append post_generators if they are not None.

    If pre/post generators are specified, the supplied generators will be extended like this:

    pre_generators = [[pre_generator] for pre_generator in pre_generators] post_generators = [[post_generator] for post_generator in self._post_generators] self.generators: List[List[AbstractFeatureGenerator]] = pre_generators + generators + post_generators self._post_generators = []

    This means that self._post_generators will be empty as post_generators will be incorporated into self.generators instead.

    Note that if generators within a generator group produce a feature with the same name, an AssertionError will be raised as features with the same name cannot be present within a valid DataFrame output.

    If both features are desired, specify a name_prefix parameter in one of the generators to prevent name collisions. If experimenting with different generator groups, it is encouraged to try fitting your experimental feature-generators to the data without any ML model training to ensure validity and avoid name collisions.

  • pre_generators (List[AbstractFeatureGenerator], optional) – pre_generators are generators which are sequentially fit prior to generators. Functions identically to post_generators argument, but pre_generators are called before generators, while post_generators are called after generators. Provided for convenience to classes inheriting from BulkFeatureGenerator. Common pre_generator’s include AsTypeFeatureGenerator and FillNaFeatureGenerator, which act to prune and clean the data instead of generating entirely new features.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

Examples

>>> from autogluon.tabular import TabularDataset
>>> from autogluon.features.generators import AsTypeFeatureGenerator, BulkFeatureGenerator, CategoryFeatureGenerator, DropDuplicatesFeatureGenerator, FillNaFeatureGenerator, IdentityFeatureGenerator  # noqa
>>> from autogluon.common.features.types import R_INT, R_FLOAT
>>>
>>> generators = [
>>>     [AsTypeFeatureGenerator()],  # Convert all input features to the exact same types as they were during fit.
>>>     [FillNaFeatureGenerator()],  # Fill all NA values in the data
>>>     [
>>>         CategoryFeatureGenerator(),  # Convert object types to category types and minimize their memory usage
>>>         # Carry over all features that are not objects and categories (without this, the int features would be dropped).
>>>         IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),
>>>     ],
>>>     # CategoryFeatureGenerator and IdentityFeatureGenerator will have their outputs concatenated together
>>>     # before being fed into DropDuplicatesFeatureGenerator
>>>     [DropDuplicatesFeatureGenerator()]  # Drops any features which are duplicates of each-other
>>> ]
>>> feature_generator = BulkFeatureGenerator(generators=generators, verbosity=3)
>>>
>>> label = 'class'
>>> train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
>>> X_train = train_data.drop(labels=[label], axis=1)
>>> y_train = train_data[label]
>>>
>>> X_train_transformed = feature_generator.fit_transform(X=X_train, y=y_train)
>>>
>>> test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
>>>
>>> X_test_transformed = feature_generator.transform(test_data)

AsTypeFeatureGenerator#

class autogluon.features.generators.AsTypeFeatureGenerator(convert_bool: bool = True, convert_bool_method: str = 'auto', convert_bool_method_v2_threshold: int = 15, convert_bool_method_v2_row_threshold: int = 128, **kwargs)[source]#

Enforces type conversion on the data to match the types seen during fitting. If a feature cannot be converted to the correct type, an exception will be raised.

Parameters:
  • convert_bool (bool, default True) – Whether to automatically convert features with only two unique values to boolean.

  • convert_bool_method (str, default "auto") – [Advanced] The processing method to convert boolean features. Recommended to keep as “auto”. If “auto”: Will attempt to automatically select the best method based on the data. If “v1”: Will use a simple method that was the default prior to v0.7 (_convert_to_bool_simple) If “v2”: Will use an optimized method that was introduced in v0.7 (_convert_to_bool_fast) Note that “v2” is not always faster than “v1”, and is often slower when there are few boolean columns. All options produce identical results, except in extreme synthetic edge-cases.

  • convert_bool_method_v2_threshold (int, default 15) – [Advanced] If convert_bool_method=”auto”, this value determines which method is used. If the number of boolean features is >= this value, then “v2” is used. Otherwise, “v1” is used. 15 is roughly the optimal value on average.

  • convert_bool_method_v2_row_threshold (int, default 128) – [Advanced] If using “v2” bool method, this is the row count in which when >=, the batch method is used instead of the realtime method. 128 is roughly the optimal value on average.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

BinnedFeatureGenerator#

class autogluon.features.generators.BinnedFeatureGenerator(num_bins=10, **kwargs)[source]#

BinnedFeatureGenerator bins incoming int and float features to num_bins unique int values, maintaining relative rank order.

CategoryFeatureGenerator#

class autogluon.features.generators.CategoryFeatureGenerator(stateful_categories=True, minimize_memory=True, cat_order='original', minimum_cat_count: int = 2, maximum_num_cat: int | None = None, fillna: str | None = None, **kwargs)[source]#

CategoryFeatureGenerator is used to convert object types to category types, as well as remove rare categories and optimize memory usage. After fitting, previously unseen categories during transform are treated as missing values.

Parameters:
  • stateful_categories (bool, default True) – If True, categories from training are applied to transformed data, and any unknown categories from input data will be treated as missing values. It is recommended to keep this value as True to avoid strange downstream behaviour.

  • minimize_memory (bool, default True) – If True, minimizes category memory usage by converting all category values to sequential integers. This replaces any string data present in the categories but does not alter the behavior of models when using the category as a feature so long as the original string values are not required downstream. It is recommended to keep this value as True to dramatically reduce memory usage with no cost to accuracy.

  • cat_order (str, default 'original') –

    Determines the order in which categories are stored. This is important when minimize_memory is True, as the order will determine which categories are converted to which integer values. Valid values:

    ’original’ : Keep the original order. If the feature was originally an object, this is equivalent to ‘alphanumeric’. ‘alphanumeric’ : Sort the categories alphanumerically. ‘count’ : Sort the categories by frequency (Least frequent in front with code of 0)

  • minimum_cat_count (int, default None) – The minimum number of occurrences a category must have in the training data to avoid being considered a rare category. Rare categories are removed and treated as missing values. If None, no minimum count is required. This includes categories that never occur in the data but are present in the category object as possible categories.

  • maximum_num_cat (int, default None) – The maximum amount of categories that can be considered non-rare. Sorted by occurrence count, up to the N highest count categories will be kept if maximum_num_cat=N. All others will be considered rare categories.

  • fillna (str, default None) –

    The method used to handle missing values. Only valid if stateful_categories=True. Missing values include the values that were originally NaN and values converted to NaN from other parameters such as minimum_cat_count. Valid values:

    None : Keep missing values as is. They will appear as NaN and have no category assigned to them. ‘mode’ : Set missing values to the most frequent category in their feature.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

DatetimeFeatureGenerator#

class autogluon.features.generators.DatetimeFeatureGenerator(features: list = ['year', 'month', 'day', 'dayofweek'], **kwargs)[source]#

Transforms datetime features into numeric features.

Parameters:

features (list, optional) – A list of datetime features to parse out of dates. For a full list of options see the methods inside pandas.Series.dt at https://pandas.pydata.org/docs/reference/api/pandas.Series.html

DropDuplicatesFeatureGenerator#

class autogluon.features.generators.DropDuplicatesFeatureGenerator(sample_size_init=500, sample_size_final=2000, **kwargs)[source]#

Drops features which are exact duplicates of other features, leaving only one instance of the data.

Parameters:
  • sample_size_init (int, default 500) – The number of rows to sample when doing an initial filter of duplicate feature candidates. Usually, the majority of features can be filtered out using this smaller amount of rows which greatly speeds up the computation of the final check. If None or greater than the number of rows, no initial filter will occur. This may increase the time to fit immensely for large datasets.

  • sample_size_final (int, default 2000) – The number of rows to sample when doing the final filter to determine duplicate features. This theoretically can lead to features that are very nearly duplicates but not exact duplicates being removed, but should be near impossible in practice. If None or greater than the number of rows, will perform exact duplicate detection (most expensive). It is recommended to keep this value below 100000 to maintain reasonable fit times.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

DropUniqueFeatureGenerator#

class autogluon.features.generators.DropUniqueFeatureGenerator(max_unique_ratio=0.99, **kwargs)[source]#

Drops features which only have 1 unique value or which have nearly no repeated values (based on max_unique_ratio) and are of category or object type.

DummyFeatureGenerator#

class autogluon.features.generators.DummyFeatureGenerator(features_in='empty', feature_metadata_in='empty', **kwargs)[source]#

Ignores all input features and returns a single int feature with all 0 values. Useful for testing purposes or to avoid crashes if no features were given.

FillNaFeatureGenerator#

class autogluon.features.generators.FillNaFeatureGenerator(fillna_map=None, fillna_default=nan, inplace=False, **kwargs)[source]#

Fills missing values in the data.

Parameters:
  • fillna_map (dict, default {'object': ''}) – Map which dictates the fill values of NaNs. Keys are the raw types of the features as in self.feature_metadata_in.type_map_raw. If a feature’s raw type is not present in fillna_map, its NaN values are filled to fillna_default.

  • fillna_default – The default fillna value if the feature’s raw type is not present in fillna_map. Be careful about setting this to anything other than np.nan, as not all raw types can handle int, float, or string values.

  • np.nan (default) – The default fillna value if the feature’s raw type is not present in fillna_map. Be careful about setting this to anything other than np.nan, as not all raw types can handle int, float, or string values.

  • inplace (bool, default False) – If True, then the NaN values are filled inplace without copying the input data. This will alter the input data outside of the scope of this function.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

IdentityFeatureGenerator#

class autogluon.features.generators.IdentityFeatureGenerator(features_in: list | None = None, feature_metadata_in: FeatureMetadata | None = None, post_generators: list | None = None, pre_enforce_types=False, pre_drop_useless=False, post_drop_duplicates=False, reset_index=False, column_names_as_str=True, name_prefix: str | None = None, name_suffix: str | None = None, infer_features_in_args: dict | None = None, infer_features_in_args_strategy='overwrite', banned_feature_special_types: List[str] | None = None, log_prefix='', verbosity=2)[source]#

IdentityFeatureGenerator simply passes the data along without alterations.

LabelEncoderFeatureGenerator#

class autogluon.features.generators.LabelEncoderFeatureGenerator(features_in: list | None = None, feature_metadata_in: FeatureMetadata | None = None, post_generators: list | None = None, pre_enforce_types=False, pre_drop_useless=False, post_drop_duplicates=False, reset_index=False, column_names_as_str=True, name_prefix: str | None = None, name_suffix: str | None = None, infer_features_in_args: dict | None = None, infer_features_in_args_strategy='overwrite', banned_feature_special_types: List[str] | None = None, log_prefix='', verbosity=2)[source]#

Converts category features to int features by mapping to the category codes.

CategoryMemoryMinimizeFeatureGenerator#

class autogluon.features.generators.CategoryMemoryMinimizeFeatureGenerator(features_in: list | None = None, feature_metadata_in: FeatureMetadata | None = None, post_generators: list | None = None, pre_enforce_types=False, pre_drop_useless=False, post_drop_duplicates=False, reset_index=False, column_names_as_str=True, name_prefix: str | None = None, name_suffix: str | None = None, infer_features_in_args: dict | None = None, infer_features_in_args_strategy='overwrite', banned_feature_special_types: List[str] | None = None, log_prefix='', verbosity=2)[source]#

Minimizes memory usage of category features by converting the category values to monotonically increasing int values. This is important for category features with string values which can take up significant memory despite the string information not being used downstream.

NumericMemoryMinimizeFeatureGenerator#

class autogluon.features.generators.NumericMemoryMinimizeFeatureGenerator(dtype_out=<class 'numpy.uint8'>, **kwargs)[source]#

Clips and converts dtype of int features to minimize memory usage.

dtype_outnp.dtype, default np.uint8

dtype to clip and convert features to. Clipping will automatically use the correct min and max values for the dtype provided.

**kwargs :

Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

RenameFeatureGenerator#

class autogluon.features.generators.RenameFeatureGenerator(name_prefix=None, name_suffix=None, inplace=False, **kwargs)[source]#

RenameFeatureGenerator renames the columns without altering their values. This can be used to avoid column name collisions when transforming the same feature in multiple ways, or to highlight that a feature was derived from a particular pipeline.

Parameters:
  • name_prefix (str, default None) – Name prefix to add to all output feature names.

  • name_suffix (str, default None) – Name suffix to add to all output feature names.

  • inplace (bool, default False) – If True, then the column names are renamed inplace without copying the input data. This will alter the input data outside of the scope of this function.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

TextNgramFeatureGenerator#

class autogluon.features.generators.TextNgramFeatureGenerator(vectorizer=None, vectorizer_strategy='combined', max_memory_ratio=0.15, prefilter_tokens=False, prefilter_token_count=100, **kwargs)[source]#

Generates ngram features from text features.

Parameters:
  • vectorizer (sklearn.feature_extraction.text.CountVectorizer or sklearn.feature_extraction.text.TfidfVectorizer, default CountVectorizer(min_df=30, ngram_range=(1, 3), max_features=10000, dtype=np.uint8) # noqa) – sklearn CountVectorizer which is used to generate the ngrams given the text data. Can also specify a TfidfVectorizer, but note that memory usage will increase by 4-8x relative to CountVectorizer.

  • vectorizer_strategy (str, default 'combined') – If ‘combined’, all text features are concatenated together to fit the vectorizer. Features generated in this way have their names prepended with ‘__nlp__.’. If ‘separate’, all text features are fit separately with their own copy of the vectorizer. Their ngram features are then concatenated together to form the output. If ‘both’, the outputs of ‘combined’ and ‘separate’ are concatenated together to form the output. It is generally recommended to keep vectorizer_strategy as ‘combined’ unless the text features are not associated with each-other, as fitting separate vectorizers could increase memory usage and model training time. Valid values: [‘combined’, ‘separate’, ‘both’]

  • max_memory_ratio (float, default 0.15) – Safety measure to avoid out-of-memory errors downstream in model training. The number of ngrams generated will be capped to take at most max_memory_ratio proportion of total available memory, treating the ngrams as float32 values. ngram features will be removed in least frequent to most frequent order. Note: For vectorizer_strategy values other than ‘combined’, the resulting ngrams may use more than this value. It is recommended to only increase this value above 0.15 if confident that higher values will not result in out-of-memory errors.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid key word arguments.

TextSpecialFeatureGenerator#

class autogluon.features.generators.TextSpecialFeatureGenerator(symbols: List[str] | None = None, min_occur_ratio=0.01, min_occur_offset=10, bin_features: bool = True, post_drop_duplicates: bool = True, **kwargs)[source]#

TextSpecialFeatureGenerator generates text specific features from incoming raw text features. These include word counts, character counts, symbol counts, capital letter ratios, and much more. Features generated by this generator will have ‘text_special’ as a special type.

Parameters:
  • symbols (List[str], optional) – List of string symbols to compute counts and ratios for as features. If not specified, defaults to [‘!’, ‘?’, ‘@’, ‘%’, ‘$’, ‘*’, ‘&’, ‘#’, ‘^’, ‘.’, ‘:’, ‘ ‘, ‘/’, ‘;’, ‘-’, ‘=’]

  • min_occur_ratio (float, default 0.01) – Minimum ratio of symbol occurrence to consider as a feature. If a symbol appears in fewer than 1 in 1/min_occur_ratio samples, it will not be used as a feature.

  • min_occur_offset (int, default 10) – Minimum symbol occurrences to consider as a feature. This is added to the threshold calculated from min_occur_ratio.

  • bin_features (bool, default True) – If True, adds a BinnedFeatureGenerator to the front of post_generators such that all features generated from this generator are then binned. This is useful for ‘text_special’ features because it lowers the chance models will overfit on the features and reduces their memory usage.

  • post_drop_duplicates (bool, default True) – Identical to AbstractFeatureGenerator’s post_drop_duplicates, except it is defaulted to True instead of False. This helps to clean the output of this generator when symbols aren’t present in the data.

  • **kwargs – Refer to AbstractFeatureGenerator documentation for details on valid keyword arguments.