.. _sec_feature_engineering:

Feature Engineering
===================


Introduction
~~~~~~~~~~~~

Feature engineering involves taking raw tabular data and

1. converting it into a format ready for the machine learning model to
   read
2. trying to enhance some columns (‘features’ in ML jargon) to give the
   ML models more information, hoping to get more accurate results.

AutoGluon does some of this for you. This document describes how that
works, and how you can extend it. We describe here the default
behaviour, much of which is configurable, as well as pointers to how to
alter the behaviour from the default.

Column Types
~~~~~~~~~~~~

AutoGluon Tabular recognises the following types of features, and has
separate processing for them:

============ ======================
Feature Type Example Values
============ ======================
boolean      A, B
numerical    1.3, 2.0, -1.6
categorical  Red, Blue, Yellow
datetime     1/31/2021, Mar-31
text         Mary had a little lamb
============ ======================

In addition, other AutoGluon prediction modules recognise additional
feature types, these can also be enabled in AutoGluon Tabular by using
the `MultiModal <tabular-multimodal.html>`__ option.

============ =================
Feature Type Example Values
============ =================
image        path/image123.png
============ =================

Column Type Detection
~~~~~~~~~~~~~~~~~~~~~

-  Boolean columns are any columns with only 2 unique values.

-  Any string columns are deemed categorical unless they are text (see
   below). Some models perform better if you tell them which columns are
   categorical and which are continuous.

-  Numeric columns are passed through without change, except to identify
   them as ``float`` or ``int``. Currently, numeric columns are not
   tested to determine if they are likely to be categorical. You can
   force them to be treated as categorical with the Pandas syntax
   ``.astype("category")``, see below.

-  Text columns are detected by firstly checking that most rows are
   unique. If they are, and there are multiple separate words detected
   in most rows, the row is a text column. For details see
   ``common/features/infer_types.py`` in the source.

-  Datetime columns are detected by trying to convert them to Pandas
   datetimes. Pandas detects a wide range of datetime formats. If many
   of the values in a column are successfully converted, they are
   datetimes. Currently datetimes that appear to be purely numeric
   (e.g. 20210530) are not correctly detected. Any NaN values are set to
   the column mean. For details see ``common/features/infer_types.py``.

Problem Type Detection
~~~~~~~~~~~~~~~~~~~~~~

If the user does not specify whether the problem is a classification
problem or a regression problem, the ‘label’ column is examined to try
to guess. Several things point towards a regression problem : the values
are floating point non-integers, and there are a large amount of unique
values. Within classification, both multiclass and binary (n=2
categories) are detected. For details see ``utils/utils.py``.

To override the automatic inference, explicitly pass the problem_type
(one of ‘binary’, ‘regression’, ‘multiclass’) to ``TabularPredictor()``.
For example:

::

   predictor = TabularPredictor(label='class', problem_type='multiclass').fit(train_data)

Automatic Feature Engineering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Numerical Columns
-----------------

Numeric columns, both integer and floating point, currently have no
automated feature engineering.

Categorical Columns
-------------------

Since many downstream models require categories to be encoded as
integers, each categorical feature is mapped to monotonically increasing
integers.

Datetime Columns
----------------

Columns recognised as datetime, are converted into several features:

-  a numerical Pandas datetime. Note this has maximum and minimum values
   specified at
   `pandas.Timestamp.min <https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.min.html>`__
   and
   `pandas.Timestamp.max <https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.min.html>`__
   respectively, which may affect extremely dates very far into the
   future or past.
-  several extracted columns, the default is
   ``[year, month, day, dayofweek]``. This is configrable via the
   `DatetimeFeatureGenerator <../../api/autogluon.features.html#datetimefeaturegenerator>`__

Note that missing, invalid and out-of-range features generated by the
above logic will be converted to the mean value across all valid rows.

Text Columns
------------

If the `MultiModal <tabular-multi.model.html>`__ option is enabled, then
text columns are processed using a full Transformer neural network model
with pretrained NLP models. For more details, see
`MultiModalPredictor <../multimodal/index.html>`__ section for more
information.

Otherwise, they are processed in two more simple ways:

-  an n-gram feature generator extracts n-grams (short strings) from the
   text feature, adding many additional columns, one for each n-gram
   feature. These columns are ‘n-hot’ encoded, containing 1 or more if
   the original feature contains the n-gram 1 or more times, and 0
   otherwise. By default, all text columns are concatenated before
   applying this stage, and the n-grams are individual words, not
   substrings of words. You can configure this via the
   `TextNgramFeatureGenerator <../../api/autogluon.features.html#textngramfeaturegenerator>`__
   class. The n-gram generation is done in ``generators/text_ngram.py``
-  Some additional numerical features are calculated, such as word
   counts, character counts, proportion of uppercase characters, etc.
   This is configurable via the
   `TextSpecialFeatureGenerator <../../api/autogluon.features.html#textspecialfeaturegenerator>`__.
   This is done in ``generators/text_special.py``

Additional Processing
---------------------

-  Columns containing only 1 value are dropped before passing to models.
-  Columns containing duplicates of other columns are removed before
   passing to models.

Feature Engineering Example
~~~~~~~~~~~~~~~~~~~~~~~~~~~

By default a feature generator called
`AutoMLPipelineFeatureGenerator <../../api/autogluon.features.html#autogluon.features.generators.AutoMLPipelineFeatureGenerator>`__
is used. Let’s see this in action. We’ll create a dataframe containing a
floating point column, an integer column, a datetime column, a
categorical column. We’ll first take a look at the raw data we created.

.. code:: python

    from autogluon.tabular import TabularDataset, TabularPredictor
    import pandas as pd
    import numpy as np
    import random
    from sklearn.datasets import make_regression
    from datetime import datetime
    
    x, y = make_regression(n_samples = 100,n_features = 5,n_targets = 1, random_state = 1)
    dfx = pd.DataFrame(x, columns=['A','B','C','D','E'])
    dfy = pd.DataFrame(y, columns=['label'])
    
    # Create an integer column, a datetime column, a categorical column and a string column to demonstrate how they are processed.
    dfx['B'] = (dfx['B']).astype(int)
    dfx['C'] = datetime(2000,1,1) + pd.to_timedelta(dfx['C'].astype(int), unit='D')
    dfx['D'] = pd.cut(dfx['D'] * 10, [-np.inf,-5,0,5,np.inf],labels=['v','w','x','y'])
    dfx['E'] = pd.Series(list(' '.join(random.choice(["abc", "d", "ef", "ghi", "jkl"]) for i in range(4)) for j in range(100)))
    dataset=TabularDataset(dfx)
    print(dfx)


.. parsed-literal::
    :class: output

               A  B          C  D                E
    0  -0.545774  0 2000-01-01  y        d ef ef d
    1  -0.468674  0 2000-01-02  x     ghi jkl d ef
    2   1.767960  0 1999-12-31  v      d abc d ghi
    3  -0.118771  1 2000-01-01  y     ef ghi d jkl
    4   0.630196  0 1999-12-31  w  abc jkl jkl ghi
    ..       ... ..        ... ..              ...
    95 -1.182318 -1 2000-01-01  v  jkl jkl ghi jkl
    96  0.562761  0 2000-01-01  v   jkl ef abc jkl
    97 -0.797270  0 2000-01-01  w    abc d ghi abc
    98  0.502741  0 1999-12-31  y     abc jkl d ef
    99  2.056356  0 1999-12-30  w   ef abc jkl ghi
    
    [100 rows x 5 columns]


Now let’s call the default feature generator
AutoMLPipeLineFeatureGenerator with no parameters and see what it does.

.. code:: python

    from autogluon.features.generators import AutoMLPipelineFeatureGenerator
    auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
    auto_ml_pipeline_feature_generator.fit_transform(X=dfx)


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>A</th>
          <th>B</th>
          <th>D</th>
          <th>E</th>
          <th>C</th>
          <th>C.year</th>
          <th>C.month</th>
          <th>C.day</th>
          <th>C.dayofweek</th>
          <th>E.char_count</th>
          <th>E.symbol_ratio.</th>
          <th>__nlp__.abc</th>
          <th>__nlp__.ef</th>
          <th>__nlp__.ghi</th>
          <th>__nlp__.jkl</th>
          <th>__nlp__._total_</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>-0.545774</td>
          <td>0</td>
          <td>3</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>1</td>
          <td>6</td>
          <td>0</td>
          <td>2</td>
          <td>0</td>
          <td>0</td>
          <td>1</td>
        </tr>
        <tr>
          <th>1</th>
          <td>-0.468674</td>
          <td>0</td>
          <td>2</td>
          <td>NaN</td>
          <td>946771200000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>2</td>
          <td>6</td>
          <td>4</td>
          <td>3</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>2</th>
          <td>1.767960</td>
          <td>0</td>
          <td>0</td>
          <td>NaN</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>3</td>
          <td>4</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
        </tr>
        <tr>
          <th>3</th>
          <td>-0.118771</td>
          <td>1</td>
          <td>3</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>4</td>
          <td>3</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>4</th>
          <td>0.630196</td>
          <td>0</td>
          <td>1</td>
          <td>NaN</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>7</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>2</td>
          <td>3</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>95</th>
          <td>-1.182318</td>
          <td>-1</td>
          <td>0</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>7</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>1</td>
          <td>3</td>
          <td>2</td>
        </tr>
        <tr>
          <th>96</th>
          <td>0.562761</td>
          <td>0</td>
          <td>0</td>
          <td>5</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>6</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
          <td>3</td>
        </tr>
        <tr>
          <th>97</th>
          <td>-0.797270</td>
          <td>0</td>
          <td>1</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>5</td>
          <td>2</td>
          <td>2</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
        </tr>
        <tr>
          <th>98</th>
          <td>0.502741</td>
          <td>0</td>
          <td>3</td>
          <td>0</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>4</td>
          <td>3</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>99</th>
          <td>2.056356</td>
          <td>0</td>
          <td>1</td>
          <td>NaN</td>
          <td>946512000000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>30</td>
          <td>3</td>
          <td>6</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>4</td>
        </tr>
      </tbody>
    </table>
    <p>100 rows × 16 columns</p>
    </div>


We can see that:

-  The floating point and integer columns ‘A’ and ‘B’ are unchanged.
-  The datetime column ‘C’ has been converted to a raw value (in
   nanoseconds), as well as parsed into additional columns for the year,
   month, day and dayofweek.
-  The string categorical column ‘D’ has been mapped 1:1 to integers - a
   lot of models only accept numerical input.
-  The freeform text column has been mapped into some summary features
   (‘char_count’ etc) as well as a N-hot matrix saying whether each text
   contained each word.

To get more details, we should call the pipeline as part of
``TabularPredictor.fit()``. We need to combine the ``dfx`` and ``dfy``
DataFrames since fit() expects a single dataframe.

.. code:: python

    df = pd.concat([dfx, dfy], axis=1)
    predictor = TabularPredictor(label='label')
    predictor.fit(df, hyperparameters={'GBM' : {}}, feature_generator=auto_ml_pipeline_feature_generator)


.. parsed-literal::
    :class: output

    No path specified. Models will be saved in: "AutogluonModels/ag-20230222_232417/"
    Beginning AutoGluon training ...
    AutoGluon will save models to "AutogluonModels/ag-20230222_232417/"
    AutoGluon Version:  0.7.0b20230222
    Python Version:     3.8.13
    Operating System:   Linux
    Platform Machine:   x86_64
    Platform Version:   #1 SMP Tue Nov 30 00:17:50 UTC 2021
    Train Data Rows:    100
    Train Data Columns: 5
    Label Column: label
    Preprocessing data ...
    AutoGluon infers your prediction problem is: 'regression' (because dtype of label-column == float and many unique label-values observed).
    	Label info (max, min, mean, stddev): (186.98105511749836, -267.99365510467214, 9.38193, 71.29287)
    	If 'regression' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
    Using Feature Generators to preprocess the data ...
    AutoMLPipelineFeatureGenerator is already fit, so the training data will be processed via .transform() instead of .fit_transform().
    	Types of features in original data (raw dtype, special dtypes):
    		('category', [])     : 1 | ['D']
    		('datetime', [])     : 1 | ['C']
    		('float', [])        : 1 | ['A']
    		('int', [])          : 1 | ['B']
    		('object', ['text']) : 1 | ['E']
    	Types of features in processed data (raw dtype, special dtypes):
    		('category', [])                    : 1 | ['D']
    		('category', ['text_as_category'])  : 1 | ['E']
    		('float', [])                       : 1 | ['A']
    		('int', [])                         : 1 | ['B']
    		('int', ['binned', 'text_special']) : 2 | ['E.char_count', 'E.symbol_ratio. ']
    		('int', ['datetime_as_int'])        : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
    		('int', ['text_ngram'])             : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
    Data preprocessing and feature engineering runtime = 0.03s ...
    AutoGluon will gauge predictive performance using evaluation metric: 'root_mean_squared_error'
    	This metric's sign has been flipped to adhere to being higher_is_better. The metric score can be multiplied by -1 to get the metric value.
    	To change this, specify the eval_metric parameter of Predictor()
    Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 80, Val Rows: 20
    Fitting 1 L1 models ...
    Fitting model: LightGBM ...
    	-60.6523	 = Validation score   (-root_mean_squared_error)
    	0.88s	 = Training   runtime
    	0.0s	 = Validation runtime
    Fitting model: WeightedEnsemble_L2 ...
    	-60.6523	 = Validation score   (-root_mean_squared_error)
    	0.0s	 = Training   runtime
    	0.0s	 = Validation runtime
    AutoGluon training complete, total runtime = 1.54s ... Best model: "WeightedEnsemble_L2"
    TabularPredictor saved. To load, use: predictor = TabularPredictor.load("AutogluonModels/ag-20230222_232417/")


.. parsed-literal::
    :class: output

    <autogluon.tabular.predictor.predictor.TabularPredictor at 0x7fb22876bc10>


Reading the output, note that:

-  the string-categorical column ‘D’, despite being mapped to integers,
   is still recognised as categorical.
-  the integer column ‘B’ has not been identified as categorical, even
   though it only has a few unique values:

.. code:: python

    print(len(set(dfx['B'])))


.. parsed-literal::
    :class: output

    5


To mark it as categorical, we can explicitly mark it as categorical in
the original dataframe:

.. code:: python

    dfx["B"] = dfx["B"].astype("category")
    auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
    auto_ml_pipeline_feature_generator.fit_transform(X=dfx)


.. parsed-literal::
    :class: output

    Fitting AutoMLPipelineFeatureGenerator...
    	Available Memory:                    31476.37 MB
    	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
    	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    	Stage 1 Generators:
    		Fitting AsTypeFeatureGenerator...
    	Stage 2 Generators:
    		Fitting FillNaFeatureGenerator...
    	Stage 3 Generators:
    		Fitting IdentityFeatureGenerator...
    		Fitting CategoryFeatureGenerator...
    			Fitting CategoryMemoryMinimizeFeatureGenerator...
    		Fitting DatetimeFeatureGenerator...
    		Fitting TextSpecialFeatureGenerator...
    			Fitting BinnedFeatureGenerator...
    			Fitting DropDuplicatesFeatureGenerator...
    		Fitting TextNgramFeatureGenerator...
    			Fitting CountVectorizer for text features: ['E']
    			CountVectorizer fit with vocabulary size = 4
    	Stage 4 Generators:
    		Fitting DropUniqueFeatureGenerator...
    	Types of features in original data (raw dtype, special dtypes):
    		('category', [])     : 2 | ['B', 'D']
    		('datetime', [])     : 1 | ['C']
    		('float', [])        : 1 | ['A']
    		('object', ['text']) : 1 | ['E']
    	Types of features in processed data (raw dtype, special dtypes):
    		('category', [])                    : 2 | ['B', 'D']
    		('category', ['text_as_category'])  : 1 | ['E']
    		('float', [])                       : 1 | ['A']
    		('int', ['binned', 'text_special']) : 2 | ['E.char_count', 'E.symbol_ratio. ']
    		('int', ['datetime_as_int'])        : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
    		('int', ['text_ngram'])             : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
    	0.1s = Fit runtime
    	5 features in original data used to generate 16 features in processed data.
    	Train Data (Processed) Memory Usage: 0.01 MB (0.0% of available memory)


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>A</th>
          <th>B</th>
          <th>D</th>
          <th>E</th>
          <th>C</th>
          <th>C.year</th>
          <th>C.month</th>
          <th>C.day</th>
          <th>C.dayofweek</th>
          <th>E.char_count</th>
          <th>E.symbol_ratio.</th>
          <th>__nlp__.abc</th>
          <th>__nlp__.ef</th>
          <th>__nlp__.ghi</th>
          <th>__nlp__.jkl</th>
          <th>__nlp__._total_</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>-0.545774</td>
          <td>1</td>
          <td>3</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>1</td>
          <td>6</td>
          <td>0</td>
          <td>2</td>
          <td>0</td>
          <td>0</td>
          <td>1</td>
        </tr>
        <tr>
          <th>1</th>
          <td>-0.468674</td>
          <td>1</td>
          <td>2</td>
          <td>NaN</td>
          <td>946771200000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>2</td>
          <td>6</td>
          <td>4</td>
          <td>3</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>2</th>
          <td>1.767960</td>
          <td>1</td>
          <td>0</td>
          <td>NaN</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>3</td>
          <td>4</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
        </tr>
        <tr>
          <th>3</th>
          <td>-0.118771</td>
          <td>2</td>
          <td>3</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>4</td>
          <td>3</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>4</th>
          <td>0.630196</td>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>7</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>2</td>
          <td>3</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>95</th>
          <td>-1.182318</td>
          <td>0</td>
          <td>0</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>7</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>1</td>
          <td>3</td>
          <td>2</td>
        </tr>
        <tr>
          <th>96</th>
          <td>0.562761</td>
          <td>1</td>
          <td>0</td>
          <td>5</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>6</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
          <td>3</td>
        </tr>
        <tr>
          <th>97</th>
          <td>-0.797270</td>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>5</td>
          <td>2</td>
          <td>2</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
        </tr>
        <tr>
          <th>98</th>
          <td>0.502741</td>
          <td>1</td>
          <td>3</td>
          <td>0</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>4</td>
          <td>3</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>99</th>
          <td>2.056356</td>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>946512000000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>30</td>
          <td>3</td>
          <td>6</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>4</td>
        </tr>
      </tbody>
    </table>
    <p>100 rows × 16 columns</p>
    </div>


Missing Value Handling
~~~~~~~~~~~~~~~~~~~~~~

To illustrate missing value handling, let’s set the first row to all
NaNs:

.. code:: python

    dfx.iloc[0] = np.nan
    dfx.head()


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>A</th>
          <th>B</th>
          <th>C</th>
          <th>D</th>
          <th>E</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaT</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>1</th>
          <td>-0.468674</td>
          <td>0</td>
          <td>2000-01-02</td>
          <td>x</td>
          <td>ghi jkl d ef</td>
        </tr>
        <tr>
          <th>2</th>
          <td>1.767960</td>
          <td>0</td>
          <td>1999-12-31</td>
          <td>v</td>
          <td>d abc d ghi</td>
        </tr>
        <tr>
          <th>3</th>
          <td>-0.118771</td>
          <td>1</td>
          <td>2000-01-01</td>
          <td>y</td>
          <td>ef ghi d jkl</td>
        </tr>
        <tr>
          <th>4</th>
          <td>0.630196</td>
          <td>0</td>
          <td>1999-12-31</td>
          <td>w</td>
          <td>abc jkl jkl ghi</td>
        </tr>
      </tbody>
    </table>
    </div>


Now if we reprocess:

.. code:: python

    auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()
    auto_ml_pipeline_feature_generator.fit_transform(X=dfx)


.. parsed-literal::
    :class: output

    Fitting AutoMLPipelineFeatureGenerator...
    	Available Memory:                    31476.4 MB
    	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
    	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    	Stage 1 Generators:
    		Fitting AsTypeFeatureGenerator...
    	Stage 2 Generators:
    		Fitting FillNaFeatureGenerator...
    	Stage 3 Generators:
    		Fitting IdentityFeatureGenerator...
    		Fitting CategoryFeatureGenerator...
    			Fitting CategoryMemoryMinimizeFeatureGenerator...
    		Fitting DatetimeFeatureGenerator...
    		Fitting TextSpecialFeatureGenerator...
    			Fitting BinnedFeatureGenerator...
    			Fitting DropDuplicatesFeatureGenerator...
    		Fitting TextNgramFeatureGenerator...
    			Fitting CountVectorizer for text features: ['E']
    			CountVectorizer fit with vocabulary size = 4
    	Stage 4 Generators:
    		Fitting DropUniqueFeatureGenerator...
    	Types of features in original data (raw dtype, special dtypes):
    		('category', [])     : 2 | ['B', 'D']
    		('datetime', [])     : 1 | ['C']
    		('float', [])        : 1 | ['A']
    		('object', ['text']) : 1 | ['E']
    	Types of features in processed data (raw dtype, special dtypes):
    		('category', [])                    : 2 | ['B', 'D']
    		('category', ['text_as_category'])  : 1 | ['E']
    		('float', [])                       : 1 | ['A']
    		('int', ['binned', 'text_special']) : 3 | ['E.char_count', 'E.word_count', 'E.symbol_ratio. ']
    		('int', ['datetime_as_int'])        : 5 | ['C', 'C.year', 'C.month', 'C.day', 'C.dayofweek']
    		('int', ['text_ngram'])             : 5 | ['__nlp__.abc', '__nlp__.ef', '__nlp__.ghi', '__nlp__.jkl', '__nlp__._total_']
    	0.1s = Fit runtime
    	5 features in original data used to generate 17 features in processed data.
    	Train Data (Processed) Memory Usage: 0.01 MB (0.0% of available memory)


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>A</th>
          <th>B</th>
          <th>D</th>
          <th>E</th>
          <th>C</th>
          <th>C.year</th>
          <th>C.month</th>
          <th>C.day</th>
          <th>C.dayofweek</th>
          <th>E.char_count</th>
          <th>E.word_count</th>
          <th>E.symbol_ratio.</th>
          <th>__nlp__.abc</th>
          <th>__nlp__.ef</th>
          <th>__nlp__.ghi</th>
          <th>__nlp__.jkl</th>
          <th>__nlp__._total_</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>946687418181818240</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
          <td>0</td>
        </tr>
        <tr>
          <th>1</th>
          <td>-0.468674</td>
          <td>1</td>
          <td>2</td>
          <td>NaN</td>
          <td>946771200000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>2</td>
          <td>6</td>
          <td>5</td>
          <td>1</td>
          <td>4</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>2</th>
          <td>1.767960</td>
          <td>1</td>
          <td>0</td>
          <td>NaN</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>4</td>
          <td>1</td>
          <td>5</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
        </tr>
        <tr>
          <th>3</th>
          <td>-0.118771</td>
          <td>2</td>
          <td>3</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>5</td>
          <td>1</td>
          <td>4</td>
          <td>0</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>4</th>
          <td>0.630196</td>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>8</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>2</td>
          <td>3</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>95</th>
          <td>-1.182318</td>
          <td>0</td>
          <td>0</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>8</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>0</td>
          <td>1</td>
          <td>3</td>
          <td>2</td>
        </tr>
        <tr>
          <th>96</th>
          <td>0.562761</td>
          <td>1</td>
          <td>0</td>
          <td>5</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>7</td>
          <td>1</td>
          <td>2</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
          <td>3</td>
        </tr>
        <tr>
          <th>97</th>
          <td>-0.797270</td>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>946684800000000000</td>
          <td>2000</td>
          <td>1</td>
          <td>1</td>
          <td>5</td>
          <td>6</td>
          <td>1</td>
          <td>3</td>
          <td>2</td>
          <td>0</td>
          <td>1</td>
          <td>0</td>
          <td>2</td>
        </tr>
        <tr>
          <th>98</th>
          <td>0.502741</td>
          <td>1</td>
          <td>3</td>
          <td>0</td>
          <td>946598400000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>31</td>
          <td>4</td>
          <td>5</td>
          <td>1</td>
          <td>4</td>
          <td>1</td>
          <td>1</td>
          <td>0</td>
          <td>1</td>
          <td>3</td>
        </tr>
        <tr>
          <th>99</th>
          <td>2.056356</td>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>946512000000000000</td>
          <td>1999</td>
          <td>12</td>
          <td>30</td>
          <td>3</td>
          <td>7</td>
          <td>1</td>
          <td>2</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>1</td>
          <td>4</td>
        </tr>
      </tbody>
    </table>
    <p>100 rows × 17 columns</p>
    </div>


We see that the floating point, integer, categorical and text fields
‘A’, ‘B’, ‘D’, and ‘E’ have retained the NaNs, but the datetime column
‘C’ has been set to the mean of the non-NaN values.

Customization of Feature Engineering
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

To customize your feature generation pipeline, it is recommended to call
`PipelineFeatureGenerator <../../api/autogluon.features.html#autogluon.features.generators.PipelineFeatureGenerator>`__,
passing in non-default parameters to other feature generators as
required. For example, if we think downstream models would benefit from
removing rare categorical values and replacing with NaN, we can supply
the parameter maximum_num_cat to CategoryFeatureGenerator, as below:

.. code:: python

    from autogluon.features.generators import PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator
    from autogluon.common.features.types import R_INT, R_FLOAT
    mypipeline = PipelineFeatureGenerator(
        generators = [[        
            CategoryFeatureGenerator(maximum_num_cat=10),  # Overridden from default.
            IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),
        ]]
    )

If we then dump out the transformed data, we can see that all columns
have been converted to numeric, because that’s what most models require,
and the rare categorical values have been replaced with NaN:

.. code:: python

    mypipeline.fit_transform(X=dfx)


.. parsed-literal::
    :class: output

    Fitting PipelineFeatureGenerator...
    	Available Memory:                    31476.4 MB
    	Train Data (Original)  Memory Usage: 0.01 MB (0.0% of available memory)
    	Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
    	Stage 1 Generators:
    		Fitting AsTypeFeatureGenerator...
    	Stage 2 Generators:
    		Fitting FillNaFeatureGenerator...
    	Stage 3 Generators:
    		Fitting CategoryFeatureGenerator...
    			Fitting CategoryMemoryMinimizeFeatureGenerator...
    		Fitting IdentityFeatureGenerator...
    	Stage 4 Generators:
    		Fitting DropUniqueFeatureGenerator...
    	Unused Original Features (Count: 1): ['C']
    		These features were not used to generate any of the output features. Add a feature generator compatible with these features to utilize them.
    		Features can also be unused if they carry very little information, such as being categorical but having almost entirely unique values or being duplicates of other features.
    		These features do not need to be present at inference time.
    		('datetime', []) : 1 | ['C']
    	Types of features in original data (raw dtype, special dtypes):
    		('category', [])     : 2 | ['B', 'D']
    		('float', [])        : 1 | ['A']
    		('object', ['text']) : 1 | ['E']
    	Types of features in processed data (raw dtype, special dtypes):
    		('category', [])                   : 2 | ['B', 'D']
    		('category', ['text_as_category']) : 1 | ['E']
    		('float', [])                      : 1 | ['A']
    	0.0s = Fit runtime
    	4 features in original data used to generate 4 features in processed data.
    	Train Data (Processed) Memory Usage: 0.0 MB (0.0% of available memory)


.. raw:: html

    <div>
    <style scoped>
        .dataframe tbody tr th:only-of-type {
            vertical-align: middle;
        }
    
        .dataframe tbody tr th {
            vertical-align: top;
        }
    
        .dataframe thead th {
            text-align: right;
        }
    </style>
    <table border="1" class="dataframe">
      <thead>
        <tr style="text-align: right;">
          <th></th>
          <th>B</th>
          <th>D</th>
          <th>E</th>
          <th>A</th>
        </tr>
      </thead>
      <tbody>
        <tr>
          <th>0</th>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
          <td>NaN</td>
        </tr>
        <tr>
          <th>1</th>
          <td>1</td>
          <td>2</td>
          <td>NaN</td>
          <td>-0.468674</td>
        </tr>
        <tr>
          <th>2</th>
          <td>1</td>
          <td>0</td>
          <td>NaN</td>
          <td>1.767960</td>
        </tr>
        <tr>
          <th>3</th>
          <td>2</td>
          <td>3</td>
          <td>NaN</td>
          <td>-0.118771</td>
        </tr>
        <tr>
          <th>4</th>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>0.630196</td>
        </tr>
        <tr>
          <th>...</th>
          <td>...</td>
          <td>...</td>
          <td>...</td>
          <td>...</td>
        </tr>
        <tr>
          <th>95</th>
          <td>0</td>
          <td>0</td>
          <td>NaN</td>
          <td>-1.182318</td>
        </tr>
        <tr>
          <th>96</th>
          <td>1</td>
          <td>0</td>
          <td>5</td>
          <td>0.562761</td>
        </tr>
        <tr>
          <th>97</th>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>-0.797270</td>
        </tr>
        <tr>
          <th>98</th>
          <td>1</td>
          <td>3</td>
          <td>0</td>
          <td>0.502741</td>
        </tr>
        <tr>
          <th>99</th>
          <td>1</td>
          <td>1</td>
          <td>NaN</td>
          <td>2.056356</td>
        </tr>
      </tbody>
    </table>
    <p>100 rows × 4 columns</p>
    </div>


For more on custom feature engineering, see the detailed notebook
``examples/tabular/example_custom_feature_generator.py``.