{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "id": "e4024454",
   "metadata": {},
   "source": [
    "# AutoGluon Tabular - Feature Engineering\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/tabular-feature-engineering.ipynb)\n",
    "[![Open In SageMaker Studio Lab](https://studiolab.sagemaker.aws/studiolab.svg)](https://studiolab.sagemaker.aws/import/github/autogluon/autogluon/blob/stable/docs/tutorials/tabular/tabular-feature-engineering.ipynb)\n",
    "\n",
    "\n",
    "\n",
    "## Introduction ##\n",
    "\n",
    "Feature engineering involves taking raw tabular data and \n",
    "\n",
    "1. converting it into a format ready for the machine learning model to read\n",
    "2. trying to enhance some columns ('features' in ML jargon) to give the ML models more information, hoping to get more accurate results.\n",
    "\n",
    "AutoGluon does some of this for you.  This document describes how that works, and how you can extend it.  We describe here the default behaviour, much of which is configurable, as well as pointers to how to alter the behaviour from the default.\n",
    "\n",
    "## Column Types ##\n",
    "\n",
    "AutoGluon Tabular recognises the following types of features, and has separate processing for them:\n",
    "\n",
    "| Feature Type        | Example Values           |\n",
    "| :------------------ | :----------------------- |\n",
    "| boolean             | A, B                     |\n",
    "| numerical           | 1.3, 2.0, -1.6           |\n",
    "| categorical         | Red, Blue, Yellow        |\n",
    "| datetime            | 1/31/2021, Mar-31        |\n",
    "| text                | Mary had a little lamb   |\n",
    "\n",
    "In addition, other AutoGluon prediction modules recognise additional feature types, these can also be enabled in AutoGluon Tabular by using the [MultiModal](tabular-multimodal.ipynb) option. \n",
    "\n",
    "| Feature Type        | Example Values           |\n",
    "| :------------------ | :----------------------- |\n",
    "| image               | path/image123.png        |\n",
    "\n",
    "## Column Type Detection ##\n",
    "\n",
    "- Boolean columns are any columns with only 2 unique values.\n",
    "\n",
    "- Any string columns are deemed categorical unless they are text (see below).  Some models perform better if you tell them which columns are categorical and which are continuous. \n",
    "\n",
    "- Numeric columns are passed through without change, except to identify them as `float` or `int`.  Currently, numeric columns are not tested to determine if they are likely to be categorical.  You can force them to be treated as categorical with the Pandas syntax `.astype(\"category\")`, see below.\n",
    "\n",
    "- Text columns are detected by firstly checking that most rows are unique.  If they are, and there are multiple separate words detected in most rows, the row is a text column.  For details see `common/features/infer_types.py` in the source.\n",
    "\n",
    "- Datetime columns are detected by trying to convert them to Pandas datetimes.  Pandas detects a wide range of datetime formats.  If many of the values in a column are successfully converted, they are datetimes.  Currently datetimes that appear to be purely numeric (e.g. 20210530) are not correctly detected.  Any NaN values are set to the column mean.  For details see `common/features/infer_types.py`.\n",
    "\n",
    "\n",
    "## Problem Type Detection ##\n",
    "\n",
    "If the user does not specify whether the problem is a classification problem or a regression problem, the 'label' column is examined to try to guess.  Several things point towards a regression problem : the values are floating point non-integers, and there are a large amount of unique values.  Within classification, both multiclass and binary (n=2 categories) are detected.  For details see `utils/utils.py`.\n",
    "\n",
    "To override the automatic inference, explicitly pass the problem_type (one of 'binary', 'regression', 'multiclass') to `TabularPredictor()`.  For example:"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5e51d285",
   "metadata": {},
   "source": [
    "```\n",
    "predictor = TabularPredictor(label='class', problem_type='multiclass').fit(train_data)\n",
    "```\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7d01b23f",
   "metadata": {},
   "source": [
    "## Automatic Feature Engineering ##\n",
    "\n",
    "## Numerical Columns ##\n",
    "\n",
    "Numeric columns, both integer and floating point, currently have no automated feature engineering.\n",
    "\n",
    "## Categorical Columns ##\n",
    "\n",
    "Since many downstream models require categories to be encoded as integers, each categorical feature is mapped to monotonically increasing integers.\n",
    "\n",
    "## Datetime Columns ##\n",
    "\n",
    "Columns recognised as datetime, are converted into several features:\n",
    "\n",
    "- a numerical Pandas datetime.  Note this has maximum and minimum values specified at [pandas.Timestamp.min](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.min.html) and [pandas.Timestamp.max](https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.min.html) respectively, which may affect extremely dates very far into the future or past.\n",
    "- several extracted columns, the default is `[year, month, day, dayofweek]`.  This is configrable via the [DatetimeFeatureGenerator](../../api/autogluon.features.rst)\n",
    "\n",
    "Note that missing, invalid and out-of-range features generated by the above logic will be converted to the mean value across all valid rows.\n",
    "\n",
    "\n",
    "## Text Columns ##\n",
    "\n",
    "If the [MultiModal](tabular-multimodal.ipynb) option is enabled, then text columns are processed using a full Transformer neural network model with pretrained NLP models.\n",
    "\n",
    "Otherwise, they are processed in two more simple ways:\n",
    "\n",
    "- an n-gram feature generator extracts n-grams (short strings) from the text feature, adding many additional columns, one for each n-gram feature.  These columns are 'n-hot' encoded, containing 1 or more if the original feature contains the n-gram 1 or more times, and 0 otherwise.  By default, all text columns are concatenated before applying this stage, and the n-grams are individual words, not substrings of words.  You can configure this via the [TextNgramFeatureGenerator](../../api/autogluon.features.rst) class. The n-gram generation is done in `generators/text_ngram.py`\n",
    "- Some additional numerical features are calculated, such as word counts, character counts, proportion of uppercase characters, etc.  This is configurable via the [TextSpecialFeatureGenerator](../../api/autogluon.features.rst).  This is done in `generators/text_special.py`\n",
    "\n",
    "## Additional Processing ##\n",
    "\n",
    "- Columns containing only 1 value are dropped before passing to models.\n",
    "- Columns containing duplicates of other columns are removed before passing to models.\n",
    "\n",
    "## Feature Engineering Example ##\n",
    "\n",
    "By default a feature generator called [AutoMLPipelineFeatureGenerator](../../api/autogluon.features.rst) is used.  Let's see this in action.  We'll create a dataframe containing a floating point column, an integer column, a datetime column,  a categorical column.  We'll first take a look at the raw data we created."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa00faab-252f-44c9-b8f7-57131aa8251c",
   "metadata": {
    "tags": [
     "remove-cell"
    ]
   },
   "outputs": [],
   "source": [
    "!pip install autogluon.tabular[all]\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d02d38de",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.tabular import TabularDataset, TabularPredictor\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import random\n",
    "from sklearn.datasets import make_regression\n",
    "from datetime import datetime\n",
    "\n",
    "x, y = make_regression(n_samples = 100,n_features = 5,n_targets = 1, random_state = 1)\n",
    "dfx = pd.DataFrame(x, columns=['A','B','C','D','E'])\n",
    "dfy = pd.DataFrame(y, columns=['label'])\n",
    "\n",
    "# Create an integer column, a datetime column, a categorical column and a string column to demonstrate how they are processed.\n",
    "dfx['B'] = (dfx['B']).astype(int)\n",
    "dfx['C'] = datetime(2000,1,1) + pd.to_timedelta(dfx['C'].astype(int), unit='D')\n",
    "dfx['D'] = pd.cut(dfx['D'] * 10, [-np.inf,-5,0,5,np.inf],labels=['v','w','x','y'])\n",
    "dfx['E'] = pd.Series(list(' '.join(random.choice([\"abc\", \"d\", \"ef\", \"ghi\", \"jkl\"]) for i in range(4)) for j in range(100)))\n",
    "dataset=TabularDataset(dfx)\n",
    "print(dfx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e7bd60d",
   "metadata": {},
   "source": [
    "Now let's call the default feature generator AutoMLPipeLineFeatureGenerator with no parameters and see what it does."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "27804af2",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.features.generators import AutoMLPipelineFeatureGenerator\n",
    "auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()\n",
    "auto_ml_pipeline_feature_generator.fit_transform(X=dfx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2c38a293",
   "metadata": {},
   "source": [
    "We can see that:\n",
    "\n",
    "- The floating point and integer columns 'A' and 'B' are unchanged.\n",
    "- The datetime column 'C' has been converted to a raw value (in nanoseconds), as well as parsed into additional columns for the year, month, day and dayofweek.\n",
    "- The string categorical column 'D' has been mapped 1:1 to integers - a lot of models only accept numerical input.\n",
    "- The freeform text column has been mapped into some summary features ('char_count' etc) as well as a N-hot matrix saying whether each text contained each word.\n",
    "\n",
    "To get more details, we should call the pipeline as part of `TabularPredictor.fit()`.  We need to combine the `dfx` and `dfy` DataFrames since fit() expects a single dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "79732c1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "df = pd.concat([dfx, dfy], axis=1)\n",
    "predictor = TabularPredictor(label='label')\n",
    "predictor.fit(df, hyperparameters={'GBM' : {}}, feature_generator=auto_ml_pipeline_feature_generator)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2ee6addc",
   "metadata": {},
   "source": [
    "Reading the output, note that:\n",
    "\n",
    "- the string-categorical column 'D', despite being mapped to integers, is still recognised as categorical. \n",
    "- the integer column 'B' has not been identified as categorical, even though it only has a few unique values:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "74e50c54",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(len(set(dfx['B'])))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "22018dfb",
   "metadata": {},
   "source": [
    "To mark it as categorical, we can explicitly mark it as categorical in the original dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2ac7ee1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "dfx[\"B\"] = dfx[\"B\"].astype(\"category\")\n",
    "auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()\n",
    "auto_ml_pipeline_feature_generator.fit_transform(X=dfx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "59d69e45",
   "metadata": {},
   "source": [
    "## Missing Value Handling ##\n",
    "To illustrate missing value handling, let's set the first row to all NaNs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "28bc8a4e",
   "metadata": {},
   "outputs": [],
   "source": [
    "dfx.iloc[0] = np.nan\n",
    "dfx.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ac52d679",
   "metadata": {},
   "source": [
    "Now if we reprocess:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "43d2f5dd",
   "metadata": {},
   "outputs": [],
   "source": [
    "auto_ml_pipeline_feature_generator = AutoMLPipelineFeatureGenerator()\n",
    "auto_ml_pipeline_feature_generator.fit_transform(X=dfx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "02072afe",
   "metadata": {},
   "source": [
    "We see that the floating point, integer, categorical and text fields 'A', 'B', 'D', and 'E' have retained the NaNs, but the datetime column 'C' has been set to the mean of the non-NaN values.\n",
    "\n",
    "\n",
    "## Customization of Feature Engineering ##\n",
    "To customize your feature generation pipeline, it is recommended to call [PipelineFeatureGenerator](../../api/autogluon.features.rst), passing in non-default parameters to other feature generators as required.  For example, if we think downstream models would benefit from removing rare categorical values and replacing with NaN, we can supply the parameter maximum_num_cat to CategoryFeatureGenerator, as below:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ef8a9a08",
   "metadata": {},
   "outputs": [],
   "source": [
    "from autogluon.features.generators import PipelineFeatureGenerator, CategoryFeatureGenerator, IdentityFeatureGenerator\n",
    "from autogluon.common.features.types import R_INT, R_FLOAT\n",
    "mypipeline = PipelineFeatureGenerator(\n",
    "    generators = [[        \n",
    "        CategoryFeatureGenerator(maximum_num_cat=10),  # Overridden from default.\n",
    "        IdentityFeatureGenerator(infer_features_in_args=dict(valid_raw_types=[R_INT, R_FLOAT])),\n",
    "    ]]\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0636a109",
   "metadata": {},
   "source": [
    "If we then dump out the transformed data, we can see that all columns have been converted to numeric, because that's what most models require, and the rare categorical values have been replaced with NaN:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ab04960d",
   "metadata": {},
   "outputs": [],
   "source": [
    "mypipeline.fit_transform(X=dfx)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4b3e9025",
   "metadata": {},
   "source": [
    "For more on custom feature engineering, see the detailed notebook `examples/tabular/example_custom_feature_generator.py`."
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}