Automated Target Variable Analysis
==================================
In this section we explore automated dataset overview functionality.
Automated target variable analysis aims to automatically analyze and
summarize the variable we are trying to predict ( label). The goal of
this analysis is to take a deeper look into target variable structure
and its relationship with other important variables in the dataset.
To simplify outliers and useful patterns discovery. This functionality
introduces components which allow generating descriptive statistics and
visualizing the target distribution and relationships between the target
variable and other variables in the dataset.
Classification Example
----------------------
We will start with getting titanic dataset and performing a quick
one-line overview to get the information.
.. code:: python
import pandas as pd
df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'
The report consists of multiple parts: statistical information overview
enriched with feature types detection and missing value counts focused
only on the target variable.
Label Insights will highlight dataset features which require attention
(i.e. class imbalance or out-of-domain data in test dataset).
The next component is feature distribution visualization. This is
helpful for choosing data transformations and/or model selection. For
regression tasks, the framework automatically fits multiple
distributions available in scipy. The distributions with the best fit
will be displayed on the chart. Distributions information will be
displayed below the chart.
Next, the report will provide correlation analysis focusing only on
highly-correlated features and visualization of their relationships with
the target.
The last chart is a feature distance. It measures the similarity between
features in a dataset. For example, if two variables are almost
identical, their feature distance will be small. Understanding feature
distance is useful in feature selection, where it can be used to
identify which variables are redundant and should be considered for
removal. To perform the analysis, we need just one line:
.. code:: python
import autogluon.eda.auto as auto
auto.target_analysis(train_data=df_train, label=target_col)
Target variable analysis
------------------------
.. raw:: html
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
dtypes |
unique |
missing_count |
missing_ratio |
raw_type |
special_types |
Survived |
891 |
0.383838 |
0.486592 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
int64 |
2 |
|
|
int |
|
.. figure:: output_eda-auto-target-analysis_6fd0e1_3_2.png
Target variable correlations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**``train_data`` - ``spearman`` correlation matrix; focus: absolute
correlation for ``Survived`` >= ``0.5``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_3_5.png
**Feature interaction between ``Sex``/``Survived`` in ``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_3_7.png
Regression Example
------------------
In the previous section we tried a classification example. Let’s try a
regression. It has a few differences.
.. code:: python
df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/AmesHousingPriceRegression/train_data.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/AmesHousingPriceRegression/test_data.csv')
target_col = 'SalePrice'
auto.target_analysis(
train_data=df_train, label=target_col,
# Optional; default will try to fit all available distributions
fit_distributions=['laplace_asymmetric', 'johnsonsu', 'exponnorm']
)
Target variable analysis
------------------------
.. raw:: html
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
dtypes |
unique |
missing_count |
missing_ratio |
raw_type |
special_types |
SalePrice |
2344 |
181794.673635 |
82035.556894 |
12789.0 |
129000.0 |
160500.0 |
214000.0 |
755000.0 |
int64 |
918 |
|
|
int |
|
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_2.png
Distribution fits for target variable
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- `laplace_asymmetric `__
- p-value: 0.784
- Parameters: (kappa: 0.5531863345530886, loc: 127499.99999894513,
scale: 43285.69671350392)
- `johnsonsu `__
- p-value: 0.120
- Parameters: (a: -1.4433009164353976, b: 1.3922853595685476, loc:
97854.76437964055, scale: 52770.348354810485)
- `exponnorm `__
- p-value: 0.063
- Parameters: (K: 2.62854289075631, loc: 107181.37845979724, scale:
28385.801254782047)
Target variable correlations
~~~~~~~~~~~~~~~~~~~~~~~~~~~~
**``train_data`` - ``spearman`` correlation matrix; focus: absolute
correlation for ``SalePrice`` >= ``0.5``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_6.png
**Feature interaction between ``Overall.Qual``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_8.png
**Feature interaction between ``Gr.Liv.Area``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_10.png
**Feature interaction between ``Garage.Cars``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_12.png
**Feature interaction between ``Year.Built``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_14.png
**Feature interaction between ``Garage.Area``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_16.png
**Feature interaction between ``Garage.Yr.Blt``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_18.png
**Feature interaction between ``Full.Bath``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_20.png
**Feature interaction between ``Total.Bsmt.SF``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_22.png
**Feature interaction between ``Year.Remod.Add``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_24.png
**Feature interaction between ``X1st.Flr.SF``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_26.png
**Feature interaction between ``Foundation``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_28.png
**Feature interaction between ``Fireplaces``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_30.png
**Feature interaction between ``TotRms.AbvGrd``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_32.png
**Feature interaction between ``Heating.QC``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_34.png
**Feature interaction between ``Kitchen.Qual``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_36.png
**Feature interaction between ``Exter.Qual``/``SalePrice`` in
``train_data``**
.. figure:: output_eda-auto-target-analysis_6fd0e1_5_38.png