Feature Interaction Charting
============================
This tool is made for quick interactions visualization between variables
in a dataset. User can specify the variables to be plotted on the x, y
and hue (color) parameters. The tool automatically picks chart type to
render based on the detected variable types and renders 1/2/3-way
interactions.
This feature can be useful in exploring patterns, trends, and outliers
and potentially identify good predictors for the task.
Using Interaction Charts for Missing Values Filling
---------------------------------------------------
Let’s load the titanic dataset:
.. code:: python
import pandas as pd
df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'
Next we will look at missing data in the variables:
.. code:: python
import autogluon.eda.auto as auto
auto.missing_values_analysis(train_data=df_train)
Missing Values Analysis
~~~~~~~~~~~~~~~~~~~~~~~
.. raw:: html
|
missing_count |
missing_ratio |
Age |
177 |
0.198653 |
Cabin |
687 |
0.771044 |
Embarked |
2 |
0.002245 |
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_3_2.png
It looks like there are only two null values in the ``Embarked``
feature. Let’s see what those two null values are:
.. code:: python
df_train[df_train.Embarked.isna()]
.. raw:: html
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
61 |
62 |
1 |
1 |
Icard, Miss. Amelie |
female |
38.0 |
0 |
0 |
113572 |
80.0 |
B28 |
NaN |
829 |
830 |
1 |
1 |
Stone, Mrs. George Nelson (Martha Evelyn) |
female |
62.0 |
0 |
0 |
113572 |
80.0 |
B28 |
NaN |
We may be able to fill these by looking at other independent variables.
Both passengers paid a ``Fare`` of ``$80``, are of ``Pclass`` ``1`` and
``female`` ``Sex``. Let’s see how the ``Fare`` is distributed among all
``Pclass`` and ``Embarked`` feature values:
.. code:: python
auto.analyze_interaction(train_data=df_train, x='Embarked', y='Fare', hue='Pclass')
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_7_0.png
The average ``Fare`` closest to ``$80`` are in the ``C`` ``Embarked``
values where ``Pclass`` is ``1``. Let’s fill in the missing values as
``C``.
Using Interaction Charts To Learn Information About the Data
------------------------------------------------------------
.. code:: python
auto.analyze_interaction(x='Pclass', y='Survived', train_data=df_train, test_data=df_test)
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_9_0.png
It looks like ``63%`` of first class passengers survived, while; ``48%``
of second class and only ``24%`` of third class passengers survived.
Similar information is visible via ``Fare`` variable:
.. code:: python
auto.analyze_interaction(x='Fare', hue='Survived', train_data=df_train, test_data=df_test, chart_args=dict(fill=True))
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_11_0.png
.. code:: python
auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, test_data=df_test)
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_12_0.png
The very left part of the distribution on this chart possibly hints that
children and infants were the priority.
.. code:: python
auto.analyze_interaction(x='Fare', y='Age', hue='Survived', train_data=df_train, test_data=df_test)
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_14_0.png
This chart highlights three outliers with a Fare of over ``$500``. Let’s
take a look at these:
.. code:: python
df_train[df_train.Fare > 400]
.. raw:: html
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
258 |
259 |
1 |
1 |
Ward, Miss. Anna |
female |
35.0 |
0 |
0 |
PC 17755 |
512.3292 |
NaN |
C |
679 |
680 |
1 |
1 |
Cardeza, Mr. Thomas Drake Martinez |
male |
36.0 |
0 |
1 |
PC 17755 |
512.3292 |
B51 B53 B55 |
C |
737 |
738 |
1 |
1 |
Lesurer, Mr. Gustave J |
male |
35.0 |
0 |
0 |
PC 17755 |
512.3292 |
B101 |
C |
As you can see all 4 passengers share the same ticket. Per-person fare
would be 1/4 of this value. Looks like we can add a new feature to the
dataset fare per person; also this allows us to see if some passengers
travelled in larger groups. Let’s create two new features and take at
the Fare-Age relationship once again.
.. code:: python
ticket_to_count = df_train.groupby(by='Ticket')['Embarked'].count().to_dict()
data = df_train.copy()
data['GroupSize'] = data.Ticket.map(ticket_to_count)
data['FarePerPerson'] = data.Fare / data.GroupSize
auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Survived', train_data=data)
auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Pclass', train_data=data)
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_18_0.png
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_18_1.png
You can see cleaner separation between ``Fare``, ``Pclass`` and
``Survived`` now.