Feature Interaction Charting ============================ This tool is made for quick interactions visualization between variables in a dataset. User can specify the variables to be plotted on the x, y and hue (color) parameters. The tool automatically picks chart type to render based on the detected variable types and renders 1/2/3-way interactions. This feature can be useful in exploring patterns, trends, and outliers and potentially identify good predictors for the task. Using Interaction Charts for Missing Values Filling --------------------------------------------------- Let’s load the titanic dataset: .. code:: python import pandas as pd df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv') df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv') target_col = 'Survived' Next we will look at missing data in the variables: .. code:: python import autogluon.eda.auto as auto auto.missing_values_analysis(train_data=df_train) Missing Values Analysis ~~~~~~~~~~~~~~~~~~~~~~~ .. raw:: html

	missing_count	missing_ratio
Age	177	0.198653
Cabin	687	0.771044
Embarked	2	0.002245

.. figure:: output_eda-auto-analyze-interaction_1bd8e2_3_2.png It looks like there are only two null values in the ``Embarked`` feature. Let’s see what those two null values are: .. code:: python df_train[df_train.Embarked.isna()] .. raw:: html

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
61	62	1	1	Icard, Miss. Amelie	female	38.0	0	0	113572	80.0	B28	NaN
829	830	1	1	Stone, Mrs. George Nelson (Martha Evelyn)	female	62.0	0	0	113572	80.0	B28	NaN

We may be able to fill these by looking at other independent variables. Both passengers paid a ``Fare`` of ``$80``, are of ``Pclass`` ``1`` and ``female`` ``Sex``. Let’s see how the ``Fare`` is distributed among all ``Pclass`` and ``Embarked`` feature values: .. code:: python auto.analyze_interaction(train_data=df_train, x='Embarked', y='Fare', hue='Pclass') .. figure:: output_eda-auto-analyze-interaction_1bd8e2_7_0.png The average ``Fare`` closest to ``$80`` are in the ``C`` ``Embarked`` values where ``Pclass`` is ``1``. Let’s fill in the missing values as ``C``. Using Interaction Charts To Learn Information About the Data ------------------------------------------------------------ .. code:: python auto.analyze_interaction(x='Pclass', y='Survived', train_data=df_train, test_data=df_test) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_9_0.png It looks like ``63%`` of first class passengers survived, while; ``48%`` of second class and only ``24%`` of third class passengers survived. Similar information is visible via ``Fare`` variable: .. code:: python auto.analyze_interaction(x='Fare', hue='Survived', train_data=df_train, test_data=df_test, chart_args=dict(fill=True)) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_11_0.png .. code:: python auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, test_data=df_test) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_12_0.png The very left part of the distribution on this chart possibly hints that children and infants were the priority. .. code:: python auto.analyze_interaction(x='Fare', y='Age', hue='Survived', train_data=df_train, test_data=df_test) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_14_0.png This chart highlights three outliers with a Fare of over ``$500``. Let’s take a look at these: .. code:: python df_train[df_train.Fare > 400] .. raw:: html

	PassengerId	Survived	Pclass	Name	Sex	Age	Parch	Ticket	Fare	Cabin	Embarked
258	259	1	1	Ward, Miss. Anna	female	35.0	0	PC 17755	512.3292	NaN	C
679	680	1	1	Cardeza, Mr. Thomas Drake Martinez	male	36.0	1	PC 17755	512.3292	B51 B53 B55	C
737	738	1	1	Lesurer, Mr. Gustave J	male	35.0	0	PC 17755	512.3292	B101	C

As you can see all 4 passengers share the same ticket. Per-person fare would be 1/4 of this value. Looks like we can add a new feature to the dataset fare per person; also this allows us to see if some passengers travelled in larger groups. Let’s create two new features and take at the Fare-Age relationship once again. .. code:: python ticket_to_count = df_train.groupby(by='Ticket')['Embarked'].count().to_dict() data = df_train.copy() data['GroupSize'] = data.Ticket.map(ticket_to_count) data['FarePerPerson'] = data.Fare / data.GroupSize auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Survived', train_data=data) auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Pclass', train_data=data) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_18_0.png .. figure:: output_eda-auto-analyze-interaction_1bd8e2_18_1.png You can see cleaner separation between ``Fare``, ``Pclass`` and ``Survived`` now.