Feature Interaction Charting#

Open In Colab Open In SageMaker Studio Lab

This tool is made for quick interactions visualization between variables in a dataset. User can specify the variables to be plotted on the x, y and hue (color) parameters. The tool automatically picks chart type to render based on the detected variable types and renders 1/2/3-way interactions.

This feature can be useful in exploring patterns, trends, and outliers and potentially identify good predictors for the task.

Using Interaction Charts for Missing Values Filling#

Let’s load the titanic dataset:

import pandas as pd

df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'

Next we will look at missing data in the variables:

import autogluon.eda.auto as auto

auto.missing_values_analysis(train_data=df_train)

Missing Values Analysis

missing_count missing_ratio
Age 177 0.198653
Cabin 687 0.771044
Embarked 2 0.002245
../../_images/07a0159c7dfeaf7a8104387601c17967a4f8a3508c8e36bf567dc2463ced398a.png

It looks like there are only two null values in the Embarked feature. Let’s see what those two null values are:

df_train[df_train.Embarked.isna()]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN

We may be able to fill these by looking at other independent variables. Both passengers paid a Fare of $80, are of Pclass 1 and female Sex. Let’s see how the Fare is distributed among all Pclass and Embarked feature values:

auto.analyze_interaction(train_data=df_train, x='Embarked', y='Fare', hue='Pclass')
../../_images/01da3cee2cf0671d3bf1ad889086768d3ae17d887c29668587828c434d654e46.png

The average Fare closest to $80 are in the C Embarked values where Pclass is 1. Let’s fill in the missing values as C.

Using Interaction Charts To Learn Information About the Data#

auto.analyze_interaction(x='Pclass', y='Survived', train_data=df_train, test_data=df_test)
../../_images/dfab5b2627cd756049fd5e4d79ce1ce32152eaf3b41bf72b8b00abbbf3a658f6.png

It looks like 63% of first class passengers survived, while; 48% of second class and only 24% of third class passengers survived. Similar information is visible via Fare variable:

auto.analyze_interaction(x='Fare', hue='Survived', train_data=df_train, test_data=df_test, chart_args=dict(fill=True))
../../_images/583b8adfba690410f93a57d22a3608d97a203b368e0898ed7bf2559b7add38a2.png
auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, test_data=df_test)
../../_images/bea7a13f61efafe9651fc2707723dad882c6d268add6c4d62d173c7621628482.png

The very left part of the distribution on this chart possibly hints that children and infants were the priority.

auto.analyze_interaction(x='Fare', y='Age', hue='Survived', train_data=df_train, test_data=df_test)
../../_images/010a64c0ef739322b62b502aa8f626d1586fee1152351ff9f653dd6e1c7fc612.png

This chart highlights three outliers with a Fare of over $500. Let’s take a look at these:

df_train[df_train.Fare > 400]
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C

As you can see all 4 passengers share the same ticket. Per-person fare would be 1/4 of this value. Looks like we can add a new feature to the dataset fare per person; also this allows us to see if some passengers travelled in larger groups. Let’s create two new features and take at the Fare-Age relationship once again.

ticket_to_count = df_train.groupby(by='Ticket')['Embarked'].count().to_dict()
data = df_train.copy()
data['GroupSize'] = data.Ticket.map(ticket_to_count)
data['FarePerPerson'] = data.Fare / data.GroupSize

auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Survived', train_data=data)
auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Pclass', train_data=data)
../../_images/f083596b56572c615c7b3c9f26a43c48b1c7fad0c446f63e3da18c755b81ea16.png ../../_images/1befc4e9be45164c79367e830e7e346947d3da823bfb61604305dd223ac88789.png

You can see cleaner separation between Fare, Pclass and Survived now.