Feature Interaction Charting#
This tool is made for quick interactions visualization between variables in a dataset. User can specify the variables to be plotted on the x, y and hue (color) parameters. The tool automatically picks chart type to render based on the detected variable types and renders 1/2/3-way interactions.
This feature can be useful in exploring patterns, trends, and outliers and potentially identify good predictors for the task.
Using Interaction Charts for Missing Values Filling#
Let’s load the titanic dataset:
import pandas as pd
df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv')
df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv')
target_col = 'Survived'
Next we will look at missing data in the variables:
import autogluon.eda.auto as auto
auto.missing_values_analysis(train_data=df_train)
Missing Values Analysis
| missing_count | missing_ratio | |
|---|---|---|
| Age | 177 | 0.198653 |
| Cabin | 687 | 0.771044 |
| Embarked | 2 | 0.002245 |
It looks like there are only two null values in the Embarked feature. Let’s see what those two null values are:
df_train[df_train.Embarked.isna()]
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 61 | 62 | 1 | 1 | Icard, Miss. Amelie | female | 38.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
| 829 | 830 | 1 | 1 | Stone, Mrs. George Nelson (Martha Evelyn) | female | 62.0 | 0 | 0 | 113572 | 80.0 | B28 | NaN |
We may be able to fill these by looking at other independent variables. Both passengers paid a Fare of $80, are
of Pclass 1 and female Sex. Let’s see how the Fare is distributed among all Pclass and Embarked feature
values:
auto.analyze_interaction(train_data=df_train, x='Embarked', y='Fare', hue='Pclass')
The average Fare closest to $80 are in the C Embarked values where Pclass is 1. Let’s fill in the missing
values as C.
Using Interaction Charts To Learn Information About the Data#
auto.analyze_interaction(x='Pclass', y='Survived', train_data=df_train, test_data=df_test)
It looks like 63% of first class passengers survived, while; 48% of second class and only 24% of third class
passengers survived. Similar information is visible via Fare variable:
auto.analyze_interaction(x='Fare', hue='Survived', train_data=df_train, test_data=df_test, chart_args=dict(fill=True))
auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, test_data=df_test)
The very left part of the distribution on this chart possibly hints that children and infants were the priority.
auto.analyze_interaction(x='Fare', y='Age', hue='Survived', train_data=df_train, test_data=df_test)
This chart highlights three outliers with a Fare of over $500. Let’s take a look at these:
df_train[df_train.Fare > 400]
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 258 | 259 | 1 | 1 | Ward, Miss. Anna | female | 35.0 | 0 | 0 | PC 17755 | 512.3292 | NaN | C |
| 679 | 680 | 1 | 1 | Cardeza, Mr. Thomas Drake Martinez | male | 36.0 | 0 | 1 | PC 17755 | 512.3292 | B51 B53 B55 | C |
| 737 | 738 | 1 | 1 | Lesurer, Mr. Gustave J | male | 35.0 | 0 | 0 | PC 17755 | 512.3292 | B101 | C |
As you can see all 4 passengers share the same ticket. Per-person fare would be 1/4 of this value. Looks like we can add a new feature to the dataset fare per person; also this allows us to see if some passengers travelled in larger groups. Let’s create two new features and take at the Fare-Age relationship once again.
ticket_to_count = df_train.groupby(by='Ticket')['Embarked'].count().to_dict()
data = df_train.copy()
data['GroupSize'] = data.Ticket.map(ticket_to_count)
data['FarePerPerson'] = data.Fare / data.GroupSize
auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Survived', train_data=data)
auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Pclass', train_data=data)
You can see cleaner separation between Fare, Pclass and Survived now.