Feature Interaction Charting ============================ This tool is made for quick interactions visualization between variables in a dataset. User can specify the variables to be plotted on the x, y and hue (color) parameters. The tool automatically picks chart type to render based on the detected variable types and renders 1/2/3-way interactions. This feature can be useful in exploring patterns, trends, and outliers and potentially identify good predictors for the task. Using Interaction Charts for Missing Values Filling --------------------------------------------------- Let’s load the titanic dataset: .. code:: python import pandas as pd df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv') df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv') target_col = 'Survived' Next we will look at missing data in the variables: .. code:: python import autogluon.eda.auto as auto auto.missing_values_analysis(train_data=df_train) Missing Values Analysis ~~~~~~~~~~~~~~~~~~~~~~~ .. raw:: html
missing_count missing_ratio
Age 177 0.198653
Cabin 687 0.771044
Embarked 2 0.002245
.. figure:: output_eda-auto-analyze-interaction_1bd8e2_3_2.png It looks like there are only two null values in the ``Embarked`` feature. Let’s see what those two null values are: .. code:: python df_train[df_train.Embarked.isna()] .. raw:: html
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
61 62 1 1 Icard, Miss. Amelie female 38.0 0 0 113572 80.0 B28 NaN
829 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62.0 0 0 113572 80.0 B28 NaN
We may be able to fill these by looking at other independent variables. Both passengers paid a ``Fare`` of ``$80``, are of ``Pclass`` ``1`` and ``female`` ``Sex``. Let’s see how the ``Fare`` is distributed among all ``Pclass`` and ``Embarked`` feature values: .. code:: python auto.analyze_interaction(train_data=df_train, x='Embarked', y='Fare', hue='Pclass') .. figure:: output_eda-auto-analyze-interaction_1bd8e2_7_0.png The average ``Fare`` closest to ``$80`` are in the ``C`` ``Embarked`` values where ``Pclass`` is ``1``. Let’s fill in the missing values as ``C``. Using Interaction Charts To Learn Information About the Data ------------------------------------------------------------ .. code:: python auto.analyze_interaction(x='Pclass', y='Survived', train_data=df_train, test_data=df_test) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_9_0.png It looks like ``63%`` of first class passengers survived, while; ``48%`` of second class and only ``24%`` of third class passengers survived. Similar information is visible via ``Fare`` variable: .. code:: python auto.analyze_interaction(x='Fare', hue='Survived', train_data=df_train, test_data=df_test, chart_args=dict(fill=True)) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_11_0.png .. code:: python auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, test_data=df_test) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_12_0.png The very left part of the distribution on this chart possibly hints that children and infants were the priority. .. code:: python auto.analyze_interaction(x='Fare', y='Age', hue='Survived', train_data=df_train, test_data=df_test) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_14_0.png This chart highlights three outliers with a Fare of over ``$500``. Let’s take a look at these: .. code:: python df_train[df_train.Fare > 400] .. raw:: html
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
258 259 1 1 Ward, Miss. Anna female 35.0 0 0 PC 17755 512.3292 NaN C
679 680 1 1 Cardeza, Mr. Thomas Drake Martinez male 36.0 0 1 PC 17755 512.3292 B51 B53 B55 C
737 738 1 1 Lesurer, Mr. Gustave J male 35.0 0 0 PC 17755 512.3292 B101 C
As you can see all 4 passengers share the same ticket. Per-person fare would be 1/4 of this value. Looks like we can add a new feature to the dataset fare per person; also this allows us to see if some passengers travelled in larger groups. Let’s create two new features and take at the Fare-Age relationship once again. .. code:: python ticket_to_count = df_train.groupby(by='Ticket')['Embarked'].count().to_dict() data = df_train.copy() data['GroupSize'] = data.Ticket.map(ticket_to_count) data['FarePerPerson'] = data.Fare / data.GroupSize auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Survived', train_data=data) auto.analyze_interaction(x='FarePerPerson', y='Age', hue='Pclass', train_data=data) .. figure:: output_eda-auto-analyze-interaction_1bd8e2_18_0.png .. figure:: output_eda-auto-analyze-interaction_1bd8e2_18_1.png You can see cleaner separation between ``Fare``, ``Pclass`` and ``Survived`` now.