EDA: Covariate Shift Analysis ============================= Covariate shift is a phenomenon in machine learning where the distribution of the independent variables in the training and testing data is different. This can occur when the training data and testing data come from different sources, regions or changes over time. This can result in biased model performance, as the model is not generalizing well to the test data. To address covariate shift, various techniques can be used, such as re-sampling the data, adjusting the model to account for the shift, transforming the data to a form not exposed to the shift (i.e. car year make -> car age) or obtaining additional data to balance the distribution of the independent variables. The goal is to ensure that the model is trained and tested on similar data distributions, so that the model is generalizing well when deployed into production. Example ------- Let’s load the titanic dataset: .. code:: python import pandas as pd df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv') df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv') target_col = 'Survived' Now we can perform analysis: .. code:: python import autogluon.eda.auto as auto auto.covariate_shift_detection(train_data=df_train, test_data=df_test, label=target_col) We detected a substantial difference between the training and test X distributions, a type of distribution shift. **Test results**: We can predict whether a sample is in the test vs. training set with a ``roc_auc`` of ``1.0000`` with a p-value of ``0.0010`` (smaller than the threshold of ``0.0100)``. **Feature importances**: The variables that are the most responsible for this shift are those with high feature importance: .. raw:: html
importance | stddev | p_value | n | p99_high | p99_low | |
---|---|---|---|---|---|---|
PassengerId | 0.476397 | 0.032325 | 0.000003 | 5 | 0.542955 | 0.409839 |
Name | 0.000203 | 0.000114 | 0.008298 | 5 | 0.000439 | -0.000033 |