EDA: Covariate Shift Analysis ============================= Covariate shift is a phenomenon in machine learning where the distribution of the independent variables in the training and testing data is different. This can occur when the training data and testing data come from different sources, regions or changes over time. This can result in biased model performance, as the model is not generalizing well to the test data. To address covariate shift, various techniques can be used, such as re-sampling the data, adjusting the model to account for the shift, transforming the data to a form not exposed to the shift (i.e. car year make -> car age) or obtaining additional data to balance the distribution of the independent variables. The goal is to ensure that the model is trained and tested on similar data distributions, so that the model is generalizing well when deployed into production. Example ------- Let’s load the titanic dataset: .. code:: python import pandas as pd df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv') df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv') target_col = 'Survived' Now we can perform analysis: .. code:: python import autogluon.eda.auto as auto auto.covariate_shift_detection(train_data=df_train, test_data=df_test, label=target_col) We detected a substantial difference between the training and test X distributions, a type of distribution shift. **Test results**: We can predict whether a sample is in the test vs. training set with a ``roc_auc`` of ``1.0000`` with a p-value of ``0.0010`` (smaller than the threshold of ``0.0100)``. **Feature importances**: The variables that are the most responsible for this shift are those with high feature importance: .. raw:: html
importance stddev p_value n p99_high p99_low
PassengerId 0.476397 0.032325 0.000003 5 0.542955 0.409839
Name 0.000203 0.000114 0.008298 5 0.000439 -0.000033
**``PassengerId`` values distribution between datasets; p-value: ``0.0000``** .. figure:: output_eda-auto-covariate-shift_29486e_3_4.png **``Name`` values distribution between datasets; p-value: ``0.0083``** Interaction ``Name``/``__dataset__`` is not rendered due to ``Name`` having too many categories (``1307`` > ``30``) for comfortable read. The detector found that ``Name`` and ``PassengerId`` with a very high certainty (``roc_auc`` is ``1``) can distinguish if the row came from a train or test parts of the dataset. We’ll ignore ``Name`` for now - it’s importance is relatively low, and instead we’ll look first at ``PassengerId``. The graph shows that the feature is uniformly distributed across different ranges between train and test datasets. In this specific case it is just a monotonically increasing ID, which carries no practical value for this task. Let’s drop it and try the run again: .. code:: python df_train = df_train.drop(columns='PassengerId') df_test = df_test.drop(columns='PassengerId') auto.covariate_shift_detection(train_data=df_train, test_data=df_test, label=target_col) We did not detect a substantial difference between the training and test X distributions.