Reference: Auto components#

This section is a reference for high-level composite components showcased in sections above.

autogluon.eda.analysis.auto#

dataset_overview

Shortcut to perform high-level datasets summary overview (counts, frequencies, missing statistics, types info).

target_analysis

Target variable composite analysis.

quick_fit

This helper performs quick model fit analysis and then produces a composite report of the results.

missing_values_analysis

Perform quick analysis of missing values across datasets.

covariate_shift_detection

Shortcut for covariate shift detection analysis.

analyze_interaction

This helper performs simple feature interaction analysis.

partial_dependence_plots

Partial Dependence Plot (PDP)

explain_rows

Kernel SHAP is a method that uses a special weighted linear regression to compute the importance of each feature.

detect_anomalies

Anomaly Detection

analyze

This helper creates BaseAnalysis wrapping passed analyses into Sampler if needed, then fits and renders produced state with specified visualizations.

dataset_overview#

autogluon.eda.auto.simple.dataset_overview(train_data: Optional[DataFrame] = None, test_data: Optional[DataFrame] = None, val_data: Optional[DataFrame] = None, label: Optional[str] = None, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, sample: Union[None, int, float] = 10000, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None)[source]#

Shortcut to perform high-level datasets summary overview (counts, frequencies, missing statistics, types info).

Supported fig_args/chart_args keys:
  • feature_distance.<property> - feature distance dendrogram chart

  • chart.<variable>.<property> - near-duplicate groups visualizations chart. If chart is labeled as a relationship <A>/<B>, then <variable> is <B>

Parameters
  • train_data (Optional[DataFrame], default = None) – training dataset

  • test_data (Optional[DataFrame], default = None) – test dataset

  • val_data (Optional[DataFrame], default = None) – validation dataset

  • label (: Optional[str], default = None) – target variable

  • state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.

  • return_state (bool, default = False) – return state if True

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

  • fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure

  • chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart

Examples

>>> import autogluon.eda.analysis as eda
>>>
>>> auto.dataset_overview(
>>>     train_data=df_train, test_data=df_test, label=target_col,
>>>     chart_args={'feature_distance.orientation': 'left'},
>>>     fig_args={'feature_distance.figsize': (6,6)},
>>> )

target_analysis#

autogluon.eda.auto.simple.target_analysis(train_data: DataFrame, label: str, test_data: Optional[DataFrame] = None, problem_type: str = 'auto', fit_distributions: Union[bool, str, List[str]] = True, sample: Union[None, int, float] = 10000, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None) Optional[AnalysisState][source]#

Target variable composite analysis.

Performs the following analysis components of the label field:
  • basic summary stats

  • feature values distribution charts; adds fitted distributions for numeric targets

  • target correlations analysis; with interaction charts of target vs high-correlated features

Supported fig_args/chart_args keys:
  • correlation.<property> - properties for correlation heatmap

  • chart.<variable_name>.<property> - properties for charts rendered during the analysis.

If <variable_name> is matching label value, then this will modify the top chart; all other values will be affecting label/<variable_name> interaction charts

Parameters
  • train_data (Optional[DataFrame]) – training dataset

  • test_data (Optional[DataFrame], default = None) – test dataset

  • label (: Optional[str]) – target variable

  • problem_type (str, default = 'auto') – problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.

  • fit_distributions (Union[bool, str, List[str]], default = False,) – If True, or list of distributions is provided, then fit distributions. Performed only if y and hue are not present.

  • state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

  • return_state (bool, default = False) – return state if True

  • fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.figsize will result in setting figsize on <target>/PassengerId figure.

  • chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.fill will result in setting fill on <target>/PassengerId chart.

Return type

state after fit call if return_state is True; None otherwise

Examples

>>> import autogluon.eda.analysis as eda
>>>
>>> auto.target_analysis(train_data=..., label=...)

quick_fit#

autogluon.eda.auto.simple.quick_fit(train_data: DataFrame, label: str, test_data: Optional[DataFrame] = None, path: Optional[str] = None, val_size: float = 0.3, problem_type: str = 'auto', fit_bagging_folds: int = 0, sample: Union[None, int, float] = 10000, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, save_model_to_state: bool = True, verbosity: int = 0, show_feature_importance_barplots: bool = False, estimator_args: Optional[Dict[str, Dict[str, Any]]] = None, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None, render_analysis: bool = True, **fit_args)[source]#

This helper performs quick model fit analysis and then produces a composite report of the results.

The analysis is structured in a sequence of operations:
  • Sample if sample is specified.

  • Perform train-test split using val_size ratio

  • Fit AutoGluon estimator given fit_args; if hyperparameters not present in args, then use default ones

    (Random Forest by default - because it is interpretable)

  • Display report

The reports include:
  • confusion matrix for classification problems; predictions vs actual for regression problems

  • model leaderboard

  • feature importance

  • samples with the highest prediction error - candidates for inspection

  • samples with the least distance from the other class - candidates for labeling

Supported fig_args/chart_args keys:
  • confusion_matrix.<property> - confusion matrix chart for classification predictor

  • regression_eval.<property> - regression predictor results chart

  • feature_importance.<property> - feature importance barplot chart

State attributes

  • model

    trained model

  • model_evaluation.importance

    feature importance calculated using the trained model

  • model_evaluation.leaderboard

    trained models leaderboard

  • model_evaluation.highest_error

    misclassified rows with the highest error between prediction and ground truth

  • model_evaluation.undecided (classification only)

    misclassified rows with the prediction closest to the decision boundary

  • model_evaluation.confusion_matrix (classification only)

    confusion matrix values

Parameters
  • train_data (DataFrame) – training dataset

  • test_data (DataFrame) – test dataset

  • label (str) – target variable

  • path (Optional[str], default = None,) – path for models saving

  • problem_type (str, default = 'auto') – problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.

  • fit_bagging_folds (int, default = 0,) – shortcut to enable training with bagged folds; disabled if 0 (default)

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

  • val_size (float, default = 0.3) – fraction of training set to be assigned as validation set during the split.

  • state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.

  • return_state (bool, default = False) – return state if True

  • save_model_to_state (bool, default = True,) – save fitted model into state under model key. This functionality might be helpful in cases when the fitted model could be usable for other purposes (i.e. imputers)

  • verbosity (int, default = 0) – Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).

  • show_feature_importance_barplots (bool, default = False) – if True, then barplot char will ba added with feature importance visualization

  • estimator_args (Optional[Dict[str, Dict[str, Any]]], default = None,) – args to pass into the estimator constructor

  • fit_args (Optional[Dict[str, Dict[str, Any]]], default = None,) – kwargs to pass into TabularPredictor fit.

  • fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure. The args are supporting nested dot syntax: ‘a.b.c’.

  • chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart. The args are supporting nested dot syntax: ‘a.b.c’.

  • render_analysis (bool, default = True) – if False, then don’t render any visualizations; this can be used if user just needs to train a model. It is recommended to use this option with save_model_to_state=True and return_state=True options.

Return type

state after fit call if return_state is True; None otherwise

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.auto as auto
>>>
>>> # Quick fit
>>> state = auto.quick_fit(
>>>     train_data=..., label=...,
>>>     return_state=True,  # return state object from call
>>>     fig_args={"regression_eval.figsize": (8,6)},  # customize regression evaluation `figsize`
>>>     chart_args={"regression_eval.residuals_plot_mode": "hist"}  # customize regression evaluation `residuals_plot_mode`
>>>     hyperparameters={'GBM': {}}  # train specific model
>>> )
>>>
>>> # Using quick fit model
>>> model = state.model
>>> y_pred = model.predict(test_data)

missing_values_analysis#

autogluon.eda.auto.simple.missing_values_analysis(train_data: Optional[DataFrame] = None, test_data: Optional[DataFrame] = None, val_data: Optional[DataFrame] = None, graph_type: str = 'matrix', state: Union[None, dict, AnalysisState] = None, return_state: bool = False, sample: Union[None, int, float] = 10000, **chart_args)[source]#

Perform quick analysis of missing values across datasets.

Parameters
  • train_data (Optional[DataFrame]) – training dataset

  • test_data (Optional[DataFrame], default = None) – test dataset

  • val_data – validation dataset

  • graph_type (str, default = 'matrix') –

    One of the following visualization types: - matrix - nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion

    This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.

    • bar - visualizes how many rows are non-null vs null in the column. Logarithmic scale can by specifying log=True in kwargs

    • heatmap - correlation heatmap measures nullity correlation: how strongly the presence or absence of one

      variable affects the presence of another. Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). Entries marked <1 or >-1 have a correlation that is close to being exactingly negative or positive but is still not quite perfectly so.

    • dendrogram - the dendrogram allows to more fully correlate variable completion, revealing trends deeper than the pairwise ones

      visible in the correlation heatmap. The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.

  • state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.

  • return_state (bool, default = False) – return state if True

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

Return type

state after fit call if return_state is True; None otherwise

Examples

>>> import autogluon.eda.auto as auto
>>>
>>> auto.missing_values_analysis(train_data=...)

covariate_shift_detection#

autogluon.eda.auto.simple.covariate_shift_detection(train_data: DataFrame, test_data: DataFrame, label: str, sample: Union[None, int, float] = 10000, path: Optional[str] = None, state: Union[None, dict, AnalysisState] = None, return_state: bool = False, verbosity: int = 0, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None, **fit_args)[source]#

Shortcut for covariate shift detection analysis.

Detects a change in covariate (X) distribution between training and test, which we call XShift. It can tell you if your training set is not representative of your test set distribution. This is done with a Classifier 2 Sample Test.

Supported fig_args/chart_args keys:
  • chart.<variable_name>.<property> - properties for charts rendered during the analysis

Parameters
  • train_data (Optional[DataFrame]) – training dataset

  • test_data (Optional[DataFrame]) – test dataset

  • label (: Optional[str]) – target variable

  • state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

  • path (Optional[str], default = None,) – path for models saving

  • return_state (bool, default = False) – return state if True

  • verbosity (int, default = 0) – Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).

  • fit_args – kwargs to pass into TabularPredictor fit

  • fig_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component figure. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.figsize will result in setting figsize on PassengerId figure.

  • chart_args (Optional[Dict[str, Any]], default = None,) – figures args for vizualizations; key == component; value = dict of kwargs for component chart. The args are supporting nested dot syntax: ‘a.b.c’. Charts args are following the convention of <variable_name>.<param> (i.e. chart.PassengerId.fill will result in setting fill on PassengerId chart.

Return type

state after fit call if return_state is True; None otherwise

Examples

>>> import autogluon.eda.auto as auto
>>>
>>> # use default settings
>>> auto.covariate_shift_detection(train_data=..., test_data=..., label=...)
>>>
>>> # customize classifier and verbosity level
>>> auto.covariate_shift_detection(train_data=..., test_data=..., label=..., verbosity=2, hyperparameters = {'GBM': {}})

analyze_interaction#

autogluon.eda.auto.simple.analyze_interaction(train_data: DataFrame, x: Optional[str] = None, y: Optional[str] = None, hue: Optional[str] = None, fit_distributions: Union[bool, str, List[str]] = False, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None, **analysis_args)[source]#

This helper performs simple feature interaction analysis.

Parameters
  • train_data (pd.DataFrame) – training dataset

  • x (Optional[str], default = None) –

  • y (Optional[str], default = None) –

  • hue (Optional[str], default = None) –

  • fit_distributions (Union[bool, str, List[str]], default = False,) – If True, or list of distributions is provided, then fit distributions. Performed only if y and hue are not present.

  • chart_args (Optional[dict], default = None) – kwargs to pass into visualization component

  • fig_args (Optional[Dict[str, Any]], default = None,) – kwargs to pass into visualization component

Examples

>>> import pandas as pd
>>> import autogluon.eda.auto as auto
>>>
>>> df_train = pd.DataFrame(...)
>>>
>>> auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, chart_args=dict(headers=True, alpha=0.2))

partial_dependence_plots#

autogluon.eda.auto.simple.partial_dependence_plots(train_data: DataFrame, label: str, target: Optional[Any] = None, features: Optional[Union[str, List[str]]] = None, two_way: bool = False, path: Optional[str] = None, max_ice_lines: int = 300, sample: Optional[Union[int, float]] = 10000, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None, show_help_text: bool = True, return_state: bool = False, col_number_warning: int = 20, **fit_args)[source]#

Partial Dependence Plot (PDP)

Analyze and interpret the relationship between a target variable and a specific feature in a machine learning model. PDP helps in understanding the marginal effect of a feature on the predicted outcome while holding other features constant

The visualizations have two modes: - Display Partial Dependence Plots (PDP) with Individual Conditional Expectation (ICE) - this is the default mode of operation - Two-Way PDP plots - this mode can be selected via passing two features and setting two_way = True

ICE plots complement PDP by showing the relationship between a feature and the model’s output for each individual instance in the dataset. ICE lines (blue) can be overlaid on PDPs (red) to provide a more detailed view of how the model behaves for specific instances. Here are some points on how to interpret PDPs with ICE lines:

  • Central tendency

    The PDP line represents the average prediction for different values of the feature of interest. Look for the overall trend of the PDP line to understand the average effect of the feature on the model’s output.

  • Variability

    The ICE lines represent the predicted outcomes for individual instances as the feature of interest changes. Examine the spread of ICE lines around the PDP line to understand the variability in predictions for different instances.

  • Non-linear relationships

    Look for any non-linear patterns in the PDP and ICE lines. This may indicate that the model captures a non-linear relationship between the feature and the model’s output.

  • Heterogeneity

    Check for instances where ICE lines have widely varying slopes, indicating different relationships between the feature and the model’s output for individual instances. This may suggest interactions between the feature of interest and other features.

  • Outliers

    Look for any ICE lines that are very different from the majority of the lines. This may indicate potential outliers or instances that have unique relationships with the feature of interest.

  • Confidence intervals

    If available, examine the confidence intervals around the PDP line. Wider intervals may indicate a less certain relationship between the feature and the model’s output, while narrower intervals suggest a more robust relationship.

  • Interactions

    By comparing PDPs and ICE plots for different features, you may detect potential interactions between features. If the ICE lines change significantly when comparing two features, this might suggest an interaction effect.

Two-way PDP can visualize potential interactions between any two features. Here are a few cases when two-way PDP can give good results:

  • Suspected interactions: Even if two features are not highly correlated, they may still interact in the context of the model.

    If you suspect that there might be interactions between any two features, two-way PDP can help to verify the hypotheses.

  • Moderate to high correlation: If two features have a moderate to high correlation,

    a two-way PDP can show how the combined effect of these features influences the model’s predictions. In this case, the plot can help reveal whether the relationship between the features is additive, multiplicative, or more complex.

  • Complementary features: If two features provide complementary information, a two-way PDP can help illustrate how the joint effect

    of these features impacts the model’s predictions. For example, if one feature measures the length of an object and another measures its width, a two-way PDP could show how the combination of these features affects the predicted outcome.

  • Domain knowledge: If domain knowledge suggests that the relationship between two features might be important for the model’s output,

    a two-way PDP can help to explore and validate these hypotheses.

  • Feature importance: If feature importance analysis ranks both features high in the leaderboard, it might be beneficial

    to examine their joint effect on the model’s predictions.

State attributes

  • pdp_id_to_category_mappings

    Categorical are represented in charts as numbers; id to value mappings are available in this property.

Parameters
  • train_data (DataFrame) – training dataset

  • label (str) – target variable

  • target (Optional[Any], default = None) – In a multiclass setting, specifies the class for which the PDPs should be computed. Ignored in binary classification or classical regression settings

  • features (Optional[Union[str, List[str]]], default = None) – feature subset to display; None means all features will be rendered.

  • two_way (bool, default = False) – render two-way PDP; this mode works only when two features are specified

  • path (Optional[str], default = None) – location to store the model trained for this task

  • max_ice_lines (int, default = 300) – max number of ice lines to display for each sub-plot

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

  • fig_args (Optional[Dict[str, Any]], default = None) – kwargs to pass into chart figure

  • chart_args (Optional[dict], default = None) – kwargs to pass into visualization component

  • show_help_text (bool, default = True) – if True shows additional information how to interpret the data

  • return_state (bool, default = False) – return state if True

  • col_number_warning (int, default = 20) – number of features to visualize after which the warning will be displayed to warn about rendering time

  • fit_args (Optional[Dict[str, Dict[str, Any]]], default = None,) – kwargs to pass into TabularPredictor fit.

Return type

state after fit call if return_state is True; None otherwise

Examples

>>> import autogluon.eda.auto as auto
>>>
>>> # Plot all features in a grid
>>> auto.partial_dependence_plots(train_data=..., label=...)
>>>
>>> # Plot two-way feature interaction for features `feature_a` and `feature_b`
>>> auto.partial_dependence_plots(train_data=..., label=..., features=['feature_a', 'feature_b'], two_way=True)

See also

PDPInteractions

explain_rows#

autogluon.eda.auto.simple.explain_rows(train_data: DataFrame, model: TabularPredictor, rows: DataFrame, display_rows: bool = False, plot: Optional[str] = 'force', baseline_sample: int = 100, return_state: bool = False, fit_args: Optional[Dict[str, Any]] = None, **kwargs) Optional[AnalysisState][source]#

Kernel SHAP is a method that uses a special weighted linear regression to compute the importance of each feature. The computed importance values are Shapley values from game theory and also coefficients from a local linear regression values analysis for the given rows.

The results are rendered either as force plot or waterfall plot.

Parameters
  • train_data (DataFrame) – training dataset

  • model (TabularPredictor) – trained AutoGluon predictor

  • rows (pd.DataFrame,) – rows to explain

  • display_rows (bool, default = False) – if True then display the row before the explanation chart

  • plot (Optional[str], default = 'force') – type of plot to visualize the Shapley values. Supported keys: - force - Visualize the given SHAP values with an additive force layout - waterfall - Visualize the given SHAP values with a waterfall layout - None - do not use any visualization

  • baseline_sample (int, default = 100) – The background dataset size to use for integrating out features. To determine the impact of a feature, that feature is set to “missing” and the change in the model output is observed.

  • return_state (bool, default = False) – return state if True

  • fit_args (Optional[Dict[str, Any]], default = None,) – kwargs for ShapAnalysis.

  • kwargs

Examples

>>> import autogluon.eda.auto as auto
>>>
>>> state = auto.quick_fit(
>>>     train_data=...,
>>>     label=...,
>>>     return_state=True,
>>> )
>>>
>>> # quick_fit stored model in `state.model`, and can be passed here.
>>> # This will visualize 1st row of rows with the highest errors;
>>> # these rows are stored under `state.model_evaluation.highest_error`
>>> auto.explain_rows(
>>>     train_data=...,
>>>     model=state.model,
>>>     display_rows=True,
>>>     rows=state.model_evaluation.highest_error[:1],
>>>     plot='waterfall',  # visualize as waterfall plot
>>> )

See also

KernelExplainer, ShapAnalysis, ExplainForcePlot, ExplainWaterfallPlot

detect_anomalies#

autogluon.eda.auto.simple.detect_anomalies(train_data: DataFrame, label: str, test_data: Optional[DataFrame] = None, val_data: Optional[DataFrame] = None, explain_top_n_anomalies: Optional[int] = None, show_top_n_anomalies: Optional[int] = 10, threshold_stds: float = 3, show_help_text: bool = True, state: Union[None, dict, AnalysisState] = None, sample: Union[None, int, float] = 10000, return_state: bool = False, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None, **anomaly_detector_kwargs) Optional[AnalysisState][source]#

Anomaly Detection

This method is used to identify unusual patterns or behaviors in data that deviate significantly from the norm. It’s best used when finding outliers, rare events, or suspicious activities that could indicate fraud, defects, or system failures.

When interpreting anomaly scores, consider:

  • Threshold:

    Determine a suitable threshold to separate normal from anomalous data points, based on domain knowledge or statistical methods.

  • Context:

    Examine the context of anomalies, including time, location, and surrounding data points, to identify possible causes.

  • False positives/negatives:

    Be aware of the trade-offs between false positives (normal points classified as anomalies) and false negatives (anomalies missed).

  • Feature relevance:

    Ensure the features used for anomaly detection are relevant and contribute to the model’s performance.

  • Model performance:

    Regularly evaluate and update the model to maintain its accuracy and effectiveness.

It’s important to understand the context and domain knowledge before deciding on an appropriate approach to deal with anomalies. The choice of method depends on the data’s nature, the cause of anomalies, and the problem being addressed. The common ways to deal with anomalies:

  • Removal:

    If an anomaly is a result of an error, noise, or irrelevance to the analysis, it can be removed from the dataset to prevent it from affecting the model’s performance.

  • Imputation:

    Replace anomalous values with appropriate substitutes, such as the mean, median, or mode of the feature, or by using more advanced techniques like regression or k-nearest neighbors.

  • Transformation:

    Apply transformations like log, square root, or z-score to normalize the data and reduce the impact of extreme values. Absolute dates might be transformed into relative features like age of the item.

  • Capping:

    Set upper and lower bounds for a feature, and replace values outside these limits with the bounds themselves. This method is also known as winsorizing.

  • Separate modeling:

    Treat anomalies as a distinct group and build a separate model for them, or use specialized algorithms designed for handling outliers, such as robust regression or one-class SVM.

  • Incorporate as a feature:

    Create a new binary feature indicating the presence of an anomaly, which can be useful if anomalies have predictive value.

State attributes

  • anomaly_detection.scores.<dataset>

    scores for each of the datasets passed into analysis (i.e. train_data, test_data)

  • state.anomaly_detection.anomalies.<dataset>

    data points considered as anomalies - original rows with added score column sorted in descending score order. defined by threshold_stds parameter

  • anomaly_detection.anomaly_score_threshold

    anomaly score threshold above which data points are considered as anomalies; defined by threshold_stds parameter

Parameters
  • train_data (DataFrame) – training dataset

  • label (str) – target variable

  • test_data (Optional[pd.DataFrame], default = None) – test dataset

  • val_data (Optional[pd.DataFrame], default = None) – validation dataset

  • explain_top_n_anomalies (Optional[int], default = None) – explain the anomaly scores for n rows with the highest scores; don’t perform analysis if value is None or 0

  • show_top_n_anomalies (Optional[int], default = 10) – display n rows with highest anomaly scores

  • threshold_stds (float, default = 3) – specifies how many standard deviations above mean anomaly score considered as anomalies (only needed for visualization, does not affect scores calculation)

  • show_help_text (bool, default = True) – if True shows additional information how to interpret the data

  • state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

  • return_state (bool, default = False) – return state if True

  • fig_args (Optional[Dict[str, Any]], default = None,) – kwargs to pass into visualization component

  • chart_args (Optional[dict], default = None) – kwargs to pass into visualization component

  • anomaly_detector_kwargs – kwargs to pass into AnomalyDetectorAnalysis

  • auto (>>> import autogluon.eda.auto as) –

  • >>>

  • auto.detect_anomalies( (>>> state =) –

  • train_data=... (>>>) –

:param : :param >>> test_data=…: :param # optional: :param >>> label=…: :param : :param >>> threshold_stds=3: :param : :param >>> show_top_n_anomalies=5: :param : :param >>> explain_top_n_anomalies=3: :param : :param >>> return_state=True: :param : :param >>> chart_args={: :param >>> ‘normal.color’: :type >>> ‘normal.color’: ‘lightgrey’, :param >>> ‘anomaly.color’: :type >>> ‘anomaly.color’: ‘orange’, :param >>> }: :param >>> ): :param >>>: :param >>> # Getting anomaly scores from the analysis: :param >>> train_anomaly_scores = state.anomaly_detection.scores.train_data: :param >>> test_anomaly_scores = state.anomaly_detection.scores.test_data: :param >>>: :param >>> # Anomaly score threshold for specified level - see threshold_stds parameter: :param >>> anomaly_score_threshold = state.anomaly_detection.anomaly_score_threshold:

Return type

state after fit call if return_state is True; None otherwise

analyze#

autogluon.eda.auto.simple.analyze(train_data: Optional[DataFrame] = None, test_data: Optional[DataFrame] = None, val_data: Optional[DataFrame] = None, model=None, label: Optional[str] = None, state: Union[None, dict, AnalysisState] = None, sample: Union[None, int, float] = 10000, anlz_facets: Optional[List[AbstractAnalysis]] = None, viz_facets: Optional[List[AbstractVisualization]] = None, return_state: bool = False, verbosity: int = 2, **kwargs) Optional[AnalysisState][source]#

This helper creates BaseAnalysis wrapping passed analyses into Sampler if needed, then fits and renders produced state with specified visualizations.

Parameters
  • train_data – training dataset

  • test_data – test dataset

  • val_data – validation dataset

  • model – trained Predictor

  • label (str) – target variable

  • state (Union[None, dict, AnalysisState], default = None) – pass prior state if necessary; the object will be updated during anlz_facets fit call.

  • sample (Union[None, int, float], default = 10000) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also autogluon.eda.analysis.dataset.Sampler()

  • anlz_facets (List[AbstractAnalysis]) – analyses to add to this composite analysis

  • viz_facets (List[AbstractVisualization]) – visualizations to add to this composite analysis

  • return_state (bool, default = False) – return state if True

  • verbosity (int, default = 2,) – Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).

Return type

state after fit call if return_state is True; None otherwise

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         # Add analysis chain here
>>>     ],
>>>     viz_facets=[
>>>         # Add visualization facets here
>>>     ]
>>> )