Components: anomaly#

autogluon.eda.visualization.anomaly#

AnomalyScoresVisualization

Visualize anomaly scores across datasets.

AnomalyScoresVisualization#

class autogluon.eda.visualization.anomaly.AnomalyScoresVisualization(threshold_stds: float = 3, headers: bool = False, namespace: Optional[str] = None, fig_args: Optional[Dict[str, Any]] = None, **chart_args)[source]#

Visualize anomaly scores across datasets.

The report depends on AnomalyDetectorAnalysis,

Parameters
  • threshold_stds (float = 3,) – defines how many standard deviations from mean the scores will be marked as anomalies

  • headers (bool, default = False) – if True then render headers

  • namespace (Optional[str], default = None) – namespace to use; can be nested like ns_a.ns_b.ns_c

  • fig_args (Optional[Dict[str, Any]] = None,) – kwargs to pass into visualization component.

  • chart_args – kwargs to pass into visualization component The chart contains two scatterpolots combined: normal and anomaly data points; both can be customized via passing additional arguments in chart_args - see the example below.

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df_train = pd.DataFrame(...)
>>> df_test = pd.DataFrame(...)
>>> label = 'target'
>>> threshold_stds = 3  # mark 3 standard deviations score values as anomalies
>>>
>>> chart_args={
>>>     'normal.color': 'lightgrey',
>>>     'anomaly.color': 'orange',
>>> }
>>>
>>> state = auto.analyze(
>>>     train_data=df_train,
>>>     test_data=df_test,
>>>     label=label,
>>>     return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.ProblemTypeControl(),
>>>         eda.transform.ApplyFeatureGenerator(category_to_numbers=True, children=[
>>>             eda.anomaly.AnomalyDetectorAnalysis(
>>>                 store_explainability_data=True  # Store additional functions for explainability
>>>             ),
>>>         ])
>>>     ],
>>>     viz_facets=[
>>>         viz.anomaly.AnomalyScoresVisualization(
>>>             threshold_stds=threshold_stds,
>>>             headers=True,
>>>             fig_args=dict(figsize=(8, 4)),
>>>             **chart_args, # pass chart args customizations
>>>         )
>>>     ]
>>> )
>>>
>>> # explain top anomalies
>>> train_anomaly_scores = state.anomaly_detection.scores.train_data
>>> anomaly_idx = train_anomaly_scores[train_anomaly_scores >= train_anomaly_scores.std() * threshold_stds]
>>> anomaly_idx = anomaly_idx.sort_values(ascending=False).index
>>>
>>> auto.explain_rows(
>>>     # Use helper function stored via `store_explainability_data=True`
>>>     **state.anomaly_detection.explain_rows_fns.train_data(anomaly_idx[:3]),
>>>     plot='waterfall',
>>> )

autogluon.eda.analysis.anomaly#

AnomalyDetectorAnalysis

Anomaly detection analysis.

AnomalyDetector

Wrapper for anomaly detector algorithms.

AnomalyDetectorAnalysis#

class autogluon.eda.analysis.anomaly.AnomalyDetectorAnalysis(n_folds: int = 5, store_explainability_data: bool = False, parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, state: Optional[AnalysisState] = None, **anomaly_detector_kwargs)[source]#

Anomaly detection analysis.

The analysis automatically creates cross-validation splits and fits detectors on each of them using train_data input. The scores produced for the training data are produced using out-of-folds predictions. All other datasets scores are produced using average of scores from detectors trained on individual folds (bag).

Please note, the analysis expects the data is ready to for fitting; all numeric columns must not have NaNs. Pre-processing can be performed using ApplyFeatureGenerator and ProblemTypeControl (see example for more details).

State attributes

  • anomaly_detection.scores.<dataset>

    scores for each of the datasets passed into analysis (i.e. train_data, test_data)

  • anomaly_detection.explain_rows_fns.<dataset>

    if store_explainability_data=True, then analysis will store helper functions into this variable. The function can be used later via explain_rows() and automatically pre-populates train_data, model and rows parameters when called (see example for more details)

Parameters
  • n_folds (int, default = 5) – number of folds to use when training detectors; default is 5 folds.

  • store_explainability_data (bool, default = False) – if True analysis will store helper functions into this variable. The function can be used later via explain_rows() and automatically pre-populates train_data, model and rows parameters when called (see example for more details)

  • parent (Optional[AbstractAnalysis], default = None) – parent Analysis

  • children (List[AbstractAnalysis], default = []) – wrapped analyses; these will receive sampled args during fit call

  • state (Optional[AnalysisState], default = None) – state to be updated by this fit function

  • anomaly_detector_kwargs – kwargs for AnomalyDetector

AnomalyDetector#

class autogluon.eda.analysis.anomaly.AnomalyDetector(label: str, n_folds: int = 5, detector_list: Optional[List[BaseDetector]] = None, silent: bool = True, **detector_kwargs)[source]#

Wrapper for anomaly detector algorithms.

fit_transform() automatically creates cross-validation splits and fits detectors on each of them. The scores produced for the training data are produced using out-of-folds predictions

transform() uses average of scores from the detectors trained on the folds.

Please note: the data passed into the fit/transform must be already pre-processed; numeric columns must have no NaNs.

Parameters
  • label (str) – dataset’s label column name

  • n_folds (int, default = 5,) – number of folds to use when training detectors

  • detector_list (Optional[List[BaseDetector]], default = None) –

    list of detectors to ensemble. If None, then use the standard list:
    • LOF(n_neighbors=15)

    • LOF(n_neighbors=20)

    • LOF(n_neighbors=25)

    • LOF(n_neighbors=35)

    • COPOD

    • IForest(n_estimators=100)

    • IForest(n_estimators=200)

    See pyod documentation for the full model list.

  • silent (bool, default = True) – Suppress SUOD logs if True

  • detector_kwargs – kwargs to pass into detector

fit_transform(train_data: DataFrame) Series[source]#

Automatically creates cross-validation splits and fits detectors on each of them. The scores produced for the training data are produced using out-of-folds predictions

Parameters

train_data (pd.DataFrame) – training data; must be already pre-processed; numeric columns must have NaNs filled

Return type

out-of-folds anomaly scores for the training data

predict(x)[source]#

API-compatibility wrapper for transform()

transform(x: DataFrame)[source]#

Predict anomaly scores for the provided inputs. This method uses average of scores produced by all the detectors trained on folds.

Parameters

x (pd.DataFrame) – data to score; must be already pre-processed; numeric columns must have NaNs filled

Return type

anomaly scores for the passed data