Components: anomaly#

autogluon.eda.visualization.anomaly#

AnomalyScoresVisualization

Visualize anomaly scores across datasets.

AnomalyScoresVisualization#

class autogluon.eda.visualization.anomaly.AnomalyScoresVisualization(threshold_stds: float = 3, headers: bool = False, namespace: Optional[str] = None, fig_args: Optional[Dict[str, Any]] = None, **chart_args)[source]#

Visualize anomaly scores across datasets.

The report depends on AnomalyDetectorAnalysis,

Parameters

threshold_stds (float = 3,) – defines how many standard deviations from mean the scores will be marked as anomalies
headers (bool, default = False) – if True then render headers
namespace (Optional[str], default = None) – namespace to use; can be nested like ns_a.ns_b.ns_c
fig_args (Optional[Dict[str, Any]] = None,) – kwargs to pass into visualization component.
chart_args – kwargs to pass into visualization component The chart contains two scatterpolots combined: normal and anomaly data points; both can be customized via passing additional arguments in chart_args - see the example below.

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df_train = pd.DataFrame(...)
>>> df_test = pd.DataFrame(...)
>>> label = 'target'
>>> threshold_stds = 3  # mark 3 standard deviations score values as anomalies
>>>
>>> chart_args={
>>>     'normal.color': 'lightgrey',
>>>     'anomaly.color': 'orange',
>>> }
>>>
>>> state = auto.analyze(
>>>     train_data=df_train,
>>>     test_data=df_test,
>>>     label=label,
>>>     return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.ProblemTypeControl(),
>>>         eda.transform.ApplyFeatureGenerator(category_to_numbers=True, children=[
>>>             eda.anomaly.AnomalyDetectorAnalysis(
>>>                 store_explainability_data=True  # Store additional functions for explainability
>>>             ),
>>>         ])
>>>     ],
>>>     viz_facets=[
>>>         viz.anomaly.AnomalyScoresVisualization(
>>>             threshold_stds=threshold_stds,
>>>             headers=True,
>>>             fig_args=dict(figsize=(8, 4)),
>>>             **chart_args, # pass chart args customizations
>>>         )
>>>     ]
>>> )
>>>
>>> # explain top anomalies
>>> train_anomaly_scores = state.anomaly_detection.scores.train_data
>>> anomaly_idx = train_anomaly_scores[train_anomaly_scores >= train_anomaly_scores.std() * threshold_stds]
>>> anomaly_idx = anomaly_idx.sort_values(ascending=False).index
>>>
>>> auto.explain_rows(
>>>     # Use helper function stored via `store_explainability_data=True`
>>>     **state.anomaly_detection.explain_rows_fns.train_data(anomaly_idx[:3]),
>>>     plot='waterfall',
>>> )

autogluon.eda.analysis.anomaly#

`AnomalyDetectorAnalysis`	Anomaly detection analysis.
`AnomalyDetector`	Wrapper for anomaly detector algorithms.

AnomalyDetectorAnalysis#

class autogluon.eda.analysis.anomaly.AnomalyDetectorAnalysis(n_folds: int = 5, store_explainability_data: bool = False, parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, state: Optional[AnalysisState] = None, **anomaly_detector_kwargs)[source]#

Anomaly detection analysis.

The analysis automatically creates cross-validation splits and fits detectors on each of them using train_data input. The scores produced for the training data are produced using out-of-folds predictions. All other datasets scores are produced using average of scores from detectors trained on individual folds (bag).

Please note, the analysis expects the data is ready to for fitting; all numeric columns must not have NaNs. Pre-processing can be performed using ApplyFeatureGenerator and ProblemTypeControl (see example for more details).

State attributes

anomaly_detection.scores.<dataset>
scores for each of the datasets passed into analysis (i.e. train_data, test_data)
anomaly_detection.explain_rows_fns.<dataset>
if store_explainability_data=True, then analysis will store helper functions into this variable. The function can be used later via explain_rows() and automatically pre-populates train_data, model and rows parameters when called (see example for more details)

Parameters

n_folds (int, default = 5) – number of folds to use when training detectors; default is 5 folds.
store_explainability_data (bool, default = False) – if True analysis will store helper functions into this variable. The function can be used later via explain_rows() and automatically pre-populates train_data, model and rows parameters when called (see example for more details)
parent (Optional[AbstractAnalysis], default = None) – parent Analysis
children (List[AbstractAnalysis], default = []) – wrapped analyses; these will receive sampled args during fit call
state (Optional[AnalysisState], default = None) – state to be updated by this fit function
anomaly_detector_kwargs – kwargs for AnomalyDetector

AnomalyDetector#

class autogluon.eda.analysis.anomaly.AnomalyDetector(label: str, n_folds: int = 5, detector_list: Optional[List[BaseDetector]] = None, silent: bool = True, **detector_kwargs)[source]#

Wrapper for anomaly detector algorithms.

fit_transform() automatically creates cross-validation splits and fits detectors on each of them. The scores produced for the training data are produced using out-of-folds predictions

transform() uses average of scores from the detectors trained on the folds.

Please note: the data passed into the fit/transform must be already pre-processed; numeric columns must have no NaNs.

Parameters

label (str) – dataset’s label column name
n_folds (int, default = 5,) – number of folds to use when training detectors
detector_list (Optional[List[BaseDetector]], default = None) –
list of detectors to ensemble. If None, then use the standard list:
- LOF(n_neighbors=15)
- LOF(n_neighbors=20)
- LOF(n_neighbors=25)
- LOF(n_neighbors=35)
- COPOD
- IForest(n_estimators=100)
- IForest(n_estimators=200)
See pyod documentation for the full model list.
silent (bool, default = True) – Suppress SUOD logs if True
detector_kwargs – kwargs to pass into detector

fit_transform(train_data: DataFrame) → Series[source]#

Automatically creates cross-validation splits and fits detectors on each of them. The scores produced for the training data are produced using out-of-folds predictions

Parameters: train_data (pd.DataFrame) – training data; must be already pre-processed; numeric columns must have NaNs filled
Return type: out-of-folds anomaly scores for the training data

predict(x)[source]#: API-compatibility wrapper for transform()

transform(x: DataFrame)[source]#

Predict anomaly scores for the provided inputs. This method uses average of scores produced by all the detectors trained on folds.

Parameters: x (pd.DataFrame) – data to score; must be already pre-processed; numeric columns must have NaNs filled
Return type: anomaly scores for the passed data