Components: anomaly#
autogluon.eda.visualization.anomaly#
Visualize anomaly scores across datasets. |
AnomalyScoresVisualization#
- class autogluon.eda.visualization.anomaly.AnomalyScoresVisualization(threshold_stds: float = 3, headers: bool = False, namespace: Optional[str] = None, fig_args: Optional[Dict[str, Any]] = None, **chart_args)[source]#
Visualize anomaly scores across datasets.
The report depends on
AnomalyDetectorAnalysis
,- Parameters
threshold_stds (float = 3,) – defines how many standard deviations from mean the scores will be marked as anomalies
headers (bool, default = False) – if True then render headers
namespace (Optional[str], default = None) – namespace to use; can be nested like ns_a.ns_b.ns_c
fig_args (Optional[Dict[str, Any]] = None,) – kwargs to pass into visualization component.
chart_args – kwargs to pass into visualization component The chart contains two scatterpolots combined: normal and anomaly data points; both can be customized via passing additional arguments in chart_args - see the example below.
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> import pandas as pd >>> import numpy as np >>> >>> df_train = pd.DataFrame(...) >>> df_test = pd.DataFrame(...) >>> label = 'target' >>> threshold_stds = 3 # mark 3 standard deviations score values as anomalies >>> >>> chart_args={ >>> 'normal.color': 'lightgrey', >>> 'anomaly.color': 'orange', >>> } >>> >>> state = auto.analyze( >>> train_data=df_train, >>> test_data=df_test, >>> label=label, >>> return_state=True, >>> anlz_facets=[ >>> eda.dataset.ProblemTypeControl(), >>> eda.transform.ApplyFeatureGenerator(category_to_numbers=True, children=[ >>> eda.anomaly.AnomalyDetectorAnalysis( >>> store_explainability_data=True # Store additional functions for explainability >>> ), >>> ]) >>> ], >>> viz_facets=[ >>> viz.anomaly.AnomalyScoresVisualization( >>> threshold_stds=threshold_stds, >>> headers=True, >>> fig_args=dict(figsize=(8, 4)), >>> **chart_args, # pass chart args customizations >>> ) >>> ] >>> ) >>> >>> # explain top anomalies >>> train_anomaly_scores = state.anomaly_detection.scores.train_data >>> anomaly_idx = train_anomaly_scores[train_anomaly_scores >= train_anomaly_scores.std() * threshold_stds] >>> anomaly_idx = anomaly_idx.sort_values(ascending=False).index >>> >>> auto.explain_rows( >>> # Use helper function stored via `store_explainability_data=True` >>> **state.anomaly_detection.explain_rows_fns.train_data(anomaly_idx[:3]), >>> plot='waterfall', >>> )
autogluon.eda.analysis.anomaly#
Anomaly detection analysis. |
|
Wrapper for anomaly detector algorithms. |
AnomalyDetectorAnalysis#
- class autogluon.eda.analysis.anomaly.AnomalyDetectorAnalysis(n_folds: int = 5, store_explainability_data: bool = False, parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, state: Optional[AnalysisState] = None, **anomaly_detector_kwargs)[source]#
Anomaly detection analysis.
The analysis automatically creates cross-validation splits and fits detectors on each of them using train_data input. The scores produced for the training data are produced using out-of-folds predictions. All other datasets scores are produced using average of scores from detectors trained on individual folds (bag).
Please note, the analysis expects the data is ready to for fitting; all numeric columns must not have NaNs. Pre-processing can be performed using
ApplyFeatureGenerator
andProblemTypeControl
(see example for more details).State attributes
- anomaly_detection.scores.<dataset>
scores for each of the datasets passed into analysis (i.e. train_data, test_data)
- anomaly_detection.explain_rows_fns.<dataset>
if store_explainability_data=True, then analysis will store helper functions into this variable. The function can be used later via
explain_rows()
and automatically pre-populates train_data, model and rows parameters when called (see example for more details)
- Parameters
n_folds (int, default = 5) – number of folds to use when training detectors; default is 5 folds.
store_explainability_data (bool, default = False) – if True analysis will store helper functions into this variable. The function can be used later via
explain_rows()
and automatically pre-populates train_data, model and rows parameters when called (see example for more details)parent (Optional[AbstractAnalysis], default = None) – parent Analysis
children (List[AbstractAnalysis], default = []) – wrapped analyses; these will receive sampled args during fit call
state (Optional[AnalysisState], default = None) – state to be updated by this fit function
anomaly_detector_kwargs – kwargs for
AnomalyDetector
AnomalyDetector#
- class autogluon.eda.analysis.anomaly.AnomalyDetector(label: str, n_folds: int = 5, detector_list: Optional[List[BaseDetector]] = None, silent: bool = True, **detector_kwargs)[source]#
Wrapper for anomaly detector algorithms.
fit_transform()
automatically creates cross-validation splits and fits detectors on each of them. The scores produced for the training data are produced using out-of-folds predictionstransform()
uses average of scores from the detectors trained on the folds.Please note: the data passed into the fit/transform must be already pre-processed; numeric columns must have no NaNs.
- Parameters
label (str) – dataset’s label column name
n_folds (int, default = 5,) – number of folds to use when training detectors
detector_list (Optional[List[BaseDetector]], default = None) –
- list of detectors to ensemble. If None, then use the standard list:
LOF(n_neighbors=15)
LOF(n_neighbors=20)
LOF(n_neighbors=25)
LOF(n_neighbors=35)
COPOD
IForest(n_estimators=100)
IForest(n_estimators=200)
See pyod documentation for the full model list.
silent (bool, default = True) – Suppress SUOD logs if True
detector_kwargs – kwargs to pass into detector
- fit_transform(train_data: DataFrame) Series [source]#
Automatically creates cross-validation splits and fits detectors on each of them. The scores produced for the training data are produced using out-of-folds predictions
- Parameters
train_data (pd.DataFrame) – training data; must be already pre-processed; numeric columns must have NaNs filled
- Return type
out-of-folds anomaly scores for the training data
- predict(x)[source]#
API-compatibility wrapper for
transform()
- transform(x: DataFrame)[source]#
Predict anomaly scores for the provided inputs. This method uses average of scores produced by all the detectors trained on folds.
- Parameters
x (pd.DataFrame) – data to score; must be already pre-processed; numeric columns must have NaNs filled
- Return type
anomaly scores for the passed data