Reference: Auto components¶
This section is a reference for high-level composite components showcased in sections above.
autogluon.eda.analysis.auto¶
Shortcut to perform high-level datasets summary overview (counts, frequencies, missing statistics, types info). |
|
Target variable composite analysis. |
|
This helper performs quick model fit analysis and then produces a composite report of the results. |
|
Perform quick analysis of missing values across datasets. |
|
Shortcut for covariate shift detection analysis. |
|
This helper performs simple feature interaction analysis. |
|
This helper creates BaseAnalysis wrapping passed analyses into Sampler if needed, then fits and renders produced state with specified visualizations. |
dataset_overview¶
-
autogluon.eda.auto.simple.
dataset_overview
(train_data: Optional[pandas.core.frame.DataFrame] = None, test_data: Optional[pandas.core.frame.DataFrame] = None, val_data: Optional[pandas.core.frame.DataFrame] = None, label: Optional[str] = None, state: Union[None, dict, autogluon.eda.state.AnalysisState] = None, return_state: bool = False, sample: Union[None, int, float] = None, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None)[source]¶ Shortcut to perform high-level datasets summary overview (counts, frequencies, missing statistics, types info).
- Supported fig_args/chart_args keys:
feature_distance - feature distance dendrogram chart
- Parameters
- train_data: Optional[DataFrame], default = None
training dataset
- test_data: Optional[DataFrame], default = None
test dataset
- val_data: Optional[DataFrame], default = None
validation dataset
- label:Optional[str], default = None
target variable
- state: Union[None, dict, AnalysisState], default = None
pass prior state if necessary; the object will be updated during anlz_facets fit call.
- return_state: bool, default = False
return state if True
- sample: Union[None, int, float], default = None
sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()
- fig_args: Optional[Dict[str, Any]], default = None,
figures args for vizualizations; key == component; value = dict of kwargs for component figure
- chart_args: Optional[Dict[str, Any]], default = None,
figures args for vizualizations; key == component; value = dict of kwargs for component chart
See also
Examples
>>> import autogluon.eda.analysis as eda >>> >>> auto.dataset_overview( >>> train_data=df_train, test_data=df_test, label=target_col, >>> chart_args={'feature_distance': dict(orientation='left')}, >>> fig_args={'feature_distance': dict(figsize=(6,6))}, >>> )
target_analysis¶
-
autogluon.eda.auto.simple.
target_analysis
(train_data: pandas.core.frame.DataFrame, label: str, test_data: Optional[pandas.core.frame.DataFrame] = None, problem_type: str = 'auto', fit_distributions: Union[bool, str, List[str]] = True, sample: Union[None, int, float] = None, state: Union[None, dict, autogluon.eda.state.AnalysisState] = None, return_state: bool = False) → Optional[autogluon.eda.state.AnalysisState][source]¶ Target variable composite analysis.
- Performs the following analysis components of the label field:
basic summary stats
feature values distribution charts; adds fitted distributions for numeric targets
target correlations analysis; with interaction charts of target vs high-correlated features
- Parameters
- train_data: Optional[DataFrame]
training dataset
- test_data: Optional[DataFrame], default = None
test dataset
- label:Optional[str]
target variable
- problem_type: str, default = ‘auto’
problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.
- fit_distributions: Union[bool, str, List[str]], default = False,
If True, or list of distributions is provided, then fit distributions. Performed only if y and hue are not present.
- state: Union[None, dict, AnalysisState], default = None
pass prior state if necessary; the object will be updated during anlz_facets fit call.
- sample: Union[None, int, float], default = None
sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()
- return_state: bool, default = False
return state if True
- Returns
- state after fit call if return_state is True; None otherwise
Examples
>>> import autogluon.eda.analysis as eda >>> >>> auto.target_analysis(train_data=..., label=...)
quick_fit¶
-
autogluon.eda.auto.simple.
quick_fit
(train_data: pandas.core.frame.DataFrame, label: str, path: Optional[str] = None, val_size: float = 0.3, problem_type: str = 'auto', sample: Union[None, int, float] = None, state: Union[None, dict, autogluon.eda.state.AnalysisState] = None, return_state: bool = False, verbosity: int = 0, show_feature_importance_barplots: bool = False, fig_args: Optional[Dict[str, Dict[str, Any]]] = None, chart_args: Optional[Dict[str, Dict[str, Any]]] = None, **fit_args)[source]¶ This helper performs quick model fit analysis and then produces a composite report of the results.
- The analysis is structured in a sequence of operations:
Sample if sample is specified.
Perform train-test split using val_size ratio
- Fit AutoGluon estimator given fit_args; if hyperparameters not present in args, then use default ones
(Random Forest by default - because it is interpretable)
Display report
- The reports include:
confusion matrix for classification problems; predictions vs actual for regression problems
model leaderboard
feature importance
samples with the highest prediction error - candidates for inspection
samples with the least distance from the other class - candidates for labeling
- Supported fig_args/chart_args keys:
confusion_matrix - confusion matrix chart for classification predictor
regression_eval - regression predictor results chart
feature_importance - feature importance barplot chart
- Parameters
- train_data: DataFrame
training dataset
- label: str
target variable
- path: Optional[str], default = None,
path for models saving
- problem_type: str, default = ‘auto’
problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.
- sample: Union[None, int, float], default = None
sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()
- val_size: float, default = 0.3
fraction of training set to be assigned as validation set during the split.
- state: Union[None, dict, AnalysisState], default = None
pass prior state if necessary; the object will be updated during anlz_facets fit call.
- return_state: bool, default = False
return state if True
- verbosity: int, default = 0
Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).
- show_feature_importance_barplots: bool, default = False
if True, then barplot char will ba added with feature importance visualization
- fit_args
kwargs to pass into TabularPredictor fit
- fig_args: Optional[Dict[str, Any]], default = None,
figures args for vizualizations; key == component; value = dict of kwargs for component figure
- chart_args: Optional[Dict[str, Any]], default = None,
figures args for vizualizations; key == component; value = dict of kwargs for component chart
- Returns
- state after fit call if return_state is True; None otherwise
Examples
>>> import autogluon.eda.analysis as eda >>> >>> # Quick fit >>> state = auto.quick_fit( >>> train_data=..., label=..., >>> return_state=True, # return state object from call >>> save_model_to_state=True, # store fitted model into the state >>> hyperparameters={'GBM': {}} # train specific model >>> ) >>> >>> # Using quick fit model >>> model = state.model >>> y_pred = model.predict(test_data)
missing_values_analysis¶
-
autogluon.eda.auto.simple.
missing_values_analysis
(graph_type: str = 'matrix', train_data: Optional[pandas.core.frame.DataFrame] = None, test_data: Optional[pandas.core.frame.DataFrame] = None, val_data: Optional[pandas.core.frame.DataFrame] = None, state: Union[None, dict, autogluon.eda.state.AnalysisState] = None, return_state: bool = False, sample: Union[None, int, float] = None, **chart_args)[source]¶ Perform quick analysis of missing values across datasets.
- Parameters
- graph_type: str, default = ‘matrix’
One of the following visualization types: - matrix - nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion
This visualization will comfortably accommodate up to 50 labelled variables. Past that range labels begin to overlap or become unreadable, and by default large displays omit them.
bar - visualizes how many rows are non-null vs null in the column. Logarithmic scale can by specifying log=True in kwargs
- heatmap - correlation heatmap measures nullity correlation: how strongly the presence or absence of one
variable affects the presence of another. Nullity correlation ranges from -1 (if one variable appears the other definitely does not) to 0 (variables appearing or not appearing have no effect on one another) to 1 (if one variable appears the other definitely also does). Entries marked <1 or >-1 have a correlation that is close to being exactingly negative or positive but is still not quite perfectly so.
- dendrogram - the dendrogram allows to more fully correlate variable completion, revealing trends deeper than the pairwise ones
visible in the correlation heatmap. The dendrogram uses a hierarchical clustering algorithm (courtesy of scipy) to bin variables against one another by their nullity correlation (measured in terms of binary distance). At each step of the tree the variables are split up based on which combination minimizes the distance of the remaining clusters. The more monotone the set of variables, the closer their total distance is to zero, and the closer their average distance (the y-axis) is to zero.
- train_data: Optional[DataFrame]
training dataset
- test_data: Optional[DataFrame], default = None
test dataset
- val_data
validation dataset
- state: Union[None, dict, AnalysisState], default = None
pass prior state if necessary; the object will be updated during anlz_facets fit call.
- return_state: bool, default = False
return state if True
- sample: Union[None, int, float], default = None
sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()
- Returns
- state after fit call if return_state is True; None otherwise
Examples
>>> import autogluon.eda.auto as auto >>> >>> auto.missing_values_analysis(train_data=...)
covariate_shift_detection¶
-
autogluon.eda.auto.simple.
covariate_shift_detection
(train_data: pandas.core.frame.DataFrame, test_data: pandas.core.frame.DataFrame, label: str, sample: Union[None, int, float] = None, path: Optional[str] = None, state: Union[None, dict, autogluon.eda.state.AnalysisState] = None, return_state: bool = False, verbosity: int = 0, **fit_args)[source]¶ Shortcut for covariate shift detection analysis.
Detects a change in covariate (X) distribution between training and test, which we call XShift. It can tell you if your training set is not representative of your test set distribution. This is done with a Classifier 2 Sample Test.
- Parameters
- train_data: Optional[DataFrame]
training dataset
- test_data: Optional[DataFrame]
test dataset
- label:Optional[str]
target variable
- state: Union[None, dict, AnalysisState], default = None
pass prior state if necessary; the object will be updated during anlz_facets fit call.
- sample: Union[None, int, float], default = None
sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()
- path: Optional[str], default = None,
path for models saving
- return_state: bool, default = False
return state if True
- verbosity: int, default = 0
Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).
- fit_args
kwargs to pass into TabularPredictor fit
- Returns
- state after fit call if return_state is True; None otherwise
See also
Examples
>>> import autogluon.eda.auto as auto >>> >>> # use default settings >>> auto.covariate_shift_detection(train_data=..., test_data=..., label=...) >>> >>> # customize classifier and verbosity level >>> auto.covariate_shift_detection(train_data=..., test_data=..., label=..., verbosity=2, hyperparameters = {'GBM': {}})
analyze_interaction¶
-
autogluon.eda.auto.simple.
analyze_interaction
(x: Optional[str] = None, y: Optional[str] = None, hue: Optional[str] = None, fit_distributions: Union[bool, str, List[str]] = False, fig_args: Optional[Dict[str, Any]] = None, chart_args: Optional[Dict[str, Any]] = None, **analysis_args)[source]¶ This helper performs simple feature interaction analysis.
- Parameters
- x: Optional[str], default = None
- y: Optional[str], default = None
- hue: Optional[str], default = None
- fit_distributions: Union[bool, str, List[str]], default = False,
If True, or list of distributions is provided, then fit distributions. Performed only if y and hue are not present.
- chart_args: Optional[dict], default = None
kwargs to pass into visualization component
- fig_args: Optional[Dict[str, Any]], default = None,
kwargs to pass into chart figure
Examples
>>> import pandas as pd >>> import autogluon.eda.auto as auto >>> >>> df_train = pd.DataFrame(...) >>> >>> auto.analyze_interaction(x='Age', hue='Survived', train_data=df_train, chart_args=dict(headers=True, alpha=0.2))
analyze¶
-
autogluon.eda.auto.simple.
analyze
(train_data=None, test_data=None, val_data=None, model=None, label: Optional[str] = None, state: Union[None, dict, autogluon.eda.state.AnalysisState] = None, sample: Union[None, int, float] = None, anlz_facets: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, viz_facets: Optional[List[autogluon.eda.visualization.base.AbstractVisualization]] = None, return_state: bool = False, verbosity: int = 2) → Optional[autogluon.eda.state.AnalysisState][source]¶ This helper creates BaseAnalysis wrapping passed analyses into Sampler if needed, then fits and renders produced state with specified visualizations.
- Parameters
- train_data
training dataset
- test_data
test dataset
- val_data
validation dataset
- model
trained Predictor
- label: str
target variable
- state: Union[None, dict, AnalysisState], default = None
pass prior state if necessary; the object will be updated during anlz_facets fit call.
- sample: Union[None, int, float], default = None
sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling See also
autogluon.eda.analysis.dataset.Sampler()
- anlz_facets: List[AbstractAnalysis]
analyses to add to this composite analysis
- viz_facets: List[AbstractVisualization]
visualizations to add to this composite analysis
- return_state: bool, default = False
return state if True
- verbosity: int, default = 2,
Verbosity levels range from 0 to 4 and control how much information is printed. Higher levels correspond to more detailed print statements (you can set verbosity = 0 to suppress warnings). If using logging, you can alternatively control amount of information printed via logger.setLevel(L), where L ranges from 0 to 50 (Note: higher values of L correspond to fewer print statements, opposite of verbosity levels).
- Returns
- state after fit call if return_state is True; None otherwise