Components: dataset¶
autogluon.eda.visualization.dataset¶
Display aggregate dataset statistics and dataset-level information. |
|
Display mismatch between raw types between datasets provided. |
|
Render label insights performed by |
DatasetStatistics¶
-
class
autogluon.eda.visualization.dataset.
DatasetStatistics
(headers: bool = False, namespace: Optional[str] = None, sort_by: Optional[str] = None, sort_asc: bool = True, **kwargs)[source]¶ Display aggregate dataset statistics and dataset-level information.
The report is a composite view of combination of performed analyses:
DatasetSummary
,RawTypesAnalysis
,VariableTypeAnalysis
,SpecialTypesAnalysis
,MissingValuesAnalysis
. The components can be present in any combination (assuming their dependencies are satisfied).The report requires at least one of the analyses present to be rendered.
- Parameters
- headers: bool, default = False
if True then render headers
- namespace: str, default = None
namespace to use; can be nested like ns_a.ns_b.ns_c
- sort_by: Optional[str], default = None
column to sort the resulting table
- sort_asc: bool, default = True
if sort_by provided, then if sorting should ascending or descending
See also
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> state = auto.analyze( >>> train_data=..., label=..., return_state=True, >>> anlz_facets=[ >>> eda.dataset.DatasetSummary(), >>> eda.dataset.RawTypesAnalysis(), >>> eda.dataset.VariableTypeAnalysis(), >>> eda.dataset.SpecialTypesAnalysis(), >>> eda.missing.MissingValuesAnalysis(), >>> ], >>> viz_facets=[ >>> viz.dataset.DatasetStatistics() >>> ] >>> )
DatasetTypeMismatch¶
-
class
autogluon.eda.visualization.dataset.
DatasetTypeMismatch
(headers: bool = False, namespace: Optional[str] = None, **kwargs)[source]¶ Display mismatch between raw types between datasets provided. In case if mismatch found, mark the row with a warning.
The report requires
RawTypesAnalysis
analysis present.- Parameters
- headers: bool, default = False
if True then render headers
- namespace: str, default = None
namespace to use; can be nested like ns_a.ns_b.ns_c
See also
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> auto.analyze( >>> train_data=..., test_data=..., >>> anlz_facets=[ >>> eda.dataset.RawTypesAnalysis(), >>> ], >>> viz_facets=[ >>> viz.dataset.DatasetTypeMismatch() >>> ] >>> )
LabelInsightsVisualization¶
-
class
autogluon.eda.visualization.dataset.
LabelInsightsVisualization
(headers: bool = False, namespace: Optional[str] = None, **kwargs)[source]¶ Render label insights performed by
LabelInsightsAnalysis
.The following insights can be rendered:
classification: low cardinality classes detection
classification: classes present in test data, but not in the train data
regression: out-of-domain labels detection
- Parameters
- headers: bool, default = False
if True then render headers
- namespace: str, default = None
namespace to use; can be nested like ns_a.ns_b.ns_c
See also
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> auto.analyze( >>> auto.analyze(train_data=..., test_data=..., label=..., anlz_facets=[ >>> eda.dataset.ProblemTypeControl(), >>> eda.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold=50, regression_ood_threshold=0.01), >>> ], viz_facets=[ >>> viz.dataset.LabelInsightsVisualization() >>> ])
autogluon.eda.analysis.dataset¶
Sampler is a wrapper that provides sampling capabilities for the wrapped analyses. |
|
This wrapper splits train_data into training and validation sets stored in train_data and val_data for the wrapped analyses. |
|
Helper component to control problem type. |
|
Infers autogluon raw types for the column. |
|
Infers variable types for the column: numeric vs category. |
|
Infers autogluon special types for the column (i.e. |
|
Generates dataset summary including counts, number of unique elements, most frequent, dtypes and 7-figure summary (std/mean/min/max/quartiles) |
|
Analyze label for insights: |
Sampler¶
-
class
autogluon.eda.analysis.dataset.
Sampler
(sample: Union[None, int, float] = None, parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, **kwargs)[source]¶ Sampler is a wrapper that provides sampling capabilities for the wrapped analyses. The sampling is performed for all datasets in args and passed to all children during fit call shadowing outer parameters.
- Parameters
- sample: Union[None, int, float], default = None
sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling
- parent: Optional[AbstractAnalysis], default = None
parent Analysis
- children: Optional[List[AbstractAnalysis]], default None
wrapped analyses; these will receive sampled args during fit call
Examples
>>> from autogluon.eda.analysis.base import BaseAnalysis >>> from autogluon.eda.analysis import Sampler >>> import pandas as pd >>> import numpy as np >>> >>> df_train = pd.DataFrame(np.random.randint(0, 100, size=(10, 4)), columns=list('ABCD')) >>> df_test = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('EFGH')) >>> analysis = BaseAnalysis(train_data=df_train, test_data=df_test, children=[ >>> Sampler(sample=5, children=[ >>> # Analysis here will be performed on a sample of 5 for both train_data and test_data >>> ]) >>> ])
TrainValidationSplit¶
-
class
autogluon.eda.analysis.dataset.
TrainValidationSplit
(val_size: float = 0.3, parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, **kwargs)[source]¶ This wrapper splits train_data into training and validation sets stored in train_data and val_data for the wrapped analyses. The split is performed for datasets in args and passed to all children during fit call shadowing outer parameters.
This component requires
ProblemTypeControl
present in the analysis call to set problem_type.- Parameters
- val_size: float, default = 0.3
fraction of training set to be assigned as validation set during the split.
- parent: Optional[AbstractAnalysis], default = None
parent Analysis
- children: Optional[List[AbstractAnalysis]], default None
wrapped analyses; these will receive sampled args during fit call
- kwargs
See also
ProblemTypeControl
Examples
>>> from autogluon.eda.analysis.base import BaseAnalysis >>> from autogluon.eda.analysis import Sampler >>> import pandas as pd >>> import numpy as np >>> >>> df_train = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD")) >>> analysis = BaseAnalysis(train_data=df_train, label="D", children=[ >>> Namespace(namespace="ns_val_split_specified", children=[ >>> ProblemTypeControl(), >>> TrainValidationSplit(val_pct=0.4, children=[ >>> # This analysis sees 60/40 split of df_train between train_data and val_data >>> SomeAnalysis() >>> ]) >>> ]), >>> Namespace(namespace="ns_val_split_default", children=[ >>> ProblemTypeControl(), >>> TrainValidationSplit(children=[ >>> # This analysis sees 70/30 split (default) of df_train between train_data and val_data >>> SomeAnalysis() >>> ]) >>> ]), >>> Namespace(namespace="ns_no_split", children=[ >>> # This analysis sees only original train_data >>> SomeAnalysis() >>> ]), >>> ], >>> ) >>> >>> state = analysis.fit() >>>
ProblemTypeControl¶
-
class
autogluon.eda.analysis.dataset.
ProblemTypeControl
(problem_type: str = 'auto', parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, **kwargs)[source]¶ Helper component to control problem type. Autodetect if problem_type = ‘auto’.
- Parameters
- problem_type: str, default = ‘auto’
problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.
- parent: Optional[AbstractAnalysis], default = None
parent Analysis
- children: Optional[List[AbstractAnalysis]], default None
wrapped analyses; these will receive sampled args during fit call
- kwargs
RawTypesAnalysis¶
-
class
autogluon.eda.analysis.dataset.
RawTypesAnalysis
(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]¶ Infers autogluon raw types for the column.
See also
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> state = auto.analyze( >>> train_data=..., label=..., return_state=True, >>> anlz_facets=[ >>> eda.dataset.RawTypesAnalysis(), >>> ], >>> viz_facets=[ >>> viz.dataset.DatasetStatistics() >>> ] >>> )
VariableTypeAnalysis¶
-
class
autogluon.eda.analysis.dataset.
VariableTypeAnalysis
(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, numeric_as_categorical_threshold: int = 20, **kwargs)[source]¶ Infers variable types for the column: numeric vs category.
This analysis depends on
RawTypesAnalysis()
.- Parameters
- numeric_as_categorical_threshold: int, default = 20
if numeric column has less than this value, then the variable should be considered as categorical
- parent: Optional[AbstractAnalysis], default = None
parent Analysis
- children: Optional[List[AbstractAnalysis]], default None
wrapped analyses; these will receive sampled args during fit call
See also
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> state = auto.analyze( >>> train_data=..., label=..., return_state=True, >>> anlz_facets=[ >>> eda.dataset.RawTypesAnalysis(), >>> eda.dataset.VariableTypeAnalysis(), >>> ], >>> viz_facets=[ >>> viz.dataset.DatasetStatistics() >>> ] >>> )
SpecialTypesAnalysis¶
-
class
autogluon.eda.analysis.dataset.
SpecialTypesAnalysis
(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]¶ Infers autogluon special types for the column (i.e. text).
See also
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> state = auto.analyze( >>> train_data=..., label=..., return_state=True, >>> anlz_facets=[ >>> eda.dataset.SpecialTypesAnalysis(), >>> ], >>> viz_facets=[ >>> viz.dataset.DatasetStatistics() >>> ] >>> )
DatasetSummary¶
-
class
autogluon.eda.analysis.dataset.
DatasetSummary
(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]¶ Generates dataset summary including counts, number of unique elements, most frequent, dtypes and 7-figure summary (std/mean/min/max/quartiles)
See also
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> state = auto.analyze( >>> train_data=..., label=..., return_state=True, >>> anlz_facets=[ >>> eda.dataset.DatasetSummary(), >>> ], >>> viz_facets=[ >>> viz.dataset.DatasetStatistics() >>> ] >>> )
LabelInsightsAnalysis¶
-
class
autogluon.eda.analysis.dataset.
LabelInsightsAnalysis
(low_cardinality_classes_threshold: int = 50, regression_ood_threshold: float = 0.01, class_imbalance_ratio_threshold: float = 0.4, parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]¶ Analyze label for insights:
classification: low cardinality classes detection
classification: classes present in test data, but not in the train data
regression: out-of-domain labels detection
Note: this Analysis requires problem_type present in state. It can be detected/set via
ProblemTypeControl
component- Parameters
- low_cardinality_classes_threshold: int, default = 50
Minimum class instances present in the dataset to consider marking a class as low-cardinality
- regression_ood_threshold: float, default = 0.01
mark results as out-of-domain when test label range in regression task is beyond train data range + regression_ood_threshold margin, This is performed because some algorithms can’t extrapolate beyond training data.
- class_imbalance_ratio_threshold: float, default = 0.4
minority class proportion to detect as imbalance.
- parent: Optional[AbstractAnalysis], default = None
parent Analysis
- children: Optional[List[AbstractAnalysis]], default None
wrapped analyses; these will receive sampled args during fit call
- state: AnalysisState
state object to perform check on
Examples
>>> import autogluon.eda.analysis as eda >>> import autogluon.eda.visualization as viz >>> import autogluon.eda.auto as auto >>> auto.analyze( >>> auto.analyze(train_data=..., test_data=..., label=..., anlz_facets=[ >>> eda.dataset.ProblemTypeControl(), >>> eda.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold=50, regression_ood_threshold=0.01), >>> ], viz_facets=[ >>> viz.dataset.LabelInsightsVisualization() >>> ])