Components: dataset

autogluon.eda.visualization.dataset

DatasetStatistics

Display aggregate dataset statistics and dataset-level information.

DatasetTypeMismatch

Display mismatch between raw types between datasets provided.

LabelInsightsVisualization

Render label insights performed by LabelInsightsAnalysis.

DatasetStatistics

class autogluon.eda.visualization.dataset.DatasetStatistics(headers: bool = False, namespace: Optional[str] = None, sort_by: Optional[str] = None, sort_asc: bool = True, **kwargs)[source]

Display aggregate dataset statistics and dataset-level information.

The report is a composite view of combination of performed analyses: DatasetSummary, RawTypesAnalysis, VariableTypeAnalysis, SpecialTypesAnalysis, MissingValuesAnalysis. The components can be present in any combination (assuming their dependencies are satisfied).

The report requires at least one of the analyses present to be rendered.

Parameters
headers: bool, default = False

if True then render headers

namespace: str, default = None

namespace to use; can be nested like ns_a.ns_b.ns_c

sort_by: Optional[str], default = None

column to sort the resulting table

sort_asc: bool, default = True

if sort_by provided, then if sorting should ascending or descending

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.DatasetSummary(),
>>>         eda.dataset.RawTypesAnalysis(),
>>>         eda.dataset.VariableTypeAnalysis(),
>>>         eda.dataset.SpecialTypesAnalysis(),
>>>         eda.missing.MissingValuesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

DatasetTypeMismatch

class autogluon.eda.visualization.dataset.DatasetTypeMismatch(headers: bool = False, namespace: Optional[str] = None, **kwargs)[source]

Display mismatch between raw types between datasets provided. In case if mismatch found, mark the row with a warning.

The report requires RawTypesAnalysis analysis present.

Parameters
headers: bool, default = False

if True then render headers

namespace: str, default = None

namespace to use; can be nested like ns_a.ns_b.ns_c

See also

RawTypesAnalysis

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> auto.analyze(
>>>     train_data=..., test_data=...,
>>>     anlz_facets=[
>>>         eda.dataset.RawTypesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetTypeMismatch()
>>>     ]
>>> )

LabelInsightsVisualization

class autogluon.eda.visualization.dataset.LabelInsightsVisualization(headers: bool = False, namespace: Optional[str] = None, **kwargs)[source]

Render label insights performed by LabelInsightsAnalysis.

The following insights can be rendered:

  • classification: low cardinality classes detection

  • classification: classes present in test data, but not in the train data

  • regression: out-of-domain labels detection

Parameters
headers: bool, default = False

if True then render headers

namespace: str, default = None

namespace to use; can be nested like ns_a.ns_b.ns_c

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> auto.analyze(
>>> auto.analyze(train_data=..., test_data=..., label=..., anlz_facets=[
>>>     eda.dataset.ProblemTypeControl(),
>>>     eda.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold=50, regression_ood_threshold=0.01),
>>> ], viz_facets=[
>>>     viz.dataset.LabelInsightsVisualization()
>>> ])

autogluon.eda.analysis.dataset

Sampler

Sampler is a wrapper that provides sampling capabilities for the wrapped analyses.

TrainValidationSplit

This wrapper splits train_data into training and validation sets stored in train_data and val_data for the wrapped analyses.

ProblemTypeControl

Helper component to control problem type.

RawTypesAnalysis

Infers autogluon raw types for the column.

VariableTypeAnalysis

Infers variable types for the column: numeric vs category.

SpecialTypesAnalysis

Infers autogluon special types for the column (i.e.

DatasetSummary

Generates dataset summary including counts, number of unique elements, most frequent, dtypes and 7-figure summary (std/mean/min/max/quartiles)

LabelInsightsAnalysis

Analyze label for insights:

Sampler

class autogluon.eda.analysis.dataset.Sampler(sample: Union[None, int, float] = None, parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, **kwargs)[source]

Sampler is a wrapper that provides sampling capabilities for the wrapped analyses. The sampling is performed for all datasets in args and passed to all children during fit call shadowing outer parameters.

Parameters
sample: Union[None, int, float], default = None

sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling

parent: Optional[AbstractAnalysis], default = None

parent Analysis

children: Optional[List[AbstractAnalysis]], default None

wrapped analyses; these will receive sampled args during fit call

Examples

>>> from autogluon.eda.analysis.base import BaseAnalysis
>>> from autogluon.eda.analysis import Sampler
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df_train = pd.DataFrame(np.random.randint(0, 100, size=(10, 4)), columns=list('ABCD'))
>>> df_test = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('EFGH'))
>>> analysis = BaseAnalysis(train_data=df_train, test_data=df_test, children=[
>>>     Sampler(sample=5, children=[
>>>         # Analysis here will be performed on a sample of 5 for both train_data and test_data
>>>     ])
>>> ])

TrainValidationSplit

class autogluon.eda.analysis.dataset.TrainValidationSplit(val_size: float = 0.3, parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, **kwargs)[source]

This wrapper splits train_data into training and validation sets stored in train_data and val_data for the wrapped analyses. The split is performed for datasets in args and passed to all children during fit call shadowing outer parameters.

This component requires ProblemTypeControl present in the analysis call to set problem_type.

Parameters
val_size: float, default = 0.3

fraction of training set to be assigned as validation set during the split.

parent: Optional[AbstractAnalysis], default = None

parent Analysis

children: Optional[List[AbstractAnalysis]], default None

wrapped analyses; these will receive sampled args during fit call

kwargs

See also

ProblemTypeControl

Examples

>>> from autogluon.eda.analysis.base import BaseAnalysis
>>> from autogluon.eda.analysis import Sampler
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df_train = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
>>> analysis = BaseAnalysis(train_data=df_train, label="D", children=[
>>>         Namespace(namespace="ns_val_split_specified", children=[
>>>             ProblemTypeControl(),
>>>             TrainValidationSplit(val_pct=0.4, children=[
>>>                 # This analysis sees 60/40 split of df_train between train_data and val_data
>>>                 SomeAnalysis()
>>>             ])
>>>         ]),
>>>         Namespace(namespace="ns_val_split_default", children=[
>>>             ProblemTypeControl(),
>>>             TrainValidationSplit(children=[
>>>                 # This analysis sees 70/30 split (default) of df_train between train_data and val_data
>>>                 SomeAnalysis()
>>>             ])
>>>         ]),
>>>         Namespace(namespace="ns_no_split", children=[
>>>                 # This analysis sees only original train_data
>>>             SomeAnalysis()
>>>         ]),
>>>     ],
>>> )
>>>
>>> state = analysis.fit()
>>>

ProblemTypeControl

class autogluon.eda.analysis.dataset.ProblemTypeControl(problem_type: str = 'auto', parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, **kwargs)[source]

Helper component to control problem type. Autodetect if problem_type = ‘auto’.

Parameters
problem_type: str, default = ‘auto’

problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.

parent: Optional[AbstractAnalysis], default = None

parent Analysis

children: Optional[List[AbstractAnalysis]], default None

wrapped analyses; these will receive sampled args during fit call

kwargs

RawTypesAnalysis

class autogluon.eda.analysis.dataset.RawTypesAnalysis(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]

Infers autogluon raw types for the column.

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.RawTypesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

VariableTypeAnalysis

class autogluon.eda.analysis.dataset.VariableTypeAnalysis(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, numeric_as_categorical_threshold: int = 20, **kwargs)[source]

Infers variable types for the column: numeric vs category.

This analysis depends on RawTypesAnalysis().

Parameters
numeric_as_categorical_threshold: int, default = 20

if numeric column has less than this value, then the variable should be considered as categorical

parent: Optional[AbstractAnalysis], default = None

parent Analysis

children: Optional[List[AbstractAnalysis]], default None

wrapped analyses; these will receive sampled args during fit call

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.RawTypesAnalysis(),
>>>         eda.dataset.VariableTypeAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

SpecialTypesAnalysis

class autogluon.eda.analysis.dataset.SpecialTypesAnalysis(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]

Infers autogluon special types for the column (i.e. text).

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.SpecialTypesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

DatasetSummary

class autogluon.eda.analysis.dataset.DatasetSummary(parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]

Generates dataset summary including counts, number of unique elements, most frequent, dtypes and 7-figure summary (std/mean/min/max/quartiles)

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.DatasetSummary(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

LabelInsightsAnalysis

class autogluon.eda.analysis.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold: int = 50, regression_ood_threshold: float = 0.01, class_imbalance_ratio_threshold: float = 0.4, parent: Optional[autogluon.eda.analysis.base.AbstractAnalysis] = None, children: Optional[List[autogluon.eda.analysis.base.AbstractAnalysis]] = None, state: Optional[autogluon.eda.state.AnalysisState] = None, **kwargs)[source]

Analyze label for insights:

  • classification: low cardinality classes detection

  • classification: classes present in test data, but not in the train data

  • regression: out-of-domain labels detection

Note: this Analysis requires problem_type present in state. It can be detected/set via ProblemTypeControl component

Parameters
low_cardinality_classes_threshold: int, default = 50

Minimum class instances present in the dataset to consider marking a class as low-cardinality

regression_ood_threshold: float, default = 0.01

mark results as out-of-domain when test label range in regression task is beyond train data range + regression_ood_threshold margin, This is performed because some algorithms can’t extrapolate beyond training data.

class_imbalance_ratio_threshold: float, default = 0.4

minority class proportion to detect as imbalance.

parent: Optional[AbstractAnalysis], default = None

parent Analysis

children: Optional[List[AbstractAnalysis]], default None

wrapped analyses; these will receive sampled args during fit call

state: AnalysisState

state object to perform check on

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> auto.analyze(
>>> auto.analyze(train_data=..., test_data=..., label=..., anlz_facets=[
>>>     eda.dataset.ProblemTypeControl(),
>>>     eda.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold=50, regression_ood_threshold=0.01),
>>> ], viz_facets=[
>>>     viz.dataset.LabelInsightsVisualization()
>>> ])