Components: dataset#

autogluon.eda.visualization.dataset#

DatasetStatistics

Display aggregate dataset statistics and dataset-level information.

DatasetTypeMismatch

Display mismatch between raw types between datasets provided.

LabelInsightsVisualization

Render label insights performed by LabelInsightsAnalysis.

DatasetStatistics#

class autogluon.eda.visualization.dataset.DatasetStatistics(headers: bool = False, namespace: Optional[str] = None, sort_by: Optional[str] = None, sort_asc: bool = True, **kwargs)[source]#

Display aggregate dataset statistics and dataset-level information.

The report is a composite view of combination of performed analyses: DatasetSummary, RawTypesAnalysis, VariableTypeAnalysis, SpecialTypesAnalysis, MissingValuesAnalysis. The components can be present in any combination (assuming their dependencies are satisfied).

The report requires at least one of the analyses present to be rendered.

Parameters
  • headers (bool, default = False) – if True then render headers

  • namespace (str, default = None) –

    namespace to use; can be nested like ns_a.ns_b.ns_c sort_by: Optional[str], default = None

    column to sort the resulting table

    sort_asc: bool, default = True

    if sort_by provided, then if sorting should ascending or descending

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.DatasetSummary(),
>>>         eda.dataset.RawTypesAnalysis(),
>>>         eda.dataset.VariableTypeAnalysis(),
>>>         eda.dataset.SpecialTypesAnalysis(),
>>>         eda.missing.MissingValuesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

DatasetTypeMismatch#

class autogluon.eda.visualization.dataset.DatasetTypeMismatch(headers: bool = False, namespace: Optional[str] = None, **kwargs)[source]#

Display mismatch between raw types between datasets provided. In case if mismatch found, mark the row with a warning.

The report requires RawTypesAnalysis analysis present.

Parameters
  • headers (bool, default = False) – if True then render headers

  • namespace (str, default = None) – namespace to use; can be nested like ns_a.ns_b.ns_c

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> auto.analyze(
>>>     train_data=..., test_data=...,
>>>     anlz_facets=[
>>>         eda.dataset.RawTypesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetTypeMismatch()
>>>     ]
>>> )

See also

RawTypesAnalysis

LabelInsightsVisualization#

class autogluon.eda.visualization.dataset.LabelInsightsVisualization(headers: bool = False, namespace: Optional[str] = None, **kwargs)[source]#

Render label insights performed by LabelInsightsAnalysis.

The following insights can be rendered:

  • classification: low cardinality classes detection

  • classification: classes present in test data, but not in the train data

  • regression: out-of-domain labels detection

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> auto.analyze(
>>> auto.analyze(train_data=..., test_data=..., label=..., anlz_facets=[
>>>     eda.dataset.ProblemTypeControl(),
>>>     eda.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold=50, regression_ood_threshold=0.01),
>>> ], viz_facets=[
>>>     viz.dataset.LabelInsightsVisualization()
>>> ])
Parameters
  • headers (bool, default = False) – if True then render headers

  • namespace (str, default = None) – namespace to use; can be nested like ns_a.ns_b.ns_c

autogluon.eda.analysis.dataset#

Sampler

Sampler is a wrapper that provides sampling capabilities for the wrapped analyses.

TrainValidationSplit

This wrapper splits train_data into training and validation sets stored in train_data and val_data for the wrapped analyses.

ProblemTypeControl

Helper component to control problem type.

RawTypesAnalysis

Infers autogluon raw types for the column.

VariableTypeAnalysis

Infers variable types for the column: numeric vs category.

SpecialTypesAnalysis

Infers autogluon special types for the column (i.e.

DatasetSummary

Generates dataset summary including counts, number of unique elements, most frequent, dtypes and 7-figure summary (std/mean/min/max/quartiles)

LabelInsightsAnalysis

Analyze label for insights:

Sampler#

class autogluon.eda.analysis.dataset.Sampler(sample: Union[None, int, float] = None, parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, **kwargs)[source]#

Sampler is a wrapper that provides sampling capabilities for the wrapped analyses. The sampling is performed for all datasets in args and passed to all children during fit call shadowing outer parameters.

Parameters
  • sample (Union[None, int, float], default = None) – sample size; if int, then row number is used; float must be between 0.0 and 1.0 and represents fraction of dataset to sample; None means no sampling

  • parent (Optional[AbstractAnalysis], default = None) – parent Analysis

  • children (Optional[List[AbstractAnalysis]], default None) – wrapped analyses; these will receive sampled args during fit call

Examples

>>> from autogluon.eda.analysis.base import BaseAnalysis
>>> from autogluon.eda.analysis import Sampler
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df_train = pd.DataFrame(np.random.randint(0, 100, size=(10, 4)), columns=list('ABCD'))
>>> df_test = pd.DataFrame(np.random.randint(0, 100, size=(20, 4)), columns=list('EFGH'))
>>> analysis = BaseAnalysis(train_data=df_train, test_data=df_test, children=[
>>>     Sampler(sample=5, children=[
>>>         # Analysis here will be performed on a sample of 5 for both train_data and test_data
>>>     ])
>>> ])

TrainValidationSplit#

class autogluon.eda.analysis.dataset.TrainValidationSplit(val_size: float = 0.3, parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, **kwargs)[source]#

This wrapper splits train_data into training and validation sets stored in train_data and val_data for the wrapped analyses. The split is performed for datasets in args and passed to all children during fit call shadowing outer parameters.

This component requires ProblemTypeControl present in the analysis call to set problem_type.

Parameters
  • val_size (float, default = 0.3) – fraction of training set to be assigned as validation set during the split.

  • parent (Optional[AbstractAnalysis], default = None) – parent Analysis

  • children (Optional[List[AbstractAnalysis]], default None) – wrapped analyses; these will receive sampled args during fit call

  • kwargs

Examples

>>> from autogluon.eda.analysis.base import BaseAnalysis
>>> from autogluon.eda.analysis import Sampler
>>> import pandas as pd
>>> import numpy as np
>>>
>>> df_train = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list("ABCD"))
>>> analysis = BaseAnalysis(train_data=df_train, label="D", children=[
>>>         Namespace(namespace="ns_val_split_specified", children=[
>>>             ProblemTypeControl(),
>>>             TrainValidationSplit(val_pct=0.4, children=[
>>>                 # This analysis sees 60/40 split of df_train between train_data and val_data
>>>                 SomeAnalysis()
>>>             ])
>>>         ]),
>>>         Namespace(namespace="ns_val_split_default", children=[
>>>             ProblemTypeControl(),
>>>             TrainValidationSplit(children=[
>>>                 # This analysis sees 70/30 split (default) of df_train between train_data and val_data
>>>                 SomeAnalysis()
>>>             ])
>>>         ]),
>>>         Namespace(namespace="ns_no_split", children=[
>>>                 # This analysis sees only original train_data
>>>             SomeAnalysis()
>>>         ]),
>>>     ],
>>> )
>>>
>>> state = analysis.fit()
>>>

See also

ProblemTypeControl

ProblemTypeControl#

class autogluon.eda.analysis.dataset.ProblemTypeControl(problem_type: str = 'auto', parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, **kwargs)[source]#

Helper component to control problem type. Autodetect if problem_type = ‘auto’.

Parameters
  • problem_type (str, default = 'auto') – problem type to use. Valid problem_type values include [‘auto’, ‘binary’, ‘multiclass’, ‘regression’, ‘quantile’, ‘softclass’] auto means it will be Auto-detected using AutoGluon methods.

  • parent (Optional[AbstractAnalysis], default = None) – parent Analysis

  • children (Optional[List[AbstractAnalysis]], default None) – wrapped analyses; these will receive sampled args during fit call

  • kwargs

RawTypesAnalysis#

class autogluon.eda.analysis.dataset.RawTypesAnalysis(parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, state: Optional[AnalysisState] = None, **kwargs)[source]#

Infers autogluon raw types for the column.

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.RawTypesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

VariableTypeAnalysis#

class autogluon.eda.analysis.dataset.VariableTypeAnalysis(parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, numeric_as_categorical_threshold: int = 20, **kwargs)[source]#

Infers variable types for the column: numeric vs category.

This analysis depends on RawTypesAnalysis().

Parameters
  • numeric_as_categorical_threshold (int, default = 20) – if numeric column has less than this value, then the variable should be considered as categorical

  • parent (Optional[AbstractAnalysis], default = None) – parent Analysis

  • children (Optional[List[AbstractAnalysis]], default None) – wrapped analyses; these will receive sampled args during fit call

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.RawTypesAnalysis(),
>>>         eda.dataset.VariableTypeAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

SpecialTypesAnalysis#

class autogluon.eda.analysis.dataset.SpecialTypesAnalysis(parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, state: Optional[AnalysisState] = None, **kwargs)[source]#

Infers autogluon special types for the column (i.e. text).

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.SpecialTypesAnalysis(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

DatasetSummary#

class autogluon.eda.analysis.dataset.DatasetSummary(parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, state: Optional[AnalysisState] = None, **kwargs)[source]#

Generates dataset summary including counts, number of unique elements, most frequent, dtypes and 7-figure summary (std/mean/min/max/quartiles)

Examples

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> state = auto.analyze(
>>>     train_data=..., label=..., return_state=True,
>>>     anlz_facets=[
>>>         eda.dataset.DatasetSummary(),
>>>     ],
>>>     viz_facets=[
>>>         viz.dataset.DatasetStatistics()
>>>     ]
>>> )

LabelInsightsAnalysis#

class autogluon.eda.analysis.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold: int = 50, regression_ood_threshold: float = 0.01, class_imbalance_ratio_threshold: float = 0.4, parent: Optional[AbstractAnalysis] = None, children: Optional[List[AbstractAnalysis]] = None, state: Optional[AnalysisState] = None, **kwargs)[source]#

Analyze label for insights:

  • classification: low cardinality classes detection

  • classification: classes present in test data, but not in the train data

  • regression: out-of-domain labels detection

Note: this Analysis requires problem_type present in state. It can be detected/set via ProblemTypeControl component

Parameters
low_cardinality_classes_threshold: int, default = 50

Minimum class instances present in the dataset to consider marking a class as low-cardinality

regression_ood_threshold: float, default = 0.01

mark results as out-of-domain when test label range in regression task is beyond train data range + regression_ood_threshold margin, This is performed because some algorithms can’t extrapolate beyond training data.

class_imbalance_ratio_threshold: float, default = 0.4

minority class proportion to detect as imbalance.

parent: Optional[AbstractAnalysis], default = None

parent Analysis

children: Optional[List[AbstractAnalysis]], default None

wrapped analyses; these will receive sampled args during fit call

state: AnalysisState

state object to perform check on

>>> import autogluon.eda.analysis as eda
>>> import autogluon.eda.visualization as viz
>>> import autogluon.eda.auto as auto
>>> auto.analyze(
>>> auto.analyze(train_data=..., test_data=..., label=..., anlz_facets=[
>>>     eda.dataset.ProblemTypeControl(),
>>>     eda.dataset.LabelInsightsAnalysis(low_cardinality_classes_threshold=50, regression_ood_threshold=0.01),
>>> ], viz_facets=[
>>>     viz.dataset.LabelInsightsVisualization()
>>> ])

ProblemTypeControl LabelInsightsVisualization