TabularPredictor.fit_pseudolabel¶

TabularPredictor.fit_pseudolabel(pseudo_data: DataFrame, max_iter: int = 3, return_pred_prob: bool = False, use_ensemble: bool = False, fit_ensemble: bool = False, fit_ensemble_every_iter: bool = False, **kwargs)[source]¶

[Advanced] Uses additional data (pseudo_data) to try to achieve better model quality. Pseudo data can come either with or without the label column.

If pseudo_data is labeled, then models will be refit using the pseudo_data as additional training data. If bagging, each fold of the bagged ensemble will use all the pseudo_data as additional training data. pseudo_data will never be used for validation/scoring.

If the data is unlabeled, such as providing the batched test data without ground truth available, then transductive learning is leveraged. In transductive learning, the existing predictor will predict on pseudo_data to identify the most confident rows (For example all rows with predictive probability above 95%). These rows will then be pseudo-labelled, given the label of the most confident class. The pseudo-labelled rows will then be used as additional training data when fitting the models. Then, if max_iter > 1, this process can repeat itself, using the new models to predict on the unused pseudo_data rows to see if any new rows should be used in the next iteration as training data. We recommend specifying return_pred_prob=True if the data is unlabeled to get the correct prediction probabilities on the pseudo_data, rather than calling predictor.predict_proba(pseudo_data).

For example:

Original fit: 10000 train_data rows with 10-fold bagging

Bagged fold models will use 9000 train_data rows for training, and 1000 for validation.

fit_pseudolabel is called with 5000 row labelled pseudo_data.

Bagged fold models are then fit again with _PSEUDO suffix. 10000 train_data rows with 10-fold bagging + 5000 pseudo_data rows. Bagged fold models will use 9000 train_data rows + 5000 pseudo_data rows = 14000 rows for training, and 1000 for validation.

Note: The same validation rows will be used as was done in the original fit, so that validation scores are directly comparable.

Alternatively, fit_pseduolabel is called with 5000 rows unlabelled pseudo_data.

Predictor predicts on the pseudo_data, finds 965 rows with confident predictions. Set the ground truth of those 965 rows as the most confident prediction. Bagged fold models are then fit with _PSEUDO suffix. 10000 train_data rows with 10-fold bagging + 965 labelled pseudo_data rows. Bagged fold models will use 9000 train_data rows + 965 pseudo_data rows = 9965 rows for training, and 1000 for validation.

Note: The same validation rows will be used as was done in the original fit, so that validation scores are directly comparable.

Repeat the process using the new pseudo-labelled predictor on the remaining pseudo_data. In the example, lets assume 188 new pseudo_data rows have confident predictions. Now the total labelled pseudo_data rows is 965 + 188 = 1153. Then repeat the process, up to max_iter times: ex 10000 train_data rows with 10-fold bagging + 1153 pseudo_data rows. Early stopping will trigger if validation score improvement is not observed.

Note: pseudo_data is only used for L1 models. Support for L2+ models is not yet implemented. L2+ models will only use the original train_data.

Parameters:

pseudo_data (str or TabularDataset or pd.DataFrame) – Extra data to incorporate into training. Pre-labeled test data allowed. If no labels then pseudo-labeling algorithm will predict and filter out which rows to incorporate into training
max_iter (int, default = 3) – Maximum iterations of pseudo-labeling allowed
return_pred_prob (bool, default = False) – Returns held-out predictive probabilities from pseudo-labeling. If test_data is labeled then returns model’s predictive probabilities.
use_ensemble (bool, default = False) – If True will use ensemble pseudo labeling algorithm. If False will just use best model for pseudo labeling algorithm.
fit_ensemble (bool, default = False) – If True with fit weighted ensemble model using combination of best models. Fitting weighted ensemble will be done after fitting has been completed unless otherwise specified. If False will not fit weighted ensemble over models trained with pseudo labeling and models trained without it.
fit_ensemble_every_iter (bool, default = False) – If True fits weighted ensemble model for every iteration of pseudo labeling algorithm. If False and fit_ensemble is True will fit after all pseudo labeling training is done.
**kwargs – If predictor is not already fit, then kwargs are for the functions ‘fit’ and ‘fit_extra’: Refer to parameters documentation in TabularPredictor.fit(). Refer to parameters documentation in TabularPredictor.fit_extra(). If predictor is fit kwargs are for ‘fit_extra’: Refer to parameters documentation in TabularPredictor.fit_extra().

Returns:

self – Returns self, which is a Python class of TabularPredictor

Return type:

TabularPredictor