.. _sec_tabulardeployment:
Predicting Columns in a Table - Deployment Optimization
=======================================================
This tutorial will cover how to perform the end-to-end AutoML process to
create an optimized and deployable AutoGluon artifact for production
usage.
This tutorial assumes you have already read :ref:`sec_tabularquick`
and :ref:`sec_tabularadvanced`.
Fitting a TabularPredictor
--------------------------
We will again use the AdultIncome dataset as in the previous tutorials
and train a predictor to predict whether the person’s income exceeds
$50,000 or not, which is recorded in the ``class`` column of this table.
.. code:: python
from autogluon.tabular import TabularDataset, TabularPredictor
train_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/train.csv')
label = 'class'
subsample_size = 500 # subsample subset of data for faster demo, try setting this to much larger values
train_data = train_data.sample(n=subsample_size, random_state=0)
train_data.head()
.. raw:: html
|
age |
workclass |
fnlwgt |
education |
education-num |
marital-status |
occupation |
relationship |
race |
sex |
capital-gain |
capital-loss |
hours-per-week |
native-country |
class |
6118 |
51 |
Private |
39264 |
Some-college |
10 |
Married-civ-spouse |
Exec-managerial |
Wife |
White |
Female |
0 |
0 |
40 |
United-States |
>50K |
23204 |
58 |
Private |
51662 |
10th |
6 |
Married-civ-spouse |
Other-service |
Wife |
White |
Female |
0 |
0 |
8 |
United-States |
<=50K |
29590 |
40 |
Private |
326310 |
Some-college |
10 |
Married-civ-spouse |
Craft-repair |
Husband |
White |
Male |
0 |
0 |
44 |
United-States |
<=50K |
18116 |
37 |
Private |
222450 |
HS-grad |
9 |
Never-married |
Sales |
Not-in-family |
White |
Male |
0 |
2339 |
40 |
El-Salvador |
<=50K |
33964 |
62 |
Private |
109190 |
Bachelors |
13 |
Married-civ-spouse |
Exec-managerial |
Husband |
White |
Male |
15024 |
0 |
40 |
United-States |
>50K |
.. code:: python
save_path = 'agModels-predictClass-deployment' # specifies folder to store trained models
predictor = TabularPredictor(label=label, path=save_path).fit(train_data)
.. parsed-literal::
:class: output
Beginning AutoGluon training ...
AutoGluon will save models to "agModels-predictClass-deployment/"
AutoGluon Version: 0.6.1b20221213
Python Version: 3.8.10
Operating System: Linux
Platform Machine: x86_64
Platform Version: #1 SMP Tue Nov 30 00:17:50 UTC 2021
Train Data Rows: 500
Train Data Columns: 14
Label Column: class
Preprocessing data ...
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values: [' >50K', ' <=50K']
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression'])
Selected class <--> label mapping: class 1 = >50K, class 0 = <=50K
Note: For your binary classification, AutoGluon arbitrarily selected which label-value represents positive ( >50K) vs negative ( <=50K) class.
To explicitly set the positive_class, either rename classes to 1 and 0, or specify positive_class in Predictor init.
Using Feature Generators to preprocess the data ...
Fitting AutoMLPipelineFeatureGenerator...
Available Memory: 31599.62 MB
Train Data (Original) Memory Usage: 0.29 MB (0.0% of available memory)
Inferring data type of each feature based on column values. Set feature_metadata_in to manually specify special dtypes of the features.
Stage 1 Generators:
Fitting AsTypeFeatureGenerator...
Note: Converting 1 features to boolean dtype as they only contain 2 unique values.
Stage 2 Generators:
Fitting FillNaFeatureGenerator...
Stage 3 Generators:
Fitting IdentityFeatureGenerator...
Fitting CategoryFeatureGenerator...
Fitting CategoryMemoryMinimizeFeatureGenerator...
Stage 4 Generators:
Fitting DropUniqueFeatureGenerator...
Types of features in original data (raw dtype, special dtypes):
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('object', []) : 8 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
Types of features in processed data (raw dtype, special dtypes):
('category', []) : 7 | ['workclass', 'education', 'marital-status', 'occupation', 'relationship', ...]
('int', []) : 6 | ['age', 'fnlwgt', 'education-num', 'capital-gain', 'capital-loss', ...]
('int', ['bool']) : 1 | ['sex']
0.1s = Fit runtime
14 features in original data used to generate 14 features in processed data.
Train Data (Processed) Memory Usage: 0.03 MB (0.0% of available memory)
Data preprocessing and feature engineering runtime = 0.09s ...
AutoGluon will gauge predictive performance using evaluation metric: 'accuracy'
To change this, specify the eval_metric parameter of Predictor()
Automatically generating train/validation split with holdout_frac=0.2, Train Rows: 400, Val Rows: 100
Fitting 13 L1 models ...
Fitting model: KNeighborsUnif ...
0.73 = Validation score (accuracy)
0.61s = Training runtime
0.01s = Validation runtime
Fitting model: KNeighborsDist ...
0.65 = Validation score (accuracy)
0.6s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMXT ...
0.83 = Validation score (accuracy)
1.25s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBM ...
0.85 = Validation score (accuracy)
0.82s = Training runtime
0.01s = Validation runtime
Fitting model: RandomForestGini ...
0.84 = Validation score (accuracy)
1.08s = Training runtime
0.06s = Validation runtime
Fitting model: RandomForestEntr ...
0.83 = Validation score (accuracy)
1.06s = Training runtime
0.06s = Validation runtime
Fitting model: CatBoost ...
0.85 = Validation score (accuracy)
1.4s = Training runtime
0.01s = Validation runtime
Fitting model: ExtraTreesGini ...
0.82 = Validation score (accuracy)
1.07s = Training runtime
0.06s = Validation runtime
Fitting model: ExtraTreesEntr ...
0.81 = Validation score (accuracy)
1.06s = Training runtime
0.06s = Validation runtime
Fitting model: NeuralNetFastAI ...
0.82 = Validation score (accuracy)
2.61s = Training runtime
0.01s = Validation runtime
Fitting model: XGBoost ...
0.87 = Validation score (accuracy)
0.26s = Training runtime
0.01s = Validation runtime
Fitting model: NeuralNetTorch ...
0.83 = Validation score (accuracy)
1.02s = Training runtime
0.01s = Validation runtime
Fitting model: LightGBMLarge ...
0.83 = Validation score (accuracy)
0.54s = Training runtime
0.01s = Validation runtime
Fitting model: WeightedEnsemble_L2 ...
0.87 = Validation score (accuracy)
0.32s = Training runtime
0.0s = Validation runtime
AutoGluon training complete, total runtime = 14.27s ... Best model: "WeightedEnsemble_L2"
TabularPredictor saved. To load, use: predictor = TabularPredictor.load("agModels-predictClass-deployment/")
Next, load separate test data to demonstrate how to make predictions on
new examples at inference time:
.. code:: python
test_data = TabularDataset('https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv')
y_test = test_data[label] # values to predict
test_data.head()
.. parsed-literal::
:class: output
Loaded data from: https://autogluon.s3.amazonaws.com/datasets/Inc/test.csv | Columns = 15 / 15 | Rows = 9769 -> 9769
.. raw:: html
|
age |
workclass |
fnlwgt |
education |
education-num |
marital-status |
occupation |
relationship |
race |
sex |
capital-gain |
capital-loss |
hours-per-week |
native-country |
class |
0 |
31 |
Private |
169085 |
11th |
7 |
Married-civ-spouse |
Sales |
Wife |
White |
Female |
0 |
0 |
20 |
United-States |
<=50K |
1 |
17 |
Self-emp-not-inc |
226203 |
12th |
8 |
Never-married |
Sales |
Own-child |
White |
Male |
0 |
0 |
45 |
United-States |
<=50K |
2 |
47 |
Private |
54260 |
Assoc-voc |
11 |
Married-civ-spouse |
Exec-managerial |
Husband |
White |
Male |
0 |
1887 |
60 |
United-States |
>50K |
3 |
21 |
Private |
176262 |
Some-college |
10 |
Never-married |
Exec-managerial |
Own-child |
White |
Female |
0 |
0 |
30 |
United-States |
<=50K |
4 |
17 |
Private |
241185 |
12th |
8 |
Never-married |
Prof-specialty |
Own-child |
White |
Male |
0 |
0 |
20 |
United-States |
<=50K |
We use our trained models to make predictions on the new data:
.. code:: python
predictor = TabularPredictor.load(save_path) # unnecessary, just demonstrates how to load previously-trained predictor from file
y_pred = predictor.predict(test_data)
y_pred
.. parsed-literal::
:class: output
0 <=50K
1 <=50K
2 <=50K
3 <=50K
4 <=50K
...
9764 <=50K
9765 <=50K
9766 <=50K
9767 <=50K
9768 <=50K
Name: class, Length: 9769, dtype: object
We can use leaderboard to evaluate the performance of each individual
trained model on our labeled test data:
.. code:: python
predictor.leaderboard(test_data, silent=True)
.. raw:: html
|
model |
score_test |
score_val |
pred_time_test |
pred_time_val |
fit_time |
pred_time_test_marginal |
pred_time_val_marginal |
fit_time_marginal |
stack_level |
can_infer |
fit_order |
0 |
RandomForestGini |
0.842870 |
0.84 |
0.137671 |
0.055894 |
1.080855 |
0.137671 |
0.055894 |
1.080855 |
1 |
True |
5 |
1 |
CatBoost |
0.842461 |
0.85 |
0.012603 |
0.005573 |
1.403570 |
0.012603 |
0.005573 |
1.403570 |
1 |
True |
7 |
2 |
RandomForestEntr |
0.841130 |
0.83 |
0.140647 |
0.060857 |
1.060027 |
0.140647 |
0.060857 |
1.060027 |
1 |
True |
6 |
3 |
LightGBM |
0.839799 |
0.85 |
0.014990 |
0.008039 |
0.824368 |
0.014990 |
0.008039 |
0.824368 |
1 |
True |
4 |
4 |
XGBoost |
0.837445 |
0.87 |
0.050143 |
0.007187 |
0.261149 |
0.050143 |
0.007187 |
0.261149 |
1 |
True |
11 |
5 |
WeightedEnsemble_L2 |
0.837445 |
0.87 |
0.052607 |
0.007834 |
0.583509 |
0.002464 |
0.000648 |
0.322360 |
2 |
True |
14 |
6 |
LightGBMXT |
0.836421 |
0.83 |
0.010455 |
0.005912 |
1.248788 |
0.010455 |
0.005912 |
1.248788 |
1 |
True |
3 |
7 |
ExtraTreesGini |
0.834579 |
0.82 |
0.139147 |
0.060351 |
1.065567 |
0.139147 |
0.060351 |
1.065567 |
1 |
True |
8 |
8 |
NeuralNetTorch |
0.833555 |
0.83 |
0.056062 |
0.013697 |
1.024997 |
0.056062 |
0.013697 |
1.024997 |
1 |
True |
12 |
9 |
ExtraTreesEntr |
0.833350 |
0.81 |
0.140015 |
0.058261 |
1.058253 |
0.140015 |
0.058261 |
1.058253 |
1 |
True |
9 |
10 |
LightGBMLarge |
0.828949 |
0.83 |
0.036233 |
0.005726 |
0.544085 |
0.036233 |
0.005726 |
0.544085 |
1 |
True |
13 |
11 |
NeuralNetFastAI |
0.818610 |
0.82 |
0.152624 |
0.013950 |
2.614331 |
0.152624 |
0.013950 |
2.614331 |
1 |
True |
10 |
12 |
KNeighborsUnif |
0.725970 |
0.73 |
0.027956 |
0.008520 |
0.609989 |
0.027956 |
0.008520 |
0.609989 |
1 |
True |
1 |
13 |
KNeighborsDist |
0.695158 |
0.65 |
0.025601 |
0.006325 |
0.603475 |
0.025601 |
0.006325 |
0.603475 |
1 |
True |
2 |
Snapshot a Predictor with .clone()
----------------------------------
Now that we have a working predictor artifact, we may want to alter it
in a variety of ways to better suite our needs. For example, we may want
to delete certain models to reduce disk usage via ``.delete_models()``,
or train additional models on top of the ones we already have via
``.fit_extra()``.
While you can do all of these operations on your predictor, you may want
to be able to be able to revert to a prior state of the predictor in
case something goes wrong. This is where ``predictor.clone()`` comes in.
``predictor.clone()`` allows you to create a snapshot of the given
predictor, cloning the artifacts of the predictor to a new location. You
can then freely play around with the predictor and always load the
earlier snapshot in case you want to undo your actions.
All you need to do to clone a predictor is specify a new directory path
to clone to:
.. code:: python
save_path_clone = save_path + '-clone'
# will return the path to the cloned predictor, identical to save_path_clone
path_clone = predictor.clone(path=save_path_clone)
.. parsed-literal::
:class: output
Cloned TabularPredictor located in 'agModels-predictClass-deployment/' to 'agModels-predictClass-deployment-clone'.
To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone")
Note that this logic doubles disk usage, as it completely clones every
predictor artifact on disk to make an exact replica.
Now we can load the cloned predictor:
.. code:: python
predictor_clone = TabularPredictor.load(path=path_clone)
# You can alternatively load the cloned TabularPredictor at the time of cloning:
# predictor_clone = predictor.clone(path=save_path_clone, return_clone=True)
We can see that the cloned predictor has the same leaderboard and
functionality as the original:
.. code:: python
y_pred_clone = predictor.predict(test_data)
y_pred_clone
.. parsed-literal::
:class: output
0 <=50K
1 <=50K
2 <=50K
3 <=50K
4 <=50K
...
9764 <=50K
9765 <=50K
9766 <=50K
9767 <=50K
9768 <=50K
Name: class, Length: 9769, dtype: object
.. code:: python
y_pred.equals(y_pred_clone)
.. parsed-literal::
:class: output
True
.. code:: python
predictor_clone.leaderboard(test_data, silent=True)
.. raw:: html
|
model |
score_test |
score_val |
pred_time_test |
pred_time_val |
fit_time |
pred_time_test_marginal |
pred_time_val_marginal |
fit_time_marginal |
stack_level |
can_infer |
fit_order |
0 |
RandomForestGini |
0.842870 |
0.84 |
0.135722 |
0.055894 |
1.080855 |
0.135722 |
0.055894 |
1.080855 |
1 |
True |
5 |
1 |
CatBoost |
0.842461 |
0.85 |
0.011522 |
0.005573 |
1.403570 |
0.011522 |
0.005573 |
1.403570 |
1 |
True |
7 |
2 |
RandomForestEntr |
0.841130 |
0.83 |
0.134074 |
0.060857 |
1.060027 |
0.134074 |
0.060857 |
1.060027 |
1 |
True |
6 |
3 |
LightGBM |
0.839799 |
0.85 |
0.015382 |
0.008039 |
0.824368 |
0.015382 |
0.008039 |
0.824368 |
1 |
True |
4 |
4 |
XGBoost |
0.837445 |
0.87 |
0.046215 |
0.007187 |
0.261149 |
0.046215 |
0.007187 |
0.261149 |
1 |
True |
11 |
5 |
WeightedEnsemble_L2 |
0.837445 |
0.87 |
0.048518 |
0.007834 |
0.583509 |
0.002304 |
0.000648 |
0.322360 |
2 |
True |
14 |
6 |
LightGBMXT |
0.836421 |
0.83 |
0.010329 |
0.005912 |
1.248788 |
0.010329 |
0.005912 |
1.248788 |
1 |
True |
3 |
7 |
ExtraTreesGini |
0.834579 |
0.82 |
0.135083 |
0.060351 |
1.065567 |
0.135083 |
0.060351 |
1.065567 |
1 |
True |
8 |
8 |
NeuralNetTorch |
0.833555 |
0.83 |
0.053613 |
0.013697 |
1.024997 |
0.053613 |
0.013697 |
1.024997 |
1 |
True |
12 |
9 |
ExtraTreesEntr |
0.833350 |
0.81 |
0.139537 |
0.058261 |
1.058253 |
0.139537 |
0.058261 |
1.058253 |
1 |
True |
9 |
10 |
LightGBMLarge |
0.828949 |
0.83 |
0.034355 |
0.005726 |
0.544085 |
0.034355 |
0.005726 |
0.544085 |
1 |
True |
13 |
11 |
NeuralNetFastAI |
0.818610 |
0.82 |
0.143432 |
0.013950 |
2.614331 |
0.143432 |
0.013950 |
2.614331 |
1 |
True |
10 |
12 |
KNeighborsUnif |
0.725970 |
0.73 |
0.026624 |
0.008520 |
0.609989 |
0.026624 |
0.008520 |
0.609989 |
1 |
True |
1 |
13 |
KNeighborsDist |
0.695158 |
0.65 |
0.025924 |
0.006325 |
0.603475 |
0.025924 |
0.006325 |
0.603475 |
1 |
True |
2 |
Now let’s do some extra logic with the clone, such as calling
refit_full:
.. code:: python
predictor_clone.refit_full()
predictor_clone.leaderboard(test_data, silent=True)
.. parsed-literal::
:class: output
Fitting 1 L1 models ...
Fitting model: KNeighborsUnif_FULL ...
0.01s = Training runtime
Fitting 1 L1 models ...
Fitting model: KNeighborsDist_FULL ...
0.01s = Training runtime
Fitting 1 L1 models ...
Fitting model: LightGBMXT_FULL ...
0.14s = Training runtime
Fitting 1 L1 models ...
Fitting model: LightGBM_FULL ...
0.16s = Training runtime
Fitting 1 L1 models ...
Fitting model: RandomForestGini_FULL ...
0.48s = Training runtime
Fitting 1 L1 models ...
Fitting model: RandomForestEntr_FULL ...
0.47s = Training runtime
Fitting 1 L1 models ...
Fitting model: CatBoost_FULL ...
0.03s = Training runtime
Fitting 1 L1 models ...
Fitting model: ExtraTreesGini_FULL ...
0.47s = Training runtime
Fitting 1 L1 models ...
Fitting model: ExtraTreesEntr_FULL ...
0.47s = Training runtime
Fitting 1 L1 models ...
Fitting model: NeuralNetFastAI_FULL ...
No improvement since epoch 0: early stopping
0.39s = Training runtime
Fitting 1 L1 models ...
Fitting model: XGBoost_FULL ...
0.07s = Training runtime
Fitting 1 L1 models ...
Fitting model: NeuralNetTorch_FULL ...
0.56s = Training runtime
Fitting 1 L1 models ...
Fitting model: LightGBMLarge_FULL ...
0.22s = Training runtime
Fitting model: WeightedEnsemble_L2_FULL | Skipping fit via cloning parent ...
0.32s = Training runtime
Updated best model to "WeightedEnsemble_L2_FULL" (Previously "WeightedEnsemble_L2"). AutoGluon will default to using "WeightedEnsemble_L2_FULL" for predict() and predict_proba().
.. raw:: html
|
model |
score_test |
score_val |
pred_time_test |
pred_time_val |
fit_time |
pred_time_test_marginal |
pred_time_val_marginal |
fit_time_marginal |
stack_level |
can_infer |
fit_order |
0 |
CatBoost_FULL |
0.842870 |
NaN |
0.011228 |
NaN |
0.026799 |
0.011228 |
NaN |
0.026799 |
1 |
True |
21 |
1 |
RandomForestGini |
0.842870 |
0.84 |
0.138639 |
0.055894 |
1.080855 |
0.138639 |
0.055894 |
1.080855 |
1 |
True |
5 |
2 |
CatBoost |
0.842461 |
0.85 |
0.012043 |
0.005573 |
1.403570 |
0.012043 |
0.005573 |
1.403570 |
1 |
True |
7 |
3 |
RandomForestEntr |
0.841130 |
0.83 |
0.138774 |
0.060857 |
1.060027 |
0.138774 |
0.060857 |
1.060027 |
1 |
True |
6 |
4 |
LightGBM_FULL |
0.840823 |
NaN |
0.017195 |
NaN |
0.163478 |
0.017195 |
NaN |
0.163478 |
1 |
True |
18 |
5 |
LightGBM |
0.839799 |
0.85 |
0.015824 |
0.008039 |
0.824368 |
0.015824 |
0.008039 |
0.824368 |
1 |
True |
4 |
6 |
RandomForestGini_FULL |
0.839595 |
NaN |
0.140190 |
NaN |
0.478390 |
0.140190 |
NaN |
0.478390 |
1 |
True |
19 |
7 |
RandomForestEntr_FULL |
0.839185 |
NaN |
0.138538 |
NaN |
0.474687 |
0.138538 |
NaN |
0.474687 |
1 |
True |
20 |
8 |
LightGBMXT_FULL |
0.837957 |
NaN |
0.011016 |
NaN |
0.137915 |
0.011016 |
NaN |
0.137915 |
1 |
True |
17 |
9 |
XGBoost |
0.837445 |
0.87 |
0.048745 |
0.007187 |
0.261149 |
0.048745 |
0.007187 |
0.261149 |
1 |
True |
11 |
10 |
WeightedEnsemble_L2 |
0.837445 |
0.87 |
0.051331 |
0.007834 |
0.583509 |
0.002586 |
0.000648 |
0.322360 |
2 |
True |
14 |
11 |
LightGBMXT |
0.836421 |
0.83 |
0.010284 |
0.005912 |
1.248788 |
0.010284 |
0.005912 |
1.248788 |
1 |
True |
3 |
12 |
ExtraTreesEntr_FULL |
0.835910 |
NaN |
0.143991 |
NaN |
0.473450 |
0.143991 |
NaN |
0.473450 |
1 |
True |
23 |
13 |
NeuralNetTorch_FULL |
0.835091 |
NaN |
0.058724 |
NaN |
0.559810 |
0.058724 |
NaN |
0.559810 |
1 |
True |
26 |
14 |
ExtraTreesGini |
0.834579 |
0.82 |
0.142700 |
0.060351 |
1.065567 |
0.142700 |
0.060351 |
1.065567 |
1 |
True |
8 |
15 |
ExtraTreesGini_FULL |
0.833862 |
NaN |
0.141119 |
NaN |
0.472204 |
0.141119 |
NaN |
0.472204 |
1 |
True |
22 |
16 |
NeuralNetTorch |
0.833555 |
0.83 |
0.057129 |
0.013697 |
1.024997 |
0.057129 |
0.013697 |
1.024997 |
1 |
True |
12 |
17 |
ExtraTreesEntr |
0.833350 |
0.81 |
0.140146 |
0.058261 |
1.058253 |
0.140146 |
0.058261 |
1.058253 |
1 |
True |
9 |
18 |
XGBoost_FULL |
0.831610 |
NaN |
0.044393 |
NaN |
0.069248 |
0.044393 |
NaN |
0.069248 |
1 |
True |
25 |
19 |
WeightedEnsemble_L2_FULL |
0.831610 |
NaN |
0.047146 |
NaN |
0.391608 |
0.002753 |
NaN |
0.322360 |
2 |
True |
28 |
20 |
LightGBMLarge |
0.828949 |
0.83 |
0.038662 |
0.005726 |
0.544085 |
0.038662 |
0.005726 |
0.544085 |
1 |
True |
13 |
21 |
LightGBMLarge_FULL |
0.820964 |
NaN |
0.041921 |
NaN |
0.220074 |
0.041921 |
NaN |
0.220074 |
1 |
True |
27 |
22 |
NeuralNetFastAI |
0.818610 |
0.82 |
0.155864 |
0.013950 |
2.614331 |
0.155864 |
0.013950 |
2.614331 |
1 |
True |
10 |
23 |
NeuralNetFastAI_FULL |
0.769270 |
NaN |
0.151720 |
NaN |
0.386512 |
0.151720 |
NaN |
0.386512 |
1 |
True |
24 |
24 |
KNeighborsUnif |
0.725970 |
0.73 |
0.025264 |
0.008520 |
0.609989 |
0.025264 |
0.008520 |
0.609989 |
1 |
True |
1 |
25 |
KNeighborsUnif_FULL |
0.725151 |
NaN |
0.023710 |
NaN |
0.005904 |
0.023710 |
NaN |
0.005904 |
1 |
True |
15 |
26 |
KNeighborsDist |
0.695158 |
0.65 |
0.027080 |
0.006325 |
0.603475 |
0.027080 |
0.006325 |
0.603475 |
1 |
True |
2 |
27 |
KNeighborsDist_FULL |
0.685434 |
NaN |
0.025221 |
NaN |
0.005437 |
0.025221 |
NaN |
0.005437 |
1 |
True |
16 |
We can see that we were able to fit additional models, but for whatever
reason we may want to undo this operation.
Luckily, our original predictor is untouched!
.. code:: python
predictor.leaderboard(test_data, silent=True)
.. raw:: html
|
model |
score_test |
score_val |
pred_time_test |
pred_time_val |
fit_time |
pred_time_test_marginal |
pred_time_val_marginal |
fit_time_marginal |
stack_level |
can_infer |
fit_order |
0 |
RandomForestGini |
0.842870 |
0.84 |
0.140122 |
0.055894 |
1.080855 |
0.140122 |
0.055894 |
1.080855 |
1 |
True |
5 |
1 |
CatBoost |
0.842461 |
0.85 |
0.011801 |
0.005573 |
1.403570 |
0.011801 |
0.005573 |
1.403570 |
1 |
True |
7 |
2 |
RandomForestEntr |
0.841130 |
0.83 |
0.139719 |
0.060857 |
1.060027 |
0.139719 |
0.060857 |
1.060027 |
1 |
True |
6 |
3 |
LightGBM |
0.839799 |
0.85 |
0.016043 |
0.008039 |
0.824368 |
0.016043 |
0.008039 |
0.824368 |
1 |
True |
4 |
4 |
XGBoost |
0.837445 |
0.87 |
0.049586 |
0.007187 |
0.261149 |
0.049586 |
0.007187 |
0.261149 |
1 |
True |
11 |
5 |
WeightedEnsemble_L2 |
0.837445 |
0.87 |
0.052166 |
0.007834 |
0.583509 |
0.002579 |
0.000648 |
0.322360 |
2 |
True |
14 |
6 |
LightGBMXT |
0.836421 |
0.83 |
0.010703 |
0.005912 |
1.248788 |
0.010703 |
0.005912 |
1.248788 |
1 |
True |
3 |
7 |
ExtraTreesGini |
0.834579 |
0.82 |
0.140917 |
0.060351 |
1.065567 |
0.140917 |
0.060351 |
1.065567 |
1 |
True |
8 |
8 |
NeuralNetTorch |
0.833555 |
0.83 |
0.060173 |
0.013697 |
1.024997 |
0.060173 |
0.013697 |
1.024997 |
1 |
True |
12 |
9 |
ExtraTreesEntr |
0.833350 |
0.81 |
0.139170 |
0.058261 |
1.058253 |
0.139170 |
0.058261 |
1.058253 |
1 |
True |
9 |
10 |
LightGBMLarge |
0.828949 |
0.83 |
0.034843 |
0.005726 |
0.544085 |
0.034843 |
0.005726 |
0.544085 |
1 |
True |
13 |
11 |
NeuralNetFastAI |
0.818610 |
0.82 |
0.160457 |
0.013950 |
2.614331 |
0.160457 |
0.013950 |
2.614331 |
1 |
True |
10 |
12 |
KNeighborsUnif |
0.725970 |
0.73 |
0.015760 |
0.008520 |
0.609989 |
0.015760 |
0.008520 |
0.609989 |
1 |
True |
1 |
13 |
KNeighborsDist |
0.695158 |
0.65 |
0.024812 |
0.006325 |
0.603475 |
0.024812 |
0.006325 |
0.603475 |
1 |
True |
2 |
We can simply clone a new predictor from our original, and we will no
longer be impacted by the call to refit_full on the prior clone.
Snapshot a deployment optimized Predictor via .clone_for_deployment()
---------------------------------------------------------------------
Instead of cloning an exact copy, we can instead clone a copy which has
the minimal set of artifacts needed to do prediction.
Note that this optimized clone will have very limited functionality
outside of calling predict and predict_proba. For example, it will be
unable to train more models.
.. code:: python
save_path_clone_opt = save_path + '-clone-opt'
# will return the path to the cloned predictor, identical to save_path_clone_opt
path_clone_opt = predictor.clone_for_deployment(path=save_path_clone_opt)
.. parsed-literal::
:class: output
Cloned TabularPredictor located in 'agModels-predictClass-deployment/' to 'agModels-predictClass-deployment-clone-opt'.
To load the cloned predictor: predictor_clone = TabularPredictor.load(path="agModels-predictClass-deployment-clone-opt")
Clone: Keeping minimum set of models required to predict with best model 'WeightedEnsemble_L2'...
Deleting model KNeighborsUnif. All files under agModels-predictClass-deployment-clone-opt/models/KNeighborsUnif/ will be removed.
Deleting model KNeighborsDist. All files under agModels-predictClass-deployment-clone-opt/models/KNeighborsDist/ will be removed.
Deleting model LightGBMXT. All files under agModels-predictClass-deployment-clone-opt/models/LightGBMXT/ will be removed.
Deleting model LightGBM. All files under agModels-predictClass-deployment-clone-opt/models/LightGBM/ will be removed.
Deleting model RandomForestGini. All files under agModels-predictClass-deployment-clone-opt/models/RandomForestGini/ will be removed.
Deleting model RandomForestEntr. All files under agModels-predictClass-deployment-clone-opt/models/RandomForestEntr/ will be removed.
Deleting model CatBoost. All files under agModels-predictClass-deployment-clone-opt/models/CatBoost/ will be removed.
Deleting model ExtraTreesGini. All files under agModels-predictClass-deployment-clone-opt/models/ExtraTreesGini/ will be removed.
Deleting model ExtraTreesEntr. All files under agModels-predictClass-deployment-clone-opt/models/ExtraTreesEntr/ will be removed.
Deleting model NeuralNetFastAI. All files under agModels-predictClass-deployment-clone-opt/models/NeuralNetFastAI/ will be removed.
Deleting model NeuralNetTorch. All files under agModels-predictClass-deployment-clone-opt/models/NeuralNetTorch/ will be removed.
Deleting model LightGBMLarge. All files under agModels-predictClass-deployment-clone-opt/models/LightGBMLarge/ will be removed.
Clone: Removing artifacts unnecessary for prediction. NOTE: Clone can no longer fit new models, and most functionality except for predict and predict_proba will no longer work
.. code:: python
predictor_clone_opt = TabularPredictor.load(path=path_clone_opt)
We can see that the optimized clone still makes the same predictions:
.. code:: python
y_pred_clone_opt = predictor_clone_opt.predict(test_data)
y_pred_clone_opt
.. parsed-literal::
:class: output
0 <=50K
1 <=50K
2 <=50K
3 <=50K
4 <=50K
...
9764 <=50K
9765 <=50K
9766 <=50K
9767 <=50K
9768 <=50K
Name: class, Length: 9769, dtype: object
.. code:: python
y_pred.equals(y_pred_clone_opt)
.. parsed-literal::
:class: output
True
.. code:: python
predictor_clone_opt.leaderboard(test_data, silent=True)
.. raw:: html
|
model |
score_test |
score_val |
pred_time_test |
pred_time_val |
fit_time |
pred_time_test_marginal |
pred_time_val_marginal |
fit_time_marginal |
stack_level |
can_infer |
fit_order |
0 |
XGBoost |
0.837445 |
0.87 |
0.038097 |
0.007187 |
0.261149 |
0.038097 |
0.007187 |
0.261149 |
1 |
True |
1 |
1 |
WeightedEnsemble_L2 |
0.837445 |
0.87 |
0.040542 |
0.007834 |
0.583509 |
0.002445 |
0.000648 |
0.322360 |
2 |
True |
2 |
We can check the disk usage of the optimized clone compared to the
original:
.. code:: python
size_original = predictor.get_size_disk()
size_opt = predictor_clone_opt.get_size_disk()
print(f'Size Original: {size_original} bytes')
print(f'Size Optimized: {size_opt} bytes')
print(f'Optimized predictor achieved a {round((1 - (size_opt/size_original)) * 100, 1)}% reduction in disk usage.')
.. parsed-literal::
:class: output
Size Original: 16966478 bytes
Size Optimized: 601220 bytes
Optimized predictor achieved a 96.5% reduction in disk usage.
We can also investigate the difference in the files that exist in the
original and optimized predictor.
Original:
.. code:: python
predictor.get_size_disk_per_file()
.. parsed-literal::
:class: output
models/ExtraTreesGini/model.pkl 4567890
models/ExtraTreesEntr/model.pkl 4530305
models/RandomForestGini/model.pkl 3076492
models/RandomForestEntr/model.pkl 2949158
models/XGBoost/xgb.ubj 564906
models/LightGBMLarge/model.pkl 470889
models/NeuralNetTorch/net.params 234610
models/NeuralNetFastAI/model-internals.pkl 167374
models/LightGBM/model.pkl 146038
models/LightGBMXT/model.pkl 42071
models/KNeighborsDist/model.pkl 39986
models/KNeighborsUnif/model.pkl 39985
utils/data/X.pkl 27655
models/CatBoost/model.pkl 21562
models/NeuralNetTorch/model.pkl 18149
learner.pkl 10719
metadata.json 8632
utils/data/X_val.pkl 8421
models/WeightedEnsemble_L2/model.pkl 8122
utils/data/y.pkl 7488
models/XGBoost/model.pkl 5475
models/trainer.pkl 5124
models/NeuralNetFastAI/model.pkl 3352
utils/data/y_val.pkl 2381
models/WeightedEnsemble_L2/utils/model_template.pkl 1024
models/WeightedEnsemble_L2/utils/oof.pkl 764
predictor.pkl 742
utils/attr/NeuralNetTorch/y_pred_proba_val.pkl 550
utils/attr/XGBoost/y_pred_proba_val.pkl 550
utils/attr/NeuralNetFastAI/y_pred_proba_val.pkl 550
utils/attr/ExtraTreesEntr/y_pred_proba_val.pkl 550
utils/attr/ExtraTreesGini/y_pred_proba_val.pkl 550
utils/attr/CatBoost/y_pred_proba_val.pkl 550
utils/attr/RandomForestEntr/y_pred_proba_val.pkl 550
utils/attr/RandomForestGini/y_pred_proba_val.pkl 550
utils/attr/LightGBM/y_pred_proba_val.pkl 550
utils/attr/LightGBMXT/y_pred_proba_val.pkl 550
utils/attr/KNeighborsDist/y_pred_proba_val.pkl 550
utils/attr/KNeighborsUnif/y_pred_proba_val.pkl 550
utils/attr/LightGBMLarge/y_pred_proba_val.pkl 550
__version__ 14
Name: size, dtype: int64
Optimized:
.. code:: python
predictor_clone_opt.get_size_disk_per_file()
.. parsed-literal::
:class: output
models/XGBoost/xgb.ubj 564906
learner.pkl 10719
metadata.json 8632
models/WeightedEnsemble_L2/model.pkl 8286
models/XGBoost/model.pkl 5495
models/trainer.pkl 2426
predictor.pkl 742
__version__ 14
Name: size, dtype: int64
Now all that is left is to upload the optimized predictor to a
centralized storage location such as S3. To use this predictor in a new
machine / system, simply download the artifact to local disk and load
the predictor. Ensure that when loading a predictor you use the same
Python version and AutoGluon version used during training to avoid
instability.