EDA: Automated Quick Model Fit ============================== The purpose of this feature is to provide a quick and easy way to obtain a preliminary understanding of the relationships between the target variable and the independent variables in a dataset. This functionality automatically splits the training data, fits a simple regression or classification model to the data and generates insights: model performance metrics, feature importance and prediction result insights. To inspect the prediction quality, a confusion matrix is displayed for classification problems and scatter plot for regression problems. Both representation allow the user to see the difference between actual and predicted values. The insights highlight two subsets of the model predictions: - predictions with the largest classification error. Rows listed in this section are candidates for inspecting why the model made the mistakes - predictions with the least distance from the other class. Rows in this category are most ‘undecided’. They are useful as an examples of data which is close to a decision boundary between the classes. The model would benefit from having more data for similar cases. Classification Example ---------------------- We will start with getting titanic dataset and performing a quick one-line overview to get the information. .. code:: python import pandas as pd import autogluon.eda.auto as auto df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/train.csv') df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/titanic/test.csv') target_col = 'Survived' auto.quick_fit(df_train, target_col, show_feature_importance_barplots=True) .. parsed-literal:: :class: output No path specified. Models will be saved in: "AutogluonModels/ag-20230204_010022/" Model Prediction for Survived ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. figure:: output_eda-auto-quick-fit_abf798_1_2.png Model Leaderboard ~~~~~~~~~~~~~~~~~ .. raw:: html

	model	score_test	score_val	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	LightGBMXT	0.809701	0.856	0.004289	0.003677	0.769681	0.004289	0.003677	0.769681	1	True	1

Feature Importance for Trained Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. raw:: html

	importance	stddev	p_value	n	p99_high	p99_low
Sex	0.112687	0.013033	0.000021	5	0.139522	0.085851
Name	0.055970	0.009140	0.000082	5	0.074789	0.037151
SibSp	0.026119	0.010554	0.002605	5	0.047850	0.004389
Fare	0.012687	0.009730	0.021720	5	0.032721	-0.007348
Embarked	0.011194	0.006981	0.011525	5	0.025567	-0.003179
Age	0.010448	0.003122	0.000853	5	0.016876	0.004020
PassengerId	0.008955	0.005659	0.012022	5	0.020607	-0.002696
Cabin	0.002985	0.006675	0.186950	5	0.016729	-0.010758
Pclass	0.002239	0.005659	0.213159	5	0.013890	-0.009413
Parch	0.001493	0.002044	0.088904	5	0.005701	-0.002716
Ticket	0.000000	0.000000	0.500000	5	0.000000	0.000000

.. figure:: output_eda-auto-quick-fit_abf798_1_7.png Rows with the highest prediction error ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Rows in this category worth inspecting for the causes of the error .. raw:: html

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived	0	1	error
498	499	1	Allison, Mrs. Hudson J C (Bessie Waldo Daniels)	female	25.0	1	2	113781	151.5500	C22 C26	S	0	0.046788	0.953212	0.953212
267	268	3	Persson, Mr. Ernst Ulrik	male	25.0	1	0	347083	7.7750	NaN	S	1	0.932024	0.067976	0.932024
569	570	3	Jonsson, Mr. Carl	male	32.0	0	0	350417	7.8542	NaN	S	1	0.922265	0.077735	0.922265
283	284	3	Dorking, Mr. Edward Arthur	male	19.0	0	0	A/5. 10482	8.0500	NaN	S	1	0.921180	0.078820	0.921180
821	822	3	Lulic, Mr. Nikola	male	27.0	0	0	315098	8.6625	NaN	S	1	0.919709	0.080291	0.919709
301	302	3	McCoy, Mr. Bernard	male	NaN	2	0	367226	23.2500	NaN	Q	1	0.918546	0.081454	0.918546
288	289	2	Hosono, Mr. Masabumi	male	42.0	0	0	237798	13.0000	NaN	S	1	0.907043	0.092957	0.907043
36	37	3	Mamee, Mr. Hanna	male	NaN	0	0	2677	7.2292	NaN	C	1	0.906803	0.093197	0.906803
127	128	3	Madsen, Mr. Fridtjof Arne	male	24.0	0	0	C 17369	7.1417	NaN	S	1	0.906605	0.093395	0.906605
391	392	3	Jansson, Mr. Carl Olof	male	21.0	0	0	350034	7.7958	NaN	S	1	0.905367	0.094633	0.905367

Rows with the least distance vs other class ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Rows in this category are the closest to the decision boundary vs the other class and are good candidates for additional labeling .. raw:: html

	PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Survived	0	1	score_diff
182	183	3	Asplund, Master. Clarence Gustaf Hugo	male	9.0	4	2	347077	31.3875	NaN	S	0	0.503872	0.496128	0.007743
475	476	1	Clifford, Mr. George Quincy	male	NaN	0	0	110465	52.0000	A14	S	0	0.509178	0.490822	0.018356
347	348	3	Davison, Mrs. Thomas Henry (Mary E Finck)	female	NaN	1	0	386525	16.1000	NaN	S	1	0.510786	0.489214	0.021572
192	193	3	Andersen-Jensen, Miss. Carla Christine Nielsine	female	19.0	1	0	350046	7.8542	NaN	S	1	0.512167	0.487833	0.024334
330	331	3	McCoy, Miss. Agnes	female	NaN	2	0	367226	23.2500	NaN	Q	1	0.478502	0.521498	0.042996
572	573	1	Flynn, Mr. John Irwin ("Irving")	male	36.0	0	0	PC 17474	26.3875	E25	S	1	0.478234	0.521766	0.043532
792	793	3	Sage, Miss. Stella Anna	female	NaN	8	2	CA. 2343	69.5500	NaN	S	0	0.525041	0.474959	0.050082
172	173	3	Johnson, Miss. Eleanor Ileen	female	1.0	1	1	347742	11.1333	NaN	S	1	0.526793	0.473207	0.053585
328	329	3	Goldsmith, Mrs. Frank John (Emily Alice Brown)	female	31.0	1	1	363291	20.5250	NaN	S	1	0.531574	0.468426	0.063149
593	594	3	Bourke, Miss. Mary	female	NaN	0	2	364848	7.7500	NaN	Q	0	0.463840	0.536160	0.072319

Regression Example ------------------ In the previous section we tried a classification example. Let’s try a regression. It has a few differences. We are also going to store the fitted model by specifying ``return_state`` and ``save_model_to_state`` parameters. This will allow us to use the model to predict test values later. It is a large dataset, so we’ll keep only a few columns for this tutorial. .. code:: python df_train = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/AmesHousingPriceRegression/train_data.csv') df_test = pd.read_csv('https://autogluon.s3.amazonaws.com/datasets/AmesHousingPriceRegression/test_data.csv') target_col = 'SalePrice' keep_cols = [ 'Overall.Qual', 'Gr.Liv.Area', 'Neighborhood', 'Total.Bsmt.SF', 'BsmtFin.SF.1', 'X1st.Flr.SF', 'Bsmt.Qual', 'Garage.Cars', 'Half.Bath', 'Year.Remod.Add', target_col ] df_train = df_train[[c for c in df_train.columns if c in keep_cols]][:500] df_test = df_test[[c for c in df_test.columns if c in keep_cols]][:500] state = auto.quick_fit(df_train, target_col, return_state=True, save_model_to_state=True) .. parsed-literal:: :class: output No path specified. Models will be saved in: "AutogluonModels/ag-20230204_010025/" Model Prediction for SalePrice ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. figure:: output_eda-auto-quick-fit_abf798_3_2.png Model Leaderboard ~~~~~~~~~~~~~~~~~ .. raw:: html

	model	score_test	score_val	pred_time_test	pred_time_val	fit_time	pred_time_test_marginal	pred_time_val_marginal	fit_time_marginal	stack_level	can_infer	fit_order
0	LightGBMXT	-29100.820216	-31075.785774	0.003136	0.003143	0.722433	0.003136	0.003143	0.722433	1	True	1

Feature Importance for Trained Model ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. raw:: html

	importance	stddev	p_value	n	p99_high	p99_low
Overall.Qual	16126.273271	1376.470881	0.000006	5	18960.445840	13292.100702
Gr.Liv.Area	8862.693281	480.183424	0.000001	5	9851.397587	7873.988974
Total.Bsmt.SF	5299.844900	870.222500	0.000084	5	7091.645055	3508.044746
Garage.Cars	4472.147453	660.340484	0.000055	5	5831.797636	3112.497270
X1st.Flr.SF	3804.848692	692.065035	0.000126	5	5229.820166	2379.877219
BsmtFin.SF.1	3725.145846	369.988099	0.000012	5	4486.956454	2963.335237
Year.Remod.Add	3562.868687	1081.770172	0.000906	5	5790.248423	1335.488951
Half.Bath	3020.213571	1041.365031	0.001457	5	5164.398563	876.028580
Neighborhood	624.378438	297.685200	0.004689	5	1237.316379	11.440497
Bsmt.Qual	0.000000	0.000000	0.500000	5	0.000000	0.000000

Rows with the highest prediction error ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Rows in this category worth inspecting for the causes of the error .. raw:: html

	Neighborhood	Overall.Qual	Year.Remod.Add	Bsmt.Qual	BsmtFin.SF.1	Total.Bsmt.SF	X1st.Flr.SF	Gr.Liv.Area	Half.Bath	Garage.Cars	SalePrice	SalePrice_pred	error
134	Edwards	6	1966	Gd	0.0	697.0	1575	2201	0	2.0	274970	150482.062500	124487.937500
90	Timber	10	2007	Ex	0.0	1824.0	1824	1824	0	3.0	392000	277549.312500	114450.687500
468	NridgHt	9	2003	Ex	1972.0	2452.0	2452	2452	0	3.0	445000	344158.687500	100841.312500
45	NridgHt	9	2006	Ex	0.0	1704.0	1722	2758	1	3.0	418000	327791.531250	90208.468750
118	Somerst	7	2006	Gd	788.0	960.0	960	2318	1	2.0	294323	218767.953125	75555.046875
318	Crawfor	7	2002	Gd	1406.0	1902.0	1902	1902	0	2.0	335000	265314.125000	69685.875000
26	Mitchel	5	2006	NaN	0.0	0.0	1771	1771	0	2.0	115000	179189.171875	64189.171875
233	NoRidge	8	2000	Gd	655.0	1145.0	1145	2198	1	3.0	250000	311554.031250	61554.031250
322	NoRidge	8	1993	Gd	1129.0	1390.0	1402	2225	1	3.0	285000	341849.781250	56849.781250
340	ClearCr	7	2005	Gd	226.0	1385.0	1363	1363	0	2.0	241500	185450.109375	56049.890625

Using a fitted model -------------------- Now let’s get the ``model`` from ``state``, perform the prediction on ``df_test`` and quickly visualize the results using ``auto.analyze_interaction()`` tool: .. code:: python model = state.model y_pred = model.predict(df_test) auto.analyze_interaction( train_data=pd.DataFrame({'SalePrice_Pred': y_pred}), x='SalePrice_Pred', fit_distributions=['johnsonsu', 'norm', 'exponnorm'] ) .. figure:: output_eda-auto-quick-fit_abf798_5_0.png