.. _cloud_aws_sagemaker_fit: Cloud Training with AWS SageMaker ================================= To help with AutoGluon models training, AWS developed a set of training and inference `deep learning containers `__. The containers can be used to train models with CPU and GPU instances and deployed as a SageMaker endpoint or used as a batch transform job. The full end-to-end example is available in `amazon-sagemaker-examples `__ repository. Pre-requisites -------------- Before starting ensure that the latest version of sagemaker python API is installed via (``pip install --upgrade sagemaker``). This is required to ensure the information about newly released containers is available. Training Scripts ---------------- To start using the containers, a user training script and the `wrapper classes `__ are required. When authoring a training/inference `scripts `__, please refer to SageMaker `documentation `__. Here is one of the possible training scripts, which takes AutoGluon parameters as a YAML config and outputs predictions, models leaderboard and feature importance: .. code:: python import argparse import os from pprint import pprint import yaml from autogluon.tabular import TabularDataset, TabularPredictor def get_input_path(path): file = os.listdir(path)[0] if len(os.listdir(path)) > 1: print(f"WARN: more than one file is found in {channel} directory") print(f"Using {file}") filename = f"{path}/{file}" return filename def get_env_if_present(name): result = None if name in os.environ: result = os.environ[name] return result if __name__ == "__main__": # Disable Autotune os.environ["MXNET_CUDNN_AUTOTUNE_DEFAULT"] = "0" # ------------------------------------------------------------ Arguments parsing print("Starting AG") parser = argparse.ArgumentParser() # Data, model, and output directories parser.add_argument( "--output-data-dir", type=str, default=get_env_if_present("SM_OUTPUT_DATA_DIR") ) parser.add_argument("--model-dir", type=str, default=get_env_if_present("SM_MODEL_DIR")) parser.add_argument("--n_gpus", type=str, default=get_env_if_present("SM_NUM_GPUS")) parser.add_argument("--training_dir", type=str, default=get_env_if_present("SM_CHANNEL_TRAIN")) parser.add_argument( "--test_dir", type=str, required=False, default=get_env_if_present("SM_CHANNEL_TEST") ) parser.add_argument("--ag_config", type=str, default=get_env_if_present("SM_CHANNEL_CONFIG")) args, _ = parser.parse_known_args() print(f"Args: {args}") # See SageMaker-specific environment variables: https://sagemaker.readthedocs.io/en/stable/overview.html#prepare-a-training-script os.makedirs(args.output_data_dir, mode=0o777, exist_ok=True) config_file = get_input_path(args.ag_config) with open(config_file) as f: config = yaml.safe_load(f) # AutoGluon-specific config if args.n_gpus: config["num_gpus"] = int(args.n_gpus) print("Running training job with the config:") pprint(config) # ---------------------------------------------------------------- Training train_file = get_input_path(args.training_dir) train_data = TabularDataset(train_file) ag_predictor_args = config["ag_predictor_args"] ag_predictor_args["path"] = args.model_dir ag_fit_args = config["ag_fit_args"] predictor = TabularPredictor(**ag_predictor_args).fit(train_data, **ag_fit_args) # --------------------------------------------------------------- Inference if args.test_dir: test_file = get_input_path(args.test_dir) test_data = TabularDataset(test_file) # Predictions y_pred_proba = predictor.predict_proba(test_data) if config.get("output_prediction_format", "csv") == "parquet": y_pred_proba.to_parquet(f"{args.output_data_dir}/predictions.parquet") else: y_pred_proba.to_csv(f"{args.output_data_dir}/predictions.csv") # Leaderboard if config.get("leaderboard", False): lb = predictor.leaderboard(test_data, silent=False) lb.to_csv(f"{args.output_data_dir}/leaderboard.csv") # Feature importance if config.get("feature_importance", False): feature_importance = predictor.feature_importance(test_data) feature_importance.to_csv(f"{args.output_data_dir}/feature_importance.csv") else: if config.get("leaderboard", False): lb = predictor.leaderboard(silent=False) lb.to_csv(f"{args.output_data_dir}/leaderboard.csv") YAML config: .. code:: yaml # AutoGluon Predictor constructor arguments # - see https://github.com/awslabs/autogluon/blob/ef3a5312dc2eaa0c6afde042d671860ac42cbafb/tabular/src/autogluon/tabular/predictor/predictor.py#L51-L159 ag_predictor_args: eval_metric: roc_auc label: class # AutoGluon Predictor.fit arguments # - see https://github.com/awslabs/autogluon/blob/ef3a5312dc2eaa0c6afde042d671860ac42cbafb/tabular/src/autogluon/tabular/predictor/predictor.py#L280-L651 ag_fit_args: presets: "medium_quality_faster_train" num_bag_folds: 2 num_bag_sets: 1 num_stack_levels: 0 output_prediction_format: csv # predictions output format: csv or parquet feature_importance: true # calculate and save feature importance if true leaderboard: true # save leaderboard output if true Training -------- To train AutoGluon model, set up a SageMaker session: .. code:: python import sagemaker # Helper wrappers referred earlier from ag_model import ( AutoGluonTraining, AutoGluonInferenceModel, AutoGluonTabularPredictor, ) from sagemaker import utils role = sagemaker.get_execution_role() sagemaker_session = sagemaker.session.Session() region = sagemaker_session._region_name bucket = sagemaker_session.default_bucket() s3_prefix = f"autogluon_sm/{utils.sagemaker_timestamp()}" output_path = f"s3://{bucket}/{s3_prefix}/output/" Create a training task: .. code:: python ag = AutoGluonTraining( role=role, entry_point="scripts/tabular_train.py", region=region, instance_count=1, instance_type="ml.m5.2xlarge", framework_version="0.4", py_version="py38", base_job_name="autogluon-tabular-train", ) Upload the required inputs, via SageMaker session (in this case it is a training set, test set and training YAML config) and start the training job: .. code:: python s3_prefix = f"autogluon_sm/{utils.sagemaker_timestamp()}" train_input = ag.sagemaker_session.upload_data( path=os.path.join("data", "train.csv"), key_prefix=s3_prefix ) eval_input = ag.sagemaker_session.upload_data( path=os.path.join("data", "test.csv"), key_prefix=s3_prefix ) config_input = ag.sagemaker_session.upload_data( path=os.path.join("config", "config-med.yaml"), key_prefix=s3_prefix ) job_name = utils.unique_name_from_base("test-autogluon-image") ag.fit( {"config": config_input, "train": train_input, "test": eval_input}, job_name=job_name, ) Once the models are trained, they will be available in S3 location specified in ``ag.model_data`` field. The model is fully portable and can be downloaded locally if needed. Conclusion ---------- In this tutorial we explored how to train AutoGluon models using SageMaker. Learn how to deploy the trained models using AWS SageMaker - :ref:`cloud_aws_sagemaker_deploy` or AWS Lambda - :ref:`cloud_aws_lambda_deploy`.