.. _cloud_aws_sagemaker_fit: Cloud Training with AWS SageMaker ================================= To help with AutoGluon models training, AWS developed a set of training and inference `deep learning containers `__. The containers can be used to train models with CPU and GPU instances and deployed as a SageMaker endpoint or used as a batch transform job. The full end-to-end example is available in `amazon-sagemaker-examples `__ repository. Pre-requisites -------------- Before starting ensure that the latest version of sagemaker python API is installed via (``pip install --upgrade sagemaker``). This is required to ensure the information about newly released containers is available. Training Scripts ---------------- To start using the containers, a user training script and the `wrapper classes `__ are required. When authoring a training/inference `scripts `__, please refer to SageMaker `documentation `__. Here is one of the possible ``TabularPredictor`` training scripts, which takes AutoGluon parameters as a YAML config and outputs predictions, models leaderboard and feature importance: .. code:: python import argparse import os from pprint import pprint import yaml from autogluon.tabular import TabularDataset, TabularPredictor def get_input_path(path): file = os.listdir(path)[0] if len(os.listdir(path)) > 1: print(f"WARN: more than one file is found in {channel} directory") print(f"Using {file}") filename = f"{path}/{file}" return filename def get_env_if_present(name): result = None if name in os.environ: result = os.environ[name] return result if __name__ == "__main__": # Disable Autotune os.environ["MXNET_CUDNN_AUTOTUNE_DEFAULT"] = "0" # ------------------------------------------------------------ Arguments parsing print("Starting AG") parser = argparse.ArgumentParser() # Data, model, and output directories parser.add_argument( "--output-data-dir", type=str, default=get_env_if_present("SM_OUTPUT_DATA_DIR") ) parser.add_argument("--model-dir", type=str, default=get_env_if_present("SM_MODEL_DIR")) parser.add_argument("--n_gpus", type=str, default=get_env_if_present("SM_NUM_GPUS")) parser.add_argument("--training_dir", type=str, default=get_env_if_present("SM_CHANNEL_TRAIN")) parser.add_argument( "--test_dir", type=str, required=False, default=get_env_if_present("SM_CHANNEL_TEST") ) parser.add_argument("--ag_config", type=str, default=get_env_if_present("SM_CHANNEL_CONFIG")) args, _ = parser.parse_known_args() print(f"Args: {args}") # See SageMaker-specific environment variables: https://sagemaker.readthedocs.io/en/stable/overview.html#prepare-a-training-script os.makedirs(args.output_data_dir, mode=0o777, exist_ok=True) config_file = get_input_path(args.ag_config) with open(config_file) as f: config = yaml.safe_load(f) # AutoGluon-specific config if args.n_gpus: config["num_gpus"] = int(args.n_gpus) print("Running training job with the config:") pprint(config) # ---------------------------------------------------------------- Training train_file = get_input_path(args.training_dir) train_data = TabularDataset(train_file) ag_predictor_args = config["ag_predictor_args"] ag_predictor_args["path"] = args.model_dir ag_fit_args = config["ag_fit_args"] predictor = TabularPredictor(**ag_predictor_args).fit(train_data, **ag_fit_args) # --------------------------------------------------------------- Inference if args.test_dir: test_file = get_input_path(args.test_dir) test_data = TabularDataset(test_file) # Predictions y_pred_proba = predictor.predict_proba(test_data) if config.get("output_prediction_format", "csv") == "parquet": y_pred_proba.to_parquet(f"{args.output_data_dir}/predictions.parquet") else: y_pred_proba.to_csv(f"{args.output_data_dir}/predictions.csv") # Leaderboard if config.get("leaderboard", False): lb = predictor.leaderboard(test_data, silent=False) lb.to_csv(f"{args.output_data_dir}/leaderboard.csv") # Feature importance if config.get("feature_importance", False): feature_importance = predictor.feature_importance(test_data) feature_importance.to_csv(f"{args.output_data_dir}/feature_importance.csv") else: if config.get("leaderboard", False): lb = predictor.leaderboard(silent=False) lb.to_csv(f"{args.output_data_dir}/leaderboard.csv") For training other types of AutoGluon Predictors, i.e. MultiModalPredictor, the training script you provided will be quite similar to the one above. Mostly, you just need to replace ``TabularPredictor`` to be ``MultiModalPredictor`` for example. Keep in mind that the specific Predictor type you want to train might not support the same feature sets as ``TabularPredictor``. For example, ``leaderboard`` does not exist for all Predictors. Notes for Training ~~~~~~~~~~~~~~~~~~ 1. If your use case involves image modality, you will need to pass the images as a compressed file to the training container (similarly to how you would pass in train data), decompress the file in the training container, and update the training data columns with the updated image path in the container. 2. If you wish to deploy or do batch inference on the trained TextPredictor/MultiModalPredictor on sagemaker later, you will need to save the model with ``standalone`` flag, which avoids internet access to load the model later. For example, ``predictor.save(path='MY_PATH', standalone=True)``. SageMaker container is known to have issue connecting to HuggingFace. That’s why we need to save the artifacts in offline mode. Tabular example YAML config: .. code:: yaml # AutoGluon Predictor constructor arguments # - see https://github.com/autogluon/autogluon/blob/v0.5.2/tabular/src/autogluon/tabular/predictor/predictor.py#L56-L181 ag_predictor_args: eval_metric: roc_auc label: class # AutoGluon Predictor.fit arguments # - see https://github.com/autogluon/autogluon/blob/v0.5.2/tabular/src/autogluon/tabular/predictor/predictor.py#L286-L711 ag_fit_args: presets: "medium_quality_faster_train" num_bag_folds: 2 num_bag_sets: 1 num_stack_levels: 0 output_prediction_format: csv # predictions output format: csv or parquet feature_importance: true # calculate and save feature importance if true leaderboard: true # save leaderboard output if true Another example, MultiModal example YAML config: .. code:: yaml # AutoGluon Predictor constructor arguments # - see https://github.com/autogluon/autogluon/blob/v0.5.2/multimodal/src/autogluon/multimodal/predictor.py#L123-L180 ag_predictor_args: eval_metric: acc label: label # AutoGluon Predictor.fit arguments # - see https://github.com/autogluon/autogluon/blob/v0.5.2/multimodal/src/autogluon/multimodal/predictor.py#L246-L363 ag_fit_args: presets: "high_quality" time_limit: 120 output_prediction_format: csv # predictions output format: csv or parquet Other predictors would follow similar format as the previous two examples. Training -------- To train AutoGluon model, set up a SageMaker session: .. code:: python import sagemaker # Helper wrappers referred earlier from ag_model import ( AutoGluonTraining, AutoGluonInferenceModel, AutoGluonTabularPredictor, ) from sagemaker import utils role = sagemaker.get_execution_role() sagemaker_session = sagemaker.session.Session() region = sagemaker_session._region_name bucket = sagemaker_session.default_bucket() s3_prefix = f"autogluon_sm/{utils.sagemaker_timestamp()}" output_path = f"s3://{bucket}/{s3_prefix}/output/" Create a training task: .. code:: python ag = AutoGluonTraining( role=role, entry_point="YOUR_TRAINING_SCRIPT_PATH", region=region, instance_count=1, instance_type="ml.m5.2xlarge", # You might want to use GPU instances for Text/Image/MultiModal Predictors etc framework_version="0.5.2", # Replace this with the AutoGLuon DLC container version you want to use py_version="py38", base_job_name="YOUR_JOB_NAME", ) Upload the required inputs, via SageMaker session (in this case it is a training set, test set and training YAML config) and start the training job: .. code:: python s3_prefix = f"autogluon_sm/{utils.sagemaker_timestamp()}" train_input = ag.sagemaker_session.upload_data( path=os.path.join("data", "train.csv"), key_prefix=s3_prefix ) eval_input = ag.sagemaker_session.upload_data( path=os.path.join("data", "test.csv"), key_prefix=s3_prefix ) config_input = ag.sagemaker_session.upload_data( path=os.path.join("config", "config-med.yaml"), key_prefix=s3_prefix ) job_name = utils.unique_name_from_base("test-autogluon-image") ag.fit( {"config": config_input, "train": train_input, "test": eval_input}, job_name=job_name, ) Once the models are trained, they will be available in S3 location specified in ``ag.model_data`` field. The model is fully portable and can be downloaded locally if needed. Conclusion ---------- In this tutorial we explored how to train AutoGluon models using SageMaker. Learn how to deploy the trained models using AWS SageMaker - :ref:`cloud_aws_sagemaker_deploy` or AWS Lambda - :ref:`cloud_aws_lambda_deploy`.