Serverless Pipelines For Automated Model Training and Deployment Using Amazon SageMaker, Step Functions and AWS Lambda
--
What is MLOps?
MLOps is the discipline of moving a model from research to production and establishes a set of processes and capabilities to operationalize and maintain models in production. The initial phase of building a machine learning model involves an iterative experimentation and research phase where data scientists try out different kinds of models on the data, do feature engineering, fine tune model hyper-parameters and prototype to build candidate models that can be trained at scale on larger datasets and then deployed to production. MLOps consists of a set of practices that helps operationalize models by rapidly moving models from the experimentation phase to production, automating all steps to reduce errors and cost, monitoring models and providing a feedback mechanism to capture new data points to retrain the models.
Following are high level components of a mature MLOps capability within an enterprise :
- Establish relevant pipelines to orchestrate and automate the training and deployment of the models
- Setup a monitoring system to identify any kind of degradation in the models
- Model registry to catalog and version models
- Determine how models are to be consumed by downstream systems and make appropriate architectural choices to deploy the model (real time inference, batch inference, streaming predictions, on-device predictions)
- Dynamic scaling of models to serve increased inference traffic
- Capability to explain the output of a model inference
- Ability to trace an inference back to the version of model and data used to produce the inference
- Capability to roll back models quickly, perform canary deployments and A/B testing on different versions of models
- CICD pipelines to move models across different environments (DEV, QA, PROD)
Model Training and Deployment Pipeline Technology Stack
In this article, we will provide an overview of a simple pipeline that automates the training and deployment of a model on AWS cloud. The pipeline uses serverless technologies on the AWS platform in order to avoid the need to provision EC2 instances, maintain and patch the instances, handling scaling, etc. The pipeline uses Step Functions to orchestrate the various steps in the workflow, SageMaker for training the models and SageMaker hosting to deploy models as endpoints.
Following are the primary AWS services used in this pipeline:
- Step Functions — Workflow orchestration
- SageMaker — Model training and hosting
- AWS Serverless Application Model (SAM) — Framework to build serverless applications and provision resources
- API Gateway — Fronts the SageMaker hosted endpoints and provides monitoring and security
- CodeBuild — Build training and inference docker images
- Elastic Container Registry — Store docker images
- Lambda — Serverless compute
- Simple Storage Service — Store train, test and validation data
- Simple Notification Service — Notify operations team of success/failure/errors during pipeline execution
- EventBridge — Serverless event bus
- Systems Manager (Parameter Store) — Store parameters of the pipeline and model hyperparameters
High Level Architecture
The following diagram shows a high-level architecture of the pipeline to train and deploy a machine learning model:
Detailed Steps
Trigger Pipeline Execution: The model training and deployment StepFunction is triggered by an EventBridge Trigger at a certain cadence (daily/weekly/monthly). The EventBridge can be modified to trigger based on events such as availability of new training data, model algorithm changes, etc. instead of a scheduled execution. The ‘initiate-pipeline’ lambda function will send out a notification to a SNS topic for the operations team alerting them that the training process has started. This lambda can be used to do any kind of initial setup such as creating temporary workspaces for intermediate data to be stored for this execution, recording the execution start time in monitoring systems, take snapshots of data and code versions, etc.
Build Training Image: We will use native Step Function integration with CodeBuild service to start a Codebuild job that checks out the model training code from a Github repository, pulls the appropriate AWS provided base training image (TensorFlow/PyTorch/MXNet) without/without GPU support, builds a docker image of the training code and pushes it to Elastic Container Registry (ECR).
Start Model Training job: We will start a SageMaker training job using native SageMaker integration with StepFunctions. Using native integrations reduces the amount of coding required to start a job, poll for completion and error handling. SageMaker training job pulls the latest training image (or a specific version) from ECR and runs the model training job using training data from S3. We can expose training specific parameters to the pipeline using SSM Parameter Store.
Deploy Model To SageMaker Hosting: SageMaker stores the model artifacts in S3 once the training job completes successfully. We can deploy the model as a real time endpoint or batch transform job based on the requirements of the use case. If the consuming application needs predictions based on dynamic real time data with low latency, then the model should be deployed as an API.
Real Time Endpoint — Build Serving Image: Include a step in the step function to trigger a CodeBuild job that pulls the model inference application from code repository (Github), builds a docker image of the inference code and pushes it to Elastic Container Registry (ECR).
Real Time Endpoint — Create SageMaker Model: Include a step in the step function to invoke SageMaker to use the docker image from the previous stage and the trained model artifacts to create a ‘SageMaker Model’ object.
Real Time Endpoint — Create Endpoint: Pipeline creates a SageMaker endpoint configuration (# of instances, instance type, traffic routing, etc.). The endpoint configuration along with the SageMaker model object is used to deploy the endpoint on SageMaker hosting instances.
Real Time Endpoint — Inference: SageMaker endpoint is fronted by a API gateway backed by a lambda inference processor. Consumers will make inference requests to the API gateway. The lambda will pre-process the request and hand off to SageMaker to do inference. The lambda will process the inference and return it back to the consumer via API gateway.
Error Notification: Any failure during the pipeline execution will result in triggering error notifications (emails) to the operations support team for further triaging and resolution. Each step in the workflow has a potential error state that will lead to a fail state and the subsequent triggering of the notification. We have to make sure that any failure in any step or branch of the step function will lead to the pipeline failing in its entirety.
Pipeline Deployment Environments: Model training and deployment pipeline will be deployed to various environments (DEV/STAGE/PROD) via CICD pipelines. The CICD pipeline will have governance to ensure the models have been tested appropriately and the metrics are acceptable before promoting to higher environments. For sensitive or external facing models, we recommend having a manual approval stage in the CICD pipeline. Also, advanced deployment strategies such as canary deployments can be used to ensure that the model is used for a limited set of consumers before it goes live on all traffic.
We briefly described the automation aspect of a mature training and deployment pipeline on AWS. In subsequent articles, we will provide more details on other aspects of MLOps such as model monitoring, model registry, autoscaling, etc. from an AWS platform standpoint. Please stay tuned for more on this series soon!