Enterprise ML Deployment Patterns: Part 1, ML Gateway

Ali Arsanjani
8 min readDec 5, 2021


Ali Arsanjani, with Bobby Lindsey and Tony Chen

All code for this pattern can be found on github.


Very often Data Science projects start in an experimental phase in which transformations on features are experimented with, algorithms are selected and tried for determining if they can fit the data distribution well enough for reliable predictions, tuning is done with various hyper-parameters and so on.

As an organization matures in their Machine Learning (ML) Journey, they will find that they will then transition to an automated ML or MLOps phase where the pipelines for data preparation, training, deployment, monitoring will all need to be automated.

In order to raise the maturity of projects to an Enterprise Scale that can fulfill business needs, sustain business-level continuity, scale, security and performance, the need for integrating data science experiments with machine learning deployment patterns and best-practices will grow in importance and will save you time and money.

In this blog series on ML Patterns, we will start by focusing on Deployment Patterns and Best-Practices within the ML lifecycle : exploring the considerations and options that present themselves, post-training; on the serving/inference/prediction phases of the ML lifecycle.

There are many ways in which we can expose an endpoint that was deployed as a hosted SageMaker endpoint: these variations are summarized in the ML Gateway Pattern with mandatory and optional components. Through this series of blogs we will outline options and their context, pros and cons for helping you decide what components to use for your specific workload and use-case.

ML Gateway — Direct Integration: APIGateway to SageMaker

Figure 1: Direct ML Gateway Pattern : API Gateway to SageMaker

Problem : How do we expose the ML Model we just trained as an API endpoint in a scalable manner?

Solution : API Gateway can be used to front an Amazon SageMaker inference endpoint as (part of) a REST API, by making use of an API Gateway feature called mapping templates. This feature makes it possible for the REST API to be integrated directly with an Amazon SageMaker runtime endpoint, thereby avoiding the use of any intermediate compute resource (such as AWS Lambda or Amazon ECS containers) to invoke the endpoint. The result is a solution that is simpler, faster, and cheaper to run.


1. A/B Testing; Canary Testing : Deploy as Production Variants

  • Performing A/B testing between a new model and an old model with production traffic can be an effective final step in the validation process for a new model. In A/B testing, you test different variants of your models and compare how each variant performs. If the newer version of the model delivers better performance than the previously-existing version, replace the old version of the model with the new version in production. SageMaker allows you to test multiple models or model versions behind the same endpoint using Production Variants. Each Production Variant maps to a single model which is deployed on its own container. You can distribute endpoint invocation requests across multiple Production Variants by providing the traffic distribution for each variant or you can invoke a specific variant directly for each request.

2. Deploy as Multi-model Endpoints: many models on one hosted endpoint

  • With SageMaker Multi-Model Endpoints, you can host multiple models. Unlike Production Variants, each model does not need its own container and resources. Instead, all models share the same container and resources usually resulting in a decrease in cost of hosting. This deployment pattern should be considered if you have many models that need hosting like in a use case where each user has their own personalized model.

Trade offs:

[+] You can use Amazon API Gateway to

  • extend the API for traffic management,
  • authorization and access control,
  • monitoring, and
  • API version management.

This ML Gateway pattern (direct integration) is especially useful for applications that exhibit high volumes of requests (RPS) at peak demand times. To handle this increase in load from downtimes to peak times, Amazon SageMaker supports automatic scaling for inference endpoints as shown below. Here is code you can use for more detail about autoscaling endpoints for Amazon SageMaker.

Figure 2: Autoscaling with SageMaker Endpoints to handle peak traffic

Also, API Gateway will automatically scale to match demand, ensuring that there is always sufficient capacity to process incoming requests, and that you only pay for what you use.

This pattern shows how a direct integration between API Gateway and Amazon SageMaker works.

In subsequent sections we will expand the pattern to show additional optional components that fit within the ML gateway pattern and can be used to balance the forces and trade-offs in the problem space.

ML Gateway — Indirect Integration: API Gateway to Lambda to SageMaker

You can add architectural components to the base ML Gateway Pattern as follows and for use-case and architectural reasons we will outline below. In the first addition of components we will add AWS Lambda as a layer between the API Gateway and the SageMaker Endpoint. This pattern is used to reduce the direct load on the compute layer by reshaping payloads at the gateway itself. This may ease the need for autoscaling groups at the endpoint.

The Lambda Function(s) . You can add a lambda function to dis-intermediate the direct connection to highly available and scalable SageMaker Endpoint.

Figure 3: Lambda Indirect Connection

ML Gateway Enterprise: Adding Feature Store for Inference and Monitoring for Data and Model Drift

Figure 4: Adding the Feature Store

The Online Feature Store. To get enterprise consistency, you can leverage an online feature store to serve as a high throughput, low-latency store of features you have prepared in batch from a historical duration (e.g. last week’s aggregated claim amounts) to realtime incoming features that are required for serving predictions such as incoming claims from all geos.

Trade offs:

1. [+] single point of reference aka gold standard for initial features to send to model for serving

2. [-] one more hop

Pros and cons:

The Model Monitor. Monitor your hosted SageMaker Endpoint with a Model Monitor so you can detect concept or data drift and trigger pipelines for data prep or for model retraining.

Figure 5: Monitoring the Hosted Endpoint

Event [CloudWatch] Triggers. This can be used when Model Monitor detects concept or data drift to indicate the need to update the existing endpoint with a newly trained one.

Figure 6: Add Cloudwatch EventBridge Triggers

Overall Consolidated Architecture for this Pattern

Figure 7: Overall Architecture for ML Gateway Pattern

Here we break down the example in this blog into four parts:

1. Data prep

  • For preparation we will load the CSV into s3
  • Then create and populate a Feature Store that can be used for training our model
  • Later we will use Athena to load the data from the feature store into a dataframe

2. Training and deployment

3. Inference

4. MLOps — deployment of a Cloud Formation Template

Architectural Decisions and Considerations

Feature Store Warm-up

The very first call to the Amazon SageMaker online Feature Store may experience a first time, cold start latency as it warms up its cache. If needed, you can add in a Lambda or EC2 Warmer to the Feature Store within the ML Gateway Pattern to keep it warm even on first incoming requests.

Lambda Provisioned Concurrency

For Alleviating ColdStart issues for Lambda, we can implement Provisioned Concurrency.

Autoscaling Considerations

A more detailed exposition of modulating autoscaling on SageMaker Hosted Endpoints can be found in this blog post.

As we increase the number of requests per second on on endpoint, eventually, the endpoint response latency will tend to increase and will ultimately error out for some requests. To mitigate this, SageMaker Endpoints allow you to define autoscaling groups for your instances so we can automatically anticipate the increase in compute power needed to handle incoming requests, respond with acceptable latency and improve prediction throughput. Auto-scaling looks like this.

AutoScaling will require setting up a Cloudwatch alarm to trigger scale in. Sagemaker Endpoints will not push metrics when there is no change in the scaleout parameters . So, to trigger the Cloudwatch alarm when your workload decreases or ends, you can consider the following actions:

  • Create a step scaling policy using the Cloudwatch metric math FILL() function for your scale in. This indicates to CloudWatch “if there’s no data, pretend this was the metric value when evaluating the alarm. This is only possible with step scaling since target tracking creates the alarms for you (and AutoScaling will periodically recreate them, so if you make manual changes they’ll get deleted)
  • Have scheduled scaling set the size back down to 1 periodically, when you anticipate low load, eg every evening
  • Ensure some traffic continues even at a low level for some duration

Model Monitoring for Data and Concept Drift

We can add model Monitor as a means of detecting and controlling Data or Concept drift and re-trigger a pipeline to retrain your models.


1. https://github.com/awsdocs/amazon-sagemaker-developer-guide/blob/master/doc_source/ex1-model-deployment.md#ex1-deploy-model


3. https://github.com/awsdocs/amazon-sagemaker-developer-guide/blob/master/doc_source/endpoint-auto-scaling.md

4. https://aws.amazon.com/blogs/machine-learning/load-test-and-optimize-an-amazon-sagemaker-endpoint-using-automatic-scaling/


Dr. Ali Arsanjani is WW Tech Leader for AI/ML Specialist Solution Architecture and Chief Principal AI/ML Architect. Previously, Ali was IBM Distinguished Engineer, CTO for Analytics and Machine Learning and IBM Master Inventor. Ali is an Adjunct Professor at San Jose State University teaching in the Master of Science in Data Science Program, and advising Master’s Projects.

Bobby Lindsey is a Machine Learning Specialist at Amazon Web Services. He’s been in technology for over a decade, spanning various technologies and multiple roles. He is currently focused on combining his background in software engineering, DevOps, and machine learning to help customers deliver machine learning workflows at scale. In his spare time, he enjoys reading, research, hiking, biking, and trail running.

Tony Chen is a Machine Learning Specialist at Amazon Web Services.



Ali Arsanjani

Director Google, AI/ML & GenAI| EX: WW Tech Leader, Chief Principal AI/ML Solution Architect, AWS | IBM Distinguished Engineer and CTO Analytics & ML