Enterprise ML Deployment Patterns: Part 1, ML Gateway


ML Gateway — Direct Integration: APIGateway to SageMaker

Figure 1: Direct ML Gateway Pattern : API Gateway to SageMaker
  • Performing A/B testing between a new model and an old model with production traffic can be an effective final step in the validation process for a new model. In A/B testing, you test different variants of your models and compare how each variant performs. If the newer version of the model delivers better performance than the previously-existing version, replace the old version of the model with the new version in production. SageMaker allows you to test multiple models or model versions behind the same endpoint using Production Variants. Each Production Variant maps to a single model which is deployed on its own container. You can distribute endpoint invocation requests across multiple Production Variants by providing the traffic distribution for each variant or you can invoke a specific variant directly for each request.
  • With SageMaker Multi-Model Endpoints, you can host multiple models. Unlike Production Variants, each model does not need its own container and resources. Instead, all models share the same container and resources usually resulting in a decrease in cost of hosting. This deployment pattern should be considered if you have many models that need hosting like in a use case where each user has their own personalized model.
  • extend the API for traffic management,
  • authorization and access control,
  • monitoring, and
  • API version management.
Figure 2: Autoscaling with SageMaker Endpoints to handle peak traffic

ML Gateway — Indirect Integration: API Gateway to Lambda to SageMaker

Figure 3: Lambda Indirect Connection

ML Gateway Enterprise: Adding Feature Store for Inference and Monitoring for Data and Model Drift

Figure 4: Adding the Feature Store
Figure 5: Monitoring the Hosted Endpoint
Figure 6: Add Cloudwatch EventBridge Triggers

Overall Consolidated Architecture for this Pattern

Figure 7: Overall Architecture for ML Gateway Pattern
  • For preparation we will load the CSV into s3
  • Then create and populate a Feature Store that can be used for training our model
  • Later we will use Athena to load the data from the feature store into a dataframe

Architectural Decisions and Considerations

Feature Store Warm-up

Lambda Provisioned Concurrency

Autoscaling Considerations

  • Create a step scaling policy using the Cloudwatch metric math FILL() function for your scale in. This indicates to CloudWatch “if there’s no data, pretend this was the metric value when evaluating the alarm. This is only possible with step scaling since target tracking creates the alarms for you (and AutoScaling will periodically recreate them, so if you make manual changes they’ll get deleted)
  • Have scheduled scaling set the size back down to 1 periodically, when you anticipate low load, eg every evening
  • Ensure some traffic continues even at a low level for some duration

Model Monitoring for Data and Concept Drift






Director at Google, AI/ML | EX: WW Tech Leader, Chief Principal AI/ML Solution Architect, AWS | IBM Distinguished Engineer and CTO Analytics & Machine Learning

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Real-time API Design Collaboration in Stoplight Studio with YJS

Delivering untested code with safety

A change in expression parsing worth knowing about

PWA vs Native App: The Ultimate Choice

Basic Linear Algebra with Python

What is PEPPOL?

Yet another Mobile architectures comparison

Mapping of MVC, MVP and MVVM to the multilayered architecture

Clean Code, What? and Why?

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Ali Arsanjani

Ali Arsanjani

Director at Google, AI/ML | EX: WW Tech Leader, Chief Principal AI/ML Solution Architect, AWS | IBM Distinguished Engineer and CTO Analytics & Machine Learning

More from Medium

Model Monitoring: A Comprehensive Introduction

MLOps — Part IV Pipelines orchestrator and SSO setup

How to Interact Delta Lake Using EMR

Machine Learning Pipelines: Everything You Need to Know

Machine Learning Pipelines