How to Build and Run your Entire End-to-end ML Life-cycle with Scalable Components
End-to-end Enterprise Scale MLOps Projects: An Overview
Enterprise scale, industrial strength machine learning (ML) requires the use of robust, scalable components that cover the entire ML Life-cycle. In this six-part series I will explore how to build the architectural components underpinning each of the machine learning (ML) life-cycle phases using a detailed music recommendation example. The entire code for this end-to-end example can we found here.
Getting all of the personnel elements to work together challenged the program managers, regardless of whether or not they were civil service, industry, or university personnel. There were various communities … that differed over priorities and competed for resources [:] … engineers and scientists.
…engineers … worked in teams to build hardware that could carry out the missions necessary to a successful Moon landing ... Their … goal involved building vehicles that would function reliably within the fiscal resources allocated to Apollo.
…space scientists engaged in pure research and were more concerned with designing experiments that would expand scientific knowledge about the Moon. … tended to be individualists, unaccustomed to regimentation and unwilling to concede gladly the direction of projects to outside entities.
The two groups contended with each other over a great variety of issues ... scientists disliked having to configure payloads so that they could meet time, money, or launch vehicle constraints.
…engineers, …, resented changes to scientific packages added after project definition because these threw their hardware efforts out of kilter. Both had valid complaints and had to maintain an uneasy cooperation to accomplish Project Apollo.
I first introduced the notion of the end-to-end ML Lifecycle in Feb 2021 with Architect and build the full machine learning lifecycle with AWS. In August I published the Music Recommender example to github. This second end-to- end example covers more tasks including building a recommender system, coupling training with SageMaker Debugger, using SageMaker Monitor to monitor deployed models and new processing steps in SageMaker Pipelines.
In this series of blogs, we will explore the key components of large scale ML projects, and discuss how they can be realized by a robust and scalable set of components across machine learning lifecycle, yielding an ML Architecture that supports your business priorities and gives you competitive advantage.
We’ll be using a Music Recommendation use-case to demonstrate all the stages and phases of the ML Life-cycle and the capabilities that are required at each phase to implement a robust, secure and scalable architecture that leverages machine learning.
Section 1: A Reference Architecture for scalable ML
The diagram below provides an overview of an ML Reference Architecture that contains a generic set of phases and tasks, along with their implementation on AWS. It simplifies the iterative nature of the major phases of the ML lifecycle. Each phase has a corresponding notebook in github.
We will be describing each phase in greater detail. Let’s start with what takes anywhere from 80–90% of the time of the project, the Data Prep Phase.
Section 2: Data Prep
Let’s explore the synthetic data that we generated for this use-case. IT is based on actual, real-life distributions of real live data, without bringing in the actual data, thus preserving privacy of the original data: three main data sources : Tracks and their musical characteristics, Ratings of those tracks by various users: “who” rated which track , when and then the 5Star rating which consists of user prefs.
Example track (track.csv) and user ratings (ratings.csv) data is provided on a publicly available S3 bucket found here:
s3://sagemaker-sample-files/datasets/tabular/synthetic-music. We’ll be running a notebook to download the data in the demo so no need to manually download it from here just yet.
tracks.csv: track music characteristics
- trackId: unique identifier for each song/track
- length: song length in seconds (numerical)
- energy: (numerical)
- acousticness: (numerical)
- valence: (numerical)
- speechiness: (numerical)
- instrumentalness: (numerical)
- liveness: (numerical)
- tempo: (numerical)
- genre: (categorical)
ratings.csv: users rating of tracks
- ratingEventId: unique identifier for each rating
- ts: timestamp of rating event (datetime in seconds since 1970)
- userId: unique id for each user
- trackId: unique id for each song/track
- sessionId: unique id for the user’s session
- Rating: user’s rating of song on scale from 1 to 5
userprefs (5star.csv): a specific user’s preferences.
For this tutorial, we’ll be using our own generated track and user ratings data, but publicly available datasets/apis such as the Million Song Dataset and open-source song ratings APIs are available for personal research purposes.
Section3: The Solution Implementation
In the following notebooks we’ll take 2 different approaches with the same modeling solution to create our music recommender.
- Exploratory ML: Run each notebook, 02a_ to 05_, in sequence
01_music_dataprep.flow: This .json file is the output from as was created by Sagemaker Data Wrangler. This .flow file contains the definitions of our data input and transformation steps that were defined using Data Wrangler in SageMaker Studio.
02a_export_fs_tracks.ipynb: define, transform and export our tracks data created in Data Wrangler to a feature store
02b_export_fs_5star_features.ipynb: define, transform and export our 5-star rated tracks data created in Data Wrangler to a feature store
02c_fs_create_ratings.ipynb: define, transform and export our user ratings data created in Data Wrangler to a feature store
03_train_model_lineage_registry_debugger.ipynb: train the model using xgboost to predict each song rating for each user
04_inference_explainability.ipynb: serve the model, go over feature importances using SHAP values
05_model_monitor.ipynb: setup Sagemaker Model Monitor to periodically monior the hosted endpoint
2. MLOps Automated Workflow. Setup a SageMaker Pipelines automated workflow to do all the above steps in a single notebook so that it can be run automatically.
06_pipeline.ipynb: create an automated workflow using SageMaker Pipelines
Phase 1: Data Prep
A Music Recommendation Use-case
Amazon SageMaker helps data scientists and developers prepare data, build, train, and deploy machine learning models. In this chapter, we’ll be working in SageMaker Studio, an IDE for machine learning, where the output will be a curated playlist of recommended music based on prior user ratings of songs.
You can find the entire code in the AWS SageMaker Examples github, here.
With advancements in machine learning, user personalization now offers organizations the ability to improve brand loyalty, grow revenue, and specifically tailor content to users. Using SageMaker, I’ll show you how you can generate your own personalized music playlists by using SageMaker to build a model that predict a given user’s rating for a given song. The model is trained using each user’s ratings to learn about their music tastes. When we provide the model with a song that the user has not rated in the past, the model looks at attributes of the song such as genre to find similarities to other songs that the user has rated and comes up with a rating for the song in question.
Let’s jump into the process of training and deploying this model so we can create a recommended music playlist for different users. Once we land on the SageMaker console, we select SageMaker Studio on the top left corner and will end up in the SageMaker Studio control panel. Here you can see all SageMaker Studio users associated with your account. If there are no users, you can create one by clicking the “add user” icon.
Once we’re in SageMaker Studio, the launcher has a variety of different features that we can use. We can start by creating a new dataflow (.flow) file, a new project or we can start by creating a new Jupyter notebook.
When we open the files icon we can see any folders or notebooks available . For this demo, you will need to clone the AWS SageMaker Sample repo and navigate to the end_to_end folder and into the music_recommendation example folder. . To do so, click the git icon, click “clone a repository” and type in link to the AWS SageMaker Examples repo, and click clone. Now that the repo is cloned, we can open the repo folder and see all of the files needed for this demo.
You’ll see in the repo, we have a series of sequential notebooks and a
.flow file which we’ll be going over. in notebooks 2 through 5, we’ll be walking through each step of creating the music recommender model. In this first part of the demo, I’ll show you how to use SageMaker Data Wrangler, SageMaker ML Lineage Tracking, SageMaker Model Registry, and SageMaker Clarity to build, train, and deploy the playlist music recommender model. In the final notebook 6, I will show you how to create an automated SageMaker Pipeline to tie all these steps together into a single Pipeline.
Get the Data
First, let’s download our music data from the public S3 bucket. We’ll launch the
00_overview_arch_data notebook and run all cells which will both download the tracks and user ratings datasets and upload them to your own default SageMaker S3 bucket that was created when you initially created your Studio workspace. This notebook also updates the data source path in the
.flow file to point to your default S3 bucket.
Data Preparation with Data Wrangler
Let’s explore how we can use a GUI based tool, in this case, SageMaker Studio’s Data Wrangler, to define feature transforms on our data. This allows us to define transformations as we do when we perform feature engineering on our raw datasets. This task involves data cleansing and defining custom feature transformations such as converting categorical values, removing nulls, etc., in order to prepare the data for the next phase, model training.
SageMaker Data Wrangler is a feature of SageMaker Studio that radically simplifies the highly interactive process of data preparation and feature engineering by loading in a portion of your dataset and giving you tools to define transformations, joins and custom feature engineering tasks. It allows you to complete each step of the data preparation workflow, including data selection, cleansing, exploration, and visualization all from a single visual interface. SageMaker Data Wrangler contains over 300 built-in data transformations for your data, such as the ability to encode categorical variables or deal with missing data or outliers.
To follow along, first run notebook
00_overview_arch_data.ipynb to download the synthetic data, and set the paths in the Data Wrangler .flow file to your default bucket.
You can see that we have two data sources here:
ratings.csv. While we are using Amazon S3 as the source for this example, Data Wrangler is also able to handle other input sources such as Amazon Athena, Amazon Redshift or Snowflake data sources. If we click into the data types, we can see a preview of the data. [click on data types for tracks.csv].
Our first dataset,
tracks.csv, provides information about different tracks in our music catalog. This includes the track ID and characteristics about the song such as length in seconds, energy, acousticness, and genre. We can go back to viewing the flow by clicking the
<back to data flow tab . Next, we can inspect the second dataset,
ratings.csv [click data types for ratings.csv] This data source contains ratings given to the songs by different users as well as metadata about the rating like the timestamp that it occurred.
Let’s go back to the data flow. If we click into the “steps” node, we can see the transformations that were applied to the data set.
Let’s take a look at the custom formula step [click “steps” → “custom formula”]. When we expand the latest transform, the custom formula, we see that we created a new feature called danceability which is a combination of several of the other features.
If we click on 6.custom formula we will see the custom formula that was defined as shown below:
Let’s go back to the data flow and check out the joins.
Let’s double click on it.
Here, we can see that we also join the tracks data with the rating data, using the trackId as the join key. After joining, we do some additional transforms. We’ve also created a column called the feature Timestamp which contains the time that we created the feature and we convert it to a float type. This is important, because timestamps are required when we store data into SageMaker Feature Store; it this timestamp to be a float data type.
Create a Feature Store
walkthrough: When our transforms are complete, we can export the transformations so we can use them for our model building. To do so, go to the export tab in SageMaker Data Wrangler,
click the node that you want to export, click on the last transform
and then click the export icon in the top right.
There are 4 export options:
- Save as a csv in s3
2. as a pipeline step for SageMaker Pipelines
3. as python code or
4. export to feature store.
We’ll be exporting to a feature store for this example.
Since we have three sets of features from three datasets to transform, we will use the code generation feature of Data Wrangler to generate the Processing Job notebook, and the actual export to the Feature Store for us.
To do so, we will want to choose ‘export’ three times, each from the relevant node, once for tracks and once for ratings. This will create separate notebooks where our input data csv will be processed (using SageMaker Processing Jobs) according to the transformations we have defined in our
.flow file, and then exported to its own each feature group: first, a feature group for the tracks data after transformations, then, the ratings data. And third, to generate the user preferences, we will use the final node after all of our transforms and join.
Note: The repo already contains these three notebooks that I generated from Data Wrangler, and will process the data using SageMaker Processing using the defined transforms in the .flow file and then create its corresponding feature store; three in total.
We’ll run each of the three 02a, 02b and 02c notebooks in sequence, to create our three feature groups that will live in the feature store. Each of these notebooks is pulling the data transformation instructions from the
Note that in each of our datasets, we have a feature that is a Record identifier and another that is an Event time feature name. These are required parameters for feature group creation.
- Record identifier name is the name of the feature defined in the feature group’s feature definitions whose value uniquely identifies a Record defined in the feature group’s feature definitions.
- Event time feature name is the name of the EventTime feature of a Record in FeatureGroup. An EventTime is a timestamp that represents the point in time when a new event occurs that corresponds to the creation or update of a Record in the FeatureGroup. All Records in the FeatureGroup must have a corresponding EventTime.
In the next blog post we will explore training for the music recommender system. In the meantime, here is an overview of the next phases in the ML Lifecycle with the corresponding code in github.
Phase 2: The Model: Train and Debug, Store Version in Model Registry
Now that we’ve saved the data in the feature store, we are ready to move onto training. Let’s open the
03 notebook. We’ll query the feature stores using Amazon Athena and join the results to get our final dataframe for training. We do a standard train/test split and save these splits to S3.
With our train/test datasets ready for training we need to select an algorithm. For our purposes, we are trying to predict what rating a user would give a song based off the song’s features and the user’s preferences. We will treat this as a regression problem since we are trying to predict the star-rating. For this type of regression, we have selected XGBoost.
Train the Model
Here we define our Estimator and we pass it the XGBoost algorithm image URI. This is what tells SageMaker hosted training, that we want to train an XGBoost model. There are also many other algorithms that SageMaker natively supports, so check them out to find the best model for your data distribution and use-case.
We can also pass the hyper-parameters for the model to use and you’ll notice the debugger hook config, which indicates that we want to store the model metrics and feature importance SHAP values. We also set a debugger rule, which is Python code used to detect certain conditions during training. In this case, we are detecting whether the loss is not decreasing, meaning the model is not improving. We will talk more about the debugger shortly when we view the outputs of the debugger report.
Once our estimator is defined, we can train the model by calling
.fit() and passing our training and validation datasets that we created above.
SageMaker Debugger is used here to save metrics about the model training such as loss and accuracy every half second. This lineage of metrics enables us to better debug our model during training. Once the model is trained, we use SageMaker Debugger to create a trial based off of the trained estimator.
We can view the Debugger report by calling
create_trial() on our estimator’s debugger path. Now we can access the Debugger report and plot some of the logged metrics throughout the training iterations.
Once our model is trained, we will want to use SageMaker ML Lineage Tracking to track different artifacts about the model. An important aspect of transparency in machine learning is to be able to link a model with the code and data used to train it so we can easily find how any given model was produced. In the event that a model begins to start performing poorly, we should be able to quickly find all of the artifacts used to create it so we can debug the problem at the source. To do this, we need to create an artifact for every aspect of training we’d like to save.
Next, we create a model package group in SageMaker Model Registry that can contain different versions of a model. You can then register each model you train and the model registry adds it to the model group as a new model version. So if you plan on tracking multiple versions of a model, you need to create a Model Group Package first.
We will only show one version of a model here, but in practice you may have different versions as you try different features, different hyper-parameters, or update your model over time. Here we include every metric that is associated with our training job. Finally, for model governance reasons, we can tag the model package with a status, so it’s easy to tell which models are actually approved or declined for deployment into production.
Phase 3: Model Deployment, Serving & Explainability
In order to use our model for inference, we’ll deploy our model using
.deploy() and create an endpoint where the model is available online so that we can make inferences on it in the next step. We can use the SageMaker Studio Components and Registries tab to see what model Endpoints we have running.
Inference & Recommendation Explainability
Now that we’ve built our model, we’ll want to make playlist predictions with it. Let’s open our next notebook and look at a single customer to predict a list of tracks that we would recommend to them.
We’ll grab a single user and the tracks data from our feature store. We’ll also reload our trained model using
.predictor() from our model endpoint we registered in our previous notebook. Next, we can send tracks to predict the user’s star rating for each song and write our the predictions to our S3 bucket.
Explainability of the Results
Once we have our track rating predictions, we’ll use SageMaker Clarify to explore what was most useful in predicting this user’s track ratings. Trained models may consider some model inputs more strongly than others when generating predictions. SageMaker Clarify lets you look into these important features as well as detect class imbalances that can lead to bias.
Phase 4: Model the Monitor
Another important part of maintaining a model is monitoring its performance. This is where SageMaker Model Monitor comes into play. In this next notebook, we’ll go over how we can setup model monitoring to check when data and predictions change beyond a normal baseline. Changes in real-world data can cause your model to give different weights to model inputs which changes its behavior over time. SageMaker Clarify is also integrated with SageMaker Model Monitor to alert you if the importance of model inputs shift, causing model behavior to change.
Establish a Baseline
DataCaptureConfig() we’ll sample our training dataset in order to establish a normal range or baseline statistics about the data.
baseline_statistics() will provide stats about each feature in the baseline dataset such as the mean and standard deviation of features, while
suggested_constraints() suggest some constraints future data should follow for each feature like whether or not it should contain null values.
Enable and Schedule Monitoring
We can enable Model Monitor by kicking off a cron job within our creation of a schedule with
_monitoring_schedule(). SageMaker Model Monitor can be later integrated into a SageMaker Pipeline to automatically alert you of data drift, or the underlying data deviates from its baseline stats or constraints setup during baselining.
Phase 5: MLOps: Automate the Entire Pipeline
In our final phase, [notebook above], we’ll look at how to combine all these steps into a SageMaker Pipeline so that your model can be ran and updated on an automated schedule. With SageMaker Pipelines, you can create, automate, and manage end-to-end ML workflows at scale. Pipelines logs every step of your workflow, creating an audit trail of model components such as training data, platform configurations, and model hyper-parameters. Pipelines can be triggered either by updating training data in your designated S3 bucket or when code is committed to a connected repo.
Kicking off the SageMaker Pipeline notebook, we’ll define some parameters that can be changed later when the pipeline is actually called to run. This is a feature with Pipelines so that parameters can be changed without changing the entire pipeline itself.
SageMaker Data Wrangler?
In our previous notebooks, we exported feature store creation notebooks generated by Data Wrangler. For our pipeline, we’ll need to reference the specific nodes in our Data Wrangler steps in order to create our tracks and user rating feature stores. Pipelines will then reference the
.flow file for the necessary data transformation steps.
We’ve located the output node IDs here in the notebook from the generated notebooks from Data Wrangler. They’ll be the same IDs for you as well until a new
.flow file is created or edited. Keep in mind these IDs will need to be updated whenever the
.flow is updated. We’ll define the creation of the feature stores here and grab the container URI from the
.flow file as well.
Automate the Dataset Preparation
Next we’ll create our train and test splits from our dataset which will be created with a file included in this repo named
.ProcessingStep() is then created for this data processing action.
Build a Recommendation Model
We’ll then define an
.Estimator() to capture parameters for training and then a
.TrainingStep() that creates a step that will tell Pipelines how to handle model training. Then we’ll create a
.Model() object and register the trained model by defining a
.RegisterModel() step. And finally register a
.ProcessingStep() to deploy the model using our
deploy_model.py script found in this repo.
Create the Pipeline
Once we’ve defined each step in the Pipeline, we’ll create a
.Pipeline() instance and pass it the parameters we defined at the beginning of this notebook,
model_approval_status, as well as each pipeline step we just went over.
Run the Pipeline
.upsert() to pass the pipeline definition to the SageMaker Pipeline service. We then kick off the feature store creations as an offline source in parquet format. We wait until the feature stores are available, and then kick off creating the pipeline.
We can view the pipeline by navigating to SageMaker Studio’s Components and Registries tab and selecting Pipelines. Here we can view currently running and past pipelines. This is a great spot to find more detailed logs about your pipeline and see an overall graph of each step in the pipeline.
Once we’re done this solution, we might not be interested in keeping our model endpoints and pipelines running. To avoid ongoing charges, we can cleanup the resources we spun up during this tutorial by running this notebook.This will delete the associated
music-recommendation S3 bucket and resources spun up during this session.
We have shown how to use robust and scalable components in AWS to implement the architectural components required as you journey across your end-to-end ML Lifecycle. You can find the code for the entire Music Recommendation Implementation here.