Build an MLOps end-to-end NLP Pipeline to understand trends in company valuation

9 min readDec 5, 2021

Using Amazon SageMaker Pipelines, Amazon Jumpstart Industry SDK and HuggingFace Transformers

by Ali Arsanjani

NLP Pipeline for Company Earnings Analysis

In this series on end-to-end pipelines for MLOps, we will explore Natural Language Processing (NLP) pipelines. Previous post series have covered the tabular/numeric aspects of end-to-end MLOps automation of the ML Lifecycle for Fraud Detection and Music Recommendation.

In this post our use-case will be using NLP to better understand trends in company valuation based on Security and Exchange Commission (SEC) earnings reports and news from media outlets.

You can find the full code on github, here.

We will use SageMaker Jumpstart Industry SDK to pull the SEC filings, create a custom docker container, and use SageMaker Pipelines to create an end-to-end NLP MLOps pipeline to automate the ML Lifecycle of preparing the data, training two models, one for Text Summarization and another for Sentiment Analysis of a specific textual section of the SEC filings called the MDNA or Management’s Discussion And Analysis Of Financial Condition And Results Of Operations.

SageMaker Pipelines uses purpose-built docker containers behind the scene to run jobs (aka Steps) in a sequence that you define in a directed acyclical graph (a DAG), much like a DevOps CI/CD pipeline. You can build our own docker container with Python3, Boto3 SDK and SageMaker Python SDK, so that you can make use of them to call to various APIs including the Amazon Fraud Detector APIs via the AWS Boto3 library and access SageMaker constructs, such as SageMaker Feature Store via custom data processing scripts.

Create a custom container

To achieve that, you first need to build a docker image and push it to an Amazon ECR (Elastic Container Registry) repository in your account. Typically, this can be done using the docker CLI and aws CLI on your local machine. However, SageMaker makes it even easier to use this in the SageMaker Studio environment: to build, create, and push any custom container to your ECR repository using a purpose-built tool known as `sagemaker-studio-image-build`, and use the custom container image in your notebooks for your ML projects.

For more information on this, see : Using the Amazon SageMaker Studio Image Build CLI to build container images from your Studio notebooks.

Next, install this required CLI tool into the SageMaker environment.

Here is the code. And we will do a walkthrough of the major steps below.

Grant appropriate permissions to SageMaker

In order to use sagemaker-studio-image-build, we need to first add permissions to SageMaker's IAM role so that it may perform actions on your behalf. Specifically, you would add Amazon ECR and Amazon CodeBuild permissions to it. Add the AmazonEC2ContainerRegistryFullAccess and AWSCodeBuildAdminAccess policies to your SageMaker default role.

Build a custom Docker image

Next we will build a custom Dockerfile and use the CLI tool we installed earlier to build the image from the Dockerfile. Our docker image is going to be pretty simple, it will be a copy of the open source python:3.7-slim-buster image and contain an installation of the Amazon Boto3 SDK, SageMaker SDK, Pandas, and NumPy. For our end to end NLP pipeline, we have a number of tasks that depend on Boto3 and SageMaker SDK.

Leverage the SageMaker JumpStart Industry Python SDK for obtaining SEC Filing Data

We will also use the SageMaker JumpStart Industry Python SDK to download 10k/10Q reports from SEC’s EDGAR system. We install all of these dependencies in the container, and use the custom container in our ScriptProcessor step of our MLOps pipeline.

Notebook 1: Summary of Steps

Use `sagemaker-studio-image-build` CLI tool that helps us build and publish custom docker images for our ML Workstream.
Set up IAM permissions for the CLI tool.
Build the docker image using the `sm-docker build` command that includes Boto3 and SageMaker SDK libraries.
Initialize the variable `CONTAINER_IMAGE_URI` with the resulting image URI and stored it in cache (notebook magic command) for use in the subsequent notebook.

Company Earnings Analysis using NLP Summarization and Sentiment Analysis

We will next orchestrate an NLP pipeline to help analyze the sentiment found in a summarization of the MDNA section of the SEC Earnings report as well as the News associated with that filing, from media outlets to help better trends in the valuation of companies. We will use two different HuggingFace Transformers, one for Summarization and another for the Sentiment Analysis.

In notebook 2, we demonstrate how to summarize and derive sentiments out of Security and Exchange Commission reports filed by a publicly traded organization. We will derive the overall market sentiments about the organization through financial news articles within the same financial period to bit of a more balanced perspective of the organization vs. market sentiments to better understand trends in outlooks about the company’s overall valuation and performance. In addition, we will also identify the most popular keywords and entities within the news articles about that organization.

In order to implement this, we will be using multiple SageMaker Hugging Face based NLP transformers for the downstream NLP tasks of Summarization (e.g., of the news and SEC MDNA sections) and Sentiment Analysis (of the resulting summaries).

Using SageMaker Pipelines

Amazon SageMaker Pipelines is a purpose-built, easy-to-use continuous integration and continuous delivery (CI/CD) service for machine learning (ML). With SageMaker Pipelines, you can create, automate, and manage end-to-end ML workflows at scale.

Orchestrating workflows across each step of the machine learning process (e.g. exploring and preparing data, experimenting with different algorithms and parameters, training and tuning models, and deploying models to production) can take months of coding.

Since it is purpose-built for machine learning, SageMaker Pipelines helps you automate different steps of the ML workflow, including data loading, data transformation, training and tuning, and deployment. With SageMaker Pipelines, you can build dozens of ML models a week, manage massive volumes of data, thousands of training experiments, and hundreds of different model versions. You can share and re-use workflows to recreate or optimize models, helping you scale ML throughout your organization.

Understanding trends in company valuation using NLP

We are going to demonstrate how to summarize and derive sentiments out of Security and Exchange Commission reports filed by a publicly traded organization. We are also going to derive the overall market sentiments about the said organization through financial news articles within the same financial period to present a fair view of the organization vs. market sentiments and outlook about the company’s overall valuation and performance. In addition to this we will also identify the most popular keywords and entities within the news articles about that organization.

In order to achieve the above we will be using multiple SageMaker Hugging Face based NLP transformers with summarization and sentiment analysis downstream tasks.

Summarization of financial text from SEC reports and news articles will be done using the Pegasus for Financial Summarization model based on the paper Towards Human-Centered Summarization: A Case Study on Financial News .
Sentiment analysis on summarized SEC financial report and news articles will be done via pre-trained NLP model to analyze sentiment of financial text called FinBERT. Here is the paper describing FinBERT in more detail: FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

The SEC Dataset

The starting point for a vast amount of financial NLP is text in SEC filings. The SEC requires companies to report different types of information related to various events involving companies. A list of SEC forms can be found here.

SEC filings are widely used by financial services companies as a source of information about companies in order to make trading, lending, investment, and risk management decisions. Because these filings are required by regulation, they are of high quality and veracity. They contain forward-looking information that helps with forecasts and are written with a view to the future, required by regulation. In addition, in recent times, the value of historical time-series data has degraded, since economies have been structurally transformed by trade wars, pandemics, and political upheavals. Therefore, text as a source of forward-looking information has been increasing in relevance.

Obtain the dataset using the SageMaker JumpStart Industry Python SDK

Downloading SEC filings is done from the SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) website, which provides open data access. EDGAR is the primary system under the U.S. Securities And Exchange Commission (SEC) for companies and others submitting documents under the Securities Act of 1933, the Securities Exchange Act of 1934, the Trust Indenture Act of 1939, and the Investment Company Act of 1940. EDGAR contains millions of company and individual filings. The system processes about 3,000 filings per day, serves up 3,000 terabytes of data to the public annually, and accommodates 40,000 new filers per year on average.

There are several ways to download the data, and some open source packages available to extract the text from these filings. However, these require extensive programming and are not always easy-to-use. We provide a simple one-API call that will create a dataset in a few lines of code, for any period of time and for numerous tickers.

We have wrapped the extraction functionality into a SageMaker processing container and provide this notebook to enable users to download a dataset of filings with metadata such as dates and parsed plain text that can then be used for machine learning using other SageMaker tools. This is included in the SageMaker Industry Jumpstart Industry library for financial language models. Users only need to specify a date range and a list of ticker symbols, and the library will take care of the rest.

As of now, the solution supports extracting a popular subset of SEC forms in plain text (excluding tables): 10-K, 10-Q, 8-K, 497, 497K, S-3ASR, and N-1A. For each of these, we provide examples throughout this notebook and a brief description of each form. For the 10-K and 10-Q forms, filed every year or quarter, we also extract the Management Discussion and Analysis (MDNA) section, which is the primary forward-looking section in the filing. This is the section that has been most widely used in financial text analysis. Therefore, we provide this section automatically in a separate column of the dataframe alongside the full text of the filing. The extracted dataframe is written to S3 storage and to the local notebook instance.

News articles related to the stock symbol — dataset

We will use the MIT Licensed, NewsAPI to identify the top 4–5 articles about the specific organization using filters, however other sources such as Social media feeds, RSS Feeds can also be used.

The first step in the pipeline is to fetch the SEC report from the EDGAR database using the SageMaker Industry Jumpstart Industry library for Financial language models. This library provides us an easy to use functionality to obtain either one or multiple SEC reports for one or more Ticker symbols or CIKs. The ticker or CIK number will be passed to the SageMaker Pipeline using Pipeline parameter inference_ticker_cik. For demo purposes of this Pipeline we will focus on a single Ticker/CIK number at a time and the MDNA section of the 10-K form. The first processing will extract the MDNA from the 10-K form for a company and will also gather few news articles related to the company from the NewsCatcher API. This data will ultimately be used for summarization and then finally sentiment analysis.

MLOps for NLP using SageMaker Pipelines

We will set up the following SageMaker Pipeline. The Pipeline will have two flows depending on what the value for `model_register_deploy` Pipeline parameter is set to.

If the value is set to `Y` we want the pipeline to register the model and deploy the latest version of the model from the model registry to the SageMaker endpoint.

If the value is set to `N` then we simply want to run inferences using the FinBERT and the Pegasus models using the Ticker symbol (or CIK number) that is passed to the pipeline using the inference_ticker_cik Pipeline parameter.

Once your pipeline completes you can look inside your SageMaker Studio Pipelines and see something similar to this image below

Company Earnings Sentiment Pipeline in SageMaker Studio

and if you double click on the graph you will see:

The Execution DAG for the NLP MLOps Pipeline for Company Earnings Sentiment Analysis

Standard Directed Acyclical Graph for your Pipeline

Conclusion

In this post we have shown how to build an automated pipeline with SageMaker Pipelines to gather data from SEC filings via SageMaker JumpStart Industry SDK, and to run two HuggingFace Transformer Models for Summarizing the MDNA part of the filing as well as the news, and to obtain a feel for the sentiment associated with both the news and the SEC filing using a second HuggingFace Transformer Model.