Part 1: Key Concepts and Workflows
The common AI/ML Lifecycle consists of data collection, preparation, training, evaluation, deployment and monitoring all encompassed with an MLOps pipeline.
Generative AI (GenAI) is a transformational technology that will continue its ramifications of major industry shifts in the coming months and years. Currently, in its earlier stages it has a raised a lot of hype; a distraction to fundamental shift that underlies its promise.
The use of Generative AI in the Enterprise in a reproducible, scalable and responsible manner that minimizes risks of AI Safety, Misuse Mitigation, and Model Robustness calls for an augmentation of the common ML lifecycle.
In this article, we will explore the initial set of nuances and augmentations needed to adopt generative AI in the enterprise.
Let’s start by contrasting a couple of terms in the AI/ML Spectrum.
Predictive AI and Generative AI
Generative AI and predictive AI are two different types of artificial intelligence (AI) that are used for different purposes. Generative AI is used to create new content, such as music, images, and texts, while predictive AI is used for clustering, classification and regression that often relies on supervised learning and historical training sets that create models for predictions about future states or events. For example, a predictive AI model that is trained on a dataset of historical data about the stock market may be able to make predictions about the future prices of stocks.
Generative AI models are built by training on a large dataset of general examples (such as wikipedia, commoncrawl, etc) and then using that knowledge to generate new examples that are similar to the training data. So the idea is to generate new data. A generative AI model that is trained on a dataset of images of cats may be able to generate new images of cats that are similar to the training data.
Generative AI models are often used in creative applications, such as creating new works of art or music. Predictive AI models are often used in business applications, such as predicting customer behavior or making financial decisions.
Both generative AI and predictive AI models have the potential to improve our lives in many ways. Generative AI models can be used to create new products and services, to automate tasks, and to improve the efficiency of businesses. Predictive AI models can be used to make better decisions, to improve customer service, and to prevent crimes.
However, both generative AI and predictive AI models also have the potential to be used for harmful purposes. Generative AI models can be used to create fake news or to spread misinformation. Predictive AI models can be used to discriminate against people or to invade their privacy.
Let’s highlight the contrasting aspects between generative and predictive AI:
- Focuses on creating new content, such as music, images, and texts, using deep learning algorithms, often through the emergent behavior of large language models (LLM).
- Involves training on a large dataset of general examples (such as Wikipedia, CommonCrawl, etc.) and using that knowledge to generate new examples that are similar to the training data.
- Used in creative applications, such as creating new works of art or music.
- Requires a reproducible, scalable and responsible approach that minimizes risks of AI safety, misuse mitigation, and model robustness.
- Requires data collection, preparation, training, evaluation, deployment, and monitoring, all encompassed with an MLOps pipeline, but with specific nuances and augmentations needed to adopt generative AI in the enterprise.
- Focuses on making predictions about future states or events, often relying on supervised learning and historical training sets that create models for clustering, classification, and regression.
- Involves training on a dataset of historical data and making predictions based on that data.
- Used in business applications, such as predicting customer behavior or making financial decisions.
- Requires a reproducible, scalable and responsible approach that minimizes risks of AI safety, misuse mitigation, and model robustness.
- Requires data collection, preparation, training, evaluation, deployment, and monitoring, all encompassed with an MLOps pipeline, but with specific nuances and augmentations needed to adopt predictive AI in the enterprise.
Regardless of whether we are doing Predictive or Generative AI, we need to be incorporating principles of Responsible AI. It is important to use AI models responsibly and to be aware of the potential risks and how to check and mitigate them. We need to incorporate and develop ethical guidelines that minimize AI Safety issues, discourage abuse/misuse of AI models and monitor and guard model robustness for the development and use of these technologies.
As a subfield of AI, Generative AI focuses on creating new content, such as music, images, and texts, using deep learning algorithms, often through the emergent behavior of Large [Language] Models. (LLM). It is a rapidly growing area, with applications ranging from art and entertainment to scientific discovery and marketing content generation.
As a rapidly developing field with the potential to revolutionize many industries we need to cognizant of not just the promises, but the challenges that need to be addressed, and weigh the compromises before generative AI can be widely adopted, which it will, as the floodgates of adoption have already been opened and hype abounds in abundance. Among the frenzy and noise, it remains important to continue to develop generative AI models that are fair, accurate, safe and secure. It is also important to look to and advocate for the development of regulations that will help to ensure that generative AI is used responsibly.
The Generative AI Lifecycle
Figure 1 below describes the overall GenAI Lifecycle.
Now let’s dive into each one of the main phases within the lifecycle and gain an understanding of that stage, its tasks and the underlying nuances and considerations that may be more or less specific to GenAI.
Data Collection and Preparation
- Data Collection: Collect data from various sources including internet, social media, sensors, and optimally data warehouses like Google BigQuery or Google Cloud Storage. When collecting data to train generative AI models, ensure that the dataset represents a diverse range of perspectives, backgrounds, and sources. Be cautious about relying solely on data from a single source or domain, as it may introduce bias.
- Data Creation: Optionally generate/use synthetic data to augment existing datasets to achieve better performance metrics on trained models.
- Data Curation: Clean and organize data through filtering, transformation, integration and labeling of data for supervised learning. Programmatic labeling may be an indispensable part of how we can accelerate training and fine-tuning.
- Feature Storage: Store the gold standard of features that have been painstakingly curated and prepped. We do this primarily for AI governance and consistency across the enterprise and the may project that will be attached to it; we will want to store and manage features used in training and inference ; often distinguishing between static and dynamic or train time and run time feature store capabilities. Looking to provide answers to questions from development teams around: “Where did we get the data from? Who touched the data? Did they bias the data during curation/labeling?” For example, you can use the Vertex AI Feature Store to support this task.
- Data Bias Check. This is an essential step in the machine learning lifecycle, especially for generative AI models, to ensure that the training data and the generated content is fair, unbiased, and representative of the underlying data distribution. Bias in generative AI models can lead to the creation of outputs that reinforce stereotypes, discriminate against certain groups, or misrepresent reality. We can address data bias in predictive and generative AI, right after collecting and curating the data. By incorporating data bias checks into the machine learning lifecycle for generative AI, we can work towards creating models that generate fair, unbiased, and representative content, ultimately ensuring that the AI systems we develop align with ethical guidelines and societal values.
Data Inspection: Examine the collected data to identify potential biases or imbalances. This can be done through exploratory data analysis, visualization, or statistical tests. Analyze the data for potential issues, such as underrepresentation of certain groups, overrepresentation of specific topics, or skewed sentiment distributions.
Bias Mitigation Techniques: Apply various techniques to mitigate the identified biases in the data. These techniques may include re-sampling, re-weighting, or generating synthetic examples to balance the dataset. Additionally, consider using de-biasing algorithms during the model training process to minimize the impact of biases on the model’s output.
Model Evaluation: Develop evaluation metrics and techniques that specifically assess the model’s fairness and bias. These might include measuring the model’s performance across different demographic groups, evaluating the diversity of generated samples, or analyzing the model’s behavior when given prompts that are prone to biased outputs.
Continuous Monitoring: After deploying the generative AI model, monitor its performance and outputs to ensure that biases do not emerge over time. Set up feedback loops to gather user feedback and input, which can be used to identify and address any newly discovered biases.
Transparency and Accountability: Document and share the steps taken to address data bias in the generative AI model. Communicate the model’s limitations and potential biases to end-users and stakeholders, fostering trust and setting realistic expectations for the model’s performance.
Model Training & Experimentation
- Experimentation: This is name of the game in machine learning.
- Pre-trained model training: Train a model using a pre-trained model as a starting point
- Fine-tune models: Update an existing model with new data or for a new task
- Model registry: This is a means and mechanism for Model Governance and Model Risk Management. To use the Model Registry we would Store and manage models for versioning, reuse, and auditing, marking the meta-data of the model to see where the data came from (Data Provenance, Lineage), how the model was trained, etc.
- Data parallelism: Split data across multiple devices and training them simultaneously to speed up training
- Model parallelism: Split the model across multiple devices and training them simultaneously to fit larger models into memory
- Federated learning: Train models collaboratively across multiple devices or organizations while preserving privacy and security.
Adapt a pre-trained model to a new domain or task
Update an existing model with new data or for a new task
AI Safety Mitigation
Ensure the model is safe, ethical, and trustworthy by mitigating misuse, improving robustness, and ensuring privacy and security.
Train or create a much smaller model with similar performance and downstream task capabilities by using a student/teacher few-shot learning approach to train a much smaller model, or, prune a model to arrive at a smaller footprint deployable model.
There are several methods to distill large language models (LLMs) to arrive at significantly smaller models with comparable performance. Let’s explore some common approaches .
- Pruning: This involves removing unimportant weights or neurons from the LLM while preserving its accuracy. Pruning can be done using various techniques such as magnitude-based pruning, sensitivity-based pruning, and weight-rewinding.
- Quantization: This involves reducing the precision of the LLM’s weights and activations. For example, weights that were initially represented as 32-bit floating-point numbers can be quantized to 8-bit integers. This reduces the model’s memory footprint without significantly impacting its performance.
- Knowledge Distillation: This involves training a smaller model to mimic the output of the LLM on a set of training examples. The idea is to transfer the knowledge from the larger model to the smaller one. This approach can be used to train models that are orders of magnitude smaller than the original model with only a small reduction in accuracy.
- Low-rank Factorization: This involves decomposing the weight matrices of the LLM into low-rank matrices. This reduces the number of parameters in the model while preserving its accuracy.
- Compact Embeddings: This involves reducing the dimensionality of the input and output embeddings of the LLM. This reduces the number of parameters in the model and speeds up inference.
- Architectural Changes: This involves changing the architecture of the LLM to make it more efficient. For example, the transformer architecture can be modified to reduce its memory footprint or the number of attention heads can be reduced.
- Parameter Sharing: This involves sharing parameters between different parts of the model. For example, the same weight matrix can be used for multiple layers in a neural network, reducing the number of parameters and improving the model’s efficiency.
Depending on the specific use case, a combination of these techniques may be used to achieve the desired level of performance and efficiency.
Another approach is the Teacher-student few-shot approach to producing smaller models. This approach is actually a variant of knowledge distillation, but it specifically focuses on few-shot learning — the ability to learn from a small number of examples.
In the teacher-student  few-shot approach, a large LLM (the teacher model) is used to generate synthetic examples, which are used to train a smaller LLM (the student model). The idea is to use the teacher model to generate examples that are representative of the underlying distribution of natural language, even though the student model is only trained on a small number of examples.
The teacher-student process involves the following steps:
- Train a large LLM (the teacher model) on a large corpus of text.
- Generate synthetic examples using the teacher model by either sampling from the model’s distribution or by using specific prompts.
- Train a smaller LLM (the student model) on a small number of labeled examples, along with the synthetic examples generated by the teacher model.
- Fine-tune the student model on a small validation set to optimize its performance.
The teacher-student few-shot approach can significantly reduce the number of examples required to train a high-quality LLM. This is particularly useful in scenarios where labeled examples are scarce or when fine-tuning is required for specific domains or tasks. However, it should be noted that the quality of the synthetic examples generated by the teacher model can impact the performance of the student model. Therefore, it is important to carefully evaluate the quality of the synthetic examples before using them to train the student model.
Let’s discuss how to serve the model using Google Cloud’s Vertex AI endpoint. Serving a model on a Vertex AI endpoint involves packaging the model into a container image, uploading the image to a container registry, creating a model resource and endpoint in the Vertex AI console, deploying the model to the endpoint, testing the endpoint, and monitoring its performance.
- Prepare the Model. First, you need to prepare the trained model for deployment. This typically involves packaging the model and any required dependencies into a container image, such as a Docker image.
- Upload the Container Image. Next, you need to upload the container image to a container registry, such as Google Container Registry. This will allow you to easily deploy the image to a Vertex AI endpoint.
- Create a Model Resource. In the Vertex AI console, create a model resource that will represent your deployed model. This involves specifying the model’s name, description, and any other metadata that may be required.
- Create an Endpoint. Create a Vertex AI endpoint to serve your model. This involves specifying the endpoint’s name, description, and the region where it will be deployed.
- Deploy the Model. Once the endpoint is created, you can deploy your model to the endpoint. This involves specifying the container image to use and any required settings, such as the number of replicas to deploy.
- Edge computing. Deploy models on edge devices for low-latency, privacy-preserving, and real-time inference. Distill models for edge compute as LLMs or Foundation Models in general are huge and will not fit in the footprint of small devices.
- Cloud deployment. Deploy models on cloud platforms for scalability, accessibility, and cost-effectiveness
5.1 Test the Endpoint. Once the model is deployed, you can test the endpoint by sending sample requests to it. This will allow you to verify that the endpoint is serving the model correctly and that the model is generating accurate predictions.
5.2 Monitor the Endpoint. After the endpoint is deployed, you should monitor its performance to ensure that it is meeting the required service level agreements (SLAs). This may involve monitoring latency, throughput, and error rates, and making adjustments as needed.
Monitoring and governance. Monitor model performance, detecting issues, and enforcing compliance with regulations and policies.
Here are some considerations for serving the model or running inference/ generating content :
- Prepare the Input Data. Once the model is loaded, the next step is to prepare the input data that will be fed into the model. This may involve preprocessing the input data, such as converting text to numerical embeddings or resizing images.
- Feed Input to the Model. After the input data is prepared, it can be fed into the loaded model for inference. This typically involves calling the model’s
generatemethod and passing in the input data.
- Generate Output. The model will then generate output based on the input data. The output can take many forms depending on the specific task and model architecture. For example, a generative text model may generate a sequence of words or sentences, while a generative image model may generate an image or a sequence of images.
- Postprocess the Output. After the output is generated, it may require postprocessing to make it usable for downstream tasks. This may involve converting the output from numerical embeddings to text, resizing images, or applying other transformations.
- Data and model drift. Monitor and managing changes in data distribution and model performance over time
- Model versioning. Manage different versions of the model for traceability, reproducibility, and comparison
- Prompt-tuning, Prompt design: Optimize prompts used in language generation models to improve quality, coherence, and relevance.
- Hardware acceleration. Optimize the model for specific hardware accelerators to improve inference speed and efficiency
- Quantization. Reduce the model size and computation requirements by converting it to lower precision data types
- Pruning. Reduce the model size and computation requirements by removing unimportant weights and neurons.
Prompt-tuning, Prompt Design, Instruction-tuning
It is important to compare and contrast the various ways in which we can tune or customize the models.
Prompt design. In this phase, you design the initial prompt that will be used to generate responses from your language model. The prompt should be carefully crafted to elicit the desired type of response and to provide enough context for the model to generate coherent and relevant output.
Prompt tuning. Once you have a draft prompt, you’ll need to fine-tune it to improve the quality of the responses generated by your language model. This might involve tweaking the wording of the prompt, adjusting the level of detail provided, or adding additional context or constraints to guide the model’s output.
Instruction tuning. In addition to tuning the prompt itself, you may also need to provide additional instructions or guidance to the model to help it generate the type of output you’re looking for. This might involve specifying certain criteria for the output, providing sample outputs to use as references, or setting constraints on the content or style of the generated text.
Fine-tuning. Once you have a well-designed prompt and clear instructions for your model, you’ll need to fine-tune the model itself to optimize its performance on your specific task. This involves training the model on a dataset of examples and adjusting its parameters to minimize the loss function and improve the quality of the generated text. The goal of fine-tuning is to create a model that can consistently generate high-quality responses that meet the criteria specified in the prompt and instructions.
Variation 1: Prompt-first GenAI
Drastically Reduce Application development time with Prompt Based Human Interaction
One of the ways in which generative AI is productive is the way in which we can employ it in the early stages of the software development lifecycle. Through rapidly prototyping synthetic data sets to code generation snippets we can accelerate the process by using prompt design to produce faster turn around and break through initial barriers of doubt to arrive at definitive solution paths : I will use this type of data with this type of algorithm to accomplish this specific task.
Thus, you can almost think of the Generative AI Lifecycle as starting with prompt design or tuning and gather data in parallel or after the initial prototype has been determined to be feasible.
Prompt design and tuning can be incorporated into the early stages of the GenAI lifecycle to optimize the AI model’s performance and effectiveness even before the data preparation stage. This can help ensure that the AI model aligns better with the project goals and also provides a set of more accurate and relevant responses to subsequent inputs, prompts and inferences. The relevant tasks within this early stage consists of the following.
When you start with prompt design and tuning in the early stages of the GenAI lifecycle, you can lay a validated, solid foundation for the next phases and tasks; even data preparation by focusing on the data you need and how generated data can be utilized to guide data prep. Also this is applicable to model training, tuning and deployment phases. The GenAI lifecycle prompt-based variation gives us a proactive approach to prompt optimization that can lead to much more efficient AI development lifecycle and resulting models. In this way, we can better align with the desired goals and provide improved performance on specific lifecycle tasks or industry domains.
Problem formulation and planning
Before starting with data preparation, it’s essential to define the problem you want the AI model to solve. During this stage, you can identify the specific tasks or instructions the model should be capable of handling. This helps set the stage for designing appropriate prompts and planning for potential tuning strategies later on.
Preliminary research and exploration
During this stage, you can explore existing AI models and solutions to understand their strengths, weaknesses, and limitations. This can give you insights into which prompt design strategies or tuning techniques might be effective for your specific problem. Also, you can gather information on potential pitfalls and challenges that may need to be addressed during prompt design or tuning.
Designing and prototyping prompts
Instead of waiting until the data preparation stage, you can start prototyping potential prompts early in the process. This enables you to experiment with different phrasing, context, and formatting to test the AI model’s responsiveness and effectiveness. It’s essential to iterate and refine your prompt design to optimize the AI model’s performance at this stage.
Feedback and collaboration
Collaborating with domain experts or target users during the early stages of the ML lifecycle can help you gain insights into the specific needs and requirements of your AI model. This feedback can be invaluable in shaping your prompt design or tuning strategy and ensuring the AI model’s relevance and effectiveness in its intended domain.
Artificial intelligence is an overwhelming accelerating field that has the potential to revolutionize many industries, the way we work, and produce a 10x improvement in productivity but also a -10x of possible risk without Responsible AI.
Two main types of AI are generative AI and predictive AI, which are used for different purposes. Understanding the differences between these two types of AI is crucial for anyone working in the field.
Both generative AI and predictive AI have the potential to improve our lives in many ways, but they also have the potential to be used for harmful purposes. Therefore, incorporating and developing ethical guidelines that minimize AI safety issues, discourage abuse/misuse of AI models, and monitor and guard model robustness for the development and use of these technologies is essential. As the floodgates of adoption have already been opened, it is important to continue to develop generative AI and predictive AI models that are fair, accurate, safe, and secure, as well as advocating for the development of regulations that will help to ensure that AI is used responsibly.
Here are the key additions to the traditional AI lifecycle when adopting generative AI:
- [Prompt-based Prototyping]. In this variation of the starting phase, before data preparation in the GenAI lifecycle, we proactively focus on designing and testing candidate prompts for an AI language model. This method allows experimentation with data sets, formats, various input phrasings, contexts, and formats to evaluate the model’s performance in addressing desired tasks or goals. Iterating and refining prompts based on performance helps identify and resolve potential issues in the model’s comprehension and output. By implementing prompt-based prototyping early, you can better align the AI model with its intended purpose, ultimately leading to more effective data preparation strategies.
- Data Generation and Programmatic Data Labelling: In addition to data collection and preparation, generative AI requires the usage or creation of large datasets for training generative models. This involves generating synthetic data or using unsupervised learning techniques to learn patterns from existing, unlabelled data.
- Prompt- , instruction- and fine-tuning. This has huge implications for model usage and human computer interaction for higher efficiency and productivity, using GenAI as a human augmentation mechanism.
- Model Selection: With so many models in multiple Model Gardens and hubs, which open to choose?
- Model Evaluation: Traditional model evaluation techniques such as accuracy, precision, and recall may not be suitable for evaluating generative models. New evaluation metrics need to be developed, such as sample diversity, quality, and consistency.
- Interpretability: Generative models can be difficult to interpret, and understanding how they generate outputs is important for ensuring their responsible use. Techniques such as visualization, attribution, and counterfactual analysis can be used to improve interpretability.
- Responsible AI & Ethical Considerations: Generative AI can raise ethical concerns related to bias, fairness, privacy, and safety. These concerns need to be addressed through the development of ethical guidelines, transparency, and accountability mechanisms.
- Distillation: Creating smaller deployable models for cost and edge computing restrictions.
- Deployment: Generative AI models may have different deployment requirements than traditional predictive models. They may require specialized hardware, longer inference times, and different deployment architectures.
- Tuning: even after deployment, tuning the deployed model to strcuture the desired types of responses.
- Monitoring: Generative models can generate unexpected outputs over time, and monitoring their performance is critical for ensuring their ongoing reliability and safety. Techniques such as drift detection and feedback loops can be used to monitor and improve model performance.
 The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding Distillation with Ensemble Learning, Bonggun Shin, Hao Yang, Jinho D. Choi. https://www.ijcai.org/proceedings/2019/0477.pdf
 Language models are few-shot learners, Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., … & Amodei, D. (2020). arXiv preprint arXiv:2005.14165.