Generative AI Lifecycle Patterns
Part 1: Enterprise GenAI : Patterns, Cycles and Strategies for Deriving Business Value
In this article we will explore some optional paths to increased enterprise and production scale maturity in the journey of adopting Generative AI. How do we scale our applications from research to prototype to production? Consider these patterns as maturing the development process to move towards enterprise scale production.
We will explore a non-exhaustive list of techniques that are often combined to make composite patterns of how you deal with typical problems and challenges encountered as you seek to adopt Gen AI at the Enterprise level.
You can use this as a checklist for patterns to adopt for using Gen AI at a production scale in an enterprise or industrial environment. Also you can use this to prepare your enterprise for Gen AI through awareness of some of the many skills you may need to overcome common challenges in the journey of its adoption.
Fundamentally ML is about creating or selecting a model, seeing how it performs on some data, as it attempts to predict or generate some downstream task. But this process is highly experimental, iterative; like the ML algorithms themselves that use backpropagation to converge on a better set of weights that decrease the loss function for the down sream task.
Iterations and Cycles. Langchain has been a highly popular and useful library to help create a chain of tasks for Gen AI. But these chains are not one and done, they are fundamentally an experiment and so we must prepare ourselves, our teams and our enterprises for cycles of these chains and iterate on our experiments as we cycle through these chains of tasks.
Let’s explore some frequently encountered solutions to problems in context: patterns. The patterns we will cover here are more iteration patterns or cycle patterns: iterations or “cycles” in your chain of tasks. For example the chain that starts with you prompting a language model to obtain a completion or “get an answer” from an LLM (large Language Model). It’s important that we establish a cycle and evaluate the outcome and iterate. I’ve set it up as a set of increasingly mature adoption of more complex and sophisticated strategies and patterns for fulfilling tasks given to the LLM.
Below is a diagram that includes almost all of the cycles and iterations we discuss in this article. Please use it as an indicative reference for the art of the possible, and adapt / add as you need to accommodate nuances of your specific enterprise needs.
They collectively represent more sophisticated or more mature ways of handling complexities of generative AI for the Enterprise.
Each of the subsequent “cycles” comprise
prompt → FM → completion
expanded into
prompt → Tuned Model → Completion,
prompt → RAG | FLARE→ Model → Completion,
prompt → Model → Completion → Grounding ,
prompt → ToT → Model → ToT → Completion steps.
For example let’s start with a simple one:
Prompt — — -> Foundation Model — — → Adaptation — — → Completion
(“Translate to French”) (Initial Translation) (Validation) (“Bonjour, comment ça va?”)
Maturity Level 1. Prompt, In-context Learning and Chaining.
Prompt it & TICL it (Textual In-Context Learning).
Select a model, prompt it, get a response, assess the response, re-prompt till your responses cumulatively give you what you want, a product description, a summary with a certain format, a SQL statement that runs, python code that is generated, etc.
In-context learning is a method of prompt engineering that allows language models to learn tasks from a few examples. In this method, a model is given a prompt with examples of a task in natural language. The model learns to solve the task without any change to its weights. ICL has become a new paradigm for NLP.
ICL has very similar goals to few-shot learning: to enable models to learn contextual data without extensive tuning. However, fine-tuning a model involves a supervised learning setup on a target dataset. In ICL, a model is prompted with a series of input–label pairs without updating the model’s parameters.
Experiences have shown that LLMs can perform quite an array of complex tasks through ICL, even as complex as solving mathematical reasoning problems [1]
Chain it.
Beyond the basic Prompt → FM → Adapt → Completion pattern, we typically need to extract data from somewhere, maybe run a predcitive ai algoritjm and then send the results to a generative AI foundational model. This Chain of TAsks (CoTA, to be distinguished from Chain of Thought, CoT) pattern is exemplified as:
Chain : extract Data/ Analytics → Run a Predictive [set of] ML Model[s] → Send the Result to an LLM → Generate an output
Example: Marketing Activation. Start by running a SQL statement against, say BigQuery to get the segments of customers you wish to reach out to in a marketing campaign. Next, run a Predictive AI ranking algorthm to get the top n customers in the segments, or get the top set of micro-segments. Next send the data of top segments or top customers in a segment to the LLM for a generative AI step that will produce a personalized marketing social media post or email to those segment of customers, increasing the chances they will be more responsive with a personalized reach out.
You can use a library like LangChain to accomplish may of this chain of tasks. LangChain includes Models, Chains and Agents.
- Models. LangChain supports a variety of LLMs, including Google Vertex AI, OpenAI and Hugging Face models.
- Chains. Chains are sequences of operations that LangChain can perform on text or other data. Chains can be used to perform tasks such as text analysis, summarization, and translation.
- Agents. Agents are programs that use LLMs to make decisions and take actions. Agents can be used to build applications such as chatbots and code analysis tools.
LangChain also provides integrations with other tools and APIs as well as end to end chains of tasks needed to complete a workflow. For example:
- Integrations with other tools: LangChain can be integrated with other tools, such as Google Search and Python REPL, to extend its capabilities.
- End-to-end chains for common applications: LangChain provides pre-built chains for common applications, such as document analysis and summarization.
LangChain agents are particularly powerful because they can use LLMs to make decisions and take actions in a dynamic and data-driven way. For example, a LangChain agent could be used to build a chatbot that can learn from its interactions with users and improve its performance over time.
LangChain can be used for a variety of use-case. For example:
- Document analysis and summarization: LangChain can be used to analyze and summarize documents, such as legal documents or scientific papers.
- Chatbots: LangChain can be used to build chatbots that can interact with users in a natural and informative way.
- Code analysis: LangChain can be used to analyze code and identify potential bugs or security vulnerabilities.
Overall, LangChain is a powerful framework that can be used to build a wide variety of applications using LLMs. It is particularly well-suited for building dynamic and data-responsive applications.
LangChain agents use an LLM to decide what actions to take and the order to take them in. They make future decisions by observing the outcome of prior actions. This allows LangChain agents to learn and adapt over time, becoming more effective at completing tasks.
LangChain agents can be used to build a variety of applications, such as chatbots, code analysis tools, and customer service assistants. They are particularly well-suited for tasks that require reasoning, planning, and decision-making.
Maturity Level 2. The above section is a very typical set of patterns used in an iterative cycle to leverage Generative AI. Now let’s explore a more mature level that augments the above.
Tune it.
As you evaluate the model response you find it wanting even after substantial effort in prompt engineering and In context-learning. Here you may need to tune the foundation model: adapt it to a domain, an industry, a type of output format , a certain brevity vs rambling output (e.g., as in classification of a set of symptoms).
Parameter-efficient fine-tuning (PEFT) is a technique for fine-tuning LLMs that is less computationally expensive than traditional fine-tuning. PEFT works by fine-tuning only a subset of the LLM’s parameters. This can be done by using a technique called adaptor tuning, or by using a technique called LoRA (Low-Rank Adaptation of Large Language Models).
Adaptor tuning involves adding a new layer to the LLM that is specific to the task at hand. The new layer is trained on a small dataset of labeled examples. This allows the LLM to learn the specific features of the task without having to fine-tune all of its parameters.
LoRA involves approximating the LLM’s parameters with a low-rank matrix. This can be done by using a technique called matrix factorization. The low-rank matrix is then fine-tuned on a small dataset of labeled examples. This allows the LLM to learn the specific features of the task without having to fine-tune all of its parameters.
Full fine-tuning is the traditional approach to fine-tuning LLMs. In full fine-tuning, all of the LLM’s parameters are fine-tuned on a large dataset of labeled examples. This can be computationally expensive, but it can lead to the best performance on the target task.
This allows the very important introduction of Domain specific LLMs. For example, see how Vertex AI can do this for you at a nominal cost.
Reinforcement Learning from Human Feedback (RLHF) can be used to further enhance the fine tuning. More on that in Part 2.
Maturity Level 3. Now let’s retrieve data before we send the prompt and contextify the input even more, decreasing the likelihood of hallucination by the LLM.
RAG it.
Access similar documents using semantic search. How is this done? A set of documents you supply are chunked (read ‘split’) up (sentence by sentence or by paragraph, or by page, etc.) then converted into an embedding with a Vector Embedding like textembedding-gecko@latest and then stored in a Vector Database such as Google’s Vertex Vector Search. The retrieval is done via an Approximate Nearest Neighbor search (ANN) aka semantic search algorithm. This input may significantly decrease the possibility of the model’s hallucination and provide the model with enough relevant context so as to be more knowledgeable about the topic and return more ‘sensible’ and relevant completions. This process is known as Retrieval Augmented Generation or RAG. So RAG it.
RAG works by:
- Creating an initial prompt from the user’s query or statement.
- Augmenting the prompt with context retrieved from the Vector Store.
- Sending the augmented prompt to the LLM.
Ground it.
Use an expanded search capability to increase the factual grounding by allowing/requesting the model to return a reference to where it found the responses it just gave. RAG does provide grounding, prior the submission to the LLM. Grounding is after the model issues the output tokens, find a citation and send it back. Many vendors such as Google Cloud AI provide multiple ways of Factual Grounding.
Note: Factual Grounding vs RAG
Factual grounding and RAG are both approaches to improving the accuracy and relevance of LLMs. However, they have different goals and use different techniques. Factual grounding is the process of ensuring that an LLM’s generated text is consistent with factual knowledge. This can be done by providing the LLM with access to a knowledge base of factual statements, and by training the LLM to generate text that is consistent with these statements.
RAG is a framework for augmenting LLMs with access to external knowledge bases. This allows LLMs to generate more accurate and informative text, even on complex and challenging tasks. RAG works by first retrieving relevant passages from the knowledge base. The LLM then uses these passages to generate its response.
The main difference between factual grounding and RAG is that factual grounding focuses on ensuring that the LLM’s generated text is consistent with factual knowledge, while RAG focuses on generating more accurate and informative text. Also factual grounding typically uses a knowledge base of factual statements, while RAG can use any type of external knowledge base, including text documents, code repositories, and databases.
Factual grounding is typically used as a pre-training step, while RAG can be used as a post-training step. This means that factual grounding is typically used to improve the accuracy and relevance of LLMs on a variety of tasks, while RAG is typically used to improve the accuracy and relevance of LLMs on specific tasks.
FLARE it.
Forward-looking Active Retrieval Augmented Generation. FLARE is a variation of RAG in which you actively decide when and what to retrieve using a prediction of the upcoming sentence to anticipate future content and utilize it as the query to retrieve relevant documents when you evaluate that the retrieved docs contain low-confidence tokens.
Maturity level 4. We are getting into a very interesting domain here where you can start to as your LLM for how it is reasoning and what are the steps in accomplishing its task.
CoT it or ToT it. GoT it?
use a prompt to derive the Chain/Tree/Graph of Thought output as a set of steps. Each step can use RAG actively to pull from docs in a Vector DB. ToT maintains a tree of thoughts, where thoughts represent coherent language sequences that are the reflection of how the LLM is “thinking” about the set of intermediate steps that it would use to solve your problem . This approach enables an LLM to self-evaluate the progress intermediate steps/thoughts make towards solving a problem through a deliberate reasoning process.
The Tree of Thoughts (ToT) framework is a new approach to AI reasoning. It’s different from the Chain of Thought (CoT) approach, which guides language models along a single path. In a CoT diagram, each sentence is a direct continuation of the previous one. In a ToT diagram, the main idea branches off into several related ideas.
In a ToT diagram, each node is a “thought”. A thought is a coherent chunk of text that represents an intermediate reasoning step. This allows the language model to explore multiple reasoning paths and evaluate the progress of different thoughts towards solving the problem.
In a CoT diagram, each sentence is a direct continuation of the previous one. This forms a linear chain.
Tree of Thoughts allows multiple step analysis, multiple comparisons, and increased options after each step. It also allows the system to restart at the first or earlier steps to look again for new options.
OK you get the idea for the Graph of Thought. GoT it?
Graph of Thought (GoT) is a framework that models the reasoning process of large language models (LLMs) as a graph. In a Chain of Thought, each sentence is a direct continuation of the previous one, forming a linear chain. In a Tree of Thought, the main idea branches off into several related ideas.
GoT allows for dynamic data flow without a fixed sequence. This flexibility is important in AI, where data can come from multiple sources and may need to be processed non-linearly.
GoT models each thought generated by an LLM as a node within a graph. Dependencies are represented by vertices that connect these nodes. This allows prompting the LLM to solve problems through networked, non-linear reasoning.
This strategy can be considered to be a generalization of Chain of Thoughts and Tree-of-Thoughts. In addition, it provides more flexibility for example, by refining a single thought and aggregating multiple thoughts together.
Chain it.
Let’s explain this table.
The Chain of Thought model does very well in its simplicity, intuitiveness, and ease of training. It follows a linear, step-by-step process that is good for tasks naturally aligned with sequential logic. This imposes the limitations on the model’s ability to handle complex reasoning tasks that may very well require considering multiple variables or alternative options or outcomes. Once it sets “its mind” on a particular ‘chain,’ the model may find it challenging to backtrack or explore other avenues, which may lead to less than optimal outcomes.
The Tree of Thought model is characterized by its ability to represent complex reasoning in a hierarchical manner. This structure enables it to tackle multi-faceted problems by branching out into sub-problems or conditions. Its structured approach also makes it relatively easier to interpret compared to more complex models. This additional complexity is at the cost of higher computational needs and a risk of possibly overfitting. Its branching can make it harder to trace the model’s exact reasoning path, not a great help when it comes to its interpretability.
The Graph of Thought model stands out for its ability to handle high-complexity tasks involving multiple interconnected variables. Its flexibility allows it to model non-linear and interconnected relationships, making it highly suitable for real-world problems with complex, interrelated variables. However, this complexity demands significant computational resources and sophisticated algorithms for effective training. The Graph of Thought model is also the most challenging to interpret; its non-linear interconnected structure doesn’t lend itself to straightforward explanations, making it difficult to understand the reasoning behind its decisions and use it for explainability.
ReAct — Plan it.
Don’t just get data and pass to the LLM so it doesn’t look stupid and make things up when it doesn’t know your domain (RAG), but actually let the LLM make an external call via an API, for example, to retrieve information as it is deciding on — reasoning — what to tell you about your prompt. ReAct is a method that combines acting and reasoning to help LLMs “learn” new tasks and make decisions. Using this technique or strategy we prompt LLMs to generate verbal reasoning traces and actions for a task. This allows the model to perform dynamic reasoning to create, maintain, and adjust high-level plans for acting.
ReAct is designed for tasks in which the LLM is allowed to perform certain actions. For example, a LLM may be able to interact with external APIs to retrieve information. It addresses issues that LLMs sometimes face, like producing incorrect facts and compounding errors.
Conclusion
This article reviewed a set of patterns including combining techniques commonly encountered as you cycle through deriving business value from Gen AI: seek to make Gen AI Enterprise ready and conversely as you seek to mature the Enterprise from prototype to product for and with Gen AI.
Topics and References
I have covered or alluded to the following topics in this article.
- [Textual] In-Context Learning (TICL). Okay, I made the [textual] word up to rhyme and make it a semantic verb aligned with the rest. guilty :-) !
- Few-shot learning
- Supervised learning
- Retrieval Augmented Generation (RAG)
- FLARE, a form of Active RAG.
- Factual grounding
- Chain of Thought (CoT)
- Tree of Thoughts (ToT)
- Graph of Thought (GoT)
- ReAct: Reasoning & Action.
Here are some specific references that you may find helpful.
[1] LLMs can perform complex tasks through ICL, even as complex as solving mathematical reasoning problems:
This paper shows that LLMs can be fine-tuned to perform complex tasks with only a few examples, using a technique called chain of thought prompting. The authors demonstrate that LLMs can be used to solve mathematical reasoning problems, translate languages, and perform other complex tasks with high accuracy.
- A Survey on In-context Learning, Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, Lei Li, Zhifang Sui.
This survey paper provides a comprehensive overview of ICL for LLMs. The authors discuss the different ways in which ICL can be used to train LLMs to perform new tasks, and they provide examples of ICL being used to solve complex tasks such as mathematical reasoning and code generation.
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou
This paper introduces the chain of thought prompting technique for training LLMs to perform complex tasks. The authors demonstrate that LLMs trained with chain of thought prompting can solve mathematical reasoning problems, even when the problems are presented in a new format. They “explore how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.” In particular, they show how such reasoning capabilities are an emergent behavior that surfaces “ naturally in sufficiently large language models …, where a few chain of thought demonstrations are provided as exemplars in prompting. Experiments on three large language models show that chain of thought prompting improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks.”
[2] (Textual) In-Context Learning (TICL)
- Brown, Tom B., et al. “Language models are few-shot learners.” arXiv preprint arXiv:2005.14165 (2020).
- Raffel, Colin, et al. “Exploring the limits of transfer learning with a unified text-to-text transformer.” arXiv preprint arXiv:1910.10683 (2019).
[3] Few-shot learning
Prototypical networks are a type of prototype classifier that is used for few-shot learning. Few-shot learning is a classification technique that uses a small dataset to adapt to a specific task. Prototypical networks are based on the idea that each class can be represented by the mean of its examples in a representation space learned by a neural network.
- Snell, Jake, Kevin Swersky, and Samy Bengio. “Prototypical networks for few-shot learning.” arXiv preprint arXiv:1703.05175 (2017).
Authors propose a model-agnostic algorithm for meta-learning: it is compatible with any model trained with gradient descent and applicable to a variety of different learning problems, including classification, regression, and reinforcement learning. They propose a meta-learning algorithm that works with any gradient-trained model and can be applied to a variety of learning tasks, including classification, regression, and reinforcement learning.
The goal of meta-learning is to train a model on a variety of tasks so that it can quickly learn new tasks with only a few training examples. In this approach, the model parameters are trained to be easy to fine-tune with a small number of gradient steps on a new task.
They demonstrate that their approach achieves state-of-the-art results on two few-shot image classification benchmarks, performs well on few-shot regression, and accelerates fine-tuning for policy gradient reinforcement learning with neural network policies.
- Finn, Chelsea, et al. “Model-agnostic meta-learning for fast adaptation of deep networks.”
[4] Supervised learning.
- Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press, 2012.
- Hastie, Trevor, Robert Tibshirani, and Jerome Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
[5] Retrieval Augmented Generation (RAG)
The paper on retrieval augmented generation (RAG) was written by Patrick Lewis, et al. RAG is a framework for augmenting large language models (LLMs) with access to external knowledge bases. This allows LLMs to generate more accurate and informative text, even on complex and challenging tasks.
RAG has been shown to be effective for a variety of tasks, including question answering, summarization, and translation. It is a promising new approach to generative AI, and it has the potential to revolutionize the way we interact with computers.
- Lewis, Patrick, et al. “Retrieval-augmented generation for text summarization.”
- Fan, Angela, et al. “RAG: Retrieval augmented generation for knowledge-intensive NLP tasks.”
[6] Factual grounding
- Yi Tay, et al.“Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback”. This paper proposes a method for improving the factual accuracy of LLMs by providing them with feedback on their generated text. The feedback is based on a knowledge base of factual statements.
- Wang, Xuezhi, et al. “Chain of thought prompting elicits reasoning in large language models.”
[7] Chain of Thought (CoT)
The paper explores how generating a chain of thought — a series of intermediate reasoning steps — significantly improves the ability of large language models to perform complex reasoning.
Authors propose chain-of-thought prompting technique, which enables large language models to accurately perform complex reasoning tasks using only a few intermediate reasoning steps without explicit training. Key insights and lessons learned from the paper are that large language models can reason at a comparable level to models that have been extensively trained and that the chain-of-thought prompting technique is a simpler and more scalable mechanism for encoding reasoning steps into language models.
- Wei, Jason, et al. “Chain of thought prompting elicits reasoning in large language models.”
[8] Tree of Thoughts (ToT)
- Shunyu Yao, et al. “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” arXiv preprint arXiv:2209.06289 (2022).
- Jieyi Long. “Large Language Model Guided Tree-of-Thought”
[9] Graph of Thought (GoT)
[10] ReAct: Reasoning and Action
- Denny Zhou et al. “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models”
- Shunyu Yao, et al. “ReAct: Synergizing Reasoning and Acting in Language Models.” Also, See this official blog.
[11] Applications of chains of tasks.