Who says RAG is only about Vector Stores?!

Ali Arsanjani
5 min readSep 5, 2024

--

Retrieval-Augmented Generation (RAG) has emerged as a de facto method of grounding LLMs so they obtain current, presumably reliable and factual; information prior to generation. This has been widely adopted as is reflected as a level 2 of maturity in the GenAI Maturity Model.

But there’s a common misconception that needs debunking: Retrieval to Augment the prompt should not only be about vector stores!

Vector databases have indeed revolutionized the way we handle semantic search and similarity matching in RAG systems (see Google Cloud AI’s Vertex Search low-code approach or its Vector Search ANN embeddings store).

Vertex AI Search functions as an out-of-the-box RAG system for information retrieval. Under the hood with Vertex AI Search, we’ve simplified the end-to-end search and discovery process of managing ETL, OCR, chunking, embedding, indexing, storing, input cleaning, schema adjustments, information retrieval, and summarization to just a few clicks. This makes it easy for you to build RAG-powered apps using Vertex AI Search as your retrieval engine.

Vector Stores, however, are just one piece of a much larger puzzle in the context of Retrieval Augmentation for Enterprise AI applications that seek to reduce hallucinations and increase precion, recall and accuracy.

Your true power in using RAG lies in your ability to tap into a diverse ecosystem of data sources — not just vector embeddings! — each bringing its unique strengths to the table.

Consider a use-case where your AI assistant doesn’t just rely on pre-trained knowledge, but can integrate information from not only vector databases, but traditional SQL databases or datalakes/data warehouses (e.g., Big Query, which has native vector embeddings for text as well), navigate complex relationships in graph databases, and leverage the structured knowledge of expansive knowledge graphs. This is the level of maturity for more advanced RAG systems (beyond Naive RAG), and it’s timely to elaboarte on the full spectrum of data sources that can make this possible.

In this blog post, we’ll look at the alternative data sources that power state-of-the-art RAG systems. This spectrum starts from the “traditional” vector embeddings space, then SQL-based structured databases/warehouses/lakehouses to the richer representations of knowledge graph representations. We will explore how each type of data source contributes to making AI-generated responses more accurate, contextually relevant, and up-to-date.

Think about how to leverage these various data sources in your own AI projects as we go through them. It’s important to challenge the notion that RAG is all about vector stores and uncover the true diversity that makes this technology so powerful!

Types of Retrieval Representations for RAG

Understanding Data Sources for RAG

RAG is a powerful approach in LLM-based applications that combines the strengths of LLMs with external knowledge sources. By leveraging various types of databases and knowledge representations, RAG systems can provide more accurate, up-to-date, and contextually relevant responses. In this blog post, we’ll explore the four main types of data sources used in RAG systems: Structured Databases, Vector Databases, Graph Databases, and Knowledge Graph Representations.

Structured Data[Warehouses][Bases][Lakes]

1. Structured Databases

Structured databases are the backbone of many information systems, offering organized and easily queryable data storage.

SQL Databases

  • Tables: Data is organized into tables with predefined schemas.
  • Rows and Columns: Information is stored in rows (records) and columns (fields).
  • ACID properties: Ensures data integrity through Atomicity, Consistency, Isolation, and Durability.

NoSQL Databases

  • Document-based: Stores data in flexible, JSON-like documents.
  • Key-value pairs: Simple storage model for rapid data retrieval.
  • Wide-column stores: Optimized for queries over large datasets.
Vector Stores

2. Vector Databases

Vector databases are crucial for RAG systems, enabling efficient similarity search and semantic understanding.

Embedding-based Search

  • Dense vectors: Represents data as compact, fixed-size numerical vectors.
  • High-dimensional space: Allows for nuanced representation of complex data.
  • Semantic representation: Captures meaning and context, not just keywords.

Similarity Search

  • Nearest neighbor search: Finds the most similar items in the vector space.
  • Cosine similarity: Measures the cosine of the angle between vectors.
  • Euclidean distance: Calculates the straight-line distance between points in vector space.
Graph Databases

3. Graph Databases

Graph databases excel at representing and querying interconnected data, which is particularly useful for complex relationships.

Nodes and Relationships

  • Vertices: Represent entities or data points.
  • Edges: Define connections between nodes.
  • Properties: Additional information attached to nodes and edges.

Graph Traversal

  • Path finding: Efficiently discovers connections between distant nodes.
  • Depth-first search: Explores as far as possible along each branch before backtracking.
  • Breadth-first search: Explores all neighbor nodes before moving to the next level.
Knowledge Graphs

4. Knowledge Graph Representations

Knowledge graphs provide a structured way to represent real-world information and relationships.

Entities

  • Real-world objects: Represent tangible things or concepts.
  • Abstract concepts: Capture ideas, categories, or theoretical constructs.
  • Unique identifiers: Ensure each entity is distinctly recognizable.

Relations

  • Semantic connections: Define meaningful links between entities.
  • Directional links: Specify the nature and direction of relationships.
  • Typed relationships: Categorize connections for more precise querying.

Attributes

  • Properties: Describe characteristics of entities.
  • Metadata: Provide additional context or classification information.
  • Values: Store specific data points associated with entities.

Conclusion

RAG systems benefit from the diverse strengths of these data sources. Structured databases provide organized, queryable information. Vector databases enable semantic search and similarity matching. Graph databases excel at representing complex relationships. Knowledge graphs offer a comprehensive way to model real-world information.

Use varied data sources, so your RAG systems can provide more accurate, contextually relevant, and up-to-date responses, enhancing the capabilities of LLMs in a wide range of applications.

--

--

Ali Arsanjani

Director Google, AI | EX: WW Tech Leader, Chief Principal AI/ML Solution Architect, AWS | IBM Distinguished Engineer and CTO Analytics & ML