RAG (Retrieval-Augmented Generation)

RAG is the most practical technique for making AI work with your own data. Instead of fine-tuning a model on your documents (expensive and inflexible), RAG retrieves relevant information at query time and includes it in the prompt. The model then generates answers grounded in your actual data rather than its training knowledge.

A typical RAG pipeline: embed your documents into vectors, store in a vector database, and at query time retrieve the most relevant chunks to include in the context. This pattern powers knowledge bases, customer support bots, code assistants, and enterprise search. Getting the retrieval right is 80% of the challenge.

RAG Architecture

Three phases: Retrieve (find relevant documents), Augment (add them to the prompt), Generate (LLM produces grounded answer). Simple in concept, nuanced in execution.

Context

Embeddings

Dense vector representations of text that capture semantic meaning. Similar texts have similar vectors. Models: OpenAI text-embedding-3, Cohere embed-v3, open-source BGE and E5.

Foundation Models

Vector Databases

Specialized databases for storing and querying embeddings. Pinecone (managed), Qdrant (open-source), Weaviate, ChromaDB (lightweight). Each optimizes for different scale and feature needs.

Chunking Strategies

How you split documents into chunks dramatically affects retrieval quality. Fixed-size, sentence-based, semantic, recursive, and document-structure-aware chunking each suit different content types.

Hybrid Search

Combining semantic search (embeddings) with keyword search (BM25). Hybrid catches both conceptually similar and keyword-exact matches. Most production RAG systems use hybrid search.

Reranking

After initial retrieval, a cross-encoder reranker scores each chunk against the query more accurately. Cohere Rerank, BGE reranker. Dramatically improves retrieval precision.

Advanced RAG Patterns

CRAG (Corrective RAG): verify retrieval quality before generating. Self-RAG: model decides when retrieval is needed. Graph RAG: combine vector search with knowledge graphs for richer context.

Multi-Modal RAG

RAG beyond text — retrieving images, tables, and code snippets. Vision models can process retrieved images. Table extraction and code understanding require specialized chunking.

Multimodality

Evaluation

Measuring RAG quality: retrieval metrics (precision, recall, MRR) and generation metrics (faithfulness, relevance, completeness). RAGAS framework automates RAG evaluation.

Common Pitfalls

Too-small chunks lose context, too-large waste tokens. Poor embeddings retrieve irrelevant content. No reranking means noise in the top results. Always evaluate retrieval quality independently of generation.

Hallucinations

EmbeddingDense vector representation of text that captures semantic meaning for similarity search.

Vector DatabaseDatabase optimized for storing embeddings and performing fast similarity search (Pinecone, Qdrant, Weaviate).

ChunkingThe process of splitting documents into smaller pieces for embedding and retrieval.

RerankingSecond-stage scoring of retrieved results using a cross-encoder model for improved precision.

Basic Theory

RAG (Retrieval-Augmented Generation)