Guide Complet RAG 2026 : Génération Augmentée par Récupération

Retrieval-Augmented Generation (RAG) is the dominant pattern for building LLM applications grounded in real, up-to-date data. Instead of relying solely on what a model memorized during training, RAG fetches relevant documents at query time and feeds them as context to the LLM. This guide covers the complete RAG pipeline — from document ingestion and embedding to vector search, advanced retrieval strategies, evaluation with RAGAS, and production deployment patterns — with working Python code examples using LangChain and OpenAI.

TL;DR: RAG = Retrieve relevant documents + Augment the prompt + Generate a grounded answer. It eliminates hallucinations, keeps responses current, and avoids expensive fine-tuning. The pipeline is: Load docs, split into chunks, embed with a model like text-embedding-3-small, store in a vector DB (Pinecone, Qdrant, pgvector), retrieve top-k results at query time, and pass them to the LLM. Advanced techniques include HyDE, parent-child chunks, hybrid search, and re-ranking. Evaluate with RAGAS (faithfulness, relevance, context recall). Production tips: cache embeddings, stream responses, monitor retrieval quality.

Key Takeaways

RAG grounds LLM responses in real data, reducing hallucinations by 80-95% compared to vanilla prompting.
Choose chunk size based on your content: 512 tokens for Q&A, 1024 for summarization, with 10-20% overlap.
text-embedding-3-small offers the best cost/performance ratio for most use cases at $0.02/1M tokens.
Hybrid search (dense vectors + sparse BM25) consistently outperforms pure vector similarity search.
Always evaluate with RAGAS before going to production — measure faithfulness, answer relevance, and context recall.
Re-ranking with a cross-encoder after initial retrieval significantly improves answer quality for minimal latency cost.

What Is RAG and Why It Matters

Large language models are trained on static datasets with a knowledge cutoff. They cannot access your internal documents, latest API docs, or real-time data. When asked about information outside their training data, they either refuse or — worse — hallucinate confident-sounding but incorrect answers.

RAG solves this by introducing a retrieval step before generation. At query time, the system searches a knowledge base for relevant documents, injects them into the prompt as context, and instructs the LLM to answer based on that context. The LLM becomes a reasoning engine over your data rather than a memorization engine.

First described in the 2020 paper by Lewis et al. at Meta, RAG has become the standard architecture for enterprise AI applications, customer support bots, documentation assistants, and any system that needs factual, verifiable answers.

RAG vs Fine-Tuning vs Prompt Engineering

Before building a RAG system, understand when to use each approach:

Criteria	Prompt Engineering	RAG	Fine-Tuning
Knowledge update	Manual prompt edits	Update docs, re-embed	Retrain the model
Cost	Low (no infra)	Medium (vector DB + embeddings)	High (GPU training)
Latency	Lowest	+100-300ms retrieval	Same as base model
Accuracy on domain data	Low	High	High
Hallucination control	Poor	Excellent (citations)	Moderate
Data freshness	Static	Real-time possible	Stale after training
Best for	Simple tasks, formatting	Knowledge-heavy apps	Style/behavior changes

In practice, most production systems combine all three: prompt engineering for formatting, RAG for knowledge, and fine-tuning for specialized behavior. But if you are building a knowledge-heavy application, start with RAG — it gives you the most impact with the least investment.

RAG Architecture: The Complete Pipeline

A RAG system has two main phases: an offline indexing pipeline that processes your documents, and an online query pipeline that handles user questions. Here is the complete architecture:

RAG Architecture Overview
========================

INDEXING PIPELINE (Offline)
--------------------------
                                                         
  [Documents]    [Text Splitter]    [Embedding]    [Vector Store]
  PDF, HTML  -->  Chunk into    -->  Convert to -->  Store in
  Markdown       256-512 tokens     dense vectors   Pinecone/
  Database       with overlap       (1536-dim)      Qdrant/PG
                                                         
QUERY PIPELINE (Online)
-----------------------
                                                         
  [User Query]   [Embed Query]   [Retrieve]   [Re-rank]   [Generate]
  "How do I  --> Convert to  --> Top-k     --> Cross-   --> LLM with
   deploy?"     query vector    similar      encoder      context
                                chunks       scoring      + prompt
                   |                            |            |
                   v                            v            v
              Same embedding              Filter &      Grounded
              model as indexing           compress      answer with
                                         context       citations

The indexing pipeline runs once (or on a schedule) to process your documents. The query pipeline runs for every user question, typically completing in 1-3 seconds end-to-end.

Note: The indexing pipeline is a one-time or periodic batch job that can run in the background. The query pipeline is the latency-sensitive critical path — every millisecond counts here. Focus your optimization efforts on the query pipeline.

Step 1: Document Loading

The first step is loading your source documents into a format the pipeline can process. LangChain provides loaders for virtually every format:

from langchain_community.document_loaders import (
    PyPDFLoader,
    UnstructuredHTMLLoader,
    UnstructuredMarkdownLoader,
    TextLoader,
    CSVLoader,
    DirectoryLoader,
    WebBaseLoader,
)

# Load a single PDF
pdf_docs = PyPDFLoader("report.pdf").load()

# Load all markdown files from a directory
md_docs = DirectoryLoader(
    "./docs/", glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader
).load()

# Load a web page
web_docs = WebBaseLoader("https://docs.example.com/api").load()

# Load CSV data
csv_docs = CSVLoader("products.csv").load()

print(f"Loaded {len(pdf_docs)} pages from PDF")
print(f"First doc metadata: {pdf_docs[0].metadata}")

Each loader returns a list of Document objects with page_content (the text) and metadata (source, page number, etc.). Metadata is critical for citations — always preserve it.

Step 2: Text Splitting (Chunking)

Raw documents are too large to embed or fit in a prompt. You need to split them into chunks. The chunk size directly impacts retrieval quality — this is the single most important parameter to tune.

Common splitting strategies:

RecursiveCharacterTextSplitter — The default choice. Splits on paragraphs, then sentences, then words. Maintains semantic coherence.
Semantic chunking — Uses embeddings to detect topic boundaries. Produces variable-size chunks that respect content structure.
Markdown/HTML splitters — Splits on headers (h1, h2, h3). Perfect for documentation and structured content.
Token-based splitting — Splits by token count rather than characters. More accurate for LLM context window management.

from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    MarkdownHeaderTextSplitter,
)

# Default: RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = text_splitter.split_documents(pdf_docs)
print(f"Split into {len(chunks)} chunks")

# Markdown-aware splitting
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "h1"),
        ("##", "h2"),
        ("###", "h3"),
    ]
)
md_chunks = md_splitter.split_text(md_content)

Chunk size guidelines: For Q&A systems, use 256-512 tokens with 50-100 token overlap. For summarization, use 1024-2048 tokens. For code, split by function/class boundaries. Always test multiple sizes and measure retrieval accuracy.

Step 3: Embedding Models

Embeddings convert text chunks into dense numerical vectors that capture semantic meaning. Similar texts produce vectors that are close together in vector space, enabling semantic search.

Model	Dimensions	Max Tokens	Cost (1M tokens)	Best For
OpenAI text-embedding-3-small	1536	8191	$0.02	General purpose, best cost/perf
OpenAI text-embedding-3-large	3072	8191	$0.13	High-accuracy requirements
Cohere embed-v3	1024	512	$0.10	Multilingual, search-optimized
sentence-transformers (all-MiniLM)	384	512	Free (self-hosted)	Privacy-sensitive, offline
Ollama (nomic-embed-text)	768	8192	Free (local)	Local development, no API costs
Voyage AI voyage-3	1024	16000	$0.06	Code search, long context

from langchain_openai import OpenAIEmbeddings

# OpenAI embeddings (recommended for most cases)
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    # dimensions=512  # optional: reduce dimensions for cost savings
)

# Embed a single query
query_vector = embeddings.embed_query("How to deploy a RAG app?")
print(f"Vector dimensions: {len(query_vector)}")  # 1536

# Embed multiple documents (batched automatically)
doc_vectors = embeddings.embed_documents(
    [chunk.page_content for chunk in chunks]
)

Tip: Start with text-embedding-3-small for prototyping. It handles most use cases well at the lowest cost. Switch to a specialized model only if evaluation metrics show a need.

Step 4: Vector Stores

Vector stores index your embeddings for fast similarity search. The choice depends on scale, infrastructure, and query patterns.

Vector Store	Type	Max Vectors	Hybrid Search	Pricing	Best For
Pinecone	Managed cloud	Billions	Yes	Free tier + pay-as-you-go	Production SaaS, zero-ops
Weaviate	Self-hosted / cloud	Billions	Yes (BM25)	Open source + cloud	Hybrid search, multi-modal
Qdrant	Self-hosted / cloud	Billions	Yes (sparse)	Open source + cloud	Filtering, payload search
ChromaDB	Embedded / self-hosted	Millions	No	Open source	Prototyping, local dev
pgvector	PostgreSQL extension	Millions	With pg_trgm	Free (your Postgres)	Existing Postgres infra
FAISS	In-memory library	Billions	No	Free (Meta)	Research, batch processing

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Create and persist a vector store
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db",
    collection_name="my_docs"
)

# Load an existing vector store
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=embeddings,
    collection_name="my_docs"
)

# Basic similarity search
results = vectorstore.similarity_search("deployment guide", k=4)
for doc in results:
    print(doc.page_content[:100])

Recommendation: Use ChromaDB or FAISS for prototyping. For production, Pinecone offers the lowest operational burden. If you already run PostgreSQL, pgvector avoids adding a new service. For hybrid search requirements, Weaviate or Qdrant are excellent choices.

Step 5: Retrieval Strategies

Retrieval quality determines RAG quality. A perfect LLM cannot generate correct answers from irrelevant context. Here are the key retrieval strategies:

Similarity search (k-NN) — Basic cosine similarity. Fast and simple. Returns the k most similar chunks.
Maximum Marginal Relevance (MMR) — Balances relevance with diversity. Prevents returning k near-duplicate chunks.
Hybrid search — Combines dense vector search with sparse keyword search (BM25). Catches both semantic matches and exact keyword matches.
Re-ranking — Uses a cross-encoder model to re-score initial results. Dramatically improves precision at minimal latency cost (+50-100ms).
Contextual compression — Extracts only the relevant sentences from retrieved chunks, reducing noise in the LLM context.

For most production systems, the winning combination is: hybrid search (dense + BM25) followed by cross-encoder re-ranking. This consistently outperforms pure vector similarity across benchmarks.

Step 6: Generation with Context

The final step passes retrieved context to the LLM with a well-crafted prompt. The prompt template is critical — it tells the LLM to use only the provided context and to cite sources.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o", temperature=0)

prompt = ChatPromptTemplate.from_template(
    """Answer the question based ONLY on the following context.
If the context does not contain the answer, say "I don't have 
enough information to answer this question."

Context:
{context}

Question: {question}

Answer (cite sources):"""
)

# Build the RAG chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

def format_docs(docs):
    return "\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}]\n"
        + d.page_content for d in docs
    )

rag_chain = (
    {"context": retriever | format_docs,
     "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

answer = rag_chain.invoke("How do I deploy to production?")
print(answer)

Key prompt engineering tips for RAG: (1) Explicitly instruct the model to answer only from context. (2) Include source metadata for citations. (3) Tell the model to say "I don't know" when context is insufficient. (4) Use system messages for consistent behavior.

Advanced RAG Techniques

Once your basic RAG pipeline works, these techniques can significantly improve quality:

HyDE (Hypothetical Document Embeddings): Instead of embedding the user query directly, first ask the LLM to generate a hypothetical answer, then embed that answer for retrieval. This bridges the gap between question-style queries and document-style content. Improves recall by 10-25% on many benchmarks.

Parent-child chunks: Index small chunks (256 tokens) for precise retrieval, but return the parent chunk (1024-2048 tokens) as context. This gives the LLM more surrounding context while maintaining retrieval precision.

Multi-query retrieval: Generate 3-5 reformulations of the user query using an LLM, run retrieval for each, and merge results. Catches relevant documents that a single query phrasing might miss.

Contextual compression: After retrieval, use a smaller LLM to extract only the relevant sentences from each chunk. Reduces noise and fits more useful information in the context window.

Query routing: Classify the query type (factual lookup, comparison, how-to, etc.) and route to different retrieval strategies or knowledge bases. A simple classifier can dramatically improve results.

Complete RAG Implementation (Python)

Here is a complete working RAG pipeline using LangChain, OpenAI, and ChromaDB. This covers document loading, chunking, embedding, indexing, and querying:

# Complete RAG pipeline: pip install langchain langchain-openai
# pip install langchain-chroma unstructured pypdf

import os
os.environ["OPENAI_API_KEY"] = "sk-..."

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_chroma import Chroma
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

# 1. Load documents
docs = PyPDFLoader("knowledge_base.pdf").load()

# 2. Split into chunks
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512, chunk_overlap=50
)
chunks = splitter.split_documents(docs)

# 3. Create vector store with embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    chunks, embeddings, persist_directory="./db"
)

# 4. Build RAG chain
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
llm = ChatOpenAI(model="gpt-4o", temperature=0)
prompt = ChatPromptTemplate.from_template(
    "Answer based ONLY on this context:\n{context}\n\n"
    "Question: {question}\nAnswer:"
)

rag_chain = (
    {"context": retriever | (lambda docs: "\n".join(
        d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | prompt | llm | StrOutputParser()
)

# 5. Query
print(rag_chain.invoke("What is the deployment process?"))

The code above creates a complete RAG pipeline in under 50 lines. For production, you would add error handling, streaming, caching, and swap ChromaDB for a managed vector store.

Advanced: Hybrid Search with Re-ranking

This example adds BM25 hybrid search and cross-encoder re-ranking for significantly better retrieval quality:

# pip install rank-bm25 sentence-transformers

from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder
import numpy as np

# BM25 sparse retriever
corpus = [doc.page_content for doc in chunks]
tokenized = [doc.split() for doc in corpus]
bm25 = BM25Okapi(tokenized)

def hybrid_search(query, k=10):
    # Dense search via vector store
    dense_results = vectorstore.similarity_search(query, k=k)
    
    # Sparse search via BM25
    bm25_scores = bm25.get_scores(query.split())
    top_bm25_idx = np.argsort(bm25_scores)[-k:][::-1]
    sparse_results = [chunks[i] for i in top_bm25_idx]
    
    # Merge and deduplicate
    seen = set()
    merged = []
    for doc in dense_results + sparse_results:
        if doc.page_content not in seen:
            seen.add(doc.page_content)
            merged.append(doc)
    return merged[:k]

# Cross-encoder re-ranking
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query, docs, top_k=4):
    pairs = [[query, d.page_content] for d in docs]
    scores = reranker.predict(pairs)
    ranked = sorted(zip(scores, docs), reverse=True)
    return [doc for _, doc in ranked[:top_k]]

# Usage: hybrid search + re-rank
candidates = hybrid_search("How to handle auth?", k=10)
final_docs = rerank("How to handle auth?", candidates, top_k=4)

Evaluating RAG Quality with RAGAS

You cannot improve what you do not measure. RAGAS (Retrieval-Augmented Generation Assessment) is the standard framework for evaluating RAG pipelines. It measures four key metrics:

Faithfulness — Are the generated claims supported by the retrieved context? (Reduces hallucinations)
Answer relevance — Is the answer actually relevant to the question asked?
Context precision — Are the retrieved chunks relevant? Are irrelevant chunks excluded?
Context recall — Did the retriever find all the relevant information needed to answer?

# pip install ragas

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)
from datasets import Dataset

# Prepare evaluation dataset
eval_data = {
    "question": [
        "How do I deploy to production?",
        "What authentication methods are supported?",
    ],
    "answer": [  # generated by your RAG pipeline
        rag_chain.invoke("How do I deploy to production?"),
        rag_chain.invoke("What auth methods are supported?"),
    ],
    "contexts": [  # retrieved chunks for each question
        [d.page_content for d in retriever.invoke(
            "How do I deploy to production?")],
        [d.page_content for d in retriever.invoke(
            "What auth methods are supported?")],
    ],
    "ground_truth": [  # human-written correct answers
        "Deploy using Docker with the provided Dockerfile...",
        "OAuth 2.0, API keys, and JWT are supported...",
    ],
}

dataset = Dataset.from_dict(eval_data)
results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy,
             context_precision, context_recall],
)
print(results)
# {faithfulness: 0.92, answer_relevancy: 0.89,
#  context_precision: 0.85, context_recall: 0.78}

Run RAGAS on a test set of 50-100 questions with known answers. A faithfulness score below 0.8 indicates your RAG system is hallucinating. Context recall below 0.7 means your retriever is missing relevant documents.

Production Considerations

Moving from prototype to production requires addressing several concerns:

Caching: Cache embedding results for identical queries. Cache LLM responses for identical query + context combinations. Use Redis or a simple in-memory LRU cache. This reduces costs by 40-60% in typical workloads.

Streaming: Stream the LLM response token by token. Users perceive the system as faster when they see partial responses. LangChain and OpenAI both support streaming natively.

Cost optimization: Use text-embedding-3-small (not large) unless evaluation shows a need. Batch embedding requests. Cache aggressively. Consider a smaller model (GPT-4o-mini) for simple queries and route complex ones to GPT-4o.

Monitoring: Log every query, retrieved chunks, and generated answer. Track retrieval latency, generation latency, and user feedback. Set up alerts for faithfulness drops. Tools: LangSmith, Phoenix, Weights & Biases.

Security: Implement access control on documents — users should only retrieve documents they are authorized to see. Sanitize user queries to prevent prompt injection. Never expose raw retrieved chunks in the UI without filtering.

Common Pitfalls and How to Avoid Them

These are the mistakes that trip up most RAG implementations:

Chunk size too large — Retrieves irrelevant content that dilutes the answer. Start with 512 tokens and tune down.
No overlap between chunks — Sentences at chunk boundaries get split. Always use 10-20% overlap.
Ignoring metadata — Without source tracking, you cannot provide citations or filter by document type/date.
Wrong embedding model — Using a generic model for specialized content (code, legal, medical). Evaluate domain-specific models.
Too many retrieved chunks — Stuffing 20 chunks into the context adds noise. Use 3-5 chunks with re-ranking for better results.
No evaluation — Flying blind without RAGAS or similar metrics. Always measure before and after changes.
Context window overflow — Not accounting for chunk size x number of chunks x prompt template tokens. Calculate total tokens and stay within limits.
Stale embeddings — Updating documents but not re-embedding them. Set up incremental indexing pipelines.

RAG Frameworks Comparison

Several frameworks simplify RAG development. Here is how the major options compare:

Framework	Language	Strengths	Weaknesses	Best For
LangChain	Python, JS	Largest ecosystem, most integrations, excellent docs	Abstraction overhead, frequent breaking changes	General RAG, rapid prototyping
LlamaIndex	Python	Purpose-built for RAG, advanced indexing strategies	Steeper learning curve, smaller community	Complex document indexing
Haystack	Python	Pipeline-first design, production-focused	Less flexible for custom flows	Enterprise pipelines
Semantic Kernel	C#, Python	Microsoft ecosystem, Azure integration	Newer, fewer examples	.NET / Azure shops
Vercel AI SDK	TypeScript	Streaming-first, React integration	Less mature RAG tooling	Next.js web apps

Our recommendation: Start with LangChain for its ecosystem breadth and documentation. If your use case is heavily document-oriented with complex indexing needs, evaluate LlamaIndex. For TypeScript web applications, the Vercel AI SDK is increasingly capable.

Real-World RAG Use Cases

RAG is not a theoretical concept — it powers thousands of production applications today:

Internal knowledge base Q&A: Employees ask questions about company policies, HR documents, engineering runbooks. RAG retrieves from Confluence, Notion, or SharePoint and answers with citations.
Customer support chatbots: AI agents search product documentation, past support tickets, and FAQs to resolve customer issues. Reduces human agent workload by 40-60%.
Legal document analysis: Lawyers query thousands of contracts, regulations, and case law. RAG finds relevant clauses and summarizes precedents.
Code assistant: Developers ask questions about their codebase. RAG searches indexed source code, READMEs, and architecture docs to provide context-aware answers.
Medical research: Researchers query PubMed papers, clinical trial data, and drug interaction databases. RAG surfaces relevant studies with proper citations.

Cost Estimation Guide

Understanding RAG costs helps with budgeting. Here is a breakdown for a typical knowledge base with 10,000 documents (averaging 2,000 tokens each):

Component	Calculation	Monthly Cost
Initial embedding (one-time)	20M tokens x $0.02/1M	$0.40
Vector storage (Pinecone)	~100K vectors, free tier	$0
Query embeddings (1K queries/day)	30K queries x 100 tokens avg	$0.06
LLM generation (GPT-4o-mini)	30K queries x 2K tokens avg	$9.00
LLM generation (GPT-4o)	30K queries x 2K tokens avg	$75.00
Re-ranking (optional, self-hosted)	GPU instance	$50-150

Key insight: Embedding costs are negligible. The LLM generation model choice dominates your bill. Use GPT-4o-mini for 80% of queries and route only complex ones to GPT-4o. This "model routing" strategy can cut costs by 60-70%.

Monitoring and Observability

A RAG system without monitoring is a system waiting to fail silently. You need visibility into every stage of the pipeline:

Retrieval latency: Measure time from query to retrieved chunks. Alert if it exceeds 500ms. Slow retrieval usually means your index needs optimization or your vector store is undersized.
Retrieval relevance: Sample 1% of queries and have annotators rate whether retrieved chunks are relevant. Declining relevance often means your knowledge base has changed but embeddings are stale.
Generation quality: Track user feedback (thumbs up/down, corrections). Run weekly RAGAS evaluations on a fixed test set. Faithfulness drops indicate prompt or retrieval regressions.
Token usage and cost: Track tokens consumed per query (embedding + generation). Set budget alerts. Identify expensive queries that could be served by a smaller model.
Error rates: Monitor embedding API errors, vector store timeouts, and LLM failures. Set up fallbacks: if primary retrieval fails, serve a cached response or gracefully degrade.

LangSmith, Phoenix (Arize), and Weights & Biases Traces are the leading observability tools for LLM applications. LangSmith integrates seamlessly with LangChain and provides trace visualization, evaluation, and cost tracking out of the box.

Production Setup: pgvector with PostgreSQL

If your team already runs PostgreSQL, pgvector is the easiest path to production. No new infrastructure, no new vendor — just an extension on your existing database:

-- Enable the extension
CREATE EXTENSION IF NOT EXISTS vector;

-- Create a table for document chunks
CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    metadata JSONB DEFAULT '{}',
    embedding vector(1536),  -- matches text-embedding-3-small
    created_at TIMESTAMP DEFAULT NOW()
);

-- Create an HNSW index for fast similarity search
CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Similarity search query
SELECT id, content, metadata,
    1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE 1 - (embedding <=> $1::vector) > 0.7
ORDER BY embedding <=> $1::vector
LIMIT 5;

# Python: LangChain + pgvector
# pip install langchain-postgres psycopg2-binary

from langchain_postgres import PGVector
from langchain_openai import OpenAIEmbeddings

CONNECTION = "postgresql://user:pass@localhost:5432/mydb"

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

vectorstore = PGVector.from_documents(
    documents=chunks,
    embedding=embeddings,
    connection=CONNECTION,
    collection_name="my_docs",
)

# Query with metadata filtering
results = vectorstore.similarity_search(
    "deployment guide",
    k=4,
    filter={"source": "docs/deploy.md"}
)

pgvector supports IVFFlat and HNSW indexes for fast approximate nearest neighbor search. For up to 1 million vectors, it performs well without sharding. Beyond that, consider a dedicated vector store.

RAG vs Agentic RAG

Traditional RAG follows a fixed pipeline: retrieve then generate. Agentic RAG gives the LLM control over the retrieval process itself — it can decide when to search, what to search for, whether to search again, and when it has enough context to answer.

In an agentic RAG system, the LLM acts as a reasoning agent that can call retrieval as a tool. If the first retrieval does not return relevant results, the agent reformulates the query and tries again. It can also combine information from multiple searches, verify facts across sources, and decide when to give up gracefully.

Agentic RAG is particularly powerful for complex questions that require multi-step reasoning: "Compare the pricing of products X, Y, and Z" requires three separate retrievals that a fixed pipeline cannot handle elegantly. Frameworks like LangGraph and CrewAI make it straightforward to build agentic RAG systems.

Multi-Modal RAG: Beyond Text

Modern RAG is not limited to text. Multi-modal RAG can index and retrieve from images, tables, and structured data:

Table extraction: Use Unstructured or Docling to extract tables from PDFs. Convert tables to markdown or JSON before chunking. Tables contain dense information that pure text extraction misses.
Image understanding: Use GPT-4o or Claude to generate text descriptions of diagrams, charts, and screenshots. Index these descriptions as searchable text alongside the original image reference.
Structured data: For databases and APIs, generate natural language summaries of schema and sample data. Use text-to-SQL for queries that need real-time database access.
Code repositories: Index functions, classes, and docstrings separately. Use code-specific embedding models (Voyage AI voyage-code-3) for better semantic matching on code content.

Multi-modal RAG adds complexity but significantly improves coverage. Start with text-only RAG, then add modalities based on what your users are asking about but not finding answers for.

Document Preprocessing Tips

The quality of your input documents directly determines RAG quality. Garbage in, garbage out. Here are preprocessing best practices:

Clean HTML artifacts: Strip navigation, footers, ads, and boilerplate before indexing. Use readability libraries or Unstructured to extract main content.
Normalize formatting: Convert all documents to consistent markdown. Standardize headers, lists, and code blocks across sources.
Deduplicate: Remove duplicate or near-duplicate documents before indexing. Duplicates waste storage and can cause redundant retrieval results.
Add metadata: Enrich each document with title, author, date, category, and source URL. This metadata enables filtering and improves citation quality.
Handle OCR content: PDFs from scanned documents need OCR. Use Tesseract or cloud OCR services. Post-process OCR output to fix common errors.
Validate encoding: Ensure all text is UTF-8. Mixed encodings cause silent corruption in embeddings and retrieval.

Getting Started Checklist

Follow this checklist to build your first RAG pipeline from scratch:

Define your knowledge source — Identify the documents, databases, or APIs that contain the answers your users need. Start with a small, well-curated dataset of 50-100 documents.
Choose your tech stack — For most teams: Python + LangChain + OpenAI + ChromaDB. This gets you from zero to working prototype in under an hour.
Set up document loading — Use appropriate loaders for your document types. Preserve metadata (source URL, page number, section title) for citations.
Tune your chunking — Start with RecursiveCharacterTextSplitter at 512 tokens with 50 token overlap. Adjust based on your content type and retrieval accuracy.
Generate and store embeddings — Use text-embedding-3-small. Store in ChromaDB locally or Pinecone for cloud persistence.
Build the retrieval chain — Start with simple similarity search (k=4). Add MMR if you see duplicate results.
Write your prompt template — Instruct the LLM to answer only from context and cite sources. Test with 10 representative questions.
Evaluate with RAGAS — Create a test set of 50+ questions with ground-truth answers. Measure faithfulness, relevance, and recall.
Iterate and improve — Tune chunk size, try hybrid search, add re-ranking. Each change should improve RAGAS scores.
Deploy — Add streaming, caching, error handling, and monitoring. Use LangSmith or a similar observability tool.

Streaming RAG Responses

Users expect real-time responses. Streaming delivers tokens as they are generated rather than waiting for the complete response. Here is how to implement streaming with LangChain:

import asyncio
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough

llm = ChatOpenAI(model="gpt-4o", temperature=0, streaming=True)

prompt = ChatPromptTemplate.from_template(
    "Answer based ONLY on this context:\n{context}\n\n"
    "Question: {question}\nAnswer:"
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 4})

rag_chain = (
    {"context": retriever | (lambda docs: "\n".join(
        d.page_content for d in docs)),
     "question": RunnablePassthrough()}
    | prompt | llm | StrOutputParser()
)

# Synchronous streaming
for chunk in rag_chain.stream("How do I set up auth?"):
    print(chunk, end="", flush=True)

# Async streaming (for web frameworks)
async def stream_response(question: str):
    async for chunk in rag_chain.astream(question):
        yield chunk  # Send to client via SSE or WebSocket

# FastAPI example
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

@app.get("/ask")
async def ask(q: str):
    return StreamingResponse(
        stream_response(q),
        media_type="text/event-stream"
    )

Streaming reduces perceived latency from 3-5 seconds to under 500ms for first token. Always implement streaming in user-facing applications — the impact on user experience is dramatic.

Frequently Asked Questions

What is RAG in AI and how does it work?

RAG (Retrieval-Augmented Generation) is an architecture that enhances LLM responses by retrieving relevant documents from a knowledge base at query time and injecting them into the prompt. The LLM then generates answers grounded in this retrieved context rather than relying solely on its training data, reducing hallucinations and enabling access to current information.

When should I use RAG vs fine-tuning?

Use RAG when you need the LLM to answer from specific documents, databases, or frequently updated content. Use fine-tuning when you need to change the model behavior, tone, or teach it specialized reasoning patterns. RAG is faster to implement, cheaper to maintain, and better at providing citations. Most production systems start with RAG and add fine-tuning only when needed.

What is the best chunk size for RAG?

There is no universal best chunk size. For Q&A systems, 256-512 tokens with 50-100 token overlap works well. For summarization, 1024-2048 tokens. For code, split by function or class boundaries. The key is to measure: create a test set, try different chunk sizes, and pick the one with the highest retrieval accuracy using RAGAS metrics.

Which vector database should I use for RAG?

For prototyping, use ChromaDB (embedded, zero config) or FAISS (in-memory). For production SaaS, Pinecone offers the best managed experience. If you already use PostgreSQL, pgvector avoids new infrastructure. For advanced hybrid search, Weaviate or Qdrant are strong choices. The best vector DB is the one that fits your existing infrastructure.

How do I reduce hallucinations in RAG?

Five strategies: (1) Improve retrieval quality with hybrid search and re-ranking so the LLM gets better context. (2) Use explicit prompt instructions telling the model to answer only from provided context. (3) Add citations — require the model to reference specific source documents. (4) Use contextual compression to remove irrelevant content from chunks. (5) Measure faithfulness with RAGAS and iterate.

What embedding model is best for RAG in 2026?

OpenAI text-embedding-3-small is the best general-purpose choice at $0.02 per million tokens. For multilingual use cases, Cohere embed-v3 excels. For private or offline deployments, sentence-transformers or Ollama nomic-embed-text are strong free options. For code search, Voyage AI voyage-3 leads. Always benchmark on your specific data before choosing.

How do I evaluate RAG pipeline quality?

Use the RAGAS framework which measures four metrics: faithfulness (are answers supported by context), answer relevance (does the answer address the question), context precision (are retrieved chunks relevant), and context recall (did retrieval find all needed information). Create a test set of 50-100 questions with ground-truth answers and run RAGAS evaluation after every pipeline change.

What is hybrid search in RAG and why is it better?

Hybrid search combines dense vector similarity search with sparse keyword search like BM25. Dense search catches semantic meaning (synonyms, paraphrases) while sparse search catches exact keyword matches (product names, error codes, acronyms). Together they outperform either approach alone. Most modern vector stores support hybrid search natively.

RAG is the most practical and impactful architecture for building LLM applications that need grounded, accurate, and up-to-date answers. Start simple with a basic pipeline (LangChain + OpenAI + ChromaDB), measure with RAGAS, then iterate. Add hybrid search and re-ranking when you need better retrieval. The ecosystem is mature and the tools are production-ready — the key differentiator is how well you tune your chunking, retrieval, and prompts for your specific data and use case.