Guide Complet des Bases de Données Vectorielles 2026 : Pinecone vs Weaviate vs Qdrant vs ChromaDB

TL;DR

TL;DR: For production at scale, choose Pinecone (managed) or Qdrant/Milvus (self-hosted). For prototyping, use ChromaDB. If you already use PostgreSQL, pgvector is the simplest path. Weaviate excels at hybrid search combining vectors with keyword filtering.

Key Takeaways

Vector databases store embeddings (numerical representations of data) and enable fast similarity search across millions or billions of vectors.
HNSW is the dominant indexing algorithm for most use cases, offering the best balance of speed and recall.
Managed solutions (Pinecone) minimize ops overhead; open-source options (Qdrant, Milvus, Weaviate) offer more control and lower cost at scale.
pgvector lets you add vector search to existing PostgreSQL databases without introducing new infrastructure.
For RAG pipelines, tight integration with LangChain or LlamaIndex is critical — all major vector databases support both.
Embedding model choice matters more than database choice for search quality — use text-embedding-3-large for best results.

Vector databases have become the backbone of modern AI applications. From retrieval-augmented generation (RAG) to semantic search, recommendation engines, and anomaly detection, vector databases store and query high-dimensional embeddings at scale. This guide compares the leading vector databases in 2026 — Pinecone, Weaviate, Qdrant, ChromaDB, pgvector, Milvus, and FAISS — covering architecture, performance, pricing, and practical code examples to help you choose the right solution.

What Are Vector Databases and Why They Matter

A vector database is a specialized storage system designed to index, store, and query high-dimensional vectors (embeddings). Traditional databases organize data in rows and columns with exact-match queries. Vector databases instead organize data by similarity in a continuous vector space, enabling "find me things similar to this" queries that power modern AI.

Why do we need them? Large language models and embedding models convert text, images, audio, and code into dense numerical vectors (typically 384 to 3072 dimensions). A sentence like "How to deploy a Docker container" becomes a float array like [0.023, -0.156, 0.891, ...]. Finding the most similar vectors among millions requires specialized indexing structures that traditional B-tree or hash indexes cannot efficiently handle.

The vector database market has exploded since 2023 with the rise of LLM applications. Every RAG system, semantic search engine, and AI-powered recommendation system relies on vector similarity search. Understanding these tools is now essential for any developer building AI-powered features.

# Traditional DB: exact match
SELECT * FROM products WHERE category = 'electronics'

# Vector DB: similarity search
# "Find products semantically similar to this query"
query_vector = model.encode("wireless noise-canceling headphones")
results = collection.query(query_vector, top_k=10)
# Returns: ranked list of most semantically similar products
# Even matches "Bluetooth ANC over-ear headset" (different words, same meaning)

How Vector Search Works: Embeddings and Similarity

Vector search involves three stages: generating embeddings with a model, indexing them in a database for fast retrieval, and querying with a similarity metric to find nearest neighbors.

# Step 1: Generate embeddings with an embedding model
from openai import OpenAI
client = OpenAI()

text = "How to deploy a Docker container"
response = client.embeddings.create(
    model="text-embedding-3-small",
    input=text
)
vector = response.data[0].embedding  # [0.023, -0.156, 0.891, ...]
# Length: 1536 float values

# Step 2: Store vector in database with metadata
# Step 3: Query by computing similarity to find nearest neighbors

Similarity Metrics Explained

Cosine: Cosine Similarity: Measures the angle between two vectors, ignoring magnitude. Best for text embeddings where direction matters more than length. Range: -1 to 1 (1 = identical direction). Most widely used metric.
Euclidean (L2): Euclidean Distance (L2): Measures straight-line distance between two points in vector space. Best when magnitude matters. Lower values = more similar. Good for image embeddings and spatial data.
Dot Product: Dot Product (Inner Product): Combines direction and magnitude. Fastest to compute. Works well when vectors are already normalized. Higher values = more similar. Preferred for performance-critical applications.

How to choose: Use cosine similarity as the default for text. Use Euclidean for spatial or image data. Use dot product when vectors are pre-normalized and you need maximum speed.

import numpy as np

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def euclidean_distance(a, b):
    return np.linalg.norm(np.array(a) - np.array(b))

def dot_product(a, b):
    return np.dot(a, b)

v1 = [0.1, 0.3, 0.5]
v2 = [0.2, 0.4, 0.6]
print(f"Cosine: {cosine_similarity(v1, v2):.4f}")    # 0.9946
print(f"Euclidean: {euclidean_distance(v1, v2):.4f}") # 0.1732
print(f"Dot product: {dot_product(v1, v2):.4f}")      # 0.4400

Vector Database Comparison Overview

Database	Type	Language	Indexing	Hybrid Search	Cloud/Managed
Pinecone	Managed SaaS	-	Proprietary	Sparse+Dense	Yes (only)
Weaviate	Open Source	Go	HNSW, Flat	Native BM25+Vector	Weaviate Cloud
Qdrant	Open Source	Rust	HNSW	Payload Filtering	Qdrant Cloud
ChromaDB	Open Source	Python	HNSW (hnswlib)	Metadata Filter	Chroma Cloud
pgvector	PG Extension	C	IVFFlat, HNSW	SQL WHERE	Any Managed PG
Milvus	Open Source	Go/C++	HNSW, IVF, DiskANN	Sparse+Dense	Zilliz Cloud
FAISS	Library	C++/Python	HNSW, IVF, PQ	N/A	N/A

Pinecone: Fully Managed Vector Database

Pinecone is a fully managed, serverless vector database. You do not run any infrastructure — Pinecone handles sharding, replication, scaling, and backups. It offers a simple API with namespaces for multi-tenancy. The serverless pricing model charges per read/write unit and per GB stored, making it cost-effective for bursty workloads. Pinecone supports metadata filtering, sparse-dense hybrid search, and integrates natively with all major AI frameworks.

# Pinecone: Serverless vector database
from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("my-index")

# Upsert vectors with metadata
index.upsert(vectors=[
    {"id": "doc1", "values": embedding1,
     "metadata": {"source": "wiki", "topic": "docker"}},
    {"id": "doc2", "values": embedding2,
     "metadata": {"source": "docs", "topic": "kubernetes"}},
])

# Query with metadata filter
results = index.query(
    vector=query_embedding,
    top_k=5,
    filter={"source": {"$eq": "docs"}},
    include_metadata=True
)
for match in results.matches:
    print(f"Score: {match.score:.4f}, ID: {match.id}")

Pros: Zero ops burden, automatic scaling, strong consistency, serverless pricing, excellent documentation, namespace isolation for multi-tenancy.

Cons: Vendor lock-in, higher cost at very large scale (10M+ vectors), limited query flexibility compared to open-source options, US/EU/AWS regions only, no self-hosted option.

Weaviate: Hybrid Search Pioneer

Weaviate is an open-source vector database written in Go. Its standout feature is native hybrid search that combines dense vector search with BM25 keyword search in a single query. It supports multi-modal data (text, images, audio) with built-in vectorizer modules for OpenAI, Cohere, Hugging Face, and others. Weaviate also supports generative search — combining retrieval with LLM generation in one API call.

# Weaviate: Hybrid search (BM25 + vector)
import weaviate
import weaviate.classes.query as wq

client = weaviate.connect_to_local()  # or connect_to_weaviate_cloud()
collection = client.collections.get("Article")

# Hybrid search: combine vector similarity + keyword matching
results = collection.query.hybrid(
    query="machine learning model deployment",
    alpha=0.75,  # 0 = pure BM25, 1 = pure vector
    limit=10,
    return_metadata=wq.MetadataQuery(score=True)
)
for obj in results.objects:
    print(f"{obj.properties['title']} — score: {obj.metadata.score:.4f}")

client.close()

Pros: True hybrid search (BM25+vector), multi-modal support, GraphQL and REST APIs, built-in vectorizers, generative search module, strong community and documentation.

Cons: Higher memory usage than Qdrant/Milvus, Go codebase harder to extend for Python-heavy teams, Weaviate Cloud Services pricing can add up, complex schema management.

Qdrant: Rust-Powered Performance Leader

Qdrant is an open-source vector database built in Rust for maximum performance and memory efficiency. It supports rich payload filtering with indexed fields, scalar and product quantization for memory reduction, and distributed deployment with automatic sharding. Qdrant consistently ranks at or near the top in independent ANN benchmarks for both speed and recall.

# Qdrant: Rust-powered high-performance vector search
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(url="http://localhost:6333")

# Create collection with HNSW index
client.create_collection(
    collection_name="documents",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)

# Upsert vectors with payload (metadata)
client.upsert(collection_name="documents", points=[
    PointStruct(id=1, vector=emb1, payload={"title": "Docker Guide", "lang": "en"}),
    PointStruct(id=2, vector=emb2, payload={"title": "K8s Tutorial", "lang": "en"}),
])

# Search with payload filtering
results = client.query_points(
    collection_name="documents",
    query=query_vector,
    query_filter={"must": [{"key": "lang", "match": {"value": "en"}}]},
    limit=5
).points

Pros: Exceptional performance and low latency, smallest memory footprint with quantization, rich payload filtering, gRPC and REST APIs, simple Docker deployment, very active development cycle.

Cons: Smaller community than Weaviate or Milvus, no built-in BM25 hybrid search (requires external full-text engine like Elasticsearch), Qdrant Cloud available in limited regions.

ChromaDB: Lightweight and Python-Native

ChromaDB is an open-source embedding database designed for simplicity and developer experience. It runs in-process with your Python application — no separate server needed for development. ChromaDB automatically handles embedding generation if you provide documents instead of vectors. It is the most popular choice for prototyping RAG applications, tutorials, and local development.

# ChromaDB: Simplest vector database for prototyping
import chromadb

# In-memory (dev) or persistent (production-lite)
client = chromadb.PersistentClient(path="./chroma_db")

collection = client.create_collection("my_docs")

# Add documents — ChromaDB auto-embeds with default model
collection.add(
    documents=[
        "Docker is a containerization platform for packaging apps",
        "Kubernetes orchestrates containers across clusters",
        "Nginx is a high-performance web server and reverse proxy"
    ],
    ids=["doc1", "doc2", "doc3"],
    metadatas=[{"topic": "docker"}, {"topic": "k8s"}, {"topic": "nginx"}]
)

# Query by text (auto-embeds the query too)
results = collection.query(
    query_texts=["container orchestration tools"],
    n_results=2
)
print(results["documents"])  # [[doc2, doc1]]

Pros: Simplest API of any vector database, runs embedded in Python (zero config), auto-embeds documents, great for prototyping, seamless LangChain/LlamaIndex integration.

Cons: Not designed for production scale (struggles above 1M vectors), limited distributed capabilities, fewer indexing options, no enterprise features like RBAC or audit logs.

pgvector: Vector Search in PostgreSQL

pgvector is a PostgreSQL extension that adds vector similarity search to your existing Postgres database. If your application already uses PostgreSQL, pgvector eliminates the need for a separate vector database entirely. It supports IVFFlat and HNSW indexes, and you can combine vector similarity with standard SQL WHERE clauses, JOINs, and transactions in a single query.

-- pgvector: Vector search inside PostgreSQL
CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
    id SERIAL PRIMARY KEY,
    title TEXT NOT NULL,
    content TEXT,
    embedding vector(768)  -- 768-dimensional vector column
);

-- Create HNSW index for fast cosine search
CREATE INDEX ON documents
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Find 5 most similar documents with SQL filtering
SELECT id, title,
       1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE title ILIKE '%docker%'  -- combine with any SQL
ORDER BY embedding <=> $1::vector
LIMIT 5;

Pros: No new infrastructure needed, familiar SQL interface, ACID transactions, combine vector search with relational queries and JOINs, mature PostgreSQL ecosystem and tooling.

Cons: Slower than purpose-built vector databases at scale (5M+ vectors), limited to PostgreSQL, fewer advanced features (no built-in quantization), HNSW index build can be slow on large datasets.

Milvus: Enterprise-Scale Open Source

Milvus is an open-source vector database designed for billion-scale similarity search. Backed by Zilliz, it features a cloud-native distributed architecture with separate storage and compute layers, GPU acceleration through NVIDIA partnership, and support for multiple index types including DiskANN for datasets larger than RAM. Zilliz Cloud offers a fully managed version.

# Milvus: Billion-scale vector search
from pymilvus import connections, Collection, FieldSchema
from pymilvus import CollectionSchema, DataType

connections.connect("default", host="localhost", port="19530")

fields = [
    FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
    FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
    FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
]
schema = CollectionSchema(fields, description="Document store")
collection = Collection("documents", schema)

# Build HNSW index
collection.create_index("embedding", {
    "metric_type": "COSINE",
    "index_type": "HNSW",
    "params": {"M": 16, "efConstruction": 256}
})
collection.load()

# Search
results = collection.search(
    data=[query_vector], anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 64}},
    limit=5, output_fields=["title"]
)

Pros: Billion-scale production-proven, GPU acceleration (NVIDIA RAFT), multiple index types including DiskANN, strong enterprise features, active CNCF project, hybrid sparse-dense search.

Cons: Complex deployment stack (requires etcd, MinIO, Pulsar for distributed mode), steep learning curve, heavy resource requirements even for small datasets, API can be verbose.

FAISS: Meta AI Research Library

FAISS (Facebook AI Similarity Search) is not a database but a library for efficient similarity search and clustering of dense vectors. It provides the core algorithms (HNSW, IVF, PQ, ScaNN) that many vector databases use internally. Use FAISS directly when you need maximum control over indexing, when building a custom solution, or when you need GPU-accelerated search without database overhead.

# FAISS: Low-level similarity search library by Meta
import faiss
import numpy as np

dimension = 768
num_vectors = 1_000_000

# Create HNSW index
index = faiss.IndexHNSWFlat(dimension, 32)  # M=32 neighbors per layer
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 64

# Add vectors (must be float32 numpy arrays)
vectors = np.random.rand(num_vectors, dimension).astype("float32")
faiss.normalize_L2(vectors)  # normalize for cosine similarity
index.add(vectors)

# Search: find 10 nearest neighbors
query = np.random.rand(1, dimension).astype("float32")
faiss.normalize_L2(query)
distances, indices = index.search(query, k=10)
print(f"Nearest IDs: {indices[0]}, Distances: {distances[0]}")

Indexing Algorithms Deep Dive

The indexing algorithm determines how vectors are organized for fast approximate nearest neighbor (ANN) retrieval. The choice directly impacts query latency, memory usage, recall (accuracy), and build time.

HNSW: HNSW (Hierarchical Navigable Small World): The most popular algorithm in production. Builds a multi-layer proximity graph where each node connects to nearby neighbors. Offers excellent recall (>99%) with sub-millisecond queries. Trade-off: requires the full graph in RAM, so memory usage is high.

IVF: IVF (Inverted File Index): Partitions vectors into clusters using k-means. At query time, only the nearest clusters (nprobe) are searched. Good for large datasets where you can tolerate slightly lower recall for significantly faster speed and much lower memory usage.

PQ: PQ (Product Quantization): Compresses vectors by splitting them into sub-vectors and quantizing each to a codebook entry. Achieves 8-32x memory compression. Often combined with IVF (IVF-PQ) for billion-scale deployments. Trade-off: noticeable recall loss, especially on smaller datasets.

ScaNN: ScaNN (Scalable Nearest Neighbors): Developed by Google, combines tree-based partitioning with anisotropic vector quantization optimized for maximum inner product search. Achieves excellent recall-vs-speed trade-offs. Used internally at Google Search and YouTube.

Embedding Models for Vector Databases

The embedding model you choose directly impacts search quality — often more than the database itself. Here are the leading models in 2026 ranked by quality on the MTEB benchmark:

# Compare embedding models on your data
from openai import OpenAI
import cohere
from sentence_transformers import SentenceTransformer

# OpenAI (best quality, paid API)
openai = OpenAI()
resp = openai.embeddings.create(
    model="text-embedding-3-small", input="test query"
)
oai_vec = resp.data[0].embedding  # 1536 dims

# Cohere (multilingual, paid API)
co = cohere.Client("YOUR_API_KEY")
resp = co.embed(
    texts=["test query"],
    model="embed-english-v3.0",
    input_type="search_query"
)
cohere_vec = resp.embeddings[0]  # 1024 dims

# Open-source (free, self-hosted)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
oss_vec = model.encode("test query")  # 1024 dims

print(f"OpenAI dims: {len(oai_vec)}")
print(f"Cohere dims: {len(cohere_vec)}")
print(f"BGE dims: {len(oss_vec)}")

Model	Dims	Provider	Best For	Cost
text-embedding-3-large	3072	OpenAI	Highest quality text	$0.13/1M tokens
text-embedding-3-small	1536	OpenAI	Cost-efficient general	$0.02/1M tokens
embed-v3	1024	Cohere	Multilingual (100+ langs)	$0.10/1M tokens
BGE-large-en-v1.5	1024	BAAI	Best open-source English	Free (self-host)
E5-large-v2	1024	Microsoft	Strong open-source	Free (self-host)
Gemini embedding	768	Google	Multimodal (text+image)	$0.004/1M tokens

Use Cases for Vector Databases

RAG (Retrieval-Augmented Generation)

The most common use case in 2026. Store document chunks as vectors. When a user asks a question, retrieve the most relevant chunks and feed them to an LLM as context. This grounds the LLM response in your actual data and dramatically reduces hallucination. Every enterprise chatbot, knowledge base assistant, and document Q&A system uses this pattern.

# Complete RAG pipeline: LangChain + Qdrant
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain.chains import RetrievalQA

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = QdrantVectorStore.from_existing_collection(
    embedding=embeddings,
    collection_name="knowledge_base",
    url="http://localhost:6333"
)

# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(model="gpt-4o"),
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)
answer = qa_chain.invoke("How do I configure Nginx reverse proxy?")
print(answer["result"])

Semantic Search

Go beyond keyword matching to search by meaning. A query for "how to fix a broken pipe" returns both plumbing articles AND Linux socket error guides because the embeddings capture semantic relationships. E-commerce, documentation sites, and support ticket systems all benefit from semantic search.

# Semantic search: Find documents by meaning, not keywords
from qdrant_client import QdrantClient
from openai import OpenAI

openai_client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")

def semantic_search(query: str, collection: str, top_k: int = 5):
    # Embed the query
    response = openai_client.embeddings.create(
        model="text-embedding-3-small", input=query
    )
    query_vector = response.data[0].embedding

    # Search by similarity
    results = qdrant.query_points(
        collection_name=collection,
        query=query_vector,
        limit=top_k
    )
    return [
        {"title": r.payload["title"], "score": r.score}
        for r in results.points
    ]

# Query: "fix broken pipe error"
# Returns: Linux SIGPIPE guide, plumbing repair tips, Node.js stream errors
# All semantically related despite different keywords

Recommendation Systems

Represent users and items as vectors in the same embedding space. Find similar items by nearest-neighbor search. Combine with collaborative filtering signals stored as metadata for hybrid recommendations that are both content-aware and behavior-aware.

# Recommendation system with vector similarity
# Embed product descriptions and find similar items

# Given a product the user liked:
liked_product_vector = get_embedding("Ergonomic mechanical keyboard with RGB")

# Find similar products
similar = qdrant.query_points(
    collection_name="products",
    query=liked_product_vector,
    query_filter={
        "must": [{"key": "in_stock", "match": {"value": True}}],
        "must_not": [{"key": "id", "match": {"value": liked_product_id}}]
    },
    limit=10
)
# Returns: similar keyboards, ergonomic mice, desk accessories
# Items are ranked by semantic similarity to the liked product

Anomaly Detection

Embed normal behavior patterns as vectors. New observations far from any cluster in the vector space indicate anomalies. Used in production for fraud detection, network intrusion detection, manufacturing quality control, and medical imaging analysis.

# Anomaly detection with vector distance
import numpy as np

# Embed a new transaction
new_transaction = embed("$15,000 wire transfer to unknown account at 3AM")

# Search for similar past transactions
neighbors = qdrant.query_points(
    collection_name="transactions",
    query=new_transaction,
    limit=5
)

# If nearest neighbors are far away, flag as anomaly
max_score = max(r.score for r in neighbors.points)
ANOMALY_THRESHOLD = 0.7  # cosine similarity threshold

if max_score < ANOMALY_THRESHOLD:
    alert(f"Anomaly detected! Nearest similarity: {max_score:.4f}")
    # Trigger fraud review pipeline

Self-Hosted vs Managed: Decision Guide

The build vs buy decision depends on your team size, operational maturity, data sovereignty requirements, and scale. Here is a practical comparison to help you decide:

Factor	Managed	Self-Hosted
Setup Time	Minutes	Hours to days
Ops Burden	None (vendor handles)	Monitoring, backups, upgrades
Cost (small scale)	$50-200/mo	$20-60/mo (VPS)
Cost (large scale)	$500-5000/mo	$200-1000/mo
Data Sovereignty	Provider regions only	Full control, any location
Customization	Limited to API options	Full source code access
Auto-scaling	Built-in	Manual or K8s-based
SLA Guarantee	Yes (99.9%+)	Self-managed uptime

Performance Benchmarks (2026)

Based on the ann-benchmarks project and independent testing with 1M vectors (768 dimensions, HNSW index, recall@10 > 0.95). Results vary by hardware — these numbers use a 4-core, 16GB RAM server:

Database	p99 Latency	QPS	Memory (1M vec)	Index Build
FAISS (library)	~1ms	~12,000	~3.0 GB	~50s
Qdrant	~2ms	~8,500	~3.2 GB	~45s
Milvus	~3ms	~7,200	~3.8 GB	~60s
Weaviate	~4ms	~5,800	~4.1 GB	~55s
Pinecone	~5ms	~6,000	Managed	Managed
pgvector (HNSW)	~8ms	~2,500	~4.5 GB	~120s
ChromaDB	~10ms	~1,800	~3.5 GB	~40s

Note: These benchmarks are approximate and vary significantly with hardware, data distribution, index parameters, and filter complexity. Always benchmark with your own data and workload patterns. FAISS is a library (not a database) and lacks persistence, replication, and filtering — raw speed comparisons are not apples-to-apples.

Framework Integration: LangChain, LlamaIndex, Haystack

All major vector databases integrate with the three dominant AI/LLM orchestration frameworks. Here are unified examples showing how each database plugs into LangChain — the most popular framework:

# LangChain unified retriever interface for all vector DBs

# --- Pinecone ---
from langchain_pinecone import PineconeVectorStore
vs = PineconeVectorStore(index_name="docs", embedding=embeddings)

# --- Weaviate ---
from langchain_weaviate import WeaviateVectorStore
vs = WeaviateVectorStore(
    client=weaviate_client, index_name="Docs", embedding=embeddings)

# --- Qdrant ---
from langchain_qdrant import QdrantVectorStore
vs = QdrantVectorStore.from_existing_collection(
    url="http://localhost:6333",
    collection_name="docs", embedding=embeddings)

# --- ChromaDB ---
from langchain_chroma import Chroma
vs = Chroma(persist_directory="./chroma", embedding_function=embeddings)

# --- pgvector ---
from langchain_postgres import PGVector
vs = PGVector(
    connection=conn_string, embeddings=embeddings,
    collection_name="docs")

# All use the SAME retriever interface:
retriever = vs.as_retriever(search_kwargs={"k": 5})
docs = retriever.invoke("How to configure CORS headers?")

LlamaIndex Integration

# LlamaIndex: Alternative to LangChain for RAG
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

# Connect to Qdrant
qdrant = QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(
    client=qdrant, collection_name="llama_docs"
)

# Load documents and build index
documents = SimpleDirectoryReader("./data/").load_data()
index = VectorStoreIndex.from_documents(
    documents, vector_store=vector_store
)

# Query with natural language
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
    "What are the best practices for Docker networking?"
)
print(response.response)
print(f"Sources: {len(response.source_nodes)} chunks retrieved")

Haystack Integration

# Haystack: Pipeline-based approach
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack.components.generators import OpenAIGenerator
from haystack import Pipeline

document_store = QdrantDocumentStore(
    url="http://localhost:6333",
    index="haystack_docs",
    embedding_dim=768
)

retriever = QdrantEmbeddingRetriever(document_store=document_store)
generator = OpenAIGenerator(model="gpt-4o")

# Build pipeline
pipe = Pipeline()
pipe.add_component("retriever", retriever)
pipe.add_component("generator", generator)
pipe.connect("retriever.documents", "generator.documents")

result = pipe.run({"retriever": {"query_embedding": query_vec}})

Docker Quick Deploy: Qdrant, Weaviate, Milvus

# Deploy Qdrant with Docker (simplest)
docker run -p 6333:6333 -p 6334:6334 \
  -v $(pwd)/qdrant_storage:/qdrant/storage:z \
  qdrant/qdrant:latest

# Deploy Weaviate with Docker Compose
# docker-compose.yml:
services:
  weaviate:
    image: semitechnologies/weaviate:latest
    ports:
      - "8080:8080"
      - "50051:50051"
    environment:
      QUERY_DEFAULTS_LIMIT: 25
      AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
      PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
      DEFAULT_VECTORIZER_MODULE: "none"
      CLUSTER_HOSTNAME: "node1"
    volumes:
      - weaviate_data:/var/lib/weaviate

# Deploy Milvus standalone with Docker Compose
# curl -O https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh
# bash standalone_embed.sh start
# Milvus Standalone runs on port 19530

Cost Analysis and Pricing Comparison

Monthly cost estimate for 1M vectors (768 dimensions) with 100 queries per second sustained. Prices as of early 2026:

Solution	Managed Cost	Self-Hosted Cost	Notes
Pinecone	$70-100/mo	N/A	Serverless: pay per read/write unit
Weaviate Cloud	$100-300/mo	$40-80/mo	Tiered: Sandbox free, Standard, Business
Qdrant Cloud	$65-150/mo	$30-60/mo	Most affordable managed option
Zilliz (Milvus)	$100-400/mo	$60-120/mo	GPU instances increase cost
pgvector	From $15/mo*	$0 extra	* = managed Postgres cost (RDS, Supabase)
ChromaDB	Beta pricing	$20-40/mo	Primarily for dev and prototyping

When NOT to Use a Vector Database

Vector databases are powerful but not always the right choice. Avoid them in these scenarios:

1. Exact match only: If you only need precise keyword or ID lookups, a traditional database with full-text search (PostgreSQL, Elasticsearch) is simpler and faster.

2. Small datasets under 10K items: For tiny datasets, brute-force cosine similarity in NumPy or a simple SQL query is often fast enough without the overhead of a vector database.

3. Structured data queries: If your queries are primarily filtering, sorting, and aggregating structured fields, a relational database or data warehouse is more appropriate.

4. Real-time streaming: Vector databases are optimized for batch-insert and query workloads, not high-frequency streaming updates. Use a streaming platform with periodic batch updates instead.

5. When explainability is required: Vector similarity search is a black box — you cannot easily explain why two items are similar. If audit trails matter, consider hybrid approaches.

How to Benchmark Your Vector Database

Do not rely solely on published benchmarks — test with your actual data and query patterns. Here is a systematic approach:

# Benchmark your vector database with your actual data
import time
import numpy as np
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")
collection_name = "benchmark_test"

# 1. Prepare test data
num_queries = 100
dimension = 768
test_queries = np.random.rand(num_queries, dimension).astype("float32")

# 2. Measure query latency
latencies = []
for query in test_queries:
    start = time.perf_counter()
    results = client.query_points(
        collection_name=collection_name,
        query=query.tolist(),
        limit=10
    )
    latency = (time.perf_counter() - start) * 1000
    latencies.append(latency)

latencies.sort()
print(f"p50 latency: {latencies[49]:.2f}ms")
print(f"p95 latency: {latencies[94]:.2f}ms")
print(f"p99 latency: {latencies[98]:.2f}ms")
print(f"Mean latency: {np.mean(latencies):.2f}ms")

# 3. Measure throughput (QPS)
start = time.perf_counter()
for query in test_queries:
    client.query_points(collection_name, query=query.tolist(), limit=10)
elapsed = time.perf_counter() - start
qps = num_queries / elapsed
print(f"Throughput: {qps:.0f} QPS")

# 4. Measure recall (requires ground truth)
# Compare approximate results vs brute-force exact search
# recall = len(set(approx_ids) & set(exact_ids)) / k

Document Chunking Strategies for Vector Databases

How you split documents into chunks before embedding dramatically affects retrieval quality. There is no universal best strategy — it depends on your data and use case. Here are the four main approaches:

Fixed-Size: Fixed-Size Chunking: Split text every N tokens (e.g., 512 tokens) with overlap (e.g., 50 tokens). Simplest to implement. Works well for uniform content like documentation and articles. Use this as your starting point.

Sentence: Sentence-Based Chunking: Split on sentence boundaries. Preserves semantic coherence within each chunk. Better than fixed-size for conversational or narrative content. Requires sentence detection (spaCy, NLTK).

Semantic: Semantic Chunking: Use an embedding model to detect topic shifts and split at natural boundaries. Produces the highest quality chunks but is slower and more complex. LangChain and LlamaIndex both offer semantic chunkers.

Parent-Child: Parent-Child Chunking: Index small chunks (sentences) for precise retrieval, but return the parent chunk (paragraph/section) for context. Gives the best of both worlds: precise matching with sufficient context for the LLM.

Best practices: Start with 512 tokens and 10-20% overlap. Test retrieval quality with your actual queries. Smaller chunks improve precision but may lose context. Larger chunks preserve context but reduce precision.

# Chunking example with LangChain
from langchain.text_splitter import (
    RecursiveCharacterTextSplitter,
    SentenceTransformersTokenTextSplitter
)

# Fixed-size with overlap (most common)
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)

# Token-based (more precise for LLM token limits)
token_splitter = SentenceTransformersTokenTextSplitter(
    chunk_overlap=50,
    tokens_per_chunk=256
)
token_chunks = token_splitter.split_text(document_text)

print(f"Fixed-size: {len(chunks)} chunks")
print(f"Token-based: {len(token_chunks)} chunks")

Monitoring Vector Databases in Production

Vector databases require specific monitoring beyond standard database metrics. Track these key indicators:

Latency: Query Latency: Track p50, p95, and p99 latency. Set alerts if p99 exceeds your SLA (typically 50-100ms for user-facing search). Latency increases as collections grow — plan for index maintenance.
Recall: Recall Drift: Periodically evaluate recall@k against a golden test set. If recall drops below your threshold (e.g., 95%), you may need to re-tune index parameters or re-build the index.
Memory: Memory Usage: HNSW indexes live in RAM. Monitor memory consumption and set alerts before you hit capacity. Plan for 3-5 GB per million 768-dimension vectors with HNSW.
Index: Index Health: Monitor index build progress after bulk inserts, segment count in Milvus, and compaction status. Fragmented indexes degrade performance over time.

# Monitoring: Qdrant collection health check
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")

# Get collection info
info = client.get_collection("documents")
print(f"Vectors: {info.vectors_count}")
print(f"Indexed: {info.indexed_vectors_count}")
print(f"Points: {info.points_count}")
print(f"Segments: {len(info.segments)}")
print(f"Status: {info.status}")

# Recall test: compare approximate vs exact search
import numpy as np
import time

test_vector = np.random.rand(768).tolist()
start = time.time()
approx = client.query_points("documents", query=test_vector, limit=10)
latency_ms = (time.time() - start) * 1000
print(f"Query latency: {latency_ms:.1f}ms")
print(f"Results: {len(approx.points)} points returned")

Decision Flowchart: Which Vector Database Should You Choose?

Use this decision tree to narrow down your choice quickly based on your primary constraints:

Pinecone: If you want zero infrastructure management and fastest time-to-production: Choose Pinecone. It is the gold standard for managed vector search with serverless pricing.

Weaviate: If you need combined keyword + vector search in one query: Choose Weaviate. Its native BM25+vector hybrid search is unmatched.

Qdrant: If raw performance and memory efficiency are your top priorities: Choose Qdrant. Written in Rust, it consistently wins ANN benchmarks.

Milvus: If you need billion-scale with GPU acceleration: Choose Milvus. Its distributed architecture and NVIDIA partnership handle massive datasets.

pgvector: If you already use PostgreSQL and want to add vector search: Choose pgvector. Zero new infrastructure, familiar SQL interface.

ChromaDB: If you are prototyping or building a tutorial: Choose ChromaDB. Three lines of Python to get started.

Quick Start: Your First Vector Search in 5 Minutes

Here is the fastest path from zero to working vector search using ChromaDB and OpenAI embeddings. This pattern works for prototyping any RAG or semantic search application:

# Quick Start: Vector search in 5 minutes
# pip install chromadb openai

import chromadb
from openai import OpenAI

# 1. Initialize clients
chroma = chromadb.PersistentClient(path="./vector_db")
openai_client = OpenAI()  # uses OPENAI_API_KEY env var

# 2. Create a helper to generate embeddings
def embed(texts: list[str]) -> list[list[float]]:
    resp = openai_client.embeddings.create(
        model="text-embedding-3-small", input=texts
    )
    return [d.embedding for d in resp.data]

# 3. Create collection and add documents
collection = chroma.get_or_create_collection(
    name="docs", metadata={"hnsw:space": "cosine"}
)

documents = [
    "Docker containers package applications with dependencies",
    "Kubernetes orchestrates container deployment at scale",
    "Nginx serves as a reverse proxy and load balancer",
    "PostgreSQL is a powerful open-source relational database",
    "Redis provides in-memory caching for low-latency access",
]

collection.add(
    ids=[f"doc_{i}" for i in range(len(documents))],
    embeddings=embed(documents),
    documents=documents,
    metadatas=[{"index": i} for i in range(len(documents))]
)

# 4. Query!
query = "How do I manage containers in production?"
results = collection.query(
    query_embeddings=embed([query]),
    n_results=3
)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
    print(f"[{1-dist:.4f}] {doc}")
# Output:
# [0.8912] Kubernetes orchestrates container deployment at scale
# [0.8234] Docker containers package applications with dependencies
# [0.6102] Nginx serves as a reverse proxy and load balancer

Quantization: Reducing Memory Usage

Quantization compresses vectors to use less memory while preserving search quality. This is critical for large-scale deployments where storing millions of full-precision vectors would be prohibitively expensive.

Scalar: Scalar Quantization: Converts float32 vectors to int8, reducing memory by 4x with minimal recall loss (typically less than 1%). Supported by Qdrant and Milvus.
Product (PQ): Product Quantization (PQ): Splits each vector into sub-vectors and replaces each with a codebook index. Achieves 8-32x compression. Used by FAISS and Milvus for billion-scale deployments.
Binary: Binary Quantization: Converts each dimension to a single bit (positive=1, negative=0). Extreme 32x compression but noticeable recall loss. Best used as a first-pass filter with re-ranking.

# Qdrant: Enable scalar quantization for 4x memory savings
from qdrant_client.models import (
    VectorParams, Distance, ScalarQuantization,
    ScalarQuantizationConfig, ScalarType, QuantizationSearchParams
)

client.create_collection(
    collection_name="docs_quantized",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
    quantization_config=ScalarQuantization(
        scalar=ScalarQuantizationConfig(
            type=ScalarType.INT8,
            quantile=0.99,  # clip outliers
            always_ram=True  # keep quantized vectors in RAM
        )
    )
)

# Search with quantization (rescore from disk for accuracy)
results = client.query_points(
    collection_name="docs_quantized",
    query=query_vector,
    search_params={"quantization": QuantizationSearchParams(
        rescore=True  # re-rank using original vectors from disk
    )},
    limit=10
)

Production Best Practices

1. Choose your embedding model first, database second. The embedding model has a larger impact on search quality than the database choice. Benchmark with your actual data.

2. Always test recall@k on your data before going to production. Create a golden test set of query-document pairs and measure recall at k=5, k=10, and k=20.

3. Use metadata filtering to reduce the search space. Filtering before vector search (pre-filtering) is faster than filtering after (post-filtering) for most databases.

4. Implement chunking carefully for RAG. Chunk size (256-1024 tokens), overlap (10-20%), and chunking strategy (sentence, paragraph, semantic) dramatically affect retrieval quality.

5. Monitor your vector database in production. Track p99 latency, recall degradation over time, index size growth, and memory usage. Set alerts for latency spikes.

6. Plan for embedding model upgrades. When you switch to a new embedding model, all vectors must be re-embedded. Design your pipeline to support batch re-indexing.

Migration Strategies Between Vector Databases

Moving between vector databases requires careful planning. The general approach follows five steps:

1. Export: Read all vectors and metadata from the source database in batches (typically 1000-5000 vectors per batch).
2. Transform: Adapt the data format to the target schema — field names, metadata structure, ID format, and vector normalization if metrics differ.
3. Load: Batch-insert vectors into the target database. Build indexes after bulk load for faster import.
4. Validate: Run a test query suite against both databases and compare recall@k to ensure search quality parity.
5. Switch: Use a feature flag or traffic split to gradually route queries to the new database. Monitor latency, recall, and error rates.

For zero-downtime migration, implement a dual-write pattern: write new vectors to both databases during the transition period, then cut over reads once validation passes.

# Migration example: ChromaDB -> Qdrant (zero downtime)
import chromadb
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance

chroma = chromadb.PersistentClient(path="./chroma_db")
qdrant = QdrantClient(url="http://localhost:6333")

# Create target collection
qdrant.create_collection("migrated_docs",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE))

# Export from ChromaDB in batches
collection = chroma.get_collection("my_docs")
batch_size = 1000
offset = 0

while True:
    batch = collection.get(
        limit=batch_size, offset=offset,
        include=["embeddings", "metadatas", "documents"]
    )
    if not batch["ids"]:
        break
    points = [
        PointStruct(
            id=i + offset, vector=emb,
            payload={"text": doc, **(meta or {})}
        )
        for i, (emb, doc, meta) in enumerate(zip(
            batch["embeddings"], batch["documents"],
            batch["metadatas"]
        ))
    ]
    qdrant.upsert("migrated_docs", points=points)
    offset += batch_size
    print(f"Migrated {offset} vectors...")

print("Migration complete. Validate recall before switching traffic.")

FAQ

What is a vector database and how is it different from a traditional database?

A vector database stores high-dimensional numerical vectors (embeddings) and enables similarity-based queries using distance metrics like cosine similarity. Traditional databases use exact-match lookups on structured data with SQL. Vector databases find the "most similar" items rather than exact matches, making them essential for AI applications like semantic search, RAG, and recommendations.

Which vector database is best for RAG applications in 2026?

For production RAG, Pinecone or Qdrant are the top choices. Pinecone offers zero-ops managed convenience with strong performance. Qdrant provides the best raw performance with open-source flexibility. For prototyping RAG pipelines, ChromaDB is the fastest to get started with — it runs embedded in Python with zero configuration.

Can I use PostgreSQL as a vector database with pgvector?

Yes. pgvector adds vector similarity search to PostgreSQL via a simple extension. It is ideal if you already use Postgres and need vector search without adding new infrastructure. For datasets under 5 million vectors with moderate QPS, pgvector performs well. For larger scale or higher throughput, consider a dedicated vector database like Qdrant or Milvus.

What is HNSW and why is it the most popular indexing algorithm?

HNSW (Hierarchical Navigable Small World) builds a multi-layer proximity graph for approximate nearest neighbor search. It offers the best balance of high recall (typically above 99 percent), low latency (sub-millisecond for most datasets), and reasonable memory usage. Almost every production vector database uses HNSW as its primary or default index type.

How much does a vector database cost in production?

Costs vary widely by scale and provider. Pinecone serverless starts around 70 dollars per month for 1M vectors at 100 QPS. Self-hosted Qdrant on a 40 dollar per month VPS can handle similar workloads. pgvector adds minimal cost if you already have a Postgres instance. Enterprise managed services like Zilliz Cloud or Weaviate Cloud Services range from 100 to 500 dollars per month depending on tier.

Should I use a managed or self-hosted vector database?

Use managed if your team is small, you want to ship fast, and your budget allows it. Self-host if you need data sovereignty, have DevOps capacity, want to minimize costs at scale, or need custom configurations. Many teams start with managed for prototyping and migrate to self-hosted as they scale and understand their workload patterns.

What embedding model should I use with my vector database?

For general-purpose text search, OpenAI text-embedding-3-large (3072 dimensions) or Cohere embed-v3 offer the best quality. For cost-efficient applications, text-embedding-3-small (1536 dimensions) is excellent. For fully open-source deployments, BGE-large-en-v1.5 or E5-large-v2 are strong choices. Always match embedding dimensions to your performance and cost budget.

How do I migrate from one vector database to another?

Export vectors and metadata in batches from the source, transform the data format to match the target schema, batch-insert into the new database, validate recall parity by running a test query suite against both databases, then gradually switch traffic using feature flags. For zero downtime, use a dual-write pattern during the transition period.