TL;DR
TL;DR:生产环境大规模使用选Pinecone(托管)或Qdrant/Milvus(自托管)。原型开发用ChromaDB。如果已经使用PostgreSQL,pgvector是最简单的路径。Weaviate擅长将向量与关键词结合的混合搜索。
核心要点
- 向量数据库存储嵌入向量(数据的数值表示),支持在数百万或数十亿向量中进行快速相似性搜索。
- HNSW是大多数场景的主流索引算法,在速度和召回率之间提供最佳平衡。
- 托管方案(Pinecone)最小化运维负担;开源方案(Qdrant、Milvus、Weaviate)提供更多控制权和更低成本。
- pgvector让你无需引入新基础设施即可为现有PostgreSQL数据库添加向量搜索。
- 对于RAG管道,与LangChain或LlamaIndex的紧密集成至关重要。
- 嵌入模型的选择比数据库选择对搜索质量的影响更大。
向量数据库已成为现代AI应用的核心基础设施。从检索增强生成(RAG)到语义搜索、推荐系统和异常检测,向量数据库大规模存储和查询高维嵌入向量。本指南比较2026年领先的向量数据库——Pinecone、Weaviate、Qdrant、ChromaDB、pgvector、Milvus和FAISS——涵盖架构、性能、定价和实际代码示例。
什么是向量数据库?为什么重要?
向量数据库是一种专门的存储系统,用于索引、存储和查询高维向量(嵌入)。传统数据库以行列组织数据并使用精确匹配查询,向量数据库则在连续的向量空间中按相似性组织数据。
大语言模型和嵌入模型将文本、图像、音频和代码转换为密集的数值向量(通常384到3072维)。在数百万向量中找到最相似的向量需要专门的索引结构。
自2023年LLM应用兴起以来,向量数据库市场爆发式增长。每个RAG系统、语义搜索引擎和AI推荐系统都依赖向量相似性搜索。
# Traditional DB: exact match
SELECT * FROM products WHERE category = 'electronics'
# Vector DB: similarity search
# "Find products semantically similar to this query"
query_vector = model.encode("wireless noise-canceling headphones")
results = collection.query(query_vector, top_k=10)
# Returns: ranked list of most semantically similar products
# Even matches "Bluetooth ANC over-ear headset" (different words, same meaning)向量搜索工作原理:嵌入与相似性
向量搜索涉及三个阶段:用模型生成嵌入、在数据库中索引以快速检索、以及使用相似性度量查询最近邻。
# Step 1: Generate embeddings with an embedding model
from openai import OpenAI
client = OpenAI()
text = "How to deploy a Docker container"
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
vector = response.data[0].embedding # [0.023, -0.156, 0.891, ...]
# Length: 1536 float values
# Step 2: Store vector in database with metadata
# Step 3: Query by computing similarity to find nearest neighbors相似性度量详解
- Cosine: 余弦相似度:测量两个向量之间的角度,忽略幅度。最适合文本嵌入。范围:-1到1。最常用的度量。
- Euclidean (L2): 欧氏距离(L2):测量向量空间中两点的直线距离。当幅度重要时最佳。值越低=越相似。
- Dot Product: 点积(内积):结合方向和幅度。计算最快。当向量已归一化时效果好。值越高=越相似。
选择建议:文本默认用余弦相似度。空间/图像数据用欧氏距离。已归一化的向量追求速度用点积。
import numpy as np
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def euclidean_distance(a, b):
return np.linalg.norm(np.array(a) - np.array(b))
def dot_product(a, b):
return np.dot(a, b)
v1 = [0.1, 0.3, 0.5]
v2 = [0.2, 0.4, 0.6]
print(f"Cosine: {cosine_similarity(v1, v2):.4f}") # 0.9946
print(f"Euclidean: {euclidean_distance(v1, v2):.4f}") # 0.1732
print(f"Dot product: {dot_product(v1, v2):.4f}") # 0.4400向量数据库综合对比
| 数据库 | 类型 | 语言 | 索引 | 混合搜索 | 云/托管 |
|---|---|---|---|---|---|
| Pinecone | Managed SaaS | - | Proprietary | Sparse+Dense | Yes (only) |
| Weaviate | Open Source | Go | HNSW, Flat | Native BM25+Vector | Weaviate Cloud |
| Qdrant | Open Source | Rust | HNSW | Payload Filtering | Qdrant Cloud |
| ChromaDB | Open Source | Python | HNSW (hnswlib) | Metadata Filter | Chroma Cloud |
| pgvector | PG Extension | C | IVFFlat, HNSW | SQL WHERE | Any Managed PG |
| Milvus | Open Source | Go/C++ | HNSW, IVF, DiskANN | Sparse+Dense | Zilliz Cloud |
| FAISS | Library | C++/Python | HNSW, IVF, PQ | N/A | N/A |
Pinecone:全托管向量数据库
Pinecone是全托管无服务器向量数据库,自动处理分片、复制、扩缩和备份。按读写单元和存储量计费,适合突发负载。支持元数据过滤和稀疏-密集混合搜索。
# Pinecone: Serverless vector database
from pinecone import Pinecone
pc = Pinecone(api_key="YOUR_API_KEY")
index = pc.Index("my-index")
# Upsert vectors with metadata
index.upsert(vectors=[
{"id": "doc1", "values": embedding1,
"metadata": {"source": "wiki", "topic": "docker"}},
{"id": "doc2", "values": embedding2,
"metadata": {"source": "docs", "topic": "kubernetes"}},
])
# Query with metadata filter
results = index.query(
vector=query_embedding,
top_k=5,
filter={"source": {"$eq": "docs"}},
include_metadata=True
)
for match in results.matches:
print(f"Score: {match.score:.4f}, ID: {match.id}")优点:零运维、自动扩缩、强一致性、无服务器定价、优秀文档、命名空间多租户隔离。
缺点:供应商锁定、大规模成本较高、查询灵活性有限、仅AWS区域、无自托管选项。
Weaviate:混合搜索先驱
Weaviate是用Go编写的开源向量数据库,原生支持将密集向量搜索与BM25关键词搜索结合的混合搜索。支持多模态数据和内置向量化模块。
# Weaviate: Hybrid search (BM25 + vector)
import weaviate
import weaviate.classes.query as wq
client = weaviate.connect_to_local() # or connect_to_weaviate_cloud()
collection = client.collections.get("Article")
# Hybrid search: combine vector similarity + keyword matching
results = collection.query.hybrid(
query="machine learning model deployment",
alpha=0.75, # 0 = pure BM25, 1 = pure vector
limit=10,
return_metadata=wq.MetadataQuery(score=True)
)
for obj in results.objects:
print(f"{obj.properties['title']} — score: {obj.metadata.score:.4f}")
client.close()优点:真正的混合搜索、多模态支持、GraphQL和REST API、内置向量化器、生成式搜索模块。
缺点:内存使用较高、Go代码库不便Python团队扩展、云服务定价复杂。
Qdrant:Rust驱动的性能王者
Qdrant是用Rust构建的开源向量数据库,追求极致性能和内存效率。支持丰富的负载过滤、标量和乘积量化、分布式自动分片。在独立基准测试中始终名列前茅。
# Qdrant: Rust-powered high-performance vector search
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
client = QdrantClient(url="http://localhost:6333")
# Create collection with HNSW index
client.create_collection(
collection_name="documents",
vectors_config=VectorParams(size=768, distance=Distance.COSINE)
)
# Upsert vectors with payload (metadata)
client.upsert(collection_name="documents", points=[
PointStruct(id=1, vector=emb1, payload={"title": "Docker Guide", "lang": "en"}),
PointStruct(id=2, vector=emb2, payload={"title": "K8s Tutorial", "lang": "en"}),
])
# Search with payload filtering
results = client.query_points(
collection_name="documents",
query=query_vector,
query_filter={"must": [{"key": "lang", "match": {"value": "en"}}]},
limit=5
).points优点:卓越性能和低延迟、量化后最小内存占用、丰富的负载过滤、gRPC和REST API、简单Docker部署。
缺点:社区较小、无内置BM25混合搜索、云服务区域有限。
ChromaDB:轻量Python原生
ChromaDB是为简洁和开发体验设计的开源嵌入数据库,可在Python进程内运行,是RAG原型开发的首选。
# ChromaDB: Simplest vector database for prototyping
import chromadb
# In-memory (dev) or persistent (production-lite)
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection("my_docs")
# Add documents — ChromaDB auto-embeds with default model
collection.add(
documents=[
"Docker is a containerization platform for packaging apps",
"Kubernetes orchestrates containers across clusters",
"Nginx is a high-performance web server and reverse proxy"
],
ids=["doc1", "doc2", "doc3"],
metadatas=[{"topic": "docker"}, {"topic": "k8s"}, {"topic": "nginx"}]
)
# Query by text (auto-embeds the query too)
results = collection.query(
query_texts=["container orchestration tools"],
n_results=2
)
print(results["documents"]) # [[doc2, doc1]]优点:最简API、Python内嵌运行零配置、自动文档嵌入、原型开发极快。
缺点:不适合生产规模(百万以上向量吃力)、分布式能力有限、无企业功能。
pgvector:PostgreSQL中的向量搜索
pgvector是PostgreSQL扩展,为现有Postgres数据库添加向量相似性搜索。支持IVFFlat和HNSW索引,可在单条SQL中结合向量相似性与WHERE、JOIN。
-- pgvector: Vector search inside PostgreSQL
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE documents (
id SERIAL PRIMARY KEY,
title TEXT NOT NULL,
content TEXT,
embedding vector(768) -- 768-dimensional vector column
);
-- Create HNSW index for fast cosine search
CREATE INDEX ON documents
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);
-- Find 5 most similar documents with SQL filtering
SELECT id, title,
1 - (embedding <=> $1::vector) AS similarity
FROM documents
WHERE title ILIKE '%docker%' -- combine with any SQL
ORDER BY embedding <=> $1::vector
LIMIT 5;优点:无需新基础设施、SQL接口、ACID事务、与关系查询和JOIN结合。
缺点:大规模下比专用向量数据库慢、仅限PostgreSQL、无内置量化、索引构建可能很慢。
Milvus:企业级开源方案
Milvus是为十亿级相似性搜索设计的开源向量数据库,云原生分布式架构,支持GPU加速和DiskANN。
# Milvus: Billion-scale vector search
from pymilvus import connections, Collection, FieldSchema
from pymilvus import CollectionSchema, DataType
connections.connect("default", host="localhost", port="19530")
fields = [
FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
FieldSchema(name="embedding", dtype=DataType.FLOAT_VECTOR, dim=768),
FieldSchema(name="title", dtype=DataType.VARCHAR, max_length=512),
]
schema = CollectionSchema(fields, description="Document store")
collection = Collection("documents", schema)
# Build HNSW index
collection.create_index("embedding", {
"metric_type": "COSINE",
"index_type": "HNSW",
"params": {"M": 16, "efConstruction": 256}
})
collection.load()
# Search
results = collection.search(
data=[query_vector], anns_field="embedding",
param={"metric_type": "COSINE", "params": {"ef": 64}},
limit=5, output_fields=["title"]
)优点:十亿级验证、GPU加速、多索引类型含DiskANN、企业特性、CNCF项目。
缺点:部署复杂(需etcd/MinIO/Pulsar)、学习曲线陡、小数据集资源需求大。
FAISS:Meta AI研究库
FAISS不是数据库而是高效相似性搜索库,提供许多向量数据库内部使用的核心算法。需要最大控制权时直接使用。
# FAISS: Low-level similarity search library by Meta
import faiss
import numpy as np
dimension = 768
num_vectors = 1_000_000
# Create HNSW index
index = faiss.IndexHNSWFlat(dimension, 32) # M=32 neighbors per layer
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 64
# Add vectors (must be float32 numpy arrays)
vectors = np.random.rand(num_vectors, dimension).astype("float32")
faiss.normalize_L2(vectors) # normalize for cosine similarity
index.add(vectors)
# Search: find 10 nearest neighbors
query = np.random.rand(1, dimension).astype("float32")
faiss.normalize_L2(query)
distances, indices = index.search(query, k=10)
print(f"Nearest IDs: {indices[0]}, Distances: {distances[0]}")索引算法深度解析
索引算法决定向量如何组织以实现快速近似最近邻(ANN)检索,直接影响查询延迟、内存使用、召回率和构建时间。
向量数据库的嵌入模型选择
嵌入模型的选择直接影响搜索质量——往往比数据库本身影响更大。以下是2026年领先模型:
# Compare embedding models on your data
from openai import OpenAI
import cohere
from sentence_transformers import SentenceTransformer
# OpenAI (best quality, paid API)
openai = OpenAI()
resp = openai.embeddings.create(
model="text-embedding-3-small", input="test query"
)
oai_vec = resp.data[0].embedding # 1536 dims
# Cohere (multilingual, paid API)
co = cohere.Client("YOUR_API_KEY")
resp = co.embed(
texts=["test query"],
model="embed-english-v3.0",
input_type="search_query"
)
cohere_vec = resp.embeddings[0] # 1024 dims
# Open-source (free, self-hosted)
model = SentenceTransformer("BAAI/bge-large-en-v1.5")
oss_vec = model.encode("test query") # 1024 dims
print(f"OpenAI dims: {len(oai_vec)}")
print(f"Cohere dims: {len(cohere_vec)}")
print(f"BGE dims: {len(oss_vec)}")| Model | Dims | Provider | Best For | Cost |
|---|---|---|---|---|
| text-embedding-3-large | 3072 | OpenAI | Highest quality text | $0.13/1M tokens |
| text-embedding-3-small | 1536 | OpenAI | Cost-efficient general | $0.02/1M tokens |
| embed-v3 | 1024 | Cohere | Multilingual (100+ langs) | $0.10/1M tokens |
| BGE-large-en-v1.5 | 1024 | BAAI | Best open-source English | Free (self-host) |
| E5-large-v2 | 1024 | Microsoft | Strong open-source | Free (self-host) |
| Gemini embedding | 768 | Multimodal (text+image) | $0.004/1M tokens |
向量数据库应用场景
RAG(检索增强生成)
2026年最常见的应用场景。将文档块存储为向量,用户提问时检索最相关的块作为LLM上下文,大幅减少幻觉。每个企业聊天机器人和文档问答系统都使用此模式。
# Complete RAG pipeline: LangChain + Qdrant
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_qdrant import QdrantVectorStore
from langchain.chains import RetrievalQA
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = QdrantVectorStore.from_existing_collection(
embedding=embeddings,
collection_name="knowledge_base",
url="http://localhost:6333"
)
# Build RAG chain
qa_chain = RetrievalQA.from_chain_type(
llm=ChatOpenAI(model="gpt-4o"),
retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
)
answer = qa_chain.invoke("How do I configure Nginx reverse proxy?")
print(answer["result"])语义搜索
超越关键词匹配,按语义搜索。嵌入捕获语义关系,电商、文档站和工单系统都受益于语义搜索。
# Semantic search: Find documents by meaning, not keywords
from qdrant_client import QdrantClient
from openai import OpenAI
openai_client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")
def semantic_search(query: str, collection: str, top_k: int = 5):
# Embed the query
response = openai_client.embeddings.create(
model="text-embedding-3-small", input=query
)
query_vector = response.data[0].embedding
# Search by similarity
results = qdrant.query_points(
collection_name=collection,
query=query_vector,
limit=top_k
)
return [
{"title": r.payload["title"], "score": r.score}
for r in results.points
]
# Query: "fix broken pipe error"
# Returns: Linux SIGPIPE guide, plumbing repair tips, Node.js stream errors
# All semantically related despite different keywords推荐系统
将用户和物品表示为同一嵌入空间的向量,通过最近邻搜索找相似物品,结合协同过滤信号做混合推荐。
# Recommendation system with vector similarity
# Embed product descriptions and find similar items
# Given a product the user liked:
liked_product_vector = get_embedding("Ergonomic mechanical keyboard with RGB")
# Find similar products
similar = qdrant.query_points(
collection_name="products",
query=liked_product_vector,
query_filter={
"must": [{"key": "in_stock", "match": {"value": True}}],
"must_not": [{"key": "id", "match": {"value": liked_product_id}}]
},
limit=10
)
# Returns: similar keyboards, ergonomic mice, desk accessories
# Items are ranked by semantic similarity to the liked product异常检测
将正常行为嵌入为向量,远离任何簇的新观察表示异常。用于欺诈检测、入侵检测和质量控制。
# Anomaly detection with vector distance
import numpy as np
# Embed a new transaction
new_transaction = embed("$15,000 wire transfer to unknown account at 3AM")
# Search for similar past transactions
neighbors = qdrant.query_points(
collection_name="transactions",
query=new_transaction,
limit=5
)
# If nearest neighbors are far away, flag as anomaly
max_score = max(r.score for r in neighbors.points)
ANOMALY_THRESHOLD = 0.7 # cosine similarity threshold
if max_score < ANOMALY_THRESHOLD:
alert(f"Anomaly detected! Nearest similarity: {max_score:.4f}")
# Trigger fraud review pipeline自托管 vs 托管:决策指南
构建还是购买取决于团队规模、运维成熟度、数据主权需求和规模。
| Factor | Managed | Self-Hosted |
|---|---|---|
| Setup Time | Minutes | Hours to days |
| Ops Burden | None (vendor handles) | Monitoring, backups, upgrades |
| Cost (small scale) | $50-200/mo | $20-60/mo (VPS) |
| Cost (large scale) | $500-5000/mo | $200-1000/mo |
| Data Sovereignty | Provider regions only | Full control, any location |
| Customization | Limited to API options | Full source code access |
| Auto-scaling | Built-in | Manual or K8s-based |
| SLA Guarantee | Yes (99.9%+) | Self-managed uptime |
性能基准(2026)
基于ann-benchmarks和独立测试,1M向量(768维,HNSW索引,recall@10 > 0.95),4核16GB服务器:
| Database | p99 Latency | QPS | Memory (1M vec) | Index Build |
|---|---|---|---|---|
| FAISS (library) | ~1ms | ~12,000 | ~3.0 GB | ~50s |
| Qdrant | ~2ms | ~8,500 | ~3.2 GB | ~45s |
| Milvus | ~3ms | ~7,200 | ~3.8 GB | ~60s |
| Weaviate | ~4ms | ~5,800 | ~4.1 GB | ~55s |
| Pinecone | ~5ms | ~6,000 | Managed | Managed |
| pgvector (HNSW) | ~8ms | ~2,500 | ~4.5 GB | ~120s |
| ChromaDB | ~10ms | ~1,800 | ~3.5 GB | ~40s |
Note: These benchmarks are approximate and vary significantly with hardware, data distribution, index parameters, and filter complexity. Always benchmark with your own data and workload patterns. FAISS is a library (not a database) and lacks persistence, replication, and filtering — raw speed comparisons are not apples-to-apples.
框架集成:LangChain、LlamaIndex、Haystack
所有主流向量数据库都与三大AI/LLM编排框架集成。以下展示各数据库如何接入LangChain:
# LangChain unified retriever interface for all vector DBs
# --- Pinecone ---
from langchain_pinecone import PineconeVectorStore
vs = PineconeVectorStore(index_name="docs", embedding=embeddings)
# --- Weaviate ---
from langchain_weaviate import WeaviateVectorStore
vs = WeaviateVectorStore(
client=weaviate_client, index_name="Docs", embedding=embeddings)
# --- Qdrant ---
from langchain_qdrant import QdrantVectorStore
vs = QdrantVectorStore.from_existing_collection(
url="http://localhost:6333",
collection_name="docs", embedding=embeddings)
# --- ChromaDB ---
from langchain_chroma import Chroma
vs = Chroma(persist_directory="./chroma", embedding_function=embeddings)
# --- pgvector ---
from langchain_postgres import PGVector
vs = PGVector(
connection=conn_string, embeddings=embeddings,
collection_name="docs")
# All use the SAME retriever interface:
retriever = vs.as_retriever(search_kwargs={"k": 5})
docs = retriever.invoke("How to configure CORS headers?")LlamaIndex Integration
# LlamaIndex: Alternative to LangChain for RAG
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.vector_stores.qdrant import QdrantVectorStore
from qdrant_client import QdrantClient
# Connect to Qdrant
qdrant = QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(
client=qdrant, collection_name="llama_docs"
)
# Load documents and build index
documents = SimpleDirectoryReader("./data/").load_data()
index = VectorStoreIndex.from_documents(
documents, vector_store=vector_store
)
# Query with natural language
query_engine = index.as_query_engine(similarity_top_k=5)
response = query_engine.query(
"What are the best practices for Docker networking?"
)
print(response.response)
print(f"Sources: {len(response.source_nodes)} chunks retrieved")Haystack Integration
# Haystack: Pipeline-based approach
from haystack_integrations.document_stores.qdrant import QdrantDocumentStore
from haystack_integrations.components.retrievers.qdrant import QdrantEmbeddingRetriever
from haystack.components.generators import OpenAIGenerator
from haystack import Pipeline
document_store = QdrantDocumentStore(
url="http://localhost:6333",
index="haystack_docs",
embedding_dim=768
)
retriever = QdrantEmbeddingRetriever(document_store=document_store)
generator = OpenAIGenerator(model="gpt-4o")
# Build pipeline
pipe = Pipeline()
pipe.add_component("retriever", retriever)
pipe.add_component("generator", generator)
pipe.connect("retriever.documents", "generator.documents")
result = pipe.run({"retriever": {"query_embedding": query_vec}})Docker Quick Deploy: Qdrant, Weaviate, Milvus
# Deploy Qdrant with Docker (simplest)
docker run -p 6333:6333 -p 6334:6334 \
-v $(pwd)/qdrant_storage:/qdrant/storage:z \
qdrant/qdrant:latest
# Deploy Weaviate with Docker Compose
# docker-compose.yml:
services:
weaviate:
image: semitechnologies/weaviate:latest
ports:
- "8080:8080"
- "50051:50051"
environment:
QUERY_DEFAULTS_LIMIT: 25
AUTHENTICATION_ANONYMOUS_ACCESS_ENABLED: "true"
PERSISTENCE_DATA_PATH: "/var/lib/weaviate"
DEFAULT_VECTORIZER_MODULE: "none"
CLUSTER_HOSTNAME: "node1"
volumes:
- weaviate_data:/var/lib/weaviate
# Deploy Milvus standalone with Docker Compose
# curl -O https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh
# bash standalone_embed.sh start
# Milvus Standalone runs on port 19530成本分析与定价对比
1M向量(768维)100 QPS持续负载的月成本估算(2026年初价格):
| 方案 | 托管成本 | 自托管成本 | 备注 |
|---|---|---|---|
| Pinecone | $70-100/mo | N/A | Serverless: pay per read/write unit |
| Weaviate Cloud | $100-300/mo | $40-80/mo | Tiered: Sandbox free, Standard, Business |
| Qdrant Cloud | $65-150/mo | $30-60/mo | Most affordable managed option |
| Zilliz (Milvus) | $100-400/mo | $60-120/mo | GPU instances increase cost |
| pgvector | From $15/mo* | $0 extra | * = managed Postgres cost (RDS, Supabase) |
| ChromaDB | Beta pricing | $20-40/mo | Primarily for dev and prototyping |
何时不应使用向量数据库
向量数据库功能强大但并非总是正确选择。以下场景应避免使用:
如何对向量数据库进行基准测试
不要仅依赖已发布的基准——用你的实际数据和查询模式测试。以下是系统化方法:
# Benchmark your vector database with your actual data
import time
import numpy as np
from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")
collection_name = "benchmark_test"
# 1. Prepare test data
num_queries = 100
dimension = 768
test_queries = np.random.rand(num_queries, dimension).astype("float32")
# 2. Measure query latency
latencies = []
for query in test_queries:
start = time.perf_counter()
results = client.query_points(
collection_name=collection_name,
query=query.tolist(),
limit=10
)
latency = (time.perf_counter() - start) * 1000
latencies.append(latency)
latencies.sort()
print(f"p50 latency: {latencies[49]:.2f}ms")
print(f"p95 latency: {latencies[94]:.2f}ms")
print(f"p99 latency: {latencies[98]:.2f}ms")
print(f"Mean latency: {np.mean(latencies):.2f}ms")
# 3. Measure throughput (QPS)
start = time.perf_counter()
for query in test_queries:
client.query_points(collection_name, query=query.tolist(), limit=10)
elapsed = time.perf_counter() - start
qps = num_queries / elapsed
print(f"Throughput: {qps:.0f} QPS")
# 4. Measure recall (requires ground truth)
# Compare approximate results vs brute-force exact search
# recall = len(set(approx_ids) & set(exact_ids)) / k向量数据库的文档分块策略
嵌入前如何分割文档对检索质量有巨大影响。没有通用最佳策略——取决于数据和场景。以下是四种主要方法:
最佳实践:从512 tokens和10-20%重叠开始。用实际查询测试检索质量。小块提高精度但可能丢失上下文。大块保留上下文但降低精度。
# Chunking example with LangChain
from langchain.text_splitter import (
RecursiveCharacterTextSplitter,
SentenceTransformersTokenTextSplitter
)
# Fixed-size with overlap (most common)
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = splitter.split_text(document_text)
# Token-based (more precise for LLM token limits)
token_splitter = SentenceTransformersTokenTextSplitter(
chunk_overlap=50,
tokens_per_chunk=256
)
token_chunks = token_splitter.split_text(document_text)
print(f"Fixed-size: {len(chunks)} chunks")
print(f"Token-based: {len(token_chunks)} chunks")生产环境监控向量数据库
向量数据库需要标准数据库指标之外的特定监控。跟踪以下关键指标:
- Latency: 查询延迟:跟踪p50、p95和p99延迟。如果p99超过SLA(通常50-100ms),设置告警。随集合增长延迟增加。
- Recall: 召回率漂移:定期对黄金测试集评估recall@k。如果低于阈值(如95%),可能需要重新调整索引参数。
- Memory: 内存使用:HNSW索引在RAM中。监控内存消耗,在接近容量前设置告警。每百万768维向量规划3-5 GB。
- Index: 索引健康:监控批量插入后的索引构建进度、Milvus的段数量和压缩状态。碎片化的索引会降低性能。
# Monitoring: Qdrant collection health check
from qdrant_client import QdrantClient
client = QdrantClient(url="http://localhost:6333")
# Get collection info
info = client.get_collection("documents")
print(f"Vectors: {info.vectors_count}")
print(f"Indexed: {info.indexed_vectors_count}")
print(f"Points: {info.points_count}")
print(f"Segments: {len(info.segments)}")
print(f"Status: {info.status}")
# Recall test: compare approximate vs exact search
import numpy as np
import time
test_vector = np.random.rand(768).tolist()
start = time.time()
approx = client.query_points("documents", query=test_vector, limit=10)
latency_ms = (time.time() - start) * 1000
print(f"Query latency: {latency_ms:.1f}ms")
print(f"Results: {len(approx.points)} points returned")决策流程图:该选哪个向量数据库?
根据你的主要约束,使用此决策树快速缩小选择范围:
快速开始:5分钟完成首次向量搜索
使用ChromaDB和OpenAI嵌入从零到可用向量搜索的最快路径。此模式适用于任何RAG或语义搜索应用的原型开发:
# Quick Start: Vector search in 5 minutes
# pip install chromadb openai
import chromadb
from openai import OpenAI
# 1. Initialize clients
chroma = chromadb.PersistentClient(path="./vector_db")
openai_client = OpenAI() # uses OPENAI_API_KEY env var
# 2. Create a helper to generate embeddings
def embed(texts: list[str]) -> list[list[float]]:
resp = openai_client.embeddings.create(
model="text-embedding-3-small", input=texts
)
return [d.embedding for d in resp.data]
# 3. Create collection and add documents
collection = chroma.get_or_create_collection(
name="docs", metadata={"hnsw:space": "cosine"}
)
documents = [
"Docker containers package applications with dependencies",
"Kubernetes orchestrates container deployment at scale",
"Nginx serves as a reverse proxy and load balancer",
"PostgreSQL is a powerful open-source relational database",
"Redis provides in-memory caching for low-latency access",
]
collection.add(
ids=[f"doc_{i}" for i in range(len(documents))],
embeddings=embed(documents),
documents=documents,
metadatas=[{"index": i} for i in range(len(documents))]
)
# 4. Query!
query = "How do I manage containers in production?"
results = collection.query(
query_embeddings=embed([query]),
n_results=3
)
for doc, dist in zip(results["documents"][0], results["distances"][0]):
print(f"[{1-dist:.4f}] {doc}")
# Output:
# [0.8912] Kubernetes orchestrates container deployment at scale
# [0.8234] Docker containers package applications with dependencies
# [0.6102] Nginx serves as a reverse proxy and load balancer量化:降低内存使用
量化压缩向量以使用更少内存同时保持搜索质量。对于存储数百万全精度向量成本过高的大规模部署至关重要。
- Scalar: 标量量化:将float32向量转换为int8,内存减少4倍,召回率损失极小(通常<1%)。Qdrant和Milvus支持。
- Product (PQ): 乘积量化(PQ):将每个向量拆分为子向量并用码本索引替换,实现8-32倍压缩。FAISS和Milvus用于十亿级部署。
- Binary: 二值量化:将每个维度转换为单个比特,极端32倍压缩但召回率明显下降。最适合作为粗筛后重排序。
# Qdrant: Enable scalar quantization for 4x memory savings
from qdrant_client.models import (
VectorParams, Distance, ScalarQuantization,
ScalarQuantizationConfig, ScalarType, QuantizationSearchParams
)
client.create_collection(
collection_name="docs_quantized",
vectors_config=VectorParams(size=768, distance=Distance.COSINE),
quantization_config=ScalarQuantization(
scalar=ScalarQuantizationConfig(
type=ScalarType.INT8,
quantile=0.99, # clip outliers
always_ram=True # keep quantized vectors in RAM
)
)
)
# Search with quantization (rescore from disk for accuracy)
results = client.query_points(
collection_name="docs_quantized",
query=query_vector,
search_params={"quantization": QuantizationSearchParams(
rescore=True # re-rank using original vectors from disk
)},
limit=10
)生产环境最佳实践
向量数据库间迁移策略
在向量数据库之间迁移需要仔细规划,通用方法分五步:
- 1. 导出:分批从源数据库读取所有向量和元数据(通常每批1000-5000个向量)。
- 2. 转换:调整数据格式以匹配目标模式——字段名、元数据结构、ID格式。
- 3. 加载:批量插入向量到目标数据库,批量导入后再建索引更快。
- 4. 验证:对两个数据库运行测试查询集,比较recall@k确保搜索质量一致。
- 5. 切换:使用特性标志逐步将查询路由到新数据库,监控延迟和错误率。
对于零停机迁移,实施双写模式:过渡期间新向量同时写入两个数据库,验证通过后切换读取。
# Migration example: ChromaDB -> Qdrant (zero downtime)
import chromadb
from qdrant_client import QdrantClient
from qdrant_client.models import PointStruct, VectorParams, Distance
chroma = chromadb.PersistentClient(path="./chroma_db")
qdrant = QdrantClient(url="http://localhost:6333")
# Create target collection
qdrant.create_collection("migrated_docs",
vectors_config=VectorParams(size=768, distance=Distance.COSINE))
# Export from ChromaDB in batches
collection = chroma.get_collection("my_docs")
batch_size = 1000
offset = 0
while True:
batch = collection.get(
limit=batch_size, offset=offset,
include=["embeddings", "metadatas", "documents"]
)
if not batch["ids"]:
break
points = [
PointStruct(
id=i + offset, vector=emb,
payload={"text": doc, **(meta or {})}
)
for i, (emb, doc, meta) in enumerate(zip(
batch["embeddings"], batch["documents"],
batch["metadatas"]
))
]
qdrant.upsert("migrated_docs", points=points)
offset += batch_size
print(f"Migrated {offset} vectors...")
print("Migration complete. Validate recall before switching traffic.")FAQ
什么是向量数据库?与传统数据库有何不同?
向量数据库存储高维数值向量并使用距离度量支持基于相似性的查询。传统数据库使用SQL精确匹配。向量数据库找到"最相似"的项目而非精确匹配。
2026年哪个向量数据库最适合RAG应用?
生产环境RAG推荐Pinecone或Qdrant。Pinecone提供零运维托管;Qdrant提供最佳原始性能和开源灵活性。原型开发首选ChromaDB。
可以用PostgreSQL的pgvector作为向量数据库吗?
可以。pgvector为PostgreSQL添加向量搜索扩展。500万以下向量中等QPS表现良好。更大规模建议用专用向量数据库。
什么是HNSW?为什么最流行?
HNSW构建多层近邻图进行近似最近邻搜索,在高召回率(>99%)、低延迟(亚毫秒)和合理内存使用之间提供最佳平衡。
生产环境向量数据库成本多少?
Pinecone无服务器1M向量约$70/月。自托管Qdrant约$40/月VPS。pgvector如已有Postgres则成本极低。企业托管$100-500/月。
应该用托管还是自托管向量数据库?
小团队快速迭代用托管,需要数据主权、有DevOps能力、想控制成本则自托管。很多团队先用托管原型,规模化后迁移到自托管。
应该用什么嵌入模型?
通用文本搜索推荐OpenAI text-embedding-3-large或Cohere embed-v3。开源方案推荐BGE-large或E5-large-v2。根据性能和成本预算匹配维度。
如何在向量数据库之间迁移?
分批导出向量和元数据,转换格式匹配目标模式,批量导入新数据库,运行测试查询验证召回率,然后用特性标志逐步切换流量。零停机使用双写模式。