DevToolBoxGRÁTIS
Blog

Elasticsearch Complete Guide: Query DSL, Mappings, Aggregations, ELK Stack, and Performance

18 min readby DevToolBox

TL;DR

Elasticsearch is a distributed, RESTful search and analytics engine built on Apache Lucene. It stores data as JSON documents in indices, distributes them across shards, and queries them with a powerful Query DSL. Use match queries for full-text search, term queries for exact matches, and bool queries to combine conditions. Install locally with Docker in minutes. Connect from Node.js with @elastic/elasticsearch or from Python with elasticsearch-py. For log analytics, use the full ELK Stack: Elasticsearch + Logstash + Kibana. In production, run at least 3 nodes with 1 replica per shard, set your JVM heap to no more than half of available RAM, and use ILM (Index Lifecycle Management) to automate data retention. Use our JSON Formatter to validate query bodies before sending them to Elasticsearch.

What Is Elasticsearch and When Should You Use It?

Elasticsearch is an open-source, distributed search and analytics engine built on top of Apache Lucene. Originally created by Shay Banon in 2010 and now maintained by Elastic NV, it has become the most widely deployed search engine in the world — used by GitHub, Wikipedia, Netflix, Uber, Walmart, and millions of other applications.

Elasticsearch stores data as structured JSON documents and indexes them in an inverted index — a data structure that maps every unique term to the list of documents that contain it. This is the same structure used by all search engines and enables extremely fast full-text search even across billions of documents.

When to Use Elasticsearch

  • Full-text search — Product search, document search, knowledge base search with relevance ranking, highlighting, and fuzzy matching.
  • Log and event analytics — Centralized logging with the ELK Stack. Ingest, search, and visualize application logs, access logs, and security events.
  • E-commerce search — Product catalog search with faceted navigation, filters, boosting popular items, and personalized ranking.
  • Autocomplete and suggestions — Real-time search-as-you-type using completion suggesters or edge n-gram tokenizers.
  • Observability and APM — Application performance monitoring, distributed tracing, uptime monitoring, and infrastructure metrics.
  • Geospatial search — Find documents near a location, within a bounding box, or within a polygon using Elasticsearch geo queries.
  • Business analytics — Real-time aggregations and dashboards over large datasets with Kibana.

When NOT to Use Elasticsearch

  • Primary transactional database — Elasticsearch does not support ACID transactions, foreign keys, or multi-document atomicity.
  • Complex relational joins — There is no SQL-style JOIN. Model your data differently (denormalize) or use a relational database.
  • Strict consistency requirements — Elasticsearch is eventually consistent by default. For financial systems requiring strict consistency, use PostgreSQL or a similar ACID database.
  • Low-traffic simple search — For small datasets under 10,000 documents, PostgreSQL full-text search (tsvector) or SQLite FTS5 may be simpler and sufficient.

Core Concepts: Index, Document, Shard, and Replica

Understanding Elasticsearch requires mapping its concepts to familiar database terms and then going deeper into how they differ.

Elasticsearch Concepts vs Relational Database:

  Elasticsearch          SQL Database
  ─────────────────────────────────────────
  Index             ≈    Table
  Document          ≈    Row
  Field             ≈    Column
  Mapping           ≈    Schema
  Shard             ≈    Partition
  Replica           ≈    Read replica
  Node              ≈    Database server
  Cluster           ≈    Database cluster

Key differences:
  - No JOINs between indices
  - No ACID transactions
  - Schema-flexible (dynamic mappings)
  - Built-in horizontal sharding
  - Inverted index (not B-tree)

Index

An index is a logical namespace for a collection of documents. All documents in an index should have a similar structure. An index has a name (lowercase), a mapping that defines field types, and settings that configure sharding and analysis. Think of it as a database table, but more flexible.

Document

A document is the basic unit of data in Elasticsearch. It is a JSON object stored within an index. Each document has a unique _id (auto-generated or user-specified), a _index indicating where it belongs, and a _source field containing the original JSON. Documents are immutable — updating a document creates a new version and marks the old one for deletion.

Shard

A shard is a horizontal partition of an index. When Elasticsearch stores an index, it automatically divides it into one or more shards and distributes them across nodes in the cluster. Each shard is itself a fully functional Lucene index. By dividing data into shards, Elasticsearch can store data larger than the disk space of any single node and parallelize queries across all shards simultaneously.

The number of primary shards is fixed at index creation time. A rule of thumb is to target 10–50 GB per shard. For a 100 GB index, 3–5 primary shards is reasonable.

Replica

A replica shard is a copy of a primary shard. Replicas serve two purposes: fault tolerance (if a primary shard fails, a replica is promoted to primary) and increased read throughput (queries can be executed on both primary and replica shards). The number of replicas can be changed dynamically at any time. In production, always use at least 1 replica.

# Cluster with 3 nodes, index with 3 primary shards and 1 replica:

  Node 1            Node 2            Node 3
  ─────────────     ─────────────     ─────────────
  Primary P0        Replica R0        Primary P1
  Replica  R2       Primary P2        Replica R1

  Total shards: 6 (3 primary + 3 replica)
  If Node 1 fails: R0 on Node 2 is promoted to P0
  Cluster stays green (all shards available)

Elasticsearch vs Traditional SQL Databases

Elasticsearch and SQL databases are complementary, not competitors. Understanding their respective strengths helps you architect systems that leverage both effectively.

FeatureElasticsearchPostgreSQL / MySQL
Full-text searchExcellent (inverted index, relevance scoring)Good (tsvector, but limited ranking)
ACID transactionsNoYes
Complex JOINsNo (denormalize data)Yes (multi-table joins)
Horizontal scalingNative (sharding built-in)Requires partitioning/sharding middleware
Aggregations/analyticsExcellent (nested, pipeline, metric aggs)Good (but slow on very large tables)
SchemaFlexible (dynamic mappings)Rigid (ALTER TABLE required)
ConsistencyEventually consistentStrongly consistent
Best forSearch, logs, analyticsTransactional data, reporting

A common architecture is to use PostgreSQL or MySQL as the source of truth for user data, orders, and other transactional data, and synchronize relevant fields to Elasticsearch for fast search. The sync can be done via application-level dual writes, change data capture (CDC) with Debezium, or Logstash JDBC input.

Installing and Setting Up Elasticsearch

The fastest way to get started is with Docker. For production, use Elastic Cloud or install on dedicated servers.

Option 1: Docker (Development)

# Single-node Elasticsearch 8.x (security disabled for local dev)
docker run -d \
  --name elasticsearch \
  -p 9200:9200 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms512m -Xmx512m" \
  docker.elastic.co/elasticsearch/elasticsearch:8.12.0

# Verify it is running
curl http://localhost:9200
# {"name":"...","cluster_name":"docker-cluster","version":{"number":"8.12.0",...}}

Option 2: Docker Compose with Kibana (Recommended for Development)

# docker-compose.yml
version: "3.8"
services:
  elasticsearch:
    image: docker.elastic.co/elasticsearch/elasticsearch:8.12.0
    environment:
      - discovery.type=single-node
      - xpack.security.enabled=false
      - ES_JAVA_OPTS=-Xms1g -Xmx1g
    ports:
      - "9200:9200"
    volumes:
      - es-data:/usr/share/elasticsearch/data

  kibana:
    image: docker.elastic.co/kibana/kibana:8.12.0
    ports:
      - "5601:5601"
    environment:
      - ELASTICSEARCH_HOSTS=http://elasticsearch:9200
    depends_on:
      - elasticsearch

volumes:
  es-data:

# Start the stack
docker compose up -d

# Open Kibana at http://localhost:5601

Option 3: Elastic Cloud (Production)

Elastic Cloud is the fully managed hosted service. It offers automatic upgrades, built-in monitoring, Kibana, and integrations. There is a 14-day free trial. Alternative managed options include Amazon OpenSearch Service (AWS) and Elastic on Google Cloud / Azure.

CRUD Operations with the REST API

Elasticsearch exposes a fully RESTful HTTP API. Every operation — creating indices, indexing documents, searching, and managing the cluster — is done via HTTP requests with JSON bodies. Use curl, Kibana Dev Tools, or any HTTP client.

Create an Index

# Create index with explicit settings and mappings
PUT /products
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "stop", "snowball"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": { "type": "text", "analyzer": "product_analyzer" },
      "description": { "type": "text" },
      "price": { "type": "double" },
      "category": { "type": "keyword" },
      "tags": { "type": "keyword" },
      "in_stock": { "type": "boolean" },
      "created_at": { "type": "date" }
    }
  }
}

Index (Create) a Document

# POST: auto-generate ID
POST /products/_doc
{
  "name": "Wireless Bluetooth Headphones",
  "description": "Premium noise-cancelling headphones with 30-hour battery life",
  "price": 199.99,
  "category": "electronics",
  "tags": ["audio", "wireless", "noise-cancelling"],
  "in_stock": true,
  "created_at": "2026-01-15T10:30:00Z"
}
# Response: { "_id": "abc123...", "result": "created", ... }

# PUT: specify your own ID
PUT /products/_doc/prod-001
{
  "name": "Mechanical Keyboard",
  "price": 149.99,
  "category": "electronics",
  "in_stock": true,
  "created_at": "2026-01-16T09:00:00Z"
}

# PUT with _create: fail if document already exists
PUT /products/_create/prod-001
{ "name": "Keyboard" }
# Returns 409 Conflict if prod-001 already exists

Read (Get) a Document

# Get by ID
GET /products/_doc/prod-001

# Response:
{
  "_index": "products",
  "_id": "prod-001",
  "_version": 1,
  "found": true,
  "_source": {
    "name": "Mechanical Keyboard",
    "price": 149.99,
    "category": "electronics",
    ...
  }
}

# Check existence without fetching source
HEAD /products/_doc/prod-001   # 200 OK or 404

# Get only specific fields
GET /products/_doc/prod-001?_source_includes=name,price

# Multi-get
POST /products/_mget
{
  "ids": ["prod-001", "prod-002", "prod-003"]
}

Update a Document

# Partial update (only specified fields)
POST /products/_update/prod-001
{
  "doc": {
    "price": 139.99,
    "in_stock": false
  }
}

# Update with script (increment a counter)
POST /products/_update/prod-001
{
  "script": {
    "source": "ctx._source.view_count += params.increment",
    "params": { "increment": 1 }
  }
}

# Upsert: create if not exists, update if exists
POST /products/_update/prod-999
{
  "doc": { "price": 99.99 },
  "upsert": { "name": "New Product", "price": 99.99 }
}

# Update by query (update all matching documents)
POST /products/_update_by_query
{
  "query": { "term": { "category": "electronics" } },
  "script": {
    "source": "ctx._source.on_sale = true"
  }
}

Delete a Document

# Delete by ID
DELETE /products/_doc/prod-001

# Delete by query
POST /products/_delete_by_query
{
  "query": {
    "bool": {
      "must": [
        { "term": { "in_stock": false } },
        { "range": { "created_at": { "lt": "2024-01-01" } } }
      ]
    }
  }
}

# Delete entire index (irreversible!)
DELETE /products

Bulk API (High-Performance Indexing)

# The bulk API processes multiple operations in a single request.
# Each operation is a pair of lines: action metadata + document body.

POST /products/_bulk
{ "index": { "_id": "prod-100" } }
{ "name": "USB-C Hub", "price": 49.99, "category": "electronics" }
{ "index": { "_id": "prod-101" } }
{ "name": "Monitor Stand", "price": 79.99, "category": "accessories" }
{ "update": { "_id": "prod-001" } }
{ "doc": { "price": 129.99 } }
{ "delete": { "_id": "prod-999" } }

# Response includes per-operation status:
{
  "took": 5,
  "errors": false,
  "items": [
    { "index": { "_id": "prod-100", "result": "created", "status": 201 } },
    { "index": { "_id": "prod-101", "result": "created", "status": 201 } },
    { "update": { "_id": "prod-001", "result": "updated", "status": 200 } },
    { "delete": { "_id": "prod-999", "result": "deleted", "status": 200 } }
  ]
}

# Best practice: use bulk batches of 5–15 MB or 1,000–5,000 documents

Mappings and Index Configuration

Mappings define the schema for your documents — the data type of each field, how it is analyzed, and whether it is indexed. Getting mappings right is critical for search relevance and query performance.

Core Field Types

PUT /articles
{
  "mappings": {
    "properties": {
      # Text: analyzed for full-text search, not sortable/aggregatable
      "title":        { "type": "text", "analyzer": "english" },

      # Keyword: exact match, sortable, aggregatable (facets, filters)
      "status":       { "type": "keyword" },
      "author_id":    { "type": "keyword" },

      # Multi-field: both text AND keyword on the same field
      "category": {
        "type": "text",
        "fields": {
          "keyword": { "type": "keyword", "ignore_above": 256 }
        }
      },

      # Numeric types
      "view_count":   { "type": "integer" },
      "rating":       { "type": "float" },
      "price":        { "type": "double" },
      "clicks":       { "type": "long" },

      # Boolean
      "is_published": { "type": "boolean" },

      # Date (ISO 8601 or custom format)
      "published_at": { "type": "date", "format": "yyyy-MM-dd'T'HH:mm:ssZ" },

      # Nested objects (for arrays of objects with queries)
      "comments": {
        "type": "nested",
        "properties": {
          "user_id": { "type": "keyword" },
          "body":    { "type": "text" }
        }
      },

      # Geo-point (latitude/longitude)
      "location":     { "type": "geo_point" },

      # Dense vector (for ML and k-NN search)
      "embedding":    { "type": "dense_vector", "dims": 768 }
    }
  }
}

Text vs Keyword Fields

The text vs keyword distinction is the single most important mapping decision in Elasticsearch:

  • text: The value is analyzed — tokenized, lowercased, and stemmed. Used for fields you want to search with full-text queries (match, match_phrase). Cannot be used for exact filtering, sorting, or aggregations (use .keyword subfield).
  • keyword: The value is stored as-is. Used for exact match filters, sorting, and aggregations (faceted navigation). Cannot perform full-text search. Examples: status, tag, ID, email, URL.

Dynamic Mappings and Strict Mode

# By default, Elasticsearch auto-maps new fields (dynamic: true).
# This is convenient but can cause mapping explosion.
# In production, use strict mode to prevent unexpected fields:

PUT /logs
{
  "mappings": {
    "dynamic": "strict",   # reject documents with unmapped fields
    "properties": {
      "timestamp":  { "type": "date" },
      "level":      { "type": "keyword" },
      "message":    { "type": "text" },
      "service":    { "type": "keyword" }
    }
  }
}

# dynamic: "false"  — unknown fields are ignored (not indexed)
# dynamic: "true"   — unknown fields are auto-mapped (default)
# dynamic: "strict" — unknown fields cause an error (safest for production)

Query DSL: match, term, bool, range, and Aggregations

Elasticsearch Query DSL (Domain Specific Language) is a JSON-based API for defining searches. Queries are divided into leaf queries (match a single condition) and compound queries (combine multiple conditions). There are also filter context and query context — a critical distinction for performance.

Query Context vs Filter Context

# Query context: calculates relevance score (_score)
#   - Used for "how well does this document match?"
#   - NOT cached
#   - Slower for repeated conditions

# Filter context: binary yes/no (no score calculation)
#   - Used for "does this document match exactly?"
#   - Cached automatically by Elasticsearch
#   - Faster for structured data (status, date ranges, IDs)

# Best practice: use filter context for all structured conditions
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "bluetooth headphones" } }  // query context (scored)
      ],
      "filter": [
        { "term": { "category": "electronics" } },       // filter context (cached)
        { "term": { "in_stock": true } },                // filter context (cached)
        { "range": { "price": { "lte": 200 } } }         // filter context (cached)
      ]
    }
  }
}

match Query (Full-Text Search)

# Basic match: tokenizes the query and finds documents with any token
GET /products/_search
{
  "query": {
    "match": {
      "name": "bluetooth headphones"
      # Finds documents containing "bluetooth" OR "headphones"
    }
  }
}

# match with AND operator: both tokens must be present
GET /products/_search
{
  "query": {
    "match": {
      "name": {
        "query": "bluetooth headphones",
        "operator": "and"
      }
    }
  }
}

# match_phrase: tokens must appear in order and adjacent
GET /products/_search
{
  "query": {
    "match_phrase": {
      "description": "noise cancelling headphones"
    }
  }
}

# match_phrase_prefix: for search-as-you-type
GET /products/_search
{
  "query": {
    "match_phrase_prefix": {
      "name": "bluetoo"  # matches "bluetooth", "bluetooth 5.0", etc.
    }
  }
}

# multi_match: search across multiple fields
GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "headphones",
      "fields": ["name^3", "description", "tags^2"],
      # ^3 means name field is boosted 3x in relevance scoring
      "type": "best_fields"  # cross_fields, most_fields, phrase
    }
  }
}

term and terms Queries (Exact Match)

# term: exact match on keyword fields
GET /products/_search
{
  "query": {
    "term": {
      "category": { "value": "electronics" }
    }
  }
}

# Shorthand term
GET /products/_search
{
  "query": { "term": { "status": "published" } }
}

# terms: match any of the listed values (SQL IN)
GET /products/_search
{
  "query": {
    "terms": {
      "category": ["electronics", "computers", "audio"]
    }
  }
}

# IMPORTANT: Do NOT use term on analyzed text fields!
# This will NOT find "Electronics" stored as "electronics":
# { "term": { "category_text": "Electronics" } }   WRONG
# Use keyword mapping for fields you want to filter exactly.

bool Query (Combining Conditions)

# bool query combines any mix of leaf queries
# must:     all clauses must match (AND, contributes to score)
# should:   at least one should match (OR, boosts score)
# filter:   must match, no score (AND, cached — use for structured data)
# must_not: must not match (NOT, no score)

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "name": "headphones" } }
      ],
      "filter": [
        { "term":  { "in_stock": true } },
        { "range": { "price": { "gte": 50, "lte": 300 } } }
      ],
      "should": [
        { "term": { "tags": "noise-cancelling" } },
        { "term": { "tags": "wireless" } }
      ],
      "minimum_should_match": 1,
      "must_not": [
        { "term": { "category": "refurbished" } }
      ]
    }
  }
}

# Nested bool queries for complex logic:
# (category = electronics OR accessories) AND price < 200
GET /products/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "bool": {
            "should": [
              { "term": { "category": "electronics" } },
              { "term": { "category": "accessories" } }
            ]
          }
        },
        { "range": { "price": { "lt": 200 } } }
      ]
    }
  }
}

range Query

# Numeric range
GET /products/_search
{
  "query": {
    "range": {
      "price": {
        "gte": 50,    # greater than or equal
        "lte": 200    # less than or equal
        # gt:  strictly greater than
        # lt:  strictly less than
      }
    }
  }
}

# Date range with relative expressions
GET /logs/_search
{
  "query": {
    "range": {
      "timestamp": {
        "gte": "now-7d/d",    # start of 7 days ago
        "lte": "now/d",       # start of today
        "time_zone": "+08:00"
      }
    }
  }
}

# Date range with explicit ISO dates
GET /articles/_search
{
  "query": {
    "range": {
      "published_at": {
        "gte": "2026-01-01",
        "lt":  "2026-02-01",
        "format": "yyyy-MM-dd"
      }
    }
  }
}

Aggregations: Analytics and Faceted Search

# Aggregations run alongside a query and return analytics.
# Main types: Metric (avg, sum, min, max), Bucket (group-by), Pipeline.

GET /products/_search
{
  "size": 0,       # don't return documents, only aggregations
  "query": {
    "term": { "in_stock": true }
  },
  "aggs": {
    # terms agg: faceted navigation (like SQL GROUP BY)
    "by_category": {
      "terms": {
        "field": "category",
        "size": 10,
        "order": { "_count": "desc" }
      },
      # Nested agg: avg price per category
      "aggs": {
        "avg_price": { "avg": { "field": "price" } }
      }
    },

    # stats agg: min, max, avg, sum, count in one
    "price_stats": {
      "stats": { "field": "price" }
    },

    # date_histogram: aggregate events by time bucket
    "orders_by_day": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "day",
        "format": "yyyy-MM-dd"
      }
    },

    # range agg: bucket by price range
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 50,    "key": "under_50" },
          { "from": 50,  "to": 100, "key": "50_to_100" },
          { "from": 100, "to": 200, "key": "100_to_200" },
          { "from": 200, "key": "over_200" }
        ]
      }
    }
  }
}

# Example response for by_category:
# { "buckets": [
#     { "key": "electronics", "doc_count": 542, "avg_price": { "value": 124.5 } },
#     { "key": "accessories", "doc_count": 189, "avg_price": { "value": 45.2 } }
#   ]
# }

Full-Text Search vs Exact Match: Analysis and Tokenization

The most common source of confusion in Elasticsearch is the difference between full-text search and exact-match queries. This comes down to how Elasticsearch processes text through an analyzer.

What is an Analyzer?

# An analyzer transforms text into tokens (terms) for indexing.
# It consists of: character filters → tokenizer → token filters

# Example: "english" analyzer on "Running quickly and Efficiently"
# 1. Tokenizer (standard):  ["Running", "quickly", "and", "Efficiently"]
# 2. Lowercase filter:      ["running", "quickly", "and", "efficiently"]
# 3. Stop word filter:      ["running", "quickly", "efficiently"]   (removes "and")
# 4. Snowball stemmer:      ["run", "quick", "effici"]

# What gets stored in the inverted index: run, quick, effici
# A match query for "runs" will also find this document (stemmed to "run")
# A term query for "Running" will NOT find this document

# Test your analyzer with the _analyze API:
POST /products/_analyze
{
  "analyzer": "english",
  "text": "Running quickly and Efficiently"
}

# Built-in analyzers:
# standard  - tokenize + lowercase (default for text fields)
# english   - + stop words + Porter stemming
# french    - + French stop words + French stemmer
# simple    - tokenize on non-letters, lowercase
# keyword   - no analysis (same as keyword field type)
# whitespace - split on whitespace only

Fuzzy Search and Autocomplete

# Fuzzy search: handles typos using Levenshtein edit distance
GET /products/_search
{
  "query": {
    "match": {
      "name": {
        "query": "hedphones",        # typo for "headphones"
        "fuzziness": "AUTO",         # AUTO: 0 for 1-2 chars, 1 for 3-5, 2 for 6+
        "fuzzy_transpositions": true # allows "ab" -> "ba"
      }
    }
  }
}

# Search-as-you-type field type (Elasticsearch 7.2+)
# Optimized for real-time autocomplete:
PUT /products
{
  "mappings": {
    "properties": {
      "name": {
        "type": "search_as_you_type"
        # Creates: name, name._2gram, name._3gram, name._index_prefix
      }
    }
  }
}

GET /products/_search
{
  "query": {
    "multi_match": {
      "query": "blue",
      "type": "bool_prefix",
      "fields": ["name", "name._2gram", "name._3gram"]
    }
  }
}

Node.js Client: @elastic/elasticsearch

The official Elasticsearch client for Node.js is @elastic/elasticsearch. It supports all Elasticsearch APIs, TypeScript, async/await, connection pooling, request sniffing, and automatic retries.

# Install
npm install @elastic/elasticsearch
// client.ts
import { Client } from '@elastic/elasticsearch';

const client = new Client({
  node: 'http://localhost:9200',
  // For production with security:
  // node: 'https://my-cluster.elastic-cloud.com:9243',
  // auth: { apiKey: process.env.ES_API_KEY },
  // tls: { rejectUnauthorized: true }
});

export default client;
// search.ts — Full product search function
import client from './client';

interface SearchParams {
  query: string;
  category?: string;
  minPrice?: number;
  maxPrice?: number;
  from?: number;
  size?: number;
}

export async function searchProducts(params: SearchParams) {
  const { query, category, minPrice, maxPrice, from = 0, size = 20 } = params;

  const filters: object[] = [{ term: { in_stock: true } }];

  if (category) {
    filters.push({ term: { category } });
  }

  if (minPrice !== undefined || maxPrice !== undefined) {
    const rangeFilter: Record<string, number> = {};
    if (minPrice !== undefined) rangeFilter.gte = minPrice;
    if (maxPrice !== undefined) rangeFilter.lte = maxPrice;
    filters.push({ range: { price: rangeFilter } });
  }

  const response = await client.search({
    index: 'products',
    from,
    size,
    body: {
      query: {
        bool: {
          must: query
            ? [{ multi_match: { query, fields: ['name^3', 'description', 'tags^2'] } }]
            : [{ match_all: {} }],
          filter: filters,
        },
      },
      sort: [
        { _score: { order: 'desc' } },
        { 'price': { order: 'asc' } },
      ],
      highlight: {
        fields: {
          name: { pre_tags: ['<mark>'], post_tags: ['</mark>'] },
          description: { number_of_fragments: 2, fragment_size: 150 },
        },
      },
      aggs: {
        by_category: { terms: { field: 'category', size: 10 } },
        price_stats: { stats: { field: 'price' } },
      },
    },
  });

  return {
    total: response.hits.total,
    hits: response.hits.hits.map((hit) => ({
      id: hit._id,
      score: hit._score,
      source: hit._source,
      highlight: hit.highlight,
    })),
    aggregations: response.aggregations,
  };
}

// Index a document
export async function indexProduct(product: object, id?: string) {
  return client.index({
    index: 'products',
    id,
    document: product,
    refresh: 'wait_for', // wait until indexed before returning
  });
}

// Bulk index
export async function bulkIndex(products: Array<{ id: string; [key: string]: unknown }>) {
  const operations = products.flatMap((product) => [
    { index: { _index: 'products', _id: product.id } },
    product,
  ]);

  const bulkResponse = await client.bulk({ operations, refresh: true });

  if (bulkResponse.errors) {
    const errors = bulkResponse.items
      .filter((item) => item.index?.error)
      .map((item) => item.index?.error);
    console.error('Bulk indexing errors:', errors);
  }

  return bulkResponse;
}

Python Client: elasticsearch-py

The official Python client for Elasticsearch is elasticsearch-py. It supports both synchronous and async (asyncio) usage, type hints, connection pooling, and all Elasticsearch APIs.

# Install
pip install elasticsearch
# search_products.py
from elasticsearch import Elasticsearch
from typing import Optional

# Connect to local Elasticsearch
es = Elasticsearch("http://localhost:9200")

# For production with authentication:
# es = Elasticsearch(
#     "https://my-cluster.elastic-cloud.com:9243",
#     api_key="your_api_key_here"
# )

def search_products(
    query: str,
    category: Optional[str] = None,
    min_price: Optional[float] = None,
    max_price: Optional[float] = None,
    page: int = 1,
    size: int = 20
) -> dict:
    filters = [{"term": {"in_stock": True}}]

    if category:
        filters.append({"term": {"category": category}})

    if min_price is not None or max_price is not None:
        price_range = {}
        if min_price is not None:
            price_range["gte"] = min_price
        if max_price is not None:
            price_range["lte"] = max_price
        filters.append({"range": {"price": price_range}})

    body = {
        "from": (page - 1) * size,
        "size": size,
        "query": {
            "bool": {
                "must": [
                    {
                        "multi_match": {
                            "query": query,
                            "fields": ["name^3", "description", "tags^2"],
                            "fuzziness": "AUTO"
                        }
                    }
                ] if query else [{"match_all": {}}],
                "filter": filters
            }
        },
        "sort": [
            {"_score": {"order": "desc"}},
            {"price": {"order": "asc"}}
        ],
        "highlight": {
            "fields": {
                "name": {},
                "description": {"number_of_fragments": 2}
            }
        },
        "aggs": {
            "by_category": {"terms": {"field": "category", "size": 10}},
            "price_stats": {"stats": {"field": "price"}}
        }
    }

    response = es.search(index="products", body=body)

    return {
        "total": response["hits"]["total"]["value"],
        "hits": [
            {
                "id": hit["_id"],
                "score": hit["_score"],
                "source": hit["_source"],
                "highlight": hit.get("highlight", {})
            }
            for hit in response["hits"]["hits"]
        ],
        "aggregations": response.get("aggregations", {})
    }


def index_product(product: dict, product_id: Optional[str] = None) -> dict:
    return es.index(
        index="products",
        id=product_id,
        document=product,
        refresh="wait_for"
    )


def bulk_index_products(products: list[dict]) -> dict:
    from elasticsearch.helpers import bulk

    actions = [
        {
            "_index": "products",
            "_id": product.get("id"),
            "_source": product
        }
        for product in products
    ]

    success_count, errors = bulk(es, actions, raise_on_error=False)
    return {"indexed": success_count, "errors": errors}

Kibana and the ELK Stack

Kibana is the visualization and management UI for Elasticsearch. The ELK Stack (now officially called the Elastic Stack) is the combination of:

  • Elasticsearch — Storage, search, and analytics engine.
  • Logstash — Server-side data processing pipeline that ingests data from multiple sources, transforms it, and sends it to Elasticsearch.
  • Kibana — Data visualization, dashboard creation, and cluster management UI.
  • Beats (added later) — Lightweight data shippers: Filebeat (log files), Metricbeat (system metrics), Packetbeat (network data), Heartbeat (uptime monitoring).

Setting Up Log Analytics with Filebeat

# filebeat.yml — ship Nginx access logs to Elasticsearch
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/nginx/access.log
    fields:
      service: nginx
      env: production
    fields_under_root: true

output.elasticsearch:
  hosts: ["http://localhost:9200"]
  index: "nginx-logs-%{+yyyy.MM.dd}"  # daily indices for log rotation

setup.kibana:
  host: "http://localhost:5601"

# Processors to enrich events
processors:
  - add_host_metadata: ~
  - add_cloud_metadata: ~
  - geoip:
      field: client.ip
      target_field: client.geo

Index Lifecycle Management (ILM)

For log analytics with time-series data, use ILM to automatically manage the lifecycle of indices: hot (actively written to), warm (indexed but not written), cold (infrequently queried), and delete phases.

# Create an ILM policy for log data
PUT /_ilm/policy/logs-policy
{
  "policy": {
    "phases": {
      "hot": {
        "min_age": "0ms",
        "actions": {
          "rollover": {
            "max_primary_shard_size": "50gb",
            "max_age": "1d"
          },
          "set_priority": { "priority": 100 }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "shrink": { "number_of_shards": 1 },
          "forcemerge": { "max_num_segments": 1 },
          "set_priority": { "priority": 50 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "set_priority": { "priority": 0 },
          "freeze": {}
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": { "delete": {} }
      }
    }
  }
}

Performance Tuning and Best Practices

JVM Heap Settings

# config/jvm.options or JVM_OPTS environment variable

# Rule 1: Set heap to no more than 50% of available RAM
# Rule 2: Never exceed 31 GB (compressed OOPs threshold)
# Rule 3: Set Xms = Xmx (avoid heap resizing at runtime)

# For 16 GB RAM machine:
-Xms8g
-Xmx8g

# For 64 GB RAM machine:
-Xms31g
-Xmx31g    # NOT 32g — stay under the compressed OOPs limit

# The other 50% of RAM is used by Lucene's file system cache
# — equally important for search performance!

Indexing Performance

# 1. Use the Bulk API — never index one document at a time in production.
#    Optimal batch size: 5–15 MB uncompressed, or 1,000–5,000 documents.

# 2. Increase refresh_interval during bulk ingestion
PUT /my-index/_settings
{
  "index": {
    "refresh_interval": "30s"   # default is 1s; use -1 to disable during bulk load
  }
}

# 3. Disable replicas during initial bulk load, re-enable after
PUT /my-index/_settings
{
  "index": { "number_of_replicas": 0 }
}
# ... bulk load ...
PUT /my-index/_settings
{
  "index": { "number_of_replicas": 1 }
}

# 4. Set translog durability to async during bulk load
PUT /my-index/_settings
{
  "index": {
    "translog": {
      "durability": "async",
      "sync_interval": "60s"
    }
  }
}

# 5. Force merge read-only indices (logs from last month)
POST /logs-2026-01/_forcemerge?max_num_segments=1

Search Performance

# 1. Use filter context for all structured conditions (exact match, date range)
#    Filters are cached; query context scores are not.

# 2. Avoid deep pagination with from/size — use search_after instead
# BAD (slow for deep pages):
GET /products/_search
{ "from": 10000, "size": 20 }

# GOOD (consistent performance):
GET /products/_search
{
  "size": 20,
  "sort": [{ "created_at": "desc" }, { "_id": "desc" }],
  "search_after": ["2026-01-15T10:00:00Z", "prod-12345"]
}

# 3. Use _source filtering to reduce network transfer
GET /products/_search
{
  "_source": ["name", "price", "category"],   # only fetch needed fields
  "query": { "match_all": {} }
}

# 4. Avoid wildcard and leading-wildcard queries on large indices
# BAD:  { "wildcard": { "name": "*phone*" } }
# GOOD: use a proper text field with match query

# 5. Profile slow queries
GET /products/_search
{
  "profile": true,
  "query": { "match": { "name": "headphones" } }
}

Elasticsearch vs OpenSearch vs Solr

Choosing a search engine is a long-term architectural decision. Here is a side-by-side comparison of the three main Lucene-based search engines in 2026.

FeatureElasticsearch 8.xOpenSearch 2.xApache Solr 9.x
LicenseElastic License 2.0 / AGPLApache 2.0 (fully open)Apache 2.0 (fully open)
Backed byElastic NV (commercial)AWS + communityApache Software Foundation
Query languageQuery DSL + EQL + ES|QLQuery DSL (ES 7.10 compatible)Lucene query syntax + JSON
Managed cloudElastic CloudAmazon OpenSearch ServiceNo official managed offering
Vector search (k-NN)Yes (HNSW, dense_vector)Yes (k-NN plugin)Yes (Solr 9.0+, HNSW)
Kibana equivalentKibanaOpenSearch DashboardsBanana / custom Grafana
Schema requiredOptional (dynamic mapping)Optional (dynamic mapping)Yes (schema.xml or schemaless)
Ecosystem maturityVery mature, largest communityGrowing rapidlyMature, enterprise adoption
Best forGeneral search + APM + SIEMAWS workloads, cost-sensitiveEnterprise document search, XML

Which Should You Choose?

  • Choose Elasticsearch if you need the latest features (ES|QL, new vector search improvements, Elastic AI), want Elastic Cloud managed hosting, or are building an observability/SIEM platform using the official integrations.
  • Choose OpenSearch if you are running on AWS, need a fully open-source (Apache 2.0) solution, or want to avoid commercial licensing concerns. The API is compatible with Elasticsearch 7.10 clients.
  • Choose Solr if you need deep XML document support, are in an enterprise environment already using Solr, or need the Apache Software Foundation governance model.

Use Cases: Log Analysis, E-Commerce Search, and Autocomplete

Log Analysis Pipeline

# Application log structure in Elasticsearch
PUT /app-logs
{
  "mappings": {
    "properties": {
      "@timestamp":   { "type": "date" },
      "level":        { "type": "keyword" },   # INFO, WARN, ERROR
      "service":      { "type": "keyword" },
      "trace_id":     { "type": "keyword" },
      "span_id":      { "type": "keyword" },
      "message":      { "type": "text" },
      "error.type":   { "type": "keyword" },
      "error.stack":  { "type": "text", "index": false },  # not searchable, saves space
      "http.method":  { "type": "keyword" },
      "http.status":  { "type": "integer" },
      "http.url":     { "type": "keyword" },
      "duration_ms":  { "type": "long" }
    }
  }
}

# Query: find all ERROR logs from the payment service in the last hour
GET /app-logs/_search
{
  "query": {
    "bool": {
      "filter": [
        { "term":  { "level": "ERROR" } },
        { "term":  { "service": "payment-service" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "aggs": {
    "errors_by_type": { "terms": { "field": "error.type" } },
    "errors_over_time": {
      "date_histogram": {
        "field": "@timestamp",
        "fixed_interval": "5m"
      }
    }
  }
}

E-Commerce Product Search with Facets

# E-commerce search: full-text + filters + facets + sorting
GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "wireless headphones",
            "fields": ["name^4", "brand^2", "description", "tags"],
            "type": "cross_fields",
            "operator": "and"
          }
        }
      ],
      "filter": [
        { "term": { "in_stock": true } },
        { "terms": { "category": ["electronics", "audio"] } },
        { "range": { "price": { "gte": 30, "lte": 300 } } },
        { "term": { "rating_count": { "gte": 10 } } }   # at least 10 reviews
      ],
      "should": [
        { "term": { "is_prime": true } },                 # boost Prime items
        { "term": { "sponsored": false } }                # slight boost for organic
      ]
    }
  },
  "sort": [
    { "_score": "desc" },
    { "sales_rank": "asc" },                              # secondary sort
    { "rating": "desc" }
  ],
  "from": 0,
  "size": 24,
  "aggs": {
    "brands":     { "terms": { "field": "brand", "size": 20 } },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "Under $50",    "to": 50 },
          { "key": "$50 - $100",  "from": 50,  "to": 100 },
          { "key": "$100 - $200", "from": 100, "to": 200 },
          { "key": "Over $200",    "from": 200 }
        ]
      }
    },
    "avg_rating":    { "avg": { "field": "rating" } },
    "connectivity": { "terms": { "field": "features", "size": 10 } }
  }
}

Autocomplete with Completion Suggester

# Step 1: Map the suggest field
PUT /products
{
  "mappings": {
    "properties": {
      "name_suggest": {
        "type": "completion",
        "analyzer": "simple",
        "max_input_length": 50
      }
    }
  }
}

# Step 2: Index documents with suggest input
PUT /products/_doc/prod-001
{
  "name": "Wireless Bluetooth Headphones",
  "name_suggest": {
    "input": [
      "Wireless Bluetooth Headphones",
      "Bluetooth Headphones",
      "Wireless Headphones",
      "Headphones"
    ],
    "weight": 100   # boost popular products
  }
}

# Step 3: Query autocomplete suggestions
GET /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "bluetoo",
      "completion": {
        "field": "name_suggest",
        "size": 5,
        "skip_duplicates": true,
        "fuzzy": { "fuzziness": 1 }
      }
    }
  }
}

# Response includes:
# "options": [
#   { "text": "Bluetooth Headphones", "_score": 100 },
#   { "text": "Bluetooth Speaker",    "_score": 95 }
# ]

Frequently Asked Questions

What is Elasticsearch and what is it used for?

Elasticsearch is an open-source, distributed search and analytics engine built on Apache Lucene. It is used for full-text search, log analytics, real-time application monitoring, e-commerce product search, autocomplete, and business intelligence. Its core strengths are near-real-time search, horizontal scalability, and powerful aggregations.

What is the difference between an Index, Document, and Shard?

An Index is a collection of documents with similar characteristics (analogous to a database table). A Document is a JSON object stored in an index (analogous to a row). A Shard is a horizontal partition of an index — Elasticsearch automatically splits large indices into shards to distribute data across multiple nodes. Each shard is itself a fully functional Lucene index.

What is the difference between a match query and a term query?

A match query is for full-text search on analyzed fields. The query string is tokenized and analyzed before matching — great for user-facing search. A term query finds documents with an exact term value. Use it for keyword fields, IDs, status codes, and other structured data. Using term on an analyzed text field will often return no results because the stored tokens are lowercased.

How do I choose the right number of shards?

Target 10–50 GB per shard. For a 100 GB index, 3–5 primary shards is a good starting point. Avoid over-sharding — each shard consumes memory, heap, and file handles. For indices under 10 GB, 1 primary shard is sufficient. The number of primary shards is fixed at index creation time. You can change replicas any time.

What is the ELK Stack?

The ELK Stack is Elasticsearch + Logstash + Kibana. It is a popular open-source log management and analytics platform. Elasticsearch stores and indexes the data, Logstash ingests and transforms it, and Kibana provides visualization. Modern deployments also include Beats (lightweight data shippers like Filebeat and Metricbeat).

What is the difference between Elasticsearch and OpenSearch?

OpenSearch is a community-driven, Apache 2.0 licensed fork of Elasticsearch 7.10 created by AWS in 2021. They are largely REST API compatible. OpenSearch is free under Apache 2.0 while newer Elasticsearch versions use the Elastic License 2.0 or AGPL. Choose OpenSearch for AWS workloads or when Apache 2.0 licensing is required.

Why is my term query returning no results?

The most common cause is using a term query on an analyzed text field. When Elasticsearch indexes a text field, it applies an analyzer that lowercases tokens. If your mapping uses type: text for a field like category, the stored token is electronics (lowercase), but your term query might be sending Electronics (capitalized). Solution: use keyword type for fields you want to filter exactly, or use the .keyword subfield: term: { "category.keyword": "Electronics" }.

How do I improve Elasticsearch query performance?

Key tips: use filter context instead of query context for exact-match conditions (filters are cached), avoid wildcard queries on high-cardinality fields, use keyword fields for exact matches, set JVM heap to no more than 50% of RAM (never exceed 31 GB), use search_after for deep pagination instead of from/size, increase refresh_interval to 30s or higher for write-heavy indices, and use bulk indexing over single-document indexing for batch ingestion.

Key Takeaways

  • Use text for full-text search, keyword for exact match: this is the single most important mapping decision. Use the .keyword subfield on text fields to support both use cases.
  • Put structured conditions in filter context: filters are cached and do not calculate relevance scores. Only use query context (must) for fields that should affect ranking.
  • Size your shards at 10–50 GB: avoid over-sharding. Too many small shards hurt performance more than too few large ones. Set primary shard count at index creation — it cannot be changed without reindexing.
  • Set JVM heap to max 50% of RAM, never above 31 GB: the other 50% is needed by Lucene for OS file system cache, which is critical for search performance.
  • Use the Bulk API for production indexing: batch 1,000–5,000 documents per request for dramatically better throughput. Never index one document at a time in a loop.
  • Use ILM for time-series data: automate hot/warm/cold/delete phases to manage index lifecycle, control storage costs, and maintain query performance on large log datasets.
  • For pagination, use search_after not from/size: deep from/size pagination is extremely expensive as Elasticsearch must sort and discard all preceding documents. search_after provides consistent O(1) pagination.
  • Elasticsearch is a search layer, not a primary database: always maintain a source of truth in a transactional database and synchronize relevant fields to Elasticsearch for search.
𝕏 Twitterin LinkedIn
Isso foi útil?

Fique atualizado

Receba dicas de dev e novos ferramentas semanalmente.

Sem spam. Cancele a qualquer momento.

Try These Related Tools

{ }JSON Formatter🗃️JSON to SQL Converter.*Regex Tester

Related Articles

Database Design Guide: Normalization, ERD, Indexing, SQL vs NoSQL, and Performance Optimization

Master database design fundamentals. Covers normalization (1NF-BCNF), ERD design, primary/foreign keys, indexing strategies, SQL vs NoSQL trade-offs, ACID transactions, real-world schemas (e-commerce, blog, social media), and PostgreSQL performance optimization.

API Design Guide: REST Best Practices, OpenAPI, Auth, Pagination, and Caching

Master API design. Covers REST principles, versioning strategies, JWT/OAuth 2.0 authentication, OpenAPI/Swagger specification, rate limiting, RFC 7807 error handling, pagination patterns, ETags caching, and REST vs GraphQL vs gRPC vs tRPC comparison.

Redis Complete Guide: Caching, Pub/Sub, Streams, and Production Patterns

Master Redis with this complete guide. Covers data types, Node.js ioredis, caching patterns, session storage, Pub/Sub, Streams, Python redis-py, rate limiting, transactions, and production setup.