What is Prometheus and what is it used for?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud in 2012. It collects metrics from configured targets at given intervals, evaluates rule expressions, displays results, and triggers alerts when conditions are observed. Prometheus is used for monitoring infrastructure, applications, and services in modern cloud-native environments.

What are the four Prometheus metric types?

Prometheus supports four metric types: Counter (monotonically increasing value, e.g., total HTTP requests), Gauge (value that can go up and down, e.g., current memory usage), Histogram (samples observations and counts them in configurable buckets, e.g., request duration), and Summary (similar to histogram but calculates configurable quantiles over a sliding time window).

How does Prometheus collect metrics?

Prometheus uses a pull-based model where the Prometheus server scrapes HTTP endpoints on monitored targets at configured intervals. Targets expose metrics in the Prometheus exposition format on a /metrics endpoint. For short-lived jobs, Prometheus provides a Pushgateway where jobs can push metrics before exiting. Service discovery mechanisms automatically find targets to scrape.

What is PromQL and how do you use it?

PromQL (Prometheus Query Language) is a functional query language used to select and aggregate time series data in real time. It supports instant vectors, range vectors, and scalar values. Key operations include label matching, rate calculations, aggregations (sum, avg, max, min), and mathematical operations. PromQL is used in dashboards, alerting rules, and ad-hoc queries.

How do you set up alerting with Prometheus?

Alerting in Prometheus is a two-step process. First, define alerting rules in Prometheus that use PromQL expressions to define alert conditions and a for duration. When conditions are met, Prometheus sends alerts to Alertmanager. Alertmanager then handles deduplication, grouping, silencing, inhibition, and routing alerts to receivers like email, Slack, PagerDuty, or webhooks.

How do you monitor Kubernetes with Prometheus?

The recommended approach for Kubernetes monitoring is to use the kube-prometheus-stack Helm chart, which deploys Prometheus Operator, Grafana, Alertmanager, node-exporter, and kube-state-metrics. Prometheus Operator uses Custom Resource Definitions (ServiceMonitor, PodMonitor) for configuring scrape targets. It automatically discovers services and pods using Kubernetes service discovery.

What is the difference between Thanos and Cortex for long-term storage?

Thanos extends Prometheus with a sidecar that uploads TSDB blocks to object storage (S3, GCS). It provides global query view across multiple Prometheus instances and downsampling for long retention. Cortex is a horizontally scalable, multi-tenant storage backend that receives data via remote write. Thanos is simpler to deploy alongside existing Prometheus, while Cortex offers better multi-tenancy and horizontal scalability.

How does Prometheus compare to other monitoring tools?

Prometheus excels at metrics collection and alerting with its pull-based model, powerful PromQL, and native Kubernetes integration. Compared to Grafana Loki (logs), Jaeger (traces), Datadog (commercial SaaS), and InfluxDB (time-series database), Prometheus is purpose-built for metrics, is free and open-source, and has the largest ecosystem of exporters. It lacks built-in long-term storage and distributed features, which are addressed by Thanos or Cortex.

Prometheus Complete Guide: Monitoring and Alerting for Modern Infrastructure

TL;DR

Prometheus is an open-source monitoring and alerting toolkit that uses a pull model to collect time series data from /metrics endpoints. It features a powerful PromQL query language, multi-dimensional data model, and native alerting. Pair it with Alertmanager for alert routing, Grafana for dashboards, and Thanos or Cortex for long-term storage. Prometheus is the de facto standard for Kubernetes monitoring.

Key Takeaways

Prometheus uses a pull model, actively scraping metrics from target /metrics endpoints
Four metric types: Counter, Gauge, Histogram, and Summary
PromQL is a powerful functional query language for real-time time series selection and aggregation
Alerting is two-part: Prometheus defines rules, Alertmanager handles routing and notification
A rich exporter ecosystem covers databases, hardware, message queues, and more
Thanos and Cortex address long-term storage and global query view needs

What Is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud in 2012. In 2016, Prometheus became the second project to join the Cloud Native Computing Foundation (CNCF) after Kubernetes and graduated in 2018. It uses a multi-dimensional data model, identifying time series by metric name and key-value label pairs.

Core features of Prometheus include: a pull-based HTTP scrape model, the powerful PromQL query language, a local time-series database with no distributed storage dependency, target discovery via service discovery or static configuration, multiple graphing and dashboard modes, and built-in alert management.

Architecture & Components

The Prometheus ecosystem consists of multiple components, most of which are optional. Understanding how these components work together is fundamental to operating Prometheus effectively.

Component	Responsibility
Prometheus Server	Scrapes and stores time series data
Alertmanager	Handles alert deduplication, grouping, routing, and notifications
Pushgateway	Allows short-lived jobs to push metrics
Exporters	Translate third-party system metrics into Prometheus format
Client Libraries	Instrument application code and expose metrics
Service Discovery	Automatically discovers scrape targets

Installing Prometheus

Install with Docker

Docker is the fastest way to get started. Mount your configuration file and a data volume for persistence.

# Pull and run Prometheus with Docker
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus:latest

# Verify it is running
curl http://localhost:9090/-/healthy

Install from Binary

# Download Prometheus binary (Linux amd64)
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xvfz prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64

# Start Prometheus
./prometheus --config.file=prometheus.yml

# Create a systemd service for production
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
sudo cp prometheus.yml /etc/prometheus/

Docker Compose Full Stack

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command: ["--config.file=/etc/prometheus/prometheus.yml",
              "--storage.tsdb.retention.time=30d"]
  alertmanager:
    image: prom/alertmanager:latest
    ports: ["9093:9093"]
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
volumes:
  prometheus-data:

Configuring prometheus.yml

prometheus.yml is the core configuration file that defines global settings, scrape configurations, alerting rule file paths, and Alertmanager addresses.

# prometheus.yml - complete example
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules
  scrape_timeout: 10s       # Timeout per scrape request

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "alert-rules.yml"
  - "recording-rules.yml"

scrape_configs:
  # Monitor Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Monitor node exporter
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
    scrape_interval: 10s

  # Monitor application with relabeling
  - job_name: "webapp"
    metrics_path: "/metrics"
    scheme: "https"
    static_configs:
      - targets: ["app1:8080", "app2:8080"]
        labels:
          env: "production"

Metric Types

Prometheus defines four core metric types, each suited for different measurement scenarios. Choosing the correct type is essential for effective monitoring.

Type	Behavior	Example
Counter	Monotonically increasing, resets on restart	http_requests_total
Gauge	Value that can go up or down	node_memory_available_bytes
Histogram	Buckets observations into configurable bins	http_request_duration_seconds
Summary	Calculates quantiles over a sliding window	rpc_duration_seconds

Here is example output for each type on the /metrics endpoint.

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="201"} 56

# HELP node_memory_available_bytes Available memory in bytes
# TYPE node_memory_available_bytes gauge
node_memory_available_bytes 4.294967296e+09

# HELP http_request_duration_seconds Request duration histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 2400
http_request_duration_seconds_bucket{le="0.1"} 2650
http_request_duration_seconds_bucket{le="0.5"} 2800
http_request_duration_seconds_bucket{le="+Inf"} 2834
http_request_duration_seconds_sum 150.72
http_request_duration_seconds_count 2834

PromQL Basics

PromQL is the functional query language of Prometheus for real-time time series selection and aggregation. It is central to building dashboards and alerting rules.

Selectors & Matchers

# Instant vector - select all time series for a metric
http_requests_total

# Label matching - exact match
http_requests_total{method="GET"}

# Regex matching
http_requests_total{status=~"5.."}

# Negative matching
http_requests_total{method!="DELETE"}

# Range vector - select 5 minutes of data
http_requests_total{method="GET"}[5m]

# Offset - query data from 1 hour ago
http_requests_total offset 1h

Common Functions

# rate() - per-second average rate of increase (for counters)
rate(http_requests_total[5m])

# irate() - instant rate based on last two data points
irate(http_requests_total[5m])

# increase() - total increase over a range
increase(http_requests_total[1h])

# histogram_quantile() - calculate percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# predict_linear() - predict value N seconds from now
predict_linear(node_filesystem_avail_bytes[6h], 24*3600)

# delta() - difference between first and last value
delta(process_resident_memory_bytes[1h])

Aggregation Operators

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Sum by specific label
sum by (method) (rate(http_requests_total[5m]))

# Average across instances
avg by (instance) (node_cpu_seconds_total)

# Top 5 by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))

# Count of targets with >80% CPU
count(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80)

Recording Rules

Recording rules precompute frequently used or computationally expensive PromQL expressions, storing results as new time series. This improves dashboard query performance and simplifies alerting rule definitions.

# recording-rules.yml
groups:
  - name: http_rules
    interval: 15s
    rules:
      # Request rate per service
      - record: job:http_requests_total:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Error rate percentage
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # 95th percentile latency
      - record: job:http_duration_seconds:p95
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

Alerting Rules & Alertmanager

Alerting in Prometheus is a two-stage process: the Prometheus server evaluates alerting rules and sends firing alerts to Alertmanager, which handles deduplication, grouping, silencing, inhibition, and routing alerts to the correct receivers.

Alerting Rules Example

# alert-rules.yml
groups:
  - name: critical_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ \$labels.job }}"
          description: "Error rate is {{ \$value | humanizePercentage }} for 5 min."

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ \$labels.instance }} is down"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10% on {{ \$labels.instance }}"

Alertmanager Configuration

# alertmanager.yml
route:
  group_by: ["alertname", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default-email"
  routes:
    - match: { severity: critical }
      receiver: "pagerduty-critical"
    - match: { severity: warning }
      receiver: "slack-warnings"

receivers:
  - name: "default-email"
    email_configs:
      - to: "team@example.com"
  - name: "slack-warnings"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T00/B00/XXXX"
        channel: "#alerts"
  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "your-pagerduty-key"

Service Discovery

Prometheus supports multiple service discovery mechanisms to automatically find scrape targets without maintaining static configuration manually.

scrape_configs:
  # File-based service discovery
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - "/etc/prometheus/targets/*.json"
        refresh_interval: 30s

  # Consul service discovery
  - job_name: "consul"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        services: ["webapp", "api"]

  # DNS-based discovery
  - job_name: "dns"
    dns_sd_configs:
      - names: ["_prometheus._tcp.example.com"]
        type: SRV
        refresh_interval: 30s

  # EC2 discovery
  - job_name: "ec2"
    ec2_sd_configs:
      - region: us-east-1
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Environment]
        target_label: env

Exporters

Exporters translate third-party system metrics into Prometheus format. Below are the most commonly used exporters.

Exporter	Port	Purpose
node_exporter	9100	Linux hardware and OS metrics
blackbox_exporter	9115	HTTP/TCP/ICMP/DNS probing
mysqld_exporter	9104	MySQL server metrics
postgres_exporter	9187	PostgreSQL server metrics
redis_exporter	9121	Redis server metrics
nginx-exporter	9113	Nginx connection and request metrics

Deploying node_exporter

# Run node_exporter with Docker
docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

# Verify metrics endpoint
curl http://localhost:9100/metrics | head -20

Blackbox Exporter Configuration

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      follow_redirects: true

# prometheus.yml - scrape config for blackbox
scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Instrumenting Applications

Prometheus provides official client libraries to define and expose custom metrics in your application code. Below are examples for Go, Python, and Node.js.

Go

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "myapp_http_requests_total",
        Help: "Total HTTP requests.",
    },
    []string{"method", "status"},
)

func init() { prometheus.MustRegister(httpRequests) }

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Python

# pip install prometheus-client
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter("myapp_requests_total", "Total requests", ["method", "endpoint"])
REQUEST_LATENCY = Histogram(
    "myapp_request_duration_seconds", "Request latency",
    ["endpoint"], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

def handle_request(method, endpoint):
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    with REQUEST_LATENCY.labels(endpoint=endpoint).time():
        process_request()

start_http_server(8000)  # Expose metrics on :8000/metrics

Node.js

// npm install prom-client express
const client = require("prom-client");
const express = require("express");
const app = express();

client.collectDefaultMetrics();

const httpRequests = new client.Counter({
  name: "myapp_http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "route", "status"],
});

app.use((req, res, next) => {
  res.on("finish", () => {
    httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
  });
  next();
});

app.get("/metrics", async (req, res) => {
  res.set("Content-Type", client.register.contentType);
  res.end(await client.register.metrics());
});
app.listen(3000);

Grafana Integration

Grafana is the most popular visualization tool for Prometheus. After adding Prometheus as a data source in Grafana, you can use PromQL to build rich dashboards.

# Grafana data source provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

Recommended community dashboards: Node Exporter Full (ID: 1860), Prometheus Stats (ID: 2), Kubernetes Cluster (ID: 6417). Here are common panel PromQL queries.

# CPU Usage per Instance (percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O (reads per second)
rate(node_disk_reads_completed_total[5m])

# Network traffic (bytes per second)
rate(node_network_receive_bytes_total{device!="lo"}[5m])

# HTTP request rate by status code
sum by (status) (rate(http_requests_total[5m]))

Federation

Federation allows one Prometheus server to scrape selected time series from another server. This is useful for multi-datacenter deployments or hierarchical aggregation of metrics.

# Global Prometheus scraping from datacenter instances
scrape_configs:
  - job_name: "federate-dc1"
    scrape_interval: 30s
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        - '{job="node"}'
        - '{job="webapp"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ["prometheus-dc1.example.com:9090"]
        labels:
          datacenter: "dc1"

Long-Term Storage: Thanos & Cortex

Prometheus local storage is suited for short-term retention (typically 15-30 days). For long-term storage and a global query view, Thanos and Cortex are the two leading solutions.

Thanos Architecture

# Thanos Sidecar - runs alongside Prometheus
docker run -d --name thanos-sidecar \
  quay.io/thanos/thanos:latest sidecar \
  --tsdb.path=/prometheus \
  --prometheus.url=http://prometheus:9090 \
  --objstore.config-file=/etc/thanos/bucket.yml

# bucket.yml - S3 object storage
type: S3
config:
  bucket: "thanos-metrics"
  endpoint: "s3.amazonaws.com"

# Thanos Querier - global query view
docker run -d --name thanos-querier \
  quay.io/thanos/thanos:latest query \
  --store=thanos-sidecar-dc1:10901 \
  --store=thanos-sidecar-dc2:10901

Cortex Remote Write

# prometheus.yml - remote write to Cortex
remote_write:
  - url: http://cortex-distributor:9009/api/v1/push
    queue_config:
      max_shards: 30
      max_samples_per_send: 1000

Feature	Thanos	Cortex
Data Ingestion	Sidecar uploads TSDB blocks	Receives via remote_write
Multi-tenancy	Limited	Native support
Deployment	Simpler, attaches to existing Prometheus	More complex, standalone services
Downsampling	Built-in	External dependency

Kubernetes Monitoring

Prometheus is the de facto standard for Kubernetes monitoring. The kube-prometheus-stack Helm chart provides a complete out-of-the-box monitoring solution.

# Install kube-prometheus-stack with Helm
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set grafana.adminPassword=admin

ServiceMonitor Custom Resource

# ServiceMonitor for auto-discovering app metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: webapp-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: webapp
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - production
      - staging

Kubernetes Service Discovery

# prometheus.yml - Kubernetes SD (without Operator)
scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape=true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
      # Use custom path from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Best Practices

Here are key best practices for running Prometheus that help you avoid common pitfalls.

Follow metric naming conventions: use snake_case with unit suffixes (_seconds, _bytes, _total)
Limit label cardinality: avoid high-cardinality labels like user IDs or request IDs
Use recording rules to precompute frequent queries instead of real-time calculation each time
Set reasonable for durations on alerts to avoid noise from flapping
Monitor Prometheus itself: up, prometheus_tsdb_head_series, prometheus_tsdb_compaction_duration_seconds
Use relabel_configs to filter and transform labels at scrape time to reduce storage overhead
Set appropriate retention: default is 15 days, adjust based on storage capacity and query needs
For queries across multiple Prometheus instances, use federation or Thanos/Cortex

Metric Naming Examples

# Good metric names
http_requests_total              # counter with _total suffix
http_request_duration_seconds    # histogram with unit suffix
node_memory_available_bytes      # gauge with unit suffix
process_open_fds                 # gauge, no unit needed

# Bad metric names - avoid these
httpRequests                     # no camelCase
request_duration                 # missing unit suffix
http_request_duration_ms         # use base units (seconds not ms)
requests{user_id="12345"}        # high-cardinality label

Prometheus vs Alternatives

Feature	Prometheus	Datadog	InfluxDB	Victoria Metrics
Type	Open-source, self-hosted	Commercial SaaS	Open-source / commercial	Open-source, self-hosted
Data Model	Pull-based, multi-dimensional labels	Push-based, tags + host	Push-based, measurement + tag	Prometheus-compatible
Query Language	PromQL	Proprietary	InfluxQL / Flux	MetricsQL (PromQL superset)
Long-term Storage	Requires Thanos/Cortex	Built-in	Built-in	Built-in, high compression
Cost	Free (operational cost)	Per host/metric pricing	Free OSS / paid enterprise	Free (operational cost)
K8s Integration	Native, de facto standard	Via agent	Via Telegraf	Prometheus ecosystem compatible

Conclusion

Prometheus is the cornerstone of modern infrastructure monitoring. Its pull model, multi-dimensional data model, and powerful PromQL make it the monitoring tool of choice for cloud-native environments. From single-node Docker deployments to large-scale Kubernetes clusters, Prometheus provides reliable metric collection and alerting capabilities. Pair it with Grafana for visualization dashboards, Alertmanager for intelligent alert routing, and Thanos or Cortex for long-term storage to build a complete, production-ready monitoring platform. Whether you are just starting with monitoring or need to scale an existing solution, Prometheus is a powerful tool worth investing in.