Prometheus is an open-source monitoring and alerting toolkit that uses a pull model to collect time series data from /metrics endpoints. It features a powerful PromQL query language, multi-dimensional data model, and native alerting. Pair it with Alertmanager for alert routing, Grafana for dashboards, and Thanos or Cortex for long-term storage. Prometheus is the de facto standard for Kubernetes monitoring.
- Prometheus uses a pull model, actively scraping metrics from target /metrics endpoints
- Four metric types: Counter, Gauge, Histogram, and Summary
- PromQL is a powerful functional query language for real-time time series selection and aggregation
- Alerting is two-part: Prometheus defines rules, Alertmanager handles routing and notification
- A rich exporter ecosystem covers databases, hardware, message queues, and more
- Thanos and Cortex address long-term storage and global query view needs
What Is Prometheus?
Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud in 2012. In 2016, Prometheus became the second project to join the Cloud Native Computing Foundation (CNCF) after Kubernetes and graduated in 2018. It uses a multi-dimensional data model, identifying time series by metric name and key-value label pairs.
Core features of Prometheus include: a pull-based HTTP scrape model, the powerful PromQL query language, a local time-series database with no distributed storage dependency, target discovery via service discovery or static configuration, multiple graphing and dashboard modes, and built-in alert management.
Architecture & Components
The Prometheus ecosystem consists of multiple components, most of which are optional. Understanding how these components work together is fundamental to operating Prometheus effectively.
| Component | Responsibility |
|---|---|
| Prometheus Server | Scrapes and stores time series data |
| Alertmanager | Handles alert deduplication, grouping, routing, and notifications |
| Pushgateway | Allows short-lived jobs to push metrics |
| Exporters | Translate third-party system metrics into Prometheus format |
| Client Libraries | Instrument application code and expose metrics |
| Service Discovery | Automatically discovers scrape targets |
Installing Prometheus
Install with Docker
Docker is the fastest way to get started. Mount your configuration file and a data volume for persistence.
# Pull and run Prometheus with Docker
docker run -d \
--name prometheus \
-p 9090:9090 \
-v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
-v prometheus-data:/prometheus \
prom/prometheus:latest
# Verify it is running
curl http://localhost:9090/-/healthyInstall from Binary
# Download Prometheus binary (Linux amd64)
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xvfz prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64
# Start Prometheus
./prometheus --config.file=prometheus.yml
# Create a systemd service for production
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
sudo cp prometheus.yml /etc/prometheus/Docker Compose Full Stack
# docker-compose.yml
services:
prometheus:
image: prom/prometheus:latest
ports: ["9090:9090"]
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command: ["--config.file=/etc/prometheus/prometheus.yml",
"--storage.tsdb.retention.time=30d"]
alertmanager:
image: prom/alertmanager:latest
ports: ["9093:9093"]
grafana:
image: grafana/grafana:latest
ports: ["3000:3000"]
volumes:
prometheus-data:Configuring prometheus.yml
prometheus.yml is the core configuration file that defines global settings, scrape configurations, alerting rule file paths, and Alertmanager addresses.
# prometheus.yml - complete example
global:
scrape_interval: 15s # How often to scrape targets
evaluation_interval: 15s # How often to evaluate rules
scrape_timeout: 10s # Timeout per scrape request
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- "alert-rules.yml"
- "recording-rules.yml"
scrape_configs:
# Monitor Prometheus itself
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
# Monitor node exporter
- job_name: "node"
static_configs:
- targets: ["node-exporter:9100"]
scrape_interval: 10s
# Monitor application with relabeling
- job_name: "webapp"
metrics_path: "/metrics"
scheme: "https"
static_configs:
- targets: ["app1:8080", "app2:8080"]
labels:
env: "production"Metric Types
Prometheus defines four core metric types, each suited for different measurement scenarios. Choosing the correct type is essential for effective monitoring.
| Type | Behavior | Example |
|---|---|---|
| Counter | Monotonically increasing, resets on restart | http_requests_total |
| Gauge | Value that can go up or down | node_memory_available_bytes |
| Histogram | Buckets observations into configurable bins | http_request_duration_seconds |
| Summary | Calculates quantiles over a sliding window | rpc_duration_seconds |
Here is example output for each type on the /metrics endpoint.
# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="201"} 56
# HELP node_memory_available_bytes Available memory in bytes
# TYPE node_memory_available_bytes gauge
node_memory_available_bytes 4.294967296e+09
# HELP http_request_duration_seconds Request duration histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 2400
http_request_duration_seconds_bucket{le="0.1"} 2650
http_request_duration_seconds_bucket{le="0.5"} 2800
http_request_duration_seconds_bucket{le="+Inf"} 2834
http_request_duration_seconds_sum 150.72
http_request_duration_seconds_count 2834PromQL Basics
PromQL is the functional query language of Prometheus for real-time time series selection and aggregation. It is central to building dashboards and alerting rules.
Selectors & Matchers
# Instant vector - select all time series for a metric
http_requests_total
# Label matching - exact match
http_requests_total{method="GET"}
# Regex matching
http_requests_total{status=~"5.."}
# Negative matching
http_requests_total{method!="DELETE"}
# Range vector - select 5 minutes of data
http_requests_total{method="GET"}[5m]
# Offset - query data from 1 hour ago
http_requests_total offset 1hCommon Functions
# rate() - per-second average rate of increase (for counters)
rate(http_requests_total[5m])
# irate() - instant rate based on last two data points
irate(http_requests_total[5m])
# increase() - total increase over a range
increase(http_requests_total[1h])
# histogram_quantile() - calculate percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# predict_linear() - predict value N seconds from now
predict_linear(node_filesystem_avail_bytes[6h], 24*3600)
# delta() - difference between first and last value
delta(process_resident_memory_bytes[1h])Aggregation Operators
# Sum across all instances
sum(rate(http_requests_total[5m]))
# Sum by specific label
sum by (method) (rate(http_requests_total[5m]))
# Average across instances
avg by (instance) (node_cpu_seconds_total)
# Top 5 by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))
# Count of targets with >80% CPU
count(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80)Recording Rules
Recording rules precompute frequently used or computationally expensive PromQL expressions, storing results as new time series. This improves dashboard query performance and simplifies alerting rule definitions.
# recording-rules.yml
groups:
- name: http_rules
interval: 15s
rules:
# Request rate per service
- record: job:http_requests_total:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Error rate percentage
- record: job:http_errors:ratio5m
expr: |
sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
# 95th percentile latency
- record: job:http_duration_seconds:p95
expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))Alerting Rules & Alertmanager
Alerting in Prometheus is a two-stage process: the Prometheus server evaluates alerting rules and sends firing alerts to Alertmanager, which handles deduplication, grouping, silencing, inhibition, and routing alerts to the correct receivers.
Alerting Rules Example
# alert-rules.yml
groups:
- name: critical_alerts
rules:
- alert: HighErrorRate
expr: job:http_errors:ratio5m > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ \$labels.job }}"
description: "Error rate is {{ \$value | humanizePercentage }} for 5 min."
- alert: InstanceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Instance {{ \$labels.instance }} is down"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space below 10% on {{ \$labels.instance }}"Alertmanager Configuration
# alertmanager.yml
route:
group_by: ["alertname", "job"]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: "default-email"
routes:
- match: { severity: critical }
receiver: "pagerduty-critical"
- match: { severity: warning }
receiver: "slack-warnings"
receivers:
- name: "default-email"
email_configs:
- to: "team@example.com"
- name: "slack-warnings"
slack_configs:
- api_url: "https://hooks.slack.com/services/T00/B00/XXXX"
channel: "#alerts"
- name: "pagerduty-critical"
pagerduty_configs:
- service_key: "your-pagerduty-key"Service Discovery
Prometheus supports multiple service discovery mechanisms to automatically find scrape targets without maintaining static configuration manually.
scrape_configs:
# File-based service discovery
- job_name: "file-sd"
file_sd_configs:
- files:
- "/etc/prometheus/targets/*.json"
refresh_interval: 30s
# Consul service discovery
- job_name: "consul"
consul_sd_configs:
- server: "consul.example.com:8500"
services: ["webapp", "api"]
# DNS-based discovery
- job_name: "dns"
dns_sd_configs:
- names: ["_prometheus._tcp.example.com"]
type: SRV
refresh_interval: 30s
# EC2 discovery
- job_name: "ec2"
ec2_sd_configs:
- region: us-east-1
port: 9100
relabel_configs:
- source_labels: [__meta_ec2_tag_Environment]
target_label: envExporters
Exporters translate third-party system metrics into Prometheus format. Below are the most commonly used exporters.
| Exporter | Port | Purpose |
|---|---|---|
| node_exporter | 9100 | Linux hardware and OS metrics |
| blackbox_exporter | 9115 | HTTP/TCP/ICMP/DNS probing |
| mysqld_exporter | 9104 | MySQL server metrics |
| postgres_exporter | 9187 | PostgreSQL server metrics |
| redis_exporter | 9121 | Redis server metrics |
| nginx-exporter | 9113 | Nginx connection and request metrics |
Deploying node_exporter
# Run node_exporter with Docker
docker run -d \
--name node-exporter \
--net="host" \
--pid="host" \
-v "/:/host:ro,rslave" \
quay.io/prometheus/node-exporter:latest \
--path.rootfs=/host
# Verify metrics endpoint
curl http://localhost:9100/metrics | head -20Blackbox Exporter Configuration
# blackbox.yml
modules:
http_2xx:
prober: http
timeout: 5s
http:
valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
valid_status_codes: [200]
follow_redirects: true
# prometheus.yml - scrape config for blackbox
scrape_configs:
- job_name: "blackbox-http"
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115Instrumenting Applications
Prometheus provides official client libraries to define and expose custom metrics in your application code. Below are examples for Go, Python, and Node.js.
Go
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var httpRequests = prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "myapp_http_requests_total",
Help: "Total HTTP requests.",
},
[]string{"method", "status"},
)
func init() { prometheus.MustRegister(httpRequests) }
func main() {
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe(":8080", nil)
}Python
# pip install prometheus-client
from prometheus_client import Counter, Histogram, start_http_server
REQUEST_COUNT = Counter("myapp_requests_total", "Total requests", ["method", "endpoint"])
REQUEST_LATENCY = Histogram(
"myapp_request_duration_seconds", "Request latency",
["endpoint"], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)
def handle_request(method, endpoint):
REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
with REQUEST_LATENCY.labels(endpoint=endpoint).time():
process_request()
start_http_server(8000) # Expose metrics on :8000/metricsNode.js
// npm install prom-client express
const client = require("prom-client");
const express = require("express");
const app = express();
client.collectDefaultMetrics();
const httpRequests = new client.Counter({
name: "myapp_http_requests_total",
help: "Total HTTP requests",
labelNames: ["method", "route", "status"],
});
app.use((req, res, next) => {
res.on("finish", () => {
httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
});
next();
});
app.get("/metrics", async (req, res) => {
res.set("Content-Type", client.register.contentType);
res.end(await client.register.metrics());
});
app.listen(3000);Grafana Integration
Grafana is the most popular visualization tool for Prometheus. After adding Prometheus as a data source in Grafana, you can use PromQL to build rich dashboards.
# Grafana data source provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
httpMethod: POSTRecommended community dashboards: Node Exporter Full (ID: 1860), Prometheus Stats (ID: 2), Kubernetes Cluster (ID: 6417). Here are common panel PromQL queries.
# CPU Usage per Instance (percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory Usage (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk I/O (reads per second)
rate(node_disk_reads_completed_total[5m])
# Network traffic (bytes per second)
rate(node_network_receive_bytes_total{device!="lo"}[5m])
# HTTP request rate by status code
sum by (status) (rate(http_requests_total[5m]))Federation
Federation allows one Prometheus server to scrape selected time series from another server. This is useful for multi-datacenter deployments or hierarchical aggregation of metrics.
# Global Prometheus scraping from datacenter instances
scrape_configs:
- job_name: "federate-dc1"
scrape_interval: 30s
honor_labels: true
metrics_path: "/federate"
params:
"match[]":
- '{job="node"}'
- '{job="webapp"}'
- '{__name__=~"job:.*"}'
static_configs:
- targets: ["prometheus-dc1.example.com:9090"]
labels:
datacenter: "dc1"Long-Term Storage: Thanos & Cortex
Prometheus local storage is suited for short-term retention (typically 15-30 days). For long-term storage and a global query view, Thanos and Cortex are the two leading solutions.
Thanos Architecture
# Thanos Sidecar - runs alongside Prometheus
docker run -d --name thanos-sidecar \
quay.io/thanos/thanos:latest sidecar \
--tsdb.path=/prometheus \
--prometheus.url=http://prometheus:9090 \
--objstore.config-file=/etc/thanos/bucket.yml
# bucket.yml - S3 object storage
type: S3
config:
bucket: "thanos-metrics"
endpoint: "s3.amazonaws.com"
# Thanos Querier - global query view
docker run -d --name thanos-querier \
quay.io/thanos/thanos:latest query \
--store=thanos-sidecar-dc1:10901 \
--store=thanos-sidecar-dc2:10901Cortex Remote Write
# prometheus.yml - remote write to Cortex
remote_write:
- url: http://cortex-distributor:9009/api/v1/push
queue_config:
max_shards: 30
max_samples_per_send: 1000| Feature | Thanos | Cortex |
|---|---|---|
| Data Ingestion | Sidecar uploads TSDB blocks | Receives via remote_write |
| Multi-tenancy | Limited | Native support |
| Deployment | Simpler, attaches to existing Prometheus | More complex, standalone services |
| Downsampling | Built-in | External dependency |
Kubernetes Monitoring
Prometheus is the de facto standard for Kubernetes monitoring. The kube-prometheus-stack Helm chart provides a complete out-of-the-box monitoring solution.
# Install kube-prometheus-stack with Helm
helm repo add prometheus-community \
https://prometheus-community.github.io/helm-charts
helm repo update
helm install monitoring prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set grafana.adminPassword=adminServiceMonitor Custom Resource
# ServiceMonitor for auto-discovering app metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: webapp-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: webapp
endpoints:
- port: metrics
interval: 15s
path: /metrics
namespaceSelector:
matchNames:
- production
- stagingKubernetes Service Discovery
# prometheus.yml - Kubernetes SD (without Operator)
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
# Only scrape pods with annotation prometheus.io/scrape=true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
# Use custom port from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
# Use custom path from annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)Best Practices
Here are key best practices for running Prometheus that help you avoid common pitfalls.
- Follow metric naming conventions: use snake_case with unit suffixes (_seconds, _bytes, _total)
- Limit label cardinality: avoid high-cardinality labels like user IDs or request IDs
- Use recording rules to precompute frequent queries instead of real-time calculation each time
- Set reasonable for durations on alerts to avoid noise from flapping
- Monitor Prometheus itself: up, prometheus_tsdb_head_series, prometheus_tsdb_compaction_duration_seconds
- Use relabel_configs to filter and transform labels at scrape time to reduce storage overhead
- Set appropriate retention: default is 15 days, adjust based on storage capacity and query needs
- For queries across multiple Prometheus instances, use federation or Thanos/Cortex
Metric Naming Examples
# Good metric names
http_requests_total # counter with _total suffix
http_request_duration_seconds # histogram with unit suffix
node_memory_available_bytes # gauge with unit suffix
process_open_fds # gauge, no unit needed
# Bad metric names - avoid these
httpRequests # no camelCase
request_duration # missing unit suffix
http_request_duration_ms # use base units (seconds not ms)
requests{user_id="12345"} # high-cardinality labelPrometheus vs Alternatives
| Feature | Prometheus | Datadog | InfluxDB | Victoria Metrics |
|---|---|---|---|---|
| Type | Open-source, self-hosted | Commercial SaaS | Open-source / commercial | Open-source, self-hosted |
| Data Model | Pull-based, multi-dimensional labels | Push-based, tags + host | Push-based, measurement + tag | Prometheus-compatible |
| Query Language | PromQL | Proprietary | InfluxQL / Flux | MetricsQL (PromQL superset) |
| Long-term Storage | Requires Thanos/Cortex | Built-in | Built-in | Built-in, high compression |
| Cost | Free (operational cost) | Per host/metric pricing | Free OSS / paid enterprise | Free (operational cost) |
| K8s Integration | Native, de facto standard | Via agent | Via Telegraf | Prometheus ecosystem compatible |
Conclusion
Prometheus is the cornerstone of modern infrastructure monitoring. Its pull model, multi-dimensional data model, and powerful PromQL make it the monitoring tool of choice for cloud-native environments. From single-node Docker deployments to large-scale Kubernetes clusters, Prometheus provides reliable metric collection and alerting capabilities. Pair it with Grafana for visualization dashboards, Alertmanager for intelligent alert routing, and Thanos or Cortex for long-term storage to build a complete, production-ready monitoring platform. Whether you are just starting with monitoring or need to scale an existing solution, Prometheus is a powerful tool worth investing in.