DevToolBoxGRATIS
Blog

Prometheus Complete Guide: Monitoring and Alerting for Modern Infrastructure

22 min readdi DevToolBox Team
TL;DR

Prometheus is an open-source monitoring and alerting toolkit that uses a pull model to collect time series data from /metrics endpoints. It features a powerful PromQL query language, multi-dimensional data model, and native alerting. Pair it with Alertmanager for alert routing, Grafana for dashboards, and Thanos or Cortex for long-term storage. Prometheus is the de facto standard for Kubernetes monitoring.

Key Takeaways
  • Prometheus uses a pull model, actively scraping metrics from target /metrics endpoints
  • Four metric types: Counter, Gauge, Histogram, and Summary
  • PromQL is a powerful functional query language for real-time time series selection and aggregation
  • Alerting is two-part: Prometheus defines rules, Alertmanager handles routing and notification
  • A rich exporter ecosystem covers databases, hardware, message queues, and more
  • Thanos and Cortex address long-term storage and global query view needs

What Is Prometheus?

Prometheus is an open-source systems monitoring and alerting toolkit originally built at SoundCloud in 2012. In 2016, Prometheus became the second project to join the Cloud Native Computing Foundation (CNCF) after Kubernetes and graduated in 2018. It uses a multi-dimensional data model, identifying time series by metric name and key-value label pairs.

Core features of Prometheus include: a pull-based HTTP scrape model, the powerful PromQL query language, a local time-series database with no distributed storage dependency, target discovery via service discovery or static configuration, multiple graphing and dashboard modes, and built-in alert management.

Architecture & Components

The Prometheus ecosystem consists of multiple components, most of which are optional. Understanding how these components work together is fundamental to operating Prometheus effectively.

ComponentResponsibility
Prometheus ServerScrapes and stores time series data
AlertmanagerHandles alert deduplication, grouping, routing, and notifications
PushgatewayAllows short-lived jobs to push metrics
ExportersTranslate third-party system metrics into Prometheus format
Client LibrariesInstrument application code and expose metrics
Service DiscoveryAutomatically discovers scrape targets

Installing Prometheus

Install with Docker

Docker is the fastest way to get started. Mount your configuration file and a data volume for persistence.

# Pull and run Prometheus with Docker
docker run -d \
  --name prometheus \
  -p 9090:9090 \
  -v /path/to/prometheus.yml:/etc/prometheus/prometheus.yml \
  -v prometheus-data:/prometheus \
  prom/prometheus:latest

# Verify it is running
curl http://localhost:9090/-/healthy

Install from Binary

# Download Prometheus binary (Linux amd64)
wget https://github.com/prometheus/prometheus/releases/download/v2.53.0/prometheus-2.53.0.linux-amd64.tar.gz
tar xvfz prometheus-2.53.0.linux-amd64.tar.gz
cd prometheus-2.53.0.linux-amd64

# Start Prometheus
./prometheus --config.file=prometheus.yml

# Create a systemd service for production
sudo useradd --no-create-home --shell /bin/false prometheus
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo cp prometheus promtool /usr/local/bin/
sudo cp prometheus.yml /etc/prometheus/

Docker Compose Full Stack

# docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:latest
    ports: ["9090:9090"]
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command: ["--config.file=/etc/prometheus/prometheus.yml",
              "--storage.tsdb.retention.time=30d"]
  alertmanager:
    image: prom/alertmanager:latest
    ports: ["9093:9093"]
  grafana:
    image: grafana/grafana:latest
    ports: ["3000:3000"]
volumes:
  prometheus-data:

Configuring prometheus.yml

prometheus.yml is the core configuration file that defines global settings, scrape configurations, alerting rule file paths, and Alertmanager addresses.

# prometheus.yml - complete example
global:
  scrape_interval: 15s      # How often to scrape targets
  evaluation_interval: 15s  # How often to evaluate rules
  scrape_timeout: 10s       # Timeout per scrape request

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

rule_files:
  - "alert-rules.yml"
  - "recording-rules.yml"

scrape_configs:
  # Monitor Prometheus itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Monitor node exporter
  - job_name: "node"
    static_configs:
      - targets: ["node-exporter:9100"]
    scrape_interval: 10s

  # Monitor application with relabeling
  - job_name: "webapp"
    metrics_path: "/metrics"
    scheme: "https"
    static_configs:
      - targets: ["app1:8080", "app2:8080"]
        labels:
          env: "production"

Metric Types

Prometheus defines four core metric types, each suited for different measurement scenarios. Choosing the correct type is essential for effective monitoring.

TypeBehaviorExample
CounterMonotonically increasing, resets on restarthttp_requests_total
GaugeValue that can go up or downnode_memory_available_bytes
HistogramBuckets observations into configurable binshttp_request_duration_seconds
SummaryCalculates quantiles over a sliding windowrpc_duration_seconds

Here is example output for each type on the /metrics endpoint.

# HELP http_requests_total Total HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="201"} 56

# HELP node_memory_available_bytes Available memory in bytes
# TYPE node_memory_available_bytes gauge
node_memory_available_bytes 4.294967296e+09

# HELP http_request_duration_seconds Request duration histogram
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.05"} 2400
http_request_duration_seconds_bucket{le="0.1"} 2650
http_request_duration_seconds_bucket{le="0.5"} 2800
http_request_duration_seconds_bucket{le="+Inf"} 2834
http_request_duration_seconds_sum 150.72
http_request_duration_seconds_count 2834

PromQL Basics

PromQL is the functional query language of Prometheus for real-time time series selection and aggregation. It is central to building dashboards and alerting rules.

Selectors & Matchers

# Instant vector - select all time series for a metric
http_requests_total

# Label matching - exact match
http_requests_total{method="GET"}

# Regex matching
http_requests_total{status=~"5.."}

# Negative matching
http_requests_total{method!="DELETE"}

# Range vector - select 5 minutes of data
http_requests_total{method="GET"}[5m]

# Offset - query data from 1 hour ago
http_requests_total offset 1h

Common Functions

# rate() - per-second average rate of increase (for counters)
rate(http_requests_total[5m])

# irate() - instant rate based on last two data points
irate(http_requests_total[5m])

# increase() - total increase over a range
increase(http_requests_total[1h])

# histogram_quantile() - calculate percentiles
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# predict_linear() - predict value N seconds from now
predict_linear(node_filesystem_avail_bytes[6h], 24*3600)

# delta() - difference between first and last value
delta(process_resident_memory_bytes[1h])

Aggregation Operators

# Sum across all instances
sum(rate(http_requests_total[5m]))

# Sum by specific label
sum by (method) (rate(http_requests_total[5m]))

# Average across instances
avg by (instance) (node_cpu_seconds_total)

# Top 5 by request rate
topk(5, sum by (handler) (rate(http_requests_total[5m])))

# Count of targets with >80% CPU
count(100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80)

Recording Rules

Recording rules precompute frequently used or computationally expensive PromQL expressions, storing results as new time series. This improves dashboard query performance and simplifies alerting rule definitions.

# recording-rules.yml
groups:
  - name: http_rules
    interval: 15s
    rules:
      # Request rate per service
      - record: job:http_requests_total:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # Error rate percentage
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # 95th percentile latency
      - record: job:http_duration_seconds:p95
        expr: histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])))

Alerting Rules & Alertmanager

Alerting in Prometheus is a two-stage process: the Prometheus server evaluates alerting rules and sends firing alerts to Alertmanager, which handles deduplication, grouping, silencing, inhibition, and routing alerts to the correct receivers.

Alerting Rules Example

# alert-rules.yml
groups:
  - name: critical_alerts
    rules:
      - alert: HighErrorRate
        expr: job:http_errors:ratio5m > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ \$labels.job }}"
          description: "Error rate is {{ \$value | humanizePercentage }} for 5 min."

      - alert: InstanceDown
        expr: up == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ \$labels.instance }} is down"

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes / node_filesystem_size_bytes) < 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk space below 10% on {{ \$labels.instance }}"

Alertmanager Configuration

# alertmanager.yml
route:
  group_by: ["alertname", "job"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "default-email"
  routes:
    - match: { severity: critical }
      receiver: "pagerduty-critical"
    - match: { severity: warning }
      receiver: "slack-warnings"

receivers:
  - name: "default-email"
    email_configs:
      - to: "team@example.com"
  - name: "slack-warnings"
    slack_configs:
      - api_url: "https://hooks.slack.com/services/T00/B00/XXXX"
        channel: "#alerts"
  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "your-pagerduty-key"

Service Discovery

Prometheus supports multiple service discovery mechanisms to automatically find scrape targets without maintaining static configuration manually.

scrape_configs:
  # File-based service discovery
  - job_name: "file-sd"
    file_sd_configs:
      - files:
          - "/etc/prometheus/targets/*.json"
        refresh_interval: 30s

  # Consul service discovery
  - job_name: "consul"
    consul_sd_configs:
      - server: "consul.example.com:8500"
        services: ["webapp", "api"]

  # DNS-based discovery
  - job_name: "dns"
    dns_sd_configs:
      - names: ["_prometheus._tcp.example.com"]
        type: SRV
        refresh_interval: 30s

  # EC2 discovery
  - job_name: "ec2"
    ec2_sd_configs:
      - region: us-east-1
        port: 9100
    relabel_configs:
      - source_labels: [__meta_ec2_tag_Environment]
        target_label: env

Exporters

Exporters translate third-party system metrics into Prometheus format. Below are the most commonly used exporters.

ExporterPortPurpose
node_exporter9100Linux hardware and OS metrics
blackbox_exporter9115HTTP/TCP/ICMP/DNS probing
mysqld_exporter9104MySQL server metrics
postgres_exporter9187PostgreSQL server metrics
redis_exporter9121Redis server metrics
nginx-exporter9113Nginx connection and request metrics

Deploying node_exporter

# Run node_exporter with Docker
docker run -d \
  --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  quay.io/prometheus/node-exporter:latest \
  --path.rootfs=/host

# Verify metrics endpoint
curl http://localhost:9100/metrics | head -20

Blackbox Exporter Configuration

# blackbox.yml
modules:
  http_2xx:
    prober: http
    timeout: 5s
    http:
      valid_http_versions: ["HTTP/1.1", "HTTP/2.0"]
      valid_status_codes: [200]
      follow_redirects: true

# prometheus.yml - scrape config for blackbox
scrape_configs:
  - job_name: "blackbox-http"
    metrics_path: /probe
    params:
      module: [http_2xx]
    static_configs:
      - targets:
          - https://example.com
          - https://api.example.com/health
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: blackbox-exporter:9115

Instrumenting Applications

Prometheus provides official client libraries to define and expose custom metrics in your application code. Below are examples for Go, Python, and Node.js.

Go

package main

import (
    "net/http"
    "github.com/prometheus/client_golang/prometheus"
    "github.com/prometheus/client_golang/prometheus/promhttp"
)

var httpRequests = prometheus.NewCounterVec(
    prometheus.CounterOpts{
        Name: "myapp_http_requests_total",
        Help: "Total HTTP requests.",
    },
    []string{"method", "status"},
)

func init() { prometheus.MustRegister(httpRequests) }

func main() {
    http.Handle("/metrics", promhttp.Handler())
    http.ListenAndServe(":8080", nil)
}

Python

# pip install prometheus-client
from prometheus_client import Counter, Histogram, start_http_server

REQUEST_COUNT = Counter("myapp_requests_total", "Total requests", ["method", "endpoint"])
REQUEST_LATENCY = Histogram(
    "myapp_request_duration_seconds", "Request latency",
    ["endpoint"], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 5.0]
)

def handle_request(method, endpoint):
    REQUEST_COUNT.labels(method=method, endpoint=endpoint).inc()
    with REQUEST_LATENCY.labels(endpoint=endpoint).time():
        process_request()

start_http_server(8000)  # Expose metrics on :8000/metrics

Node.js

// npm install prom-client express
const client = require("prom-client");
const express = require("express");
const app = express();

client.collectDefaultMetrics();

const httpRequests = new client.Counter({
  name: "myapp_http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "route", "status"],
});

app.use((req, res, next) => {
  res.on("finish", () => {
    httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });
  });
  next();
});

app.get("/metrics", async (req, res) => {
  res.set("Content-Type", client.register.contentType);
  res.end(await client.register.metrics());
});
app.listen(3000);

Grafana Integration

Grafana is the most popular visualization tool for Prometheus. After adding Prometheus as a data source in Grafana, you can use PromQL to build rich dashboards.

# Grafana data source provisioning
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: true
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

Recommended community dashboards: Node Exporter Full (ID: 1860), Prometheus Stats (ID: 2), Kubernetes Cluster (ID: 6417). Here are common panel PromQL queries.

# CPU Usage per Instance (percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory Usage (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk I/O (reads per second)
rate(node_disk_reads_completed_total[5m])

# Network traffic (bytes per second)
rate(node_network_receive_bytes_total{device!="lo"}[5m])

# HTTP request rate by status code
sum by (status) (rate(http_requests_total[5m]))

Federation

Federation allows one Prometheus server to scrape selected time series from another server. This is useful for multi-datacenter deployments or hierarchical aggregation of metrics.

# Global Prometheus scraping from datacenter instances
scrape_configs:
  - job_name: "federate-dc1"
    scrape_interval: 30s
    honor_labels: true
    metrics_path: "/federate"
    params:
      "match[]":
        - '{job="node"}'
        - '{job="webapp"}'
        - '{__name__=~"job:.*"}'
    static_configs:
      - targets: ["prometheus-dc1.example.com:9090"]
        labels:
          datacenter: "dc1"

Long-Term Storage: Thanos & Cortex

Prometheus local storage is suited for short-term retention (typically 15-30 days). For long-term storage and a global query view, Thanos and Cortex are the two leading solutions.

Thanos Architecture

# Thanos Sidecar - runs alongside Prometheus
docker run -d --name thanos-sidecar \
  quay.io/thanos/thanos:latest sidecar \
  --tsdb.path=/prometheus \
  --prometheus.url=http://prometheus:9090 \
  --objstore.config-file=/etc/thanos/bucket.yml

# bucket.yml - S3 object storage
type: S3
config:
  bucket: "thanos-metrics"
  endpoint: "s3.amazonaws.com"

# Thanos Querier - global query view
docker run -d --name thanos-querier \
  quay.io/thanos/thanos:latest query \
  --store=thanos-sidecar-dc1:10901 \
  --store=thanos-sidecar-dc2:10901

Cortex Remote Write

# prometheus.yml - remote write to Cortex
remote_write:
  - url: http://cortex-distributor:9009/api/v1/push
    queue_config:
      max_shards: 30
      max_samples_per_send: 1000
FeatureThanosCortex
Data IngestionSidecar uploads TSDB blocksReceives via remote_write
Multi-tenancyLimitedNative support
DeploymentSimpler, attaches to existing PrometheusMore complex, standalone services
DownsamplingBuilt-inExternal dependency

Kubernetes Monitoring

Prometheus is the de facto standard for Kubernetes monitoring. The kube-prometheus-stack Helm chart provides a complete out-of-the-box monitoring solution.

# Install kube-prometheus-stack with Helm
helm repo add prometheus-community \
  https://prometheus-community.github.io/helm-charts
helm repo update

helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set grafana.adminPassword=admin

ServiceMonitor Custom Resource

# ServiceMonitor for auto-discovering app metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: webapp-monitor
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: webapp
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics
  namespaceSelector:
    matchNames:
      - production
      - staging

Kubernetes Service Discovery

# prometheus.yml - Kubernetes SD (without Operator)
scrape_configs:
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      # Only scrape pods with annotation prometheus.io/scrape=true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      # Use custom port from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
      # Use custom path from annotation
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Best Practices

Here are key best practices for running Prometheus that help you avoid common pitfalls.

  • Follow metric naming conventions: use snake_case with unit suffixes (_seconds, _bytes, _total)
  • Limit label cardinality: avoid high-cardinality labels like user IDs or request IDs
  • Use recording rules to precompute frequent queries instead of real-time calculation each time
  • Set reasonable for durations on alerts to avoid noise from flapping
  • Monitor Prometheus itself: up, prometheus_tsdb_head_series, prometheus_tsdb_compaction_duration_seconds
  • Use relabel_configs to filter and transform labels at scrape time to reduce storage overhead
  • Set appropriate retention: default is 15 days, adjust based on storage capacity and query needs
  • For queries across multiple Prometheus instances, use federation or Thanos/Cortex

Metric Naming Examples

# Good metric names
http_requests_total              # counter with _total suffix
http_request_duration_seconds    # histogram with unit suffix
node_memory_available_bytes      # gauge with unit suffix
process_open_fds                 # gauge, no unit needed

# Bad metric names - avoid these
httpRequests                     # no camelCase
request_duration                 # missing unit suffix
http_request_duration_ms         # use base units (seconds not ms)
requests{user_id="12345"}        # high-cardinality label

Prometheus vs Alternatives

FeaturePrometheusDatadogInfluxDBVictoria Metrics
TypeOpen-source, self-hostedCommercial SaaSOpen-source / commercialOpen-source, self-hosted
Data ModelPull-based, multi-dimensional labelsPush-based, tags + hostPush-based, measurement + tagPrometheus-compatible
Query LanguagePromQLProprietaryInfluxQL / FluxMetricsQL (PromQL superset)
Long-term StorageRequires Thanos/CortexBuilt-inBuilt-inBuilt-in, high compression
CostFree (operational cost)Per host/metric pricingFree OSS / paid enterpriseFree (operational cost)
K8s IntegrationNative, de facto standardVia agentVia TelegrafPrometheus ecosystem compatible

Conclusion

Prometheus is the cornerstone of modern infrastructure monitoring. Its pull model, multi-dimensional data model, and powerful PromQL make it the monitoring tool of choice for cloud-native environments. From single-node Docker deployments to large-scale Kubernetes clusters, Prometheus provides reliable metric collection and alerting capabilities. Pair it with Grafana for visualization dashboards, Alertmanager for intelligent alert routing, and Thanos or Cortex for long-term storage to build a complete, production-ready monitoring platform. Whether you are just starting with monitoring or need to scale an existing solution, Prometheus is a powerful tool worth investing in.

𝕏 Twitterin LinkedIn
È stato utile?

Resta aggiornato

Ricevi consigli dev e nuovi strumenti ogni settimana.

Niente spam. Cancella quando vuoi.

Prova questi strumenti correlati

{ }JSON FormatterJSON Validator

Articoli correlati

Ansible Complete Guide: Infrastructure Automation Made Simple

Master Ansible with inventory, playbooks, modules, roles, Galaxy, Vault, Jinja2 templates, dynamic inventory, Docker/Kubernetes integration, and best practices.

Kubernetes Complete Guide for Developers: Pods, Helm, RBAC, and CI/CD

Master Kubernetes with this developer guide. Covers Pods, Deployments, Services, Ingress, Helm, PVC, health checks, HPA, RBAC, and CI/CD integration with GitHub Actions.

Docker Commands: Complete Guide from Basics to Production

Master Docker with this complete commands guide. Covers docker run/build/push, Dockerfile, multi-stage builds, volumes, networking, Docker Compose, security, registry, and production deployment patterns.