What is OpenTelemetry and why should I use it?

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework for generating, collecting, and exporting telemetry data (traces, metrics, and logs). It is a CNCF incubating project formed by merging OpenTracing and OpenCensus. You should use it because it provides a single, standardized API across languages, avoids vendor lock-in, and has broad industry adoption from all major cloud providers and observability platforms.

What are the three pillars of observability in OpenTelemetry?

The three pillars (signals) in OpenTelemetry are: Traces (distributed traces that follow a request across services, composed of spans), Metrics (numerical measurements over time like counters, gauges, and histograms), and Logs (timestamped text records with structured metadata). OpenTelemetry unifies all three signals with correlated context, enabling engineers to move seamlessly between traces, metrics, and logs during debugging.

What is the OpenTelemetry Collector and how does it work?

The OpenTelemetry Collector is a vendor-agnostic proxy that receives, processes, and exports telemetry data. It has three main components: Receivers (accept data in various formats like OTLP, Jaeger, Zipkin, Prometheus), Processors (transform data by batching, filtering, sampling, or adding attributes), and Exporters (send data to backends like Jaeger, Prometheus, Datadog, or any OTLP-compatible endpoint). Pipelines connect these components for each signal type.

How does auto-instrumentation work in OpenTelemetry?

Auto-instrumentation automatically captures telemetry data from popular libraries and frameworks without code changes. In Node.js, you use the @opentelemetry/auto-instrumentations-node package. In Python, the opentelemetry-instrument CLI wraps your application. In Java, a Java agent JAR is attached at startup. Auto-instrumentation patches HTTP clients, database drivers, message queues, and web frameworks to automatically create spans and propagate context.

What is context propagation in OpenTelemetry?

Context propagation is the mechanism that links spans across service boundaries to form a complete distributed trace. When a service makes an outbound request, the current trace context (trace ID, span ID, trace flags) is injected into request headers using W3C Trace Context or B3 format. The receiving service extracts this context and creates child spans under the same trace, enabling end-to-end visibility across microservices.

What sampling strategies does OpenTelemetry support?

OpenTelemetry supports several sampling strategies: AlwaysOn (record everything), AlwaysOff (record nothing), TraceIdRatioBased (sample a percentage based on trace ID), ParentBased (respect the parent span decision), and Tail-based sampling (make decisions in the Collector after collecting all spans). Tail-based sampling is the most powerful as it can keep all error traces while sampling successful ones.

How do I deploy OpenTelemetry in Kubernetes?

The recommended approach is using the OpenTelemetry Operator, which provides CRDs for managing Collectors and auto-instrumentation. Deploy Collectors as a DaemonSet (one per node) or Deployment (centralized). Use the Instrumentation CRD to inject auto-instrumentation into pods via annotations. The Operator also supports sidecar mode for per-pod Collector instances.

How does OpenTelemetry integrate with Grafana, Datadog, and New Relic?

OpenTelemetry integrates via exporters. Grafana receives OTLP data into Tempo (traces), Mimir (metrics), and Loki (logs). Datadog accepts OTLP via its Agent or OTLP ingest endpoints. New Relic has native OTLP support where you send data to their endpoint with a license key header. All three support the standard OTLP protocol, making backend migration straightforward.

OpenTelemetry Complete Guide: Unified Observability for Modern Applications

TL;DROpenTelemetry is the CNCF open-source observability framework that unifies distributed traces, metrics, and logs. It provides a standardized API across languages, a vendor-agnostic Collector for data collection and export, supports auto and manual instrumentation, and is the foundation for building modern observability platforms.

Key Takeaways

OTel unifies three signals: traces, metrics, and logs with correlated context
Architecture splits into API (interfaces), SDK (implementation), and Collector (data pipeline)
Auto-instrumentation generates telemetry data with zero code changes
OTLP is the standard protocol supported by all major backends
Tail-based sampling reduces costs while retaining error traces
Kubernetes Operator simplifies in-cluster deployment and management

What Is OpenTelemetry?

OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework for generating, collecting, and exporting telemetry data. Hosted by the CNCF, it was formed by merging OpenTracing and OpenCensus, providing a unified set of APIs, SDKs, and tools covering traces, metrics, and logs.

OpenTelemetry Architecture

OTel architecture consists of three layers: API (defines interfaces), SDK (provides implementation), and Collector (data pipeline).

Application Layer           Collector Layer            Backend Layer
+--------------------+     +--------------------+    +------------+
| OTel API           |     | Receivers          |    | Jaeger     |
|  TracerProvider     | --> |  otlp, jaeger,     | -> | Tempo      |
|  MeterProvider      |     |  prometheus, zipkin|    | Zipkin     |
|  LoggerProvider     |     +--------------------+    +------------+
+--------------------+     | Processors         |    | Prometheus |
| OTel SDK           |     |  batch, filter,    | -> | Mimir      |
|  SpanProcessor      |     |  attributes, sample|    | Datadog    |
|  MetricReader       |     +--------------------+    +------------+
|  LogRecordProcessor |     | Exporters          |    | New Relic  |
|  OTLP Exporter      |     |  otlp, prometheus, | -> | Grafana    |
+--------------------+     |  datadog, debug    |    | Loki       |
                           +--------------------+    +------------+

API Layer

The API defines zero-dependency interfaces (TracerProvider, MeterProvider, LoggerProvider). Library authors instrument against the API without pulling in specific implementations.

SDK Layer

The SDK provides concrete implementations including Span processors, metric aggregators, and exporters. Application developers configure the SDK at the entry point.

Collector Layer

The Collector is a standalone service that receives, processes, and exports data. It decouples applications from backends, supporting batching, retries, and multi-destination export.

Three Signals: Traces, Metrics, and Logs

Distributed Traces

Traces record the complete path a request takes through a distributed system. A Trace is composed of multiple Spans forming a tree via parent-child relationships.

Trace: [trace_id: abc123]
|
+-- Span A: API Gateway (root span, 250ms)
|   attributes: http.method=GET, http.url=/api/orders
|   +-- Span B: Order Service (200ms)
|   |   +-- Span C: DB Query (45ms, db.system=postgresql)
|   |   +-- Span D: Cache Lookup (3ms, db.system=redis)
|   +-- Span E: Payment Service (35ms, status=ERROR)

Metrics

OTel Metrics defines Counter (monotonically increasing), Histogram (distribution), Gauge (instantaneous), and UpDownCounter (bidirectional).

Logs

OTel Logs integrates with existing frameworks (Log4j, SLF4J, Python logging) via a Bridge API, correlating log records with trace context.

Installation and Setup

Node.js

npm install @opentelemetry/api @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http

// tracing.ts - Initialize OpenTelemetry
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from
  '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from
  '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from
  '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from
  '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://localhost:4318/v1/metrics',
    }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'my-node-service',
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Python

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp opentelemetry-instrumentation

# Auto-instrument - no code changes needed:
opentelemetry-instrument \
  --service_name my-python-service \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --exporter_otlp_endpoint http://localhost:4317 \
  python app.py

Go

go get go.opentelemetry.io/otel \
  go.opentelemetry.io/otel/sdk \
  go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp

// main.go
func initTracer() (func(context.Context) error, error) {
  exporter, err := otlptracehttp.New(
    context.Background(),
    otlptracehttp.WithEndpoint("localhost:4318"),
    otlptracehttp.WithInsecure(),
  )
  if err != nil { return nil, err }
  tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),
    sdktrace.WithResource(resource.NewWithAttributes(
      semconv.SchemaURL,
      semconv.ServiceNameKey.String("my-go-service"),
    )),
  )
  otel.SetTracerProvider(tp)
  return tp.Shutdown, nil
}

Java

# Download the Java agent and run with your app
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/\
opentelemetry-java-instrumentation/releases/latest/\
download/opentelemetry-javaagent.jar

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-java-service \
  -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
  -jar myapp.jar

Auto-Instrumentation

Auto-instrumentation patches popular libraries to generate spans and propagate context without code changes. Commonly supported libraries:

Node.js: Express, Fastify, HTTP, gRPC, pg, mysql2, Redis, MongoDB, AWS SDK
Python: Flask, Django, FastAPI, requests, psycopg2, SQLAlchemy, Redis, Celery
Go: net/http, gRPC, database/sql, Gin, Echo
Java: Spring Boot, Servlet, JDBC, Hibernate, Kafka, gRPC, OkHttp

Manual Instrumentation

Manual instrumentation gives full control over telemetry data for business-specific spans, custom attributes, and metrics.

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.addEvent('order.validation.started');
      const order = await validateOrder(orderId);
      span.addEvent('order.validation.completed', {
        'order.items_count': order.items.length,
      });
      // Nested span for payment
      await tracer.startActiveSpan('processPayment',
        async (paymentSpan) => {
          paymentSpan.setAttribute(
            'payment.method', order.paymentMethod);
          await chargePayment(order);
          paymentSpan.end();
        });
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Custom Metrics

import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');

// Counter
const orderCounter = meter.createCounter('orders.processed.total',
  { description: 'Total orders processed', unit: 'orders' });

// Histogram
const durationHist = meter.createHistogram(
  'orders.processing.duration',
  { description: 'Processing time', unit: 'ms' });

// Observable Gauge
const activeGauge = meter.createObservableGauge(
  'orders.active.count',
  { description: 'Active orders' });
activeGauge.addCallback((r) => r.observe(getActiveCount()));

orderCounter.add(1, { 'order.type': 'standard' });
durationHist.record(245, { 'order.type': 'standard' });

Context Propagation

Context propagation links spans across services into complete traces. W3C Trace Context injects trace ID, span ID, and sampling flags via the traceparent header.

// W3C Trace Context header:
// traceparent: 00-<trace-id>-<parent-span-id>-<flags>

import { context, propagation } from '@opentelemetry/api';

// Inject context into outgoing request
function makeRequest(url: string) {
  const headers: Record<string, string> = {};
  propagation.inject(context.active(), headers);
  return fetch(url, { headers });
}

// Extract context from incoming request
function handleRequest(req: Request) {
  const ctx = propagation.extract(
    context.active(), req.headers);
  return context.with(ctx, () => {
    return tracer.startActiveSpan('handle', (span) => {
      // child of caller span
      span.end();
    });
  });
}

Span Attributes and Events

Attributes are key-value metadata, events are point-in-time records within a span. OTel Semantic Conventions standardize common attribute names.

// Semantic Conventions examples:
// HTTP: http.request.method, http.response.status_code, url.full
// DB:   db.system, db.statement, db.operation.name
// RPC:  rpc.system, rpc.service, rpc.method

span.setAttribute('http.request.method', 'POST');
span.setAttribute('http.response.status_code', 200);
span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.statement', 'SELECT * FROM orders WHERE id=?');

span.addEvent('cache.miss', {
  'cache.key': 'user:1234',
  'cache.backend': 'redis',
});

span.recordException(new Error('Connection timeout'));

Exporters

Exporters send telemetry data to backends. OTLP is the native protocol supported by all major backends.

OTLP (gRPC / HTTP): Recommended standard, supports all three signals
Jaeger: Direct export to Jaeger (Thrift or gRPC)
Zipkin: Zipkin-compatible backends
Prometheus: Expose /metrics endpoint for scraping
Console/Debug: For development debugging

Collector Configuration

The Collector is configured via YAML defining receivers, processors, exporters, and pipelines:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
  prometheus:
    config:
      scrape_configs:
        - job_name: app-metrics
          scrape_interval: 15s
          static_configs:
            - targets: ['app:9090']
  jaeger:
    protocols:
      thrift_http: { endpoint: 0.0.0.0:14268 }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  otlp/mimir:
    endpoint: mimir:4317
    tls: { insecure: true }
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, filter, batch, resource]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/mimir]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [debug]

Running the Collector

# Docker
docker run -d --name otel-collector \
  -p 4317:4317 -p 4318:4318 \
  -v ./otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib:latest

# Docker Compose
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otelcol/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Collector metrics

Sampling Strategies

Collecting every trace in high-traffic systems is impractical. Sampling strategies balance observability and cost.

Head-Based Sampling (SDK)

import {
  TraceIdRatioBasedSampler,
  ParentBasedSampler,
} from '@opentelemetry/sdk-trace-base';

// ParentBased: respect parent decision, sample 10% of roots
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

const sdk = new NodeSDK({ sampler, /* ... */ });

Tail-Based Sampling (Collector)

Tail-based sampling decides in the Collector after seeing complete traces, ideal for retaining all error and high-latency traces.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: latency-policy
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
      - name: string-attr-policy
        type: string_attribute
        string_attribute:
          key: priority
          values: [high, critical]

Integrating with Observability Backends

Grafana Stack (Tempo + Mimir + Loki)

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  otlphttp/mimir:
    endpoint: http://mimir:9009/otlp
  otlphttp/loki:
    endpoint: http://loki:3100/otlp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/mimir]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/loki]

Datadog

exporters:
  datadog:
    api:
      key: "\${DD_API_KEY}"
      site: datadoghq.com
    traces:
      span_name_as_resource_name: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [datadog]

New Relic

exporters:
  otlp/newrelic:
    endpoint: otlp.nr-data.net:4317
    headers:
      api-key: "\${NEW_RELIC_LICENSE_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/newrelic]

Kubernetes Deployment

The OpenTelemetry Operator is the recommended way to manage OTel in Kubernetes, providing CRDs for Collectors and auto-instrumentation injection.

# Install cert-manager + OTel Operator
kubectl apply -f https://github.com/cert-manager/cert-manager/\
releases/download/v1.14.0/cert-manager.yaml

helm repo add open-telemetry \
  https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-operator \
  open-telemetry/opentelemetry-operator \
  --namespace otel-system --create-namespace

Collector CRD (DaemonSet Mode)

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: otel-system
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc: { endpoint: 0.0.0.0:4317 }
          http: { endpoint: 0.0.0.0:4318 }
    processors:
      batch: { timeout: 5s }
      k8sattributes:
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
    exporters:
      otlp:
        endpoint: tempo.observability:4317
        tls: { insecure: true }
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp]

Auto-Instrumentation Injection

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector.otel-system:4317
  propagators: [tracecontext, baggage]
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-nodejs:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-python:latest
---
# Annotate Deployment for auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-node-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-nodejs: "true"
    spec:
      containers:
        - name: app
          image: my-node-app:latest

Best Practices

1. Follow Semantic Conventions

Use standard attribute names (e.g., http.request.method not method) for cross-service consistency and automatic backend parsing.

2. Set Resource Attributes

import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from '@opentelemetry/semantic-conventions';

const resource = new Resource({
  [ATTR_SERVICE_NAME]: 'order-service',
  [ATTR_SERVICE_VERSION]: '2.1.0',
  [ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: 'production',
});

3. Control Span Granularity

Create spans for network calls, database operations, and critical business operations. Avoid creating spans in tight loops.

4. Handle Span Lifecycle Correctly

Always end spans in a finally block. Use startActiveSpan for async operations to maintain context.

5. Production Sampling

Never use AlwaysOn. Start with ParentBased + TraceIdRatio(0.1), add tail-based sampling to retain error traces.

6. Use the Collector

The Collector provides buffering, retries, batching, and multi-destination export. Reduces network connections and resource usage on the application side.

7. Correlate All Three Signals

// Inject trace context into logs (Node.js + Winston)
import { trace, context } from '@opentelemetry/api';
import winston from 'winston';

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format((info) => {
      const span = trace.getSpan(context.active());
      if (span) {
        const ctx = span.spanContext();
        info.trace_id = ctx.traceId;
        info.span_id = ctx.spanId;
      }
      return info;
    })(),
    winston.format.json()
  ),
  transports: [new winston.transports.Console()],
});
// Output: {"message":"Order processed",
//  "trace_id":"abc...","span_id":"def..."}

8. Set Collector Resource Limits

Configure memory_limiter to prevent OOM, set resources.limits in K8s, and monitor the Collector built-in metrics.

Conclusion

OpenTelemetry is becoming the de facto standard for observability. Its vendor-neutral design lets you instrument once and export to any backend. Unified context correlation across all three signals makes debugging distributed systems manageable. The Collector flexible pipelines make data processing straightforward. Start with auto-instrumentation for quick value, then gradually add manual instrumentation, optimize sampling, deploy Collector pipelines, and build a production-ready, full-stack observability platform.