- OTel unifies three signals: traces, metrics, and logs with correlated context
- Architecture splits into API (interfaces), SDK (implementation), and Collector (data pipeline)
- Auto-instrumentation generates telemetry data with zero code changes
- OTLP is the standard protocol supported by all major backends
- Tail-based sampling reduces costs while retaining error traces
- Kubernetes Operator simplifies in-cluster deployment and management
What Is OpenTelemetry?
OpenTelemetry (OTel) is a vendor-neutral, open-source observability framework for generating, collecting, and exporting telemetry data. Hosted by the CNCF, it was formed by merging OpenTracing and OpenCensus, providing a unified set of APIs, SDKs, and tools covering traces, metrics, and logs.
OpenTelemetry Architecture
OTel architecture consists of three layers: API (defines interfaces), SDK (provides implementation), and Collector (data pipeline).
Application Layer Collector Layer Backend Layer
+--------------------+ +--------------------+ +------------+
| OTel API | | Receivers | | Jaeger |
| TracerProvider | --> | otlp, jaeger, | -> | Tempo |
| MeterProvider | | prometheus, zipkin| | Zipkin |
| LoggerProvider | +--------------------+ +------------+
+--------------------+ | Processors | | Prometheus |
| OTel SDK | | batch, filter, | -> | Mimir |
| SpanProcessor | | attributes, sample| | Datadog |
| MetricReader | +--------------------+ +------------+
| LogRecordProcessor | | Exporters | | New Relic |
| OTLP Exporter | | otlp, prometheus, | -> | Grafana |
+--------------------+ | datadog, debug | | Loki |
+--------------------+ +------------+API Layer
The API defines zero-dependency interfaces (TracerProvider, MeterProvider, LoggerProvider). Library authors instrument against the API without pulling in specific implementations.
SDK Layer
The SDK provides concrete implementations including Span processors, metric aggregators, and exporters. Application developers configure the SDK at the entry point.
Collector Layer
The Collector is a standalone service that receives, processes, and exports data. It decouples applications from backends, supporting batching, retries, and multi-destination export.
Three Signals: Traces, Metrics, and Logs
Distributed Traces
Traces record the complete path a request takes through a distributed system. A Trace is composed of multiple Spans forming a tree via parent-child relationships.
Trace: [trace_id: abc123]
|
+-- Span A: API Gateway (root span, 250ms)
| attributes: http.method=GET, http.url=/api/orders
| +-- Span B: Order Service (200ms)
| | +-- Span C: DB Query (45ms, db.system=postgresql)
| | +-- Span D: Cache Lookup (3ms, db.system=redis)
| +-- Span E: Payment Service (35ms, status=ERROR)Metrics
OTel Metrics defines Counter (monotonically increasing), Histogram (distribution), Gauge (instantaneous), and UpDownCounter (bidirectional).
Logs
OTel Logs integrates with existing frameworks (Log4j, SLF4J, Python logging) via a Bridge API, correlating log records with trace context.
Installation and Setup
Node.js
npm install @opentelemetry/api @opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/exporter-metrics-otlp-http// tracing.ts - Initialize OpenTelemetry
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from
'@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from
'@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from
'@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from
'@opentelemetry/sdk-metrics';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: 'http://localhost:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: 'http://localhost:4318/v1/metrics',
}),
}),
instrumentations: [getNodeAutoInstrumentations()],
serviceName: 'my-node-service',
});
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());Python
pip install opentelemetry-api opentelemetry-sdk \
opentelemetry-exporter-otlp opentelemetry-instrumentation
# Auto-instrument - no code changes needed:
opentelemetry-instrument \
--service_name my-python-service \
--traces_exporter otlp \
--metrics_exporter otlp \
--exporter_otlp_endpoint http://localhost:4317 \
python app.pyGo
go get go.opentelemetry.io/otel \
go.opentelemetry.io/otel/sdk \
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp
// main.go
func initTracer() (func(context.Context) error, error) {
exporter, err := otlptracehttp.New(
context.Background(),
otlptracehttp.WithEndpoint("localhost:4318"),
otlptracehttp.WithInsecure(),
)
if err != nil { return nil, err }
tp := sdktrace.NewTracerProvider(
sdktrace.WithBatcher(exporter),
sdktrace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-go-service"),
)),
)
otel.SetTracerProvider(tp)
return tp.Shutdown, nil
}Java
# Download the Java agent and run with your app
curl -L -o opentelemetry-javaagent.jar \
https://github.com/open-telemetry/\
opentelemetry-java-instrumentation/releases/latest/\
download/opentelemetry-javaagent.jar
java -javaagent:opentelemetry-javaagent.jar \
-Dotel.service.name=my-java-service \
-Dotel.exporter.otlp.endpoint=http://localhost:4317 \
-jar myapp.jarAuto-Instrumentation
Auto-instrumentation patches popular libraries to generate spans and propagate context without code changes. Commonly supported libraries:
- Node.js: Express, Fastify, HTTP, gRPC, pg, mysql2, Redis, MongoDB, AWS SDK
- Python: Flask, Django, FastAPI, requests, psycopg2, SQLAlchemy, Redis, Celery
- Go: net/http, gRPC, database/sql, Gin, Echo
- Java: Spring Boot, Servlet, JDBC, Hibernate, Kafka, gRPC, OkHttp
Manual Instrumentation
Manual instrumentation gives full control over telemetry data for business-specific spans, custom attributes, and metrics.
import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('order-service', '1.0.0');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
try {
span.setAttribute('order.id', orderId);
span.addEvent('order.validation.started');
const order = await validateOrder(orderId);
span.addEvent('order.validation.completed', {
'order.items_count': order.items.length,
});
// Nested span for payment
await tracer.startActiveSpan('processPayment',
async (paymentSpan) => {
paymentSpan.setAttribute(
'payment.method', order.paymentMethod);
await chargePayment(order);
paymentSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: (error as Error).message,
});
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}Custom Metrics
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');
// Counter
const orderCounter = meter.createCounter('orders.processed.total',
{ description: 'Total orders processed', unit: 'orders' });
// Histogram
const durationHist = meter.createHistogram(
'orders.processing.duration',
{ description: 'Processing time', unit: 'ms' });
// Observable Gauge
const activeGauge = meter.createObservableGauge(
'orders.active.count',
{ description: 'Active orders' });
activeGauge.addCallback((r) => r.observe(getActiveCount()));
orderCounter.add(1, { 'order.type': 'standard' });
durationHist.record(245, { 'order.type': 'standard' });Context Propagation
Context propagation links spans across services into complete traces. W3C Trace Context injects trace ID, span ID, and sampling flags via the traceparent header.
// W3C Trace Context header:
// traceparent: 00-<trace-id>-<parent-span-id>-<flags>
import { context, propagation } from '@opentelemetry/api';
// Inject context into outgoing request
function makeRequest(url: string) {
const headers: Record<string, string> = {};
propagation.inject(context.active(), headers);
return fetch(url, { headers });
}
// Extract context from incoming request
function handleRequest(req: Request) {
const ctx = propagation.extract(
context.active(), req.headers);
return context.with(ctx, () => {
return tracer.startActiveSpan('handle', (span) => {
// child of caller span
span.end();
});
});
}Span Attributes and Events
Attributes are key-value metadata, events are point-in-time records within a span. OTel Semantic Conventions standardize common attribute names.
// Semantic Conventions examples:
// HTTP: http.request.method, http.response.status_code, url.full
// DB: db.system, db.statement, db.operation.name
// RPC: rpc.system, rpc.service, rpc.method
span.setAttribute('http.request.method', 'POST');
span.setAttribute('http.response.status_code', 200);
span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.statement', 'SELECT * FROM orders WHERE id=?');
span.addEvent('cache.miss', {
'cache.key': 'user:1234',
'cache.backend': 'redis',
});
span.recordException(new Error('Connection timeout'));Exporters
Exporters send telemetry data to backends. OTLP is the native protocol supported by all major backends.
- OTLP (gRPC / HTTP): Recommended standard, supports all three signals
- Jaeger: Direct export to Jaeger (Thrift or gRPC)
- Zipkin: Zipkin-compatible backends
- Prometheus: Expose /metrics endpoint for scraping
- Console/Debug: For development debugging
Collector Configuration
The Collector is configured via YAML defining receivers, processors, exporters, and pipelines:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
prometheus:
config:
scrape_configs:
- job_name: app-metrics
scrape_interval: 15s
static_configs:
- targets: ['app:9090']
jaeger:
protocols:
thrift_http: { endpoint: 0.0.0.0:14268 }
processors:
batch:
timeout: 5s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: production
action: upsert
filter:
error_mode: ignore
traces:
span:
- 'attributes["http.target"] == "/health"'
memory_limiter:
check_interval: 1s
limit_mib: 2048
spike_limit_mib: 512
exporters:
otlp/tempo:
endpoint: tempo:4317
tls: { insecure: true }
otlp/mimir:
endpoint: mimir:4317
tls: { insecure: true }
debug:
verbosity: detailed
service:
pipelines:
traces:
receivers: [otlp, jaeger]
processors: [memory_limiter, filter, batch, resource]
exporters: [otlp/tempo]
metrics:
receivers: [otlp, prometheus]
processors: [memory_limiter, batch, resource]
exporters: [otlp/mimir]
logs:
receivers: [otlp]
processors: [memory_limiter, batch, resource]
exporters: [debug]Running the Collector
# Docker
docker run -d --name otel-collector \
-p 4317:4317 -p 4318:4318 \
-v ./otel-collector-config.yaml:/etc/otelcol/config.yaml \
otel/opentelemetry-collector-contrib:latest
# Docker Compose
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otelcol/config.yaml"]
volumes:
- ./otel-collector-config.yaml:/etc/otelcol/config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8888:8888" # Collector metricsSampling Strategies
Collecting every trace in high-traffic systems is impractical. Sampling strategies balance observability and cost.
Head-Based Sampling (SDK)
import {
TraceIdRatioBasedSampler,
ParentBasedSampler,
} from '@opentelemetry/sdk-trace-base';
// ParentBased: respect parent decision, sample 10% of roots
const sampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1),
});
const sdk = new NodeSDK({ sampler, /* ... */ });Tail-Based Sampling (Collector)
Tail-based sampling decides in the Collector after seeing complete traces, ideal for retaining all error and high-latency traces.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
- name: error-policy
type: status_code
status_code: { status_codes: [ERROR] }
- name: latency-policy
type: latency
latency: { threshold_ms: 2000 }
- name: probabilistic-policy
type: probabilistic
probabilistic: { sampling_percentage: 5 }
- name: string-attr-policy
type: string_attribute
string_attribute:
key: priority
values: [high, critical]Integrating with Observability Backends
Grafana Stack (Tempo + Mimir + Loki)
exporters:
otlp/tempo:
endpoint: tempo:4317
tls: { insecure: true }
otlphttp/mimir:
endpoint: http://mimir:9009/otlp
otlphttp/loki:
endpoint: http://loki:3100/otlp
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/tempo]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/mimir]
logs:
receivers: [otlp]
processors: [batch]
exporters: [otlphttp/loki]Datadog
exporters:
datadog:
api:
key: "\${DD_API_KEY}"
site: datadoghq.com
traces:
span_name_as_resource_name: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [datadog]New Relic
exporters:
otlp/newrelic:
endpoint: otlp.nr-data.net:4317
headers:
api-key: "\${NEW_RELIC_LICENSE_KEY}"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [otlp/newrelic]Kubernetes Deployment
The OpenTelemetry Operator is the recommended way to manage OTel in Kubernetes, providing CRDs for Collectors and auto-instrumentation injection.
# Install cert-manager + OTel Operator
kubectl apply -f https://github.com/cert-manager/cert-manager/\
releases/download/v1.14.0/cert-manager.yaml
helm repo add open-telemetry \
https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-operator \
open-telemetry/opentelemetry-operator \
--namespace otel-system --create-namespaceCollector CRD (DaemonSet Mode)
apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
name: otel
namespace: otel-system
spec:
mode: daemonset
config:
receivers:
otlp:
protocols:
grpc: { endpoint: 0.0.0.0:4317 }
http: { endpoint: 0.0.0.0:4318 }
processors:
batch: { timeout: 5s }
k8sattributes:
extract:
metadata:
- k8s.pod.name
- k8s.namespace.name
- k8s.deployment.name
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
otlp:
endpoint: tempo.observability:4317
tls: { insecure: true }
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, k8sattributes, batch]
exporters: [otlp]Auto-Instrumentation Injection
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: otel-instrumentation
spec:
exporter:
endpoint: http://otel-collector.otel-system:4317
propagators: [tracecontext, baggage]
sampler:
type: parentbased_traceidratio
argument: "0.25"
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-nodejs:latest
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-python:latest
---
# Annotate Deployment for auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-node-app
spec:
template:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-nodejs: "true"
spec:
containers:
- name: app
image: my-node-app:latestBest Practices
1. Follow Semantic Conventions
Use standard attribute names (e.g., http.request.method not method) for cross-service consistency and automatic backend parsing.
2. Set Resource Attributes
import { Resource } from '@opentelemetry/resources';
import {
ATTR_SERVICE_NAME,
ATTR_SERVICE_VERSION,
ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from '@opentelemetry/semantic-conventions';
const resource = new Resource({
[ATTR_SERVICE_NAME]: 'order-service',
[ATTR_SERVICE_VERSION]: '2.1.0',
[ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: 'production',
});3. Control Span Granularity
Create spans for network calls, database operations, and critical business operations. Avoid creating spans in tight loops.
4. Handle Span Lifecycle Correctly
Always end spans in a finally block. Use startActiveSpan for async operations to maintain context.
5. Production Sampling
Never use AlwaysOn. Start with ParentBased + TraceIdRatio(0.1), add tail-based sampling to retain error traces.
6. Use the Collector
The Collector provides buffering, retries, batching, and multi-destination export. Reduces network connections and resource usage on the application side.
7. Correlate All Three Signals
// Inject trace context into logs (Node.js + Winston)
import { trace, context } from '@opentelemetry/api';
import winston from 'winston';
const logger = winston.createLogger({
format: winston.format.combine(
winston.format((info) => {
const span = trace.getSpan(context.active());
if (span) {
const ctx = span.spanContext();
info.trace_id = ctx.traceId;
info.span_id = ctx.spanId;
}
return info;
})(),
winston.format.json()
),
transports: [new winston.transports.Console()],
});
// Output: {"message":"Order processed",
// "trace_id":"abc...","span_id":"def..."}8. Set Collector Resource Limits
Configure memory_limiter to prevent OOM, set resources.limits in K8s, and monitor the Collector built-in metrics.
Conclusion
OpenTelemetry is becoming the de facto standard for observability. Its vendor-neutral design lets you instrument once and export to any backend. Unified context correlation across all three signals makes debugging distributed systems manageable. The Collector flexible pipelines make data processing straightforward. Start with auto-instrumentation for quick value, then gradually add manual instrumentation, optimize sampling, deploy Collector pipelines, and build a production-ready, full-stack observability platform.