DevToolBox免费
博客

OpenTelemetry 完全指南:现代应用的统一可观测性

20 min read作者 DevToolBox Team
TL;DROpenTelemetry 是 CNCF 的开源可观测性框架,统一了分布式追踪、指标和日志。它提供跨语言的标准 API,通过 Collector 实现厂商无关的数据采集和导出,支持自动埋点和手动埋点,是构建现代可观测性平台的基石。
核心要点
  • OTel 统一三大信号:追踪、指标和日志,并通过上下文关联
  • 架构分为 API(接口)、SDK(实现)和 Collector(数据管道)
  • 自动埋点可零代码修改生成遥测数据
  • OTLP 是标准协议,所有主流后端均支持
  • 尾部采样可在保留错误追踪的同时降低成本
  • Kubernetes Operator 简化了集群内的部署和管理

什么是 OpenTelemetry?

OpenTelemetry(简称 OTel)是一个厂商中立的开源可观测性框架,用于生成、采集和导出遥测数据。它由 CNCF 托管,是 OpenTracing 和 OpenCensus 合并的产物,提供统一的 API、SDK 和工具覆盖追踪、指标和日志三大信号。

OpenTelemetry 架构

OTel 架构由三层组成:API(定义接口)、SDK(提供实现)和 Collector(数据管道)。

Application Layer           Collector Layer            Backend Layer
+--------------------+     +--------------------+    +------------+
| OTel API           |     | Receivers          |    | Jaeger     |
|  TracerProvider     | --> |  otlp, jaeger,     | -> | Tempo      |
|  MeterProvider      |     |  prometheus, zipkin|    | Zipkin     |
|  LoggerProvider     |     +--------------------+    +------------+
+--------------------+     | Processors         |    | Prometheus |
| OTel SDK           |     |  batch, filter,    | -> | Mimir      |
|  SpanProcessor      |     |  attributes, sample|    | Datadog    |
|  MetricReader       |     +--------------------+    +------------+
|  LogRecordProcessor |     | Exporters          |    | New Relic  |
|  OTLP Exporter      |     |  otlp, prometheus, | -> | Grafana    |
+--------------------+     |  datadog, debug    |    | Loki       |
                           +--------------------+    +------------+

API 层

API 定义零依赖接口(TracerProvider、MeterProvider、LoggerProvider)。库作者安全依赖 API 埋点而不引入特定实现。

SDK 层

SDK 提供 API 的具体实现,包括 Span 处理器、指标聚合器和导出器。应用开发者在入口配置 SDK。

Collector 层

Collector 是独立服务,负责接收、处理和导出数据。它解耦应用与后端,支持批处理、重试和多目标导出。

三大信号

分布式追踪(Traces)

追踪记录请求在分布式系统中的完整路径。一个 Trace 由多个 Span 组成,通过 parent-child 关系形成树状结构。

Trace: [trace_id: abc123]
|
+-- Span A: API Gateway (root span, 250ms)
|   attributes: http.method=GET, http.url=/api/orders
|   +-- Span B: Order Service (200ms)
|   |   +-- Span C: DB Query (45ms, db.system=postgresql)
|   |   +-- Span D: Cache Lookup (3ms, db.system=redis)
|   +-- Span E: Payment Service (35ms, status=ERROR)

指标(Metrics)

OTel 指标定义了 Counter(单调递增)、Histogram(分布统计)、Gauge(瞬时值)和 UpDownCounter(可增可减)。

日志(Logs)

OTel 日志通过 Bridge API 与现有日志框架(Log4j、SLF4J、Python logging)集成,将日志与追踪上下文关联。

安装与设置

Node.js

npm install @opentelemetry/api @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http \
  @opentelemetry/exporter-metrics-otlp-http
// tracing.ts - Initialize OpenTelemetry
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from
  '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from
  '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from
  '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from
  '@opentelemetry/sdk-metrics';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: 'http://localhost:4318/v1/metrics',
    }),
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'my-node-service',
});

sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Python

pip install opentelemetry-api opentelemetry-sdk \
  opentelemetry-exporter-otlp opentelemetry-instrumentation

# Auto-instrument - no code changes needed:
opentelemetry-instrument \
  --service_name my-python-service \
  --traces_exporter otlp \
  --metrics_exporter otlp \
  --exporter_otlp_endpoint http://localhost:4317 \
  python app.py

Go

go get go.opentelemetry.io/otel \
  go.opentelemetry.io/otel/sdk \
  go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp

// main.go
func initTracer() (func(context.Context) error, error) {
  exporter, err := otlptracehttp.New(
    context.Background(),
    otlptracehttp.WithEndpoint("localhost:4318"),
    otlptracehttp.WithInsecure(),
  )
  if err != nil { return nil, err }
  tp := sdktrace.NewTracerProvider(
    sdktrace.WithBatcher(exporter),
    sdktrace.WithResource(resource.NewWithAttributes(
      semconv.SchemaURL,
      semconv.ServiceNameKey.String("my-go-service"),
    )),
  )
  otel.SetTracerProvider(tp)
  return tp.Shutdown, nil
}

Java

# Download the Java agent and run with your app
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/\
opentelemetry-java-instrumentation/releases/latest/\
download/opentelemetry-javaagent.jar

java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=my-java-service \
  -Dotel.exporter.otlp.endpoint=http://localhost:4317 \
  -jar myapp.jar

自动埋点

自动埋点通过补丁或代理机制自动拦截常用库调用,无需修改业务代码即可生成 Span 和传播上下文。各语言支持的常见库:

  • Node.js: Express, Fastify, HTTP, gRPC, pg, mysql2, Redis, MongoDB, AWS SDK
  • Python: Flask, Django, FastAPI, requests, psycopg2, SQLAlchemy, Redis, Celery
  • Go: net/http, gRPC, database/sql, Gin, Echo
  • Java: Spring Boot, Servlet, JDBC, Hibernate, Kafka, gRPC, OkHttp

手动埋点

手动埋点让你完全控制遥测数据,适合添加业务特定的 Span、自定义属性和指标。

import { trace, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service', '1.0.0');

async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId);
      span.addEvent('order.validation.started');
      const order = await validateOrder(orderId);
      span.addEvent('order.validation.completed', {
        'order.items_count': order.items.length,
      });
      // Nested span for payment
      await tracer.startActiveSpan('processPayment',
        async (paymentSpan) => {
          paymentSpan.setAttribute(
            'payment.method', order.paymentMethod);
          await chargePayment(order);
          paymentSpan.end();
        });
      span.setStatus({ code: SpanStatusCode.OK });
      return order;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: (error as Error).message,
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

自定义指标

import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');

// Counter
const orderCounter = meter.createCounter('orders.processed.total',
  { description: 'Total orders processed', unit: 'orders' });

// Histogram
const durationHist = meter.createHistogram(
  'orders.processing.duration',
  { description: 'Processing time', unit: 'ms' });

// Observable Gauge
const activeGauge = meter.createObservableGauge(
  'orders.active.count',
  { description: 'Active orders' });
activeGauge.addCallback((r) => r.observe(getActiveCount()));

orderCounter.add(1, { 'order.type': 'standard' });
durationHist.record(245, { 'order.type': 'standard' });

上下文传播

上下文传播将跨服务的 Span 链接成完整追踪。W3C Trace Context 通过 traceparent 头注入 trace ID、span ID 和采样标志。

// W3C Trace Context header:
// traceparent: 00-<trace-id>-<parent-span-id>-<flags>

import { context, propagation } from '@opentelemetry/api';

// Inject context into outgoing request
function makeRequest(url: string) {
  const headers: Record<string, string> = {};
  propagation.inject(context.active(), headers);
  return fetch(url, { headers });
}

// Extract context from incoming request
function handleRequest(req: Request) {
  const ctx = propagation.extract(
    context.active(), req.headers);
  return context.with(ctx, () => {
    return tracer.startActiveSpan('handle', (span) => {
      // child of caller span
      span.end();
    });
  });
}

Span 属性与事件

属性是键值对元数据,事件是 Span 内的时间点记录。OTel 语义约定标准化了常见属性名。

// Semantic Conventions examples:
// HTTP: http.request.method, http.response.status_code, url.full
// DB:   db.system, db.statement, db.operation.name
// RPC:  rpc.system, rpc.service, rpc.method

span.setAttribute('http.request.method', 'POST');
span.setAttribute('http.response.status_code', 200);
span.setAttribute('db.system', 'postgresql');
span.setAttribute('db.statement', 'SELECT * FROM orders WHERE id=?');

span.addEvent('cache.miss', {
  'cache.key': 'user:1234',
  'cache.backend': 'redis',
});

span.recordException(new Error('Connection timeout'));

导出器

导出器将遥测数据发送到后端。OTLP 是原生协议,所有主流后端均支持。

  • OTLP (gRPC / HTTP): 推荐标准协议,支持所有三种信号
  • Jaeger: 直接导出到 Jaeger(Thrift 或 gRPC)
  • Zipkin: Zipkin 兼容后端
  • Prometheus: 暴露 /metrics 端点
  • Console/Debug: 开发调试用

Collector 配置

Collector 通过 YAML 定义接收器、处理器、导出器和管道。以下是完整的生产配置:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc: { endpoint: 0.0.0.0:4317 }
      http: { endpoint: 0.0.0.0:4318 }
  prometheus:
    config:
      scrape_configs:
        - job_name: app-metrics
          scrape_interval: 15s
          static_configs:
            - targets: ['app:9090']
  jaeger:
    protocols:
      thrift_http: { endpoint: 0.0.0.0:14268 }

processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  resource:
    attributes:
      - key: environment
        value: production
        action: upsert
  filter:
    error_mode: ignore
    traces:
      span:
        - 'attributes["http.target"] == "/health"'
  memory_limiter:
    check_interval: 1s
    limit_mib: 2048
    spike_limit_mib: 512

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  otlp/mimir:
    endpoint: mimir:4317
    tls: { insecure: true }
  debug:
    verbosity: detailed

service:
  pipelines:
    traces:
      receivers: [otlp, jaeger]
      processors: [memory_limiter, filter, batch, resource]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp/mimir]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [debug]

运行 Collector

# Docker
docker run -d --name otel-collector \
  -p 4317:4317 -p 4318:4318 \
  -v ./otel-collector-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib:latest

# Docker Compose
services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otelcol/config.yaml"]
    volumes:
      - ./otel-collector-config.yaml:/etc/otelcol/config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8888:8888"   # Collector metrics

采样策略

高流量系统中采集每条追踪不实际。采样策略在可观测性和成本之间取得平衡。

头部采样(SDK 端)

import {
  TraceIdRatioBasedSampler,
  ParentBasedSampler,
} from '@opentelemetry/sdk-trace-base';

// ParentBased: respect parent decision, sample 10% of roots
const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1),
});

const sdk = new NodeSDK({ sampler, /* ... */ });

尾部采样(Collector 端)

尾部采样在 Collector 中看到完整追踪后决策,适合保留所有错误和高延迟追踪。

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: latency-policy
        type: latency
        latency: { threshold_ms: 2000 }
      - name: probabilistic-policy
        type: probabilistic
        probabilistic: { sampling_percentage: 5 }
      - name: string-attr-policy
        type: string_attribute
        string_attribute:
          key: priority
          values: [high, critical]

集成可观测性后端

Grafana Stack (Tempo + Mimir + Loki)

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  otlphttp/mimir:
    endpoint: http://mimir:9009/otlp
  otlphttp/loki:
    endpoint: http://loki:3100/otlp

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/mimir]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlphttp/loki]

Datadog

exporters:
  datadog:
    api:
      key: "\${DD_API_KEY}"
      site: datadoghq.com
    traces:
      span_name_as_resource_name: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [datadog]

New Relic

exporters:
  otlp/newrelic:
    endpoint: otlp.nr-data.net:4317
    headers:
      api-key: "\${NEW_RELIC_LICENSE_KEY}"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp/newrelic]

Kubernetes 部署

OpenTelemetry Operator 是 K8s 中管理 OTel 的推荐方式,提供 CRD 管理 Collector 和自动埋点注入。

# Install cert-manager + OTel Operator
kubectl apply -f https://github.com/cert-manager/cert-manager/\
releases/download/v1.14.0/cert-manager.yaml

helm repo add open-telemetry \
  https://open-telemetry.github.io/opentelemetry-helm-charts
helm install otel-operator \
  open-telemetry/opentelemetry-operator \
  --namespace otel-system --create-namespace

Collector CRD (DaemonSet)

apiVersion: opentelemetry.io/v1beta1
kind: OpenTelemetryCollector
metadata:
  name: otel
  namespace: otel-system
spec:
  mode: daemonset
  config:
    receivers:
      otlp:
        protocols:
          grpc: { endpoint: 0.0.0.0:4317 }
          http: { endpoint: 0.0.0.0:4318 }
    processors:
      batch: { timeout: 5s }
      k8sattributes:
        extract:
          metadata:
            - k8s.pod.name
            - k8s.namespace.name
            - k8s.deployment.name
      memory_limiter:
        check_interval: 1s
        limit_mib: 512
    exporters:
      otlp:
        endpoint: tempo.observability:4317
        tls: { insecure: true }
    service:
      pipelines:
        traces:
          receivers: [otlp]
          processors: [memory_limiter, k8sattributes, batch]
          exporters: [otlp]

自动埋点注入

apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
  name: otel-instrumentation
spec:
  exporter:
    endpoint: http://otel-collector.otel-system:4317
  propagators: [tracecontext, baggage]
  sampler:
    type: parentbased_traceidratio
    argument: "0.25"
  nodejs:
    image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-nodejs:latest
  python:
    image: ghcr.io/open-telemetry/opentelemetry-operator/\
autoinstrumentation-python:latest
---
# Annotate Deployment for auto-instrumentation
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-node-app
spec:
  template:
    metadata:
      annotations:
        instrumentation.opentelemetry.io/inject-nodejs: "true"
    spec:
      containers:
        - name: app
          image: my-node-app:latest

最佳实践

1. 遵循语义约定

使用标准属性名(如 http.request.method 而非 method),确保跨服务一致性并让后端自动解析。

2. 设置资源属性

import { Resource } from '@opentelemetry/resources';
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
  ATTR_DEPLOYMENT_ENVIRONMENT_NAME,
} from '@opentelemetry/semantic-conventions';

const resource = new Resource({
  [ATTR_SERVICE_NAME]: 'order-service',
  [ATTR_SERVICE_VERSION]: '2.1.0',
  [ATTR_DEPLOYMENT_ENVIRONMENT_NAME]: 'production',
});

3. 控制 Span 粒度

为跨网络调用、数据库操作和关键业务操作创建 Span。避免在紧密循环中创建 Span。

4. 正确处理 Span 生命周期

始终在 finally 块中结束 Span。对异步操作使用 startActiveSpan 保持上下文。

5. 生产环境采样

不要用 AlwaysOn。从 ParentBased + TraceIdRatio(0.1) 开始,配合尾部采样保留错误追踪。

6. 使用 Collector

Collector 提供缓冲、重试、批处理和多目标导出。减少应用端网络连接和资源消耗。

7. 关联三大信号

// Inject trace context into logs (Node.js + Winston)
import { trace, context } from '@opentelemetry/api';
import winston from 'winston';

const logger = winston.createLogger({
  format: winston.format.combine(
    winston.format((info) => {
      const span = trace.getSpan(context.active());
      if (span) {
        const ctx = span.spanContext();
        info.trace_id = ctx.traceId;
        info.span_id = ctx.spanId;
      }
      return info;
    })(),
    winston.format.json()
  ),
  transports: [new winston.transports.Console()],
});
// Output: {"message":"Order processed",
//  "trace_id":"abc...","span_id":"def..."}

8. Collector 资源限制

配置 memory_limiter 防 OOM,在 K8s 中设置 resources.limits,监控 Collector 自身指标。

总结

OpenTelemetry 正在成为可观测性的事实标准。厂商中立设计让你一次埋点、导出到任何后端;三大信号的统一上下文关联让调试分布式系统不再是噩梦;Collector 的灵活管道让数据处理变得简单。从自动埋点开始快速获得价值,逐步添加手动埋点、优化采样、部署 Collector,最终建立生产就绪的全栈可观测性平台。

𝕏 Twitterin LinkedIn
这篇文章有帮助吗?

保持更新

获取每周开发技巧和新工具通知。

无垃圾邮件,随时退订。

试试这些相关工具

{ }JSON FormatterJSON Validator

相关文章

Prometheus 完全指南:现代基础设施监控与告警

掌握 Prometheus 指标类型、PromQL、记录规则、告警、Alertmanager、导出器、Grafana 与 Kubernetes 监控。

Grafana 完全指南:DevOps 仪表盘与可观测性

掌握 Grafana 数据源、仪表盘、面板类型、变量、告警、配置即代码、Loki、Tempo 与 RBAC。

Kubernetes开发者完整指南:Pod、Helm、RBAC和CI/CD

掌握Kubernetes的开发者指南。含Pod、Deployment、Service、Ingress、Helm、PVC、健康检查、HPA、RBAC和GitHub Actions CI/CD集成。