链路追踪与分布式追踪:构建可观测的微服务系统
链路追踪与分布式追踪:构建可观测的微服务系统
一、分布式追踪概述
1.1 为什么需要链路追踪
在微服务架构中,一次请求可能涉及多个服务的协同工作:
- 问题定位困难:出现问题时难以快速定位是哪个服务
- 性能瓶颈不明:无法了解整个链路的性能情况
- 依赖关系复杂:服务间的调用关系难以理清
- 调用链路不透明:无法追踪请求的完整路径
1.2 链路追踪核心概念
| 概念 | 描述 |
|---|---|
| Trace | 一次请求的完整链路标识 |
| Span | 链路中的一个工作单元 |
| Annotation | 时间点上的标记事件 |
| Baggage | 随请求传递的上下文数据 |
1.3 链路追踪架构
┌─────────────────────────────────────────────────────────────────────────┐ │ 分布式追踪架构 │ ├─────────────────────────────────────────────────────────────────────────┤ │ │ │ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ │ │ Client │────▶│Service A │────▶│Service B │────▶│Service C │ │ │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Trace Context │ │ │ │ traceId: abc123 | spanId: 1 | parentSpanId: null | sampled: true │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Collector │ │ │ │ (Zipkin/Jaeger)│ │ │ └─────────────────┘ │ │ │ │ │ ▼ │ │ ┌─────────────────┐ │ │ │ Storage │ │ │ │ (ES/MySQL) │ │ │ └─────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────────┘二、Spring Cloud Sleuth配置
2.1 基础依赖
<dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-sleuth</artifactId> </dependency> <!-- 可选:添加OpenTelemetry支持 --> <dependency> <groupId>org.springframework.cloud</groupId> <artifactId>spring-cloud-starter-tracing</artifactId> </dependency> <dependency> <groupId>io.opentelemetry</groupId> <artifactId>opentelemetry-exporter-otlp</artifactId> </dependency>2.2 Sleuth配置
spring: application: name: user-service sleuth: sampler: probability: 1.0 # 采样率 0-1 rate: 100 # 每秒最大采样数 propagation: type: B3 w3c: enabled: true baggage: remote-fields: - user-id - request-id correlation-enabled: true header-names: user-id: X-User-Id instrument: web: enabled: true reactor: enabled: true mongo: enabled: true redis: enabled: true logs: enabled: true2.3 手动创建Span
@Service public class UserService { private static final Logger log = LoggerFactory.getLogger(UserService.class); @Autowired private Tracer tracer; public User getUserById(Long id) { // 创建子Span Span span = tracer.nextSpan().name("getUserById").start(); try (Tracer.SpanInScope inScope = tracer.withSpanInScope(span)) { log.info("Getting user by id: {}", id); // 创建子Span Span dbSpan = tracer.nextSpan().name("queryDatabase").start(); try (Tracer.SpanInScope dbScope = tracer.withSpanInScope(dbSpan)) { dbSpan.tag("db.system", "mysql"); dbSpan.tag("db.statement", "SELECT * FROM users WHERE id = ?"); User user = userRepository.findById(id).orElse(null); return user; } finally { dbSpan.end(); } } finally { span.end(); } } }三、Jaeger集成
3.1 Jaeger服务端配置
version: '3.8' services: jaeger: image: jaegertracing/all-in-one:latest ports: - "16686:16686" # UI - "6831:6831/udp" # Jaeger.thrift (compact) - "14250:14250" # gRPC environment: - COLLECTOR_OTLP_ENABLED=true - SPAN_STORAGE_TYPE=elasticsearch - ES_SERVER_URLS=http://elasticsearch:9200 depends_on: - elasticsearch elasticsearch: image: docker.elastic.co/elasticsearch/elasticsearch:7.17.0 environment: - discovery.type=single-node - "ES_JAVA_OPTS=-Xms512m -Xmx512m" ports: - "9200:9200"3.2 Spring Boot集成Jaeger
spring: application: name: user-service autoconfigure: exclude: - org.springframework.cloud.sleuth.autoconfig.SleuthReactorInstrumentationAutoConfiguration otlp: tracing: endpoint: http://localhost:4318/v1/traces headers: Authorization: Bearer your-token management: tracing: sampling: probability: 1.0 propagation: type: w3c exclusions: - /actuator/** - /health3.3 自定义Jaeger配置
@Configuration public class JaegerConfig { @Bean public Configurer samplerConfigurer() { return builder -> builder .withLogSpans(true) .withCodec(Propagation.B3) .withSampler(new ProbabilisticSampler(0.5)); } @Bean public RestTemplateCustomizer jaegerRestTemplateCustomizer(Tracer tracer) { return restTemplate -> { List<ClientHttpRequestInterceptor> interceptors = new ArrayList<>( restTemplate.getInterceptors()); interceptors.add(new TracingClientHttpRequestInterceptor(tracer)); restTemplate.setInterceptors(interceptors); }; } }四、Zipkin集成
4.1 Zipkin服务端配置
# docker-compose.yml version: '3.8' services: zipkin: image: openzipkin/zipkin:latest ports: - "9411:9411" environment: - STORAGE_TYPE=elasticsearch - ES_HOSTS=http://elasticsearch:9200 - RABBIT_URI=amqp://guest:guest@rabbit:5672 depends_on: - elasticsearch4.2 Spring Boot集成Zipkin
spring: application: name: user-service zipkin: base-url: http://localhost:9411 sender: type: rest # 或 rabbit/kafka/web sampler: probability: 1.0 # 采样率 locator: discovery: enabled: true # 从Eureka发现Zipkin服务器4.3 异步发送配置
spring: zipkin: sender: type: rabbit rabbit: queue: zipkin connection-name: zipkin-sender rabbitmq: host: localhost port: 5672 username: guest password: guest management: metrics: export: zipkin: enabled: true五、OpenTelemetry集成
5.1 OpenTelemetry SDK配置
spring: application: name: user-service otel: exporter: otlp: endpoint: http://localhost:4317 headers: api-key: your-api-key service: name: ${spring.application.name} version: 1.0.0 traces: exporter: otlp metrics: exporter: otlp logs: exporter: otlp sampler: ratio: 1.0 parent-based: true5.2 自定义Span配置
@Component public class TracingInterceptor extends HandlerInterceptorAdapter { private final Tracer tracer; public TracingInterceptor(Tracer tracer) { this.tracer = tracer; } @Override public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) { Span span = tracer.nextSpan() .name(request.getMethod() + " " + request.getRequestURI()) .tag("http.method", request.getMethod()) .tag("http.url", request.getRequestURL().toString()) .tag("http.host", request.getRemoteHost()) .start(); tracer.withSpanInScope(span); request.setAttribute("currentSpan", span); return true; } @Override public void afterCompletion(HttpServletRequest request, HttpServletResponse response, Object handler, Exception ex) { Span span = tracer.currentSpan(); if (span != null) { span.tag("http.status_code", String.valueOf(response.getStatus())); if (ex != null) { span.tag("error", "true"); span.tag("error.message", ex.getMessage()); span.status(StatusCode.ERROR); } span.end(); } } }5.3 数据库追踪
@Component public class TracingDataSourceDecorator extends DataSourceWrapper { private final Tracer tracer; public TracingDataSourceDecorator(DataSource delegate, Tracer tracer) { super(delegate); this.tracer = tracer; } @Override public Connection getConnection() throws SQLException { Span span = tracer.nextSpan().name("db.query").start(); try (Tracer.SpanInScope inScope = tracer.withSpanInScope(span)) { span.tag("db.system", "mysql"); span.tag("db.pool.active", getActiveCount()); Connection connection = super.getConnection(); return new TracingConnection(connection, span, tracer); } catch (Exception e) { span.tag("error", "true"); span.status(StatusCode.ERROR); throw e; } finally { span.end(); } } }六、请求上下文传播
6.1 上下文传播配置
@Configuration public class ContextPropagationConfig { @Autowired private BeanFactory beanFactory; @Bean public ContextRegistry contextRegistry() { ContextRegistry registry = ContextRegistry.getInstance(); registry.registerContextPropagator(TextMapPropagator.getDefault()); return registry; } @Bean public BaggageRegistry baggageRegistry() { BaggageRegistry registry = BaggageRegistry.newBuilder() .addDefaultBaggageHandler((key, value) -> MDC.put(key, value)) .build(); registry.register BaggageHandler.forEntry( Entry.of("user-id", new MDCEntryToContextCarrier()) ); return registry; } }6.2 MDC集成
@Component public class MdcTracingFilter extends OncePerRequestFilter { private static final String TRACE_ID = "traceId"; private static final String SPAN_ID = "spanId"; @Autowired private Tracer tracer; @Override protected void doFilterInternal(HttpServletRequest request, HttpServletResponse response, FilterChain chain) throws ServletException, IOException { Span currentSpan = tracer.currentSpan(); if (currentSpan != null) { MDC.put(TRACE_ID, currentSpan.context().traceId()); MDC.put(SPAN_ID, currentSpan.context().spanId()); } try { chain.doFilter(request, response); } finally { MDC.clear(); } } }6.3 跨服务上下文传递
@Service public class UserServiceClient { private final RestTemplate restTemplate; private final Tracer tracer; public UserServiceClient(RestTemplate restTemplate, Tracer tracer) { this.restTemplate = restTemplate; this.tracer = tracer; } public User getUserById(Long id) { HttpHeaders headers = new HttpHeaders(); // 从当前Span注入上下文到HTTP Header Span span = tracer.currentSpan(); if (span != null) { Injector<HttpHeaders> injector = TracingPropagators.getDefault() .getPropagator(getGlobalTracer()); injector.inject(span.context(), headers, HttpHeadersCarrier.create(headers)); } HttpEntity<Void> entity = new HttpEntity<>(headers); ResponseEntity<User> response = restTemplate.exchange( "http://user-service/api/users/{id}", HttpMethod.GET, entity, User.class, id ); return response.getBody(); } }七、链路分析
7.1 慢查询分析
@Service public class SlowQueryAnalyzer { @Autowired private Tracer tracer; public void analyze() { Span currentSpan = tracer.currentSpan(); if (currentSpan == null) return; // 获取当前Span的子Span Collection<SpanData> childSpans = getChildSpans(currentSpan.context().spanId()); // 找出慢Span List<SpanData> slowSpans = childSpans.stream() .filter(span -> span.durationMs() > 1000) // 超过1秒 .sorted(Comparator.comparing(SpanData::durationMs).reversed()) .collect(Collectors.toList()); log.warn("Slow spans detected: {}", slowSpans); } }7.2 调用链分析
@Service public class TraceAnalyzer { @Autowired private SpanRepository spanRepository; public CallGraph buildCallGraph(String traceId) { List<SpanData> spans = spanRepository.findByTraceId(traceId); CallGraph graph = new CallGraph(); for (SpanData span : spans) { Node node = new Node( span.getSpanId(), span.getOperationName(), span.getDurationMs() ); graph.addNode(node); if (span.getParentSpanId() != null) { graph.addEdge(span.getParentSpanId(), span.getSpanId()); } } return graph; } public List<Path> findCriticalPath(String traceId) { CallGraph graph = buildCallGraph(traceId); return graph.findLongestPath(); } }7.3 依赖分析
@Service public class DependencyAnalyzer { public ServiceDependencyGraph buildDependencyGraph() { List<SpanData> allSpans = spanRepository.findAll(); Map<String, Set<String>> dependencies = new HashMap<>(); for (SpanData span : allSpans) { String service = span.getServiceName(); span.getTags().forEach((key, value) -> { if (key.startsWith("peer.")) { String peerService = extractPeerService(value); if (peerService != null) { dependencies.computeIfAbsent(service, k -> new HashSet<>()) .add(peerService); } } }); } return new ServiceDependencyGraph(dependencies); } }八、告警配置
8.1 错误率告警
# Prometheus告警规则 groups: - name: tracing-alerts rules: - alert: HighErrorRate expr: | sum(rate(spring_sleuth_spans{tag_error="true"}[5m])) by (service) / sum(rate(spring_sleuth_spans_count[5m])) by (service) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate in {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }}" - alert: SlowResponseTime expr: | histogram_quantile(0.95, sum(rate(spring_sleuth_spans_duration_seconds_bucket[5m])) by (le, service) ) > 2 for: 10m labels: severity: warning annotations: summary: "Slow response time in {{ $labels.service }}" description: "95th percentile is {{ $value | humanizeDuration }}"8.2 延迟告警
- alert: LatencyIncrease expr: | sum(rate(spring_sleuth_spans_duration_seconds_sum[5m])) by (service) / sum(rate(spring_sleuth_spans_duration_seconds_count[5m])) by (service) > 1.5 * avg_over_time( sum(rate(spring_sleuth_spans_duration_seconds_sum[1h])) by (service) / sum(rate(spring_sleuth_spans_duration_seconds_count[1h])) by (service) [1h:5m]) for: 5m labels: severity: warning annotations: summary: "Latency increased in {{ $labels.service }}"九、Grafana仪表盘
9.1 链路追踪面板
{ "title": "Request Trace Overview", "panels": [ { "title": "Request Rate by Service", "type": "graph", "targets": [ { "expr": "sum(rate(spring_sleuth_spans_count[5m])) by (service)", "legendFormat": "{{ service }}" } ] }, { "title": "Error Rate", "type": "graph", "targets": [ { "expr": "sum(rate(spring_sleuth_spans{tag_error=\"true\"}[5m])) by (service)", "legendFormat": "{{ service }}" } ] }, { "title": "P99 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(spring_sleuth_spans_duration_seconds_bucket[5m])) by (le, service))", "legendFormat": "{{ service }}" } ] } ] }十、最佳实践
10.1 采样策略
| 策略 | 适用场景 | 配置 |
|---|---|---|
| 全量采样 | 开发环境、调试 | probability: 1.0 |
| 概率采样 | 生产环境常规 | probability: 0.1-0.5 |
| 头部采样 | 请求入口统一采样 | sampler: HeadBased |
| 自适应采样 | 动态调整 | 错误时提高采样率 |
10.2 性能优化建议
- 异步发送:使用Kafka/RabbitMQ异步发送追踪数据
- 采样策略:根据流量动态调整采样率
- 数据压缩:启用追踪数据的压缩
- 批量发送:聚合多个Span后批量发送
- 存储优化:使用合适的存储后端和索引策略
10.3 安全考虑
# 敏感数据过滤 spring: sleuth: instrument: exclude: - org.springframework.web.servlet.Filter propagation: type: w3c baggage: correlation-enabled: false # 禁用自动MDC关联 data: redis: customizers: - tracing-repository-customizer十一、总结
链路追踪是微服务可观测性的核心组件,通过本文的介绍,你可以:
- 链路追踪概述:Trace、Span、Annotation等核心概念
- Spring Cloud Sleuth:分布式追踪的基础组件
- Jaeger集成:CNCF推荐的追踪系统
- Zipkin集成:Twitter开源的追踪系统
- OpenTelemetry:跨语言的追踪标准
- 上下文传播:跨服务传递追踪上下文
- 链路分析:慢查询、调用链、依赖分析
- 告警配置:基于Prometheus的告警规则
- Grafana仪表盘:可视化链路追踪数据
通过完善的链路追踪系统,可以快速定位问题、优化性能、理解系统行为,构建真正可观测的微服务系统。
