Jaeger分布式跟踪工具初探

官方文档

Jaegertracing

Jaeger简介

Jaeger:开源的端到端分布式跟踪,监视复杂的分布式系统中的事务并进行故障排除。
下图对比了常用的开源全链路追踪方案,目前SkyWalking和Pinpoint使用比较多,Jaeger相比客户端支持语言比较多,特别是对C++的支持,所以这次选择测试下。

Jaeger解决的问题

  • 分布式事务监控
  • 性能和延迟优化
  • 根本原因分析
  • 服务依赖性分析
  • 分布式上下文传播

Jaeger架构图

Jaeger组件

  • Jaeger Agent,负责和客户端通信,把收集到的追踪信息上报个收集器 Jaeger Collector
  • Jaeger Colletor把收集到的数据存入数据库或者其它存储器
  • Jaeger Query 负责对追踪数据进行查询
  • Jaeger Ingester 是一个从Kafka主题读取并写入另一个存储后端(Cassandra、Elasticsearch)的服务
  • Jaeger UI负责用户交互

Jaeger端口统计

Agent
5775 UDP协议,接收兼容zipkin的协议数据
6831 UDP协议,接收兼容jaeger的兼容协议
6832 UDP协议,接收jaeger的二进制协议
5778 HTTP协议,数据量大不建议使用

Collector
14267 tcp agent发送jaeger.thrift格式数据
14250 tcp agent发送proto格式数据(背后gRPC)
14268 http 直接接受客户端数据
14269 http 健康检查

Query
16686 http jaeger的前端,放给用户的接口
16687 http 健康检查

Jaeger部署

1.创建命名空间

[root@VM-0-123-centos jaeger]# kubectl create namespace jaeger 

2.部署Jaeger-Operator
Jaeger Operator:Jaeger Operator for Kubernetes简化了在Kubernetes上的部署和运行Jaeger。
Jaeger Operator是Kubernetes operator的实现。操作员是一种软件,可以减轻运行另一软件的操作复杂性。从技术上讲,操作员是打包,部署和管理Kubernetes应用程序的一种方法。
Jaeger Operator版本跟踪Jaeger组件(查询,收集器,代理)的一种版本。发行新版本的Jaeger组件时,将发行新版本的操作员,该操作员了解如何将先前版本的运行实例升级到新版本。

[root@VM-0-123-centos jaeger]# kubectl create -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/crds/jaegertracing.io_jaegers_crd.yaml 
[root@VM-0-123-centos jaeger]# kubectl create -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/service_account.yaml
[root@VM-0-123-centos jaeger]# kubectl create -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role.yaml
[root@VM-0-123-centos jaeger]# kubectl create -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/role_binding.yaml
[root@VM-0-123-centos jaeger]# kubectl create -n jaeger -f https://raw.githubusercontent.com/jaegertracing/jaeger-operator/master/deploy/operator.yaml

查看状态

[root@VM-0-123-centos jaeger]# kubectl get all -n jaeger
NAME                                         READY   STATUS        RESTARTS   AGE
pod/jaeger-operator-6ff67bdd4b-4nffk         1/1     Running       0          14d
pod/simple-prod-collector-59fc47bf5c-h26mq   0/1     Terminating   0          9d

NAME                              TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)             AGE
service/jaeger-operator-metrics   ClusterIP   172.20.253.138   <none>        8383/TCP,8686/TCP   14d

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/jaeger-operator   1/1     1            1           14d

NAME                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/jaeger-operator-6ff67bdd4b   1         1         1       14d

3.创建jaeger实例
创建jaeger.yaml文件,配置ES集群及限制Deployment/simple-prod-collector容器的cpu和内存使用大小。最大数量可以起10个pod。

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simple-prod
spec:
  strategy: production
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://10.0.16.3:9200
        index-prefix: zhjt
  collector:
    maxReplicas: 10
    resources:
      limits:
        cpu: 500m
        memory: 512Mi
[root@VM-0-123-centos jaeger]# kubectl apply -f  jaeger.yaml  -n jaeger
jaeger.jaegertracing.io/simple-prod created

列出jaeger对象
备注:貌似使用官网all in one的例子状态是正常的Running,这里状态虽然是Failed,但是不影响使用。

[root@VM-0-123-centos jaeger]# kubectl get jaegers -n jaeger
NAME          STATUS   VERSION   STRATEGY     STORAGE         AGE
simple-prod   Failed   1.22.0    production   elasticsearch   9d

获取pod名字

[root@VM-0-123-centos jaeger]# kubectl get pods -l app.kubernetes.io/instance=simple-prod -n jaeger
NAME                                              READY   STATUS      RESTARTS   AGE
simple-prod-collector-59fc47bf5c-h26mq            1/1     Running     0          9d
simple-prod-query-85689b7bbd-g5jw9                2/2     Running     0          9d

获取pod日志

[root@VM-0-123-centos jaeger]# kubectl  logs simple-prod-query-85689b7bbd-g5jw9 jaeger-agent  -n jaeger
2021/04/28 04:55:34 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
{"level":"info","ts":1619585734.2081811,"caller":"flags/service.go:117","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1619585734.2082183,"caller":"flags/service.go:123","msg":"Mounting expvar handler on admin server","route":"/debug/vars"}
{"level":"info","ts":1619585734.2083232,"caller":"flags/admin.go:105","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1619585734.2083883,"caller":"flags/admin.go:111","msg":"Starting admin HTTP server","http-addr":":14271"}
{"level":"info","ts":1619585734.2084124,"caller":"flags/admin.go:97","msg":"Admin server started","http.host-port":"[::]:14271","health-status":"unavailable"}
{"level":"info","ts":1619585734.2089527,"caller":"grpc/builder.go:70","msg":"Agent requested insecure grpc connection to collector(s)"}
{"level":"info","ts":1619585734.2089992,"caller":"grpc@v1.29.1/clientconn.go:243","msg":"parsed scheme: \"dns\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.21038,"caller":"command-line-arguments/main.go:84","msg":"Starting agent"}
{"level":"info","ts":1619585734.2104166,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1619585734.2108943,"caller":"grpc/builder.go:108","msg":"Checking connection to collector"}
{"level":"info","ts":1619585734.210908,"caller":"grpc/builder.go:119","msg":"Agent collector connection state change","dialTarget":"dns:///simple-prod-collector-headless.jaeger.svc:14250","status":"IDLE"}
{"level":"info","ts":1619585734.211061,"caller":"app/agent.go:69","msg":"Starting jaeger-agent HTTP server","http-port":5778}
{"level":"info","ts":1619585734.3344934,"caller":"grpc@v1.29.1/resolver_conn_wrapper.go:143","msg":"ccResolverWrapper: sending update to cc: {[{172.20.0.88:14250  <nil> 0 <nil>}] <nil> <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.3345578,"caller":"grpc@v1.29.1/clientconn.go:667","msg":"ClientConn switching balancer to \"round_robin\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.3345697,"caller":"grpc@v1.29.1/clientconn.go:682","msg":"Channel switches to new LB policy \"round_robin\"","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.3346283,"caller":"grpc@v1.29.1/clientconn.go:1056","msg":"Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.33467,"caller":"grpc@v1.29.1/clientconn.go:1193","msg":"Subchannel picks a new address \"172.20.0.88:14250\" to connect","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.334736,"caller":"grpc@v1.29.1/clientconn.go:417","msg":"Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.3347983,"caller":"grpc/builder.go:119","msg":"Agent collector connection state change","dialTarget":"dns:///simple-prod-collector-headless.jaeger.svc:14250","status":"CONNECTING"}
{"level":"info","ts":1619585734.335669,"caller":"grpc@v1.29.1/clientconn.go:1056","msg":"Subchannel Connectivity change to READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.3357751,"caller":"base/balancer.go:200","msg":"roundrobinPicker: newPicker called with info: {map[0xc0002f5ea0:{{172.20.0.88:14250  <nil> 0 <nil>}}]}","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.3357947,"caller":"grpc@v1.29.1/clientconn.go:417","msg":"Channel Connectivity change to READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1619585734.335807,"caller":"grpc/builder.go:119","msg":"Agent collector connection state change","dialTarget":"dns:///simple-prod-collector-headless.jaeger.svc:14250","status":"READY"}
{"level":"info","ts":1619592172.4516647,"caller":"grpc@v1.29.1/clientconn.go:1056","msg":"Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.4517512,"caller":"grpc@v1.29.1/clientconn.go:1193","msg":"Subchannel picks a new address \"172.20.0.88:14250\" to connect","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.4517596,"caller":"base/balancer.go:200","msg":"roundrobinPicker: newPicker called with info: {map[]}","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.4517772,"caller":"grpc@v1.29.1/clientconn.go:417","msg":"Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.4517884,"caller":"grpc/builder.go:119","msg":"Agent collector connection state change","dialTarget":"dns:///simple-prod-collector-headless.jaeger.svc:14250","status":"CONNECTING"}
{"level":"warn","ts":1619592172.4523218,"caller":"grpc@v1.29.1/clientconn.go:1275","msg":"grpc: addrConn.createTransport failed to connect to {172.20.0.88:14250  <nil> 0 <nil>}. Err: connection error: desc = \"transport: Error while dialing dial tcp 172.20.0.88:14250: connect: connection refused\". Reconnecting...","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.4523551,"caller":"grpc@v1.29.1/clientconn.go:1056","msg":"Subchannel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.452386,"caller":"grpc@v1.29.1/clientconn.go:417","msg":"Channel Connectivity change to TRANSIENT_FAILURE","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.4523947,"caller":"grpc/builder.go:119","msg":"Agent collector connection state change","dialTarget":"dns:///simple-prod-collector-headless.jaeger.svc:14250","status":"TRANSIENT_FAILURE"}
{"level":"info","ts":1619592172.6118224,"caller":"grpc@v1.29.1/resolver_conn_wrapper.go:143","msg":"ccResolverWrapper: sending update to cc: {[{172.20.0.178:14250  <nil> 0 <nil>}] <nil> <nil>}","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.6118581,"caller":"grpc@v1.29.1/clientconn.go:1056","msg":"Subchannel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.6118758,"caller":"grpc@v1.29.1/clientconn.go:1056","msg":"Subchannel Connectivity change to SHUTDOWN","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.611892,"caller":"grpc@v1.29.1/clientconn.go:417","msg":"Channel Connectivity change to CONNECTING","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.6119003,"caller":"grpc/builder.go:119","msg":"Agent collector connection state change","dialTarget":"dns:///simple-prod-collector-headless.jaeger.svc:14250","status":"CONNECTING"}
{"level":"info","ts":1619592172.6119049,"caller":"grpc@v1.29.1/clientconn.go:1193","msg":"Subchannel picks a new address \"172.20.0.178:14250\" to connect","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.612726,"caller":"grpc@v1.29.1/clientconn.go:1056","msg":"Subchannel Connectivity change to READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.6127572,"caller":"base/balancer.go:200","msg":"roundrobinPicker: newPicker called with info: {map[0xc0003df970:{{172.20.0.178:14250  <nil> 0 <nil>}}]}","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.6127682,"caller":"grpc@v1.29.1/clientconn.go:417","msg":"Channel Connectivity change to READY","system":"grpc","grpc_log":true}
{"level":"info","ts":1619592172.6127849,"caller":"grpc/builder.go:119","msg":"Agent collector connection state change","dialTarget":"dns:///simple-prod-collector-headless.jaeger.svc:14250","status":"READY"}
[root@VM-0-123-centos jaeger]# kubectl  logs simple-prod-query-85689b7bbd-g5jw9 jaeger-query   -n jaeger
2021/04/28 04:55:29 maxprocs: Leaving GOMAXPROCS=4: CPU quota undefined
{"level":"info","ts":1619585729.8951077,"caller":"flags/service.go:117","msg":"Mounting metrics handler on admin server","route":"/metrics"}
{"level":"info","ts":1619585729.8951416,"caller":"flags/service.go:123","msg":"Mounting expvar handler on admin server","route":"/debug/vars"}
{"level":"info","ts":1619585729.8952546,"caller":"flags/admin.go:105","msg":"Mounting health check on admin server","route":"/"}
{"level":"info","ts":1619585729.8953054,"caller":"flags/admin.go:111","msg":"Starting admin HTTP server","http-addr":":16687"}
{"level":"info","ts":1619585729.8953238,"caller":"flags/admin.go:97","msg":"Admin server started","http.host-port":"[::]:16687","health-status":"unavailable"}
{"level":"info","ts":1619585729.9169888,"caller":"config/config.go:183","msg":"Elasticsearch detected","version":7}
{"level":"info","ts":1619585729.9174955,"caller":"app/static_handler.go:181","msg":"UI config path not provided, config file will not be watched"}
{"level":"info","ts":1619585729.9175768,"caller":"app/server.go:170","msg":"Query server started"}
{"level":"info","ts":1619585729.9175944,"caller":"healthcheck/handler.go:128","msg":"Health Check state change","status":"ready"}
{"level":"info","ts":1619585729.9176183,"caller":"app/server.go:249","msg":"Starting GRPC server","port":16685,"addr":":16685"}
{"level":"info","ts":1619585729.9176335,"caller":"app/server.go:230","msg":"Starting HTTP server","port":16686,"addr":":16686"}

4.查看jaeger资源

[root@VM-0-123-centos jaeger]# kubectl get all -n jaeger
NAME                                                  READY   STATUS      RESTARTS   AGE
pod/jaeger-operator-6ff67bdd4b-4nffk                  1/1     Running     0          14d
pod/simple-prod-collector-59fc47bf5c-h26mq            1/1     Running     0          8d
pod/simple-prod-query-85689b7bbd-g5jw9                2/2     Running     0          8d

NAME                                     TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                                  AGE
service/jaeger-operator-metrics          ClusterIP   172.20.253.138   <none>        8383/TCP,8686/TCP                        14d
service/simple-prod-collector            ClusterIP   172.20.255.184   <none>        9411/TCP,14250/TCP,14267/TCP,14268/TCP   8d
service/simple-prod-collector-headless   ClusterIP   None             <none>        9411/TCP,14250/TCP,14267/TCP,14268/TCP   8d
service/simple-prod-query                ClusterIP   172.20.254.102   <none>        16686/TCP                                8d

NAME                                    READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/jaeger-operator         1/1     1            1           14d
deployment.apps/simple-prod-collector   1/1     1            1           8d
deployment.apps/simple-prod-query       1/1     1            1           8d

NAME                                               DESIRED   CURRENT   READY   AGE
replicaset.apps/jaeger-operator-6ff67bdd4b         1         1         1       14d
replicaset.apps/simple-prod-collector-59fc47bf5c   1         1         1       8d
replicaset.apps/simple-prod-query-85689b7bbd       1         1         1       8d

NAME                                                        REFERENCE                          TARGETS             MINPODS   MAXPODS   REPLICAS   AGE
horizontalpodautoscaler.autoscaling/simple-prod-collector   Deployment/simple-prod-collector   1457m/90, 137m/90   1         10        1          8d

如果流量大需要减小es压力,可以接入kafka集群,修改jaeger.yaml文件

apiVersion: jaegertracing.io/v1
kind: Jaeger
metadata:
  name: simple-streaming
spec:
  strategy: streaming
  collector:
    options:
      kafka:
        producer:
          topic: jaeger-spans
          brokers: my-cluster-kafka-brokers.kafka:9092   #修改为kafka地址
  ingester:
    options:
      kafka:
        consumer:
          topic: jaeger-spans
          brokers: my-cluster-kafka-brokers.kafka:9092  #修改为kafka地址
      ingester:
        deadlockInterval: 5s
  storage:
    type: elasticsearch
    options:
      es:
        server-urls: http://elasticsearch:9200   #修改为ES地址

5.agent部署

jaeger client的一个代理程序,client将收集到的调用链数据发给agent,然后由agent发给collector。由于使用的udp协议,一般部署在靠近client的位置。

agent有多种安装方式

1).docker安装

下载:jaegertracing/jaeger-agent Tags (docker.com)

docker run -d -p 6831:6831/udp -p 6832:6832/udp -p 5778:5778/tcp jaegertracing/jaeger-agent:1.12 –reporter.grpc.host-port=xx.xx.xx.xx:14250

2).k8s安装又分两种

sidecar方式

daemonset方式

参考:Operator for Kubernetes — Jaeger documentation (jaegertracing.io)

3).二进制安装

下载:Jaeger – Download Jaeger (jaegertracing.io)

nohup ./jaeger-agent –collector.host-port=xxxx:14267 1>1.log 2>2.log &