Datacenter: id=5, name=Kaliti Asset: id=885, name=KLT-IGW-T9K001-NPB06 AssetInfo dashboard -> TSG-9140-NPB-TRAFFIC-> Throughput - DP1/L4 监控图表异常原因分析如下 问题说明: 图表异常时间范围:2023-06-09 06:58:13 - 2023-06-09 07:00:13 (UTC+3) prometheus 采集间隔时间: 1分钟 prometheus 采集状态指标: up 指标:1 表示目标服务健康,即可达;0 表示抓取失败 scrape_samples_scraped 指标:目标暴露的样本数 1. up 指标数据如下:在 06:58:13,06:59:13,07:00:13 结果均为 1 分析:说明 10.217.11.6:9000 服务健康(成功建立连接) { "metric": { "__name__": "up", "asset": "KLT-IGW-T9K001-NPB06", "asset_id": "885", "datacenter": "Kaliti", "datacenter_id": "5", "device_group": "KLT-IGW", "endpoint": "MRAPM-STREAM-KLT-IGW-T9K001-NPB06", "endpoint_id": "5703", "instance": "10.217.11.6:9000", "job": "5703", "module": "MRAPM-STREAM", "module_id": "13", "nz_agent_id": "43", "project": "TSG-9140", "project_id": "3" }, "values": [ [ 1686283093.501, // 2023-06-09 06:58:13 "1" ], [ 1686283153.501, // 2023-06-09 06:59:13 "1" ], [ 1686283213.501, // 2023-06-09 07:00:13 "1" ] ] } 2.scrape_samples_scraped 指标数据如下:在 06:58:13,07:00:13 结果均为 36, 在 06:59:13 时结果为 0 分析:结合 up 指标可得,在 06:59:13 时,up=1,scrape_samples_scraped=0, 说明在该时间点,prometheus 成功连接了目标服务,但是没有拉取到任何指标数据 { "metric": { "__name__": "scrape_samples_scraped", "asset": "KLT-IGW-T9K001-NPB06", "asset_id": "885", "datacenter": "Kaliti", "datacenter_id": "5", "device_group": "KLT-IGW", "endpoint": "MRAPM-STREAM-KLT-IGW-T9K001-NPB06", "endpoint_id": "5703", "instance": "10.217.11.6:9000", "job": "5703", "module": "MRAPM-STREAM", "module_id": "13", "nz_agent_id": "43", "project": "TSG-9140", "project_id": "3" }, "values": [ [ 1686283093.501, // 2023-06-09 06:58:13 "36" ], [ 1686283153.501, // 2023-06-09 06:59:13 "0" ], [ 1686283213.501, // 2023-06-09 07:00:13 "36" ] ] } 3.rx_bits_total 指标说明:Throughput - DP1/L4 监控图表使用指标项 rx_bits_total 指标数据如下:在 06:58:13,07:00:13 均有数据,但缺少 06:59:13 的指标数据 分析:监控图表中查询表达式为:irate(rx_bits_total{asset="KLT-IGW-T9K001-NPB06", app="sapp4", device="eth_vf_raw"}[2m]),由于 irate 函数会计算两个时间范围之间的速率,现在缺少了一个周期的数据,所以造成了图表中断 { "metric": { "__name__": "rx_bits_total", "app": "sapp4", "asset": "KLT-IGW-T9K001-NPB06", "asset_id": "885", "datacenter": "Kaliti", "datacenter_id": "5", "device": "eth_vf_raw", "device_group": "KLT-IGW", "endpoint": "MRAPM-STREAM-KLT-IGW-T9K001-NPB06", "endpoint_id": "5703", "module": "MRAPM-STREAM", "module_id": "13", "nz_agent_id": "43", "project": "TSG-9140", "project_id": "3" }, "values": [ [ 1686283093.501, // 2023-06-09 06:58:13 "3346270294310792" ], [ 1686283213.501, // 2023-06-09 07:00:13 "3346325185776760" ] ] } 总结: 1. 在 06:58:13,06:59:13,07:00:13 时间点,10.217.11.6:9000 服务健康 2. 在 06:59:13 时间点,prometheus 进行了指标采集,但是 10.217.11.6:9000 服务没有提供指标数据 3. 在 06:59:13 时间点,由于缺少了一个周期的指标数据,结合图表使用 irate 函数,所以造成了图表中断