监控告警体系

线上 Agent 服务的 p99 延迟飙到 45 秒，用户投诉涌来，但没有监控面板、没有告警，甚至不知道问题从什么时候开始。花了两小时翻日志，才发现是 LLM API 限流导致重试雪崩。

如果有完整的监控告警体系，五分钟就能发现问题。这套体系怎么搭？

监控三大支柱：Metrics、Logs、Traces

做监控之前，先搞清楚一个基本概念。业界把可观测性 (Observability) 分成三大支柱，分别解决不同的问题：

可观测性三大支柱
=============================

  Metrics (指标)          Logs (日志)           Traces (分布式链路)
  ─────────────          ──────────           ─────────────────
  "发生了什么"            "为什么发生"           "在哪里发生"
  数值型时间序列           结构化/非结构化文本      请求在各服务间的传播路径

  ┌─────────────┐      ┌─────────────┐      ┌─────────────────┐
  │ CPU: 73%     │      │ ERROR: LLM  │      │ [Gateway]       │
  │ QPS: 1200    │      │ timeout at  │      │  └─ [Agent]     │
  │ p99: 2.3s    │      │ 2024-01-15  │      │     ├─ [LLM]   │
  │ Error: 0.1%  │      │ 03:15:22    │      │     └─ [Tool]  │
  └─────────────┘      └─────────────┘      └─────────────────┘

  聚合后的数值           事件的详细记录          请求的完整调用链
  趋势分析              模式识别              瓶颈定位

三者不是互相替代的关系，而是互补的。Metrics 告诉你"出事了"，Logs 告诉你"为什么出事了"，Traces 告诉你"具体哪个环节出事了"。一个成熟的监控体系，三者缺一不可。

对于 Agent 服务来说，这个分层尤为重要。因为 Agent 的调用链路天然就是多跳的 -- 用户请求进来，经过 Gateway，到 Agent Orchestrator，可能还要调用 LLM、RAG、Tool，每一跳都可能出问题。

整体监控架构

先把全景图画出来，后面再逐个击破：

Agent 服务监控架构全景图
======================================================================

  用户端                    服务端                          监控端
  ──────                    ──────                          ──────

  ┌──────┐   HTTP/gRPC   ┌──────────┐    scrape     ┌──────────────┐
  │ 用户  │ ──────────── │  Gateway  │ ──────────►  │  Prometheus  │
  └──────┘               └──────────┘               │              │
                          │                          │  · Metrics   │
                          ▼                          │  · Rules     │
                      ┌──────────┐    expose        │  · Storage   │
                      │  Agent   │ ──────────►      └──────┬───────┘
                      │ Orchestr │                        │
                      └──────────┘                        │
                       │       │                          │
                  ┌────┘       └────┐                    ▼
                  ▼                 ▼             ┌──────────────┐
              ┌───────┐        ┌───────┐         │   Grafana    │
              │  LLM  │        │  Tool │         │              │
              │  API  │        │ Service│         │  · Dashboards│
              └───────┘        └───────┘         │  · Alerts    │
                                                 └──────┬───────┘
  ┌──────┐                                              │
  │ 日志  │ ──── push ────► ┌──────────┐                │
  │ 输出  │                 │   Loki   │                │
  └──────┘                 │    /     │                ▼
                           │  ELK     │         ┌──────────────┐
  ┌──────┐                └──────────┘         │ AlertManager │
  │ Jaeger│ ◄──── pull ─────────────────────── │              │
  │ /Tempo│                                    │  · Route     │
  └──────┘                                    │  · Silence   │
                                              │  · Notify    │
                                              └──────┬───────┘
                                                     │
                                          ┌──────────┼──────────┐
                                          ▼          ▼          ▼
                                     ┌────────┐ ┌────────┐ ┌────────┐
                                     │Slack   │ │Pager   │ │Email   │
                                     │        │ │Duty    │ │        │
                                     └────────┘ └────────┘ └────────┘

从左到右，数据流向非常清晰：服务产生指标/日志/链路数据，分别流入 Prometheus、Loki、Jaeger，Grafana 做统一可视化，AlertManager 做告警路由和通知。

Prometheus：指标采集的核心

为什么选 Prometheus

对于 Agent 服务来说，Prometheus 是最自然的选择。原因有三：

Pull 模型 -- Prometheus 主动拉取指标，不依赖服务主动推送。这意味着即使 Agent 服务挂了，Prometheus 还能知道它挂了（拉不到数据就告警）
PromQL -- 查询语言非常强大，做聚合、分位数计算都很方便
生态成熟 -- 几乎所有中间件都有现成的 Exporter

数据模型

Prometheus 的核心数据模型是时间序列。每个指标由 metric name + labels 唯一标识：

# 格式
metric_name{label1="value1", label2="value2"} timestamp value

# 示例
agent_request_total{service="agent-orchestrator", method="POST", status="200"} 1705300000 15234
agent_llm_latency_seconds{service="agent-orchestrator", model="gpt-4"} 1705300000 2.341
agent_tool_call_total{service="tool-service", tool_name="search", status="success"} 1705300000 892

四种基础指标类型要搞清楚：

Counter -- 只增不减的计数器，比如请求总数、错误总数
Gauge -- 可增可减的瞬时值，比如 CPU 使用率、当前连接数
Histogram -- 分桶统计，比如请求延迟分布
Summary -- 客户端计算分位数，比 Histogram 省存储但没法跨实例聚合

对于 Agent 服务，Histogram 是最常用的类型，因为延迟分布是你最关心的指标。

Agent 服务的自定义指标

下面是我在生产环境中实际使用的一组核心指标定义：

from prometheus_client import Counter, Histogram, Gauge, Summary

# ========== 请求级别指标 ==========

# Agent 请求总数 (Counter)
agent_request_total = Counter(
    'agent_request_total',
    'Total agent requests',
    ['service', 'method', 'status', 'user_type']
)

# Agent 请求延迟 (Histogram)
agent_request_duration_seconds = Histogram(
    'agent_request_duration_seconds',
    'Agent request latency in seconds',
    ['service', 'method'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0, 15.0, 30.0, 60.0]
)

# ========== LLM 相关指标 ==========

# LLM 调用延迟 (Histogram)
llm_call_duration_seconds = Histogram(
    'llm_call_duration_seconds',
    'LLM API call latency',
    ['model', 'provider', 'operation'],
    buckets=[0.5, 1.0, 2.0, 5.0, 10.0, 15.0, 30.0, 60.0]
)

# Token 消耗速率 (Counter)
llm_tokens_total = Counter(
    'llm_tokens_total',
    'Total tokens consumed by LLM',
    ['model', 'provider', 'type']  # type: input / output
)

# LLM 调用成本 (Counter)
llm_cost_dollars_total = Counter(
    'llm_cost_dollars_total',
    'Total LLM API cost in dollars',
    ['model', 'provider']
)

# ========== Tool 调用指标 ==========

# Tool 调用总数 (Counter)
tool_call_total = Counter(
    'tool_call_total',
    'Total tool calls',
    ['tool_name', 'status']  # status: success / failure / timeout
)

# Tool 调用延迟 (Histogram)
tool_call_duration_seconds = Histogram(
    'tool_call_duration_seconds',
    'Tool call latency',
    ['tool_name'],
    buckets=[0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# ========== Agent 运行时指标 ==========

# 当前活跃的 Agent 会话数 (Gauge)
agent_active_sessions = Gauge(
    'agent_active_sessions',
    'Number of active agent sessions',
    ['service']
)

# Agent 循环迭代次数 (Histogram)
agent_loop_iterations = Histogram(
    'agent_loop_iterations',
    'Number of iterations per agent run',
    ['service'],
    buckets=[1, 2, 3, 5, 8, 10, 15, 20, 30, 50]
)

# ========== 队列和缓存指标 ==========

# 请求队列深度 (Gauge)
agent_queue_depth = Gauge(
    'agent_queue_depth',
    'Current depth of agent request queue',
    ['service', 'priority']
)

# 缓存命中率 (Counter)
agent_cache_total = Counter(
    'agent_cache_total',
    'Cache operations',
    ['cache_type', 'result']  # result: hit / miss
)

PromQL 实战

光埋指标不够，关键是要会写查询。下面是一些 Agent 服务常用的 PromQL 查询：

# 1. Agent 服务 p99 延迟 (最近 5 分钟)
histogram_quantile(0.99,
  rate(agent_request_duration_seconds_bucket[5m])
)

# 2. LLM 调用的 p95 延迟，按模型分组
histogram_quantile(0.95,
  sum by (model) (
    rate(llm_call_duration_seconds_bucket[5m])
  )
)

# 3. 每分钟 Token 消耗速率 (tokens/minute)
sum by (model) (
  rate(llm_tokens_total[1m])
) * 60

# 4. 每分钟 LLM 成本 (美元/分钟)
sum(rate(llm_cost_dollars_total[1m])) * 60

# 5. Tool 调用成功率
sum(rate(tool_call_total{status="success"}[5m]))
/
sum(rate(tool_call_total[5m]))

# 6. Agent 请求错误率 (5xx)
sum(rate(agent_request_total{status=~"5.."}[5m]))
/
sum(rate(agent_request_total[5m]))

# 7. LLM 限流次数 (过去 10 分钟)
sum(increase(llm_call_duration_seconds_count{
  reason="rate_limit"
}[10m]))

# 8. 单次请求的平均 Token 消耗
sum(rate(llm_tokens_total[5m]))
/
sum(rate(agent_request_total[5m]))

# 9. Agent 循环次数分布 (P50, P95, P99)
histogram_quantile(0.50, rate(agent_loop_iterations_bucket[5m]))
histogram_quantile(0.95, rate(agent_loop_iterations_bucket[5m]))
histogram_quantile(0.99, rate(agent_loop_iterations_bucket[5m]))

# 10. 缓存命中率
sum(rate(agent_cache_total{result="hit"}[5m]))
/
sum(rate(agent_cache_total[5m]))

PromQL 有几个容易搞混的点，特别提醒一下：

rate() 和 increase() 的区别：rate() 返回每秒速率，increase() 返回区间内增长量。Counter 类型用 rate()，Gauge 类型用 delta()
时间窗口的选择：窗口太短会抖动严重，太长会反应迟钝。一般用 5 分钟作为平衡点
histogram_quantile 的计算基于 bucket，bucket 划分不合理会导致分位数不准

Prometheus 配置详解

下面是生产级别的 Prometheus 配置，重点看 scrape_configs 和 alerting_rules 部分：

# prometheus.yml
global:
  scrape_interval: 15s       # 全局采集间隔
  evaluation_interval: 15s   # 规则评估间隔
  scrape_timeout: 10s        # 采集超时

# 告警规则文件
rule_files:
  - "/etc/prometheus/rules/*.yml"

# AlertManager 配置
alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - "alertmanager:9093"

# 采集配置
scrape_configs:
  # Prometheus 自身
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Agent Orchestrator 服务
  - job_name: "agent-orchestrator"
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - "agent-system"
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        target_label: __address__
        regex: (.+)
        replacement: "${1}"

  # Tool 服务
  - job_name: "tool-service"
    static_configs:
      - targets: ["tool-service:8000"]
    metrics_path: "/metrics"
    scrape_interval: 10s      # Tool 服务采集频率更高

  # LLM API 代理 (如果自建了代理层)
  - job_name: "llm-proxy"
    static_configs:
      - targets: ["llm-proxy:8080"]
    scrape_interval: 10s

  # Redis 缓存
  - job_name: "redis"
    static_configs:
      - targets: ["redis-exporter:9121"]

  # PostgreSQL
  - job_name: "postgres"
    static_configs:
      - targets: ["postgres-exporter:9187"]

Grafana：可视化和仪表盘

Prometheus 负责采集和存储，Grafana 负责让你"看懂"数据。一个好的 Dashboard 应该能让你在 30 秒内判断服务是否健康。

Dashboard 设计思路

我通常把 Dashboard 分成三层：

Dashboard 分层设计
=============================

  ┌─────────────────────────────────────────┐
  │  L1: Overview (概览层)                   │
  │  一眼看全貌，判断整体健康状态              │
  │  · 总 QPS / 总错误率 / 总延迟 p99        │
  │  · 各服务状态 (红/黄/绿)                  │
  └───────────────────┬─────────────────────┘
                      │ 发现异常后下钻
                      ▼
  ┌─────────────────────────────────────────┐
  │  L2: Service (服务层)                    │
  │  按服务拆分，定位哪个服务有问题             │
  │  · Agent Orchestrator 指标               │
  │  · LLM API 指标                          │
  │  · Tool Service 指标                     │
  └───────────────────┬─────────────────────┘
                      │ 定位到具体服务后深入
                      ▼
  ┌─────────────────────────────────────────┐
  │  L3: Detail (详情层)                     │
  │  深入单个服务的内部细节                    │
  │  · 单个请求的链路追踪                     │
  │  · LLM Token 消耗明细                    │
  │  · Tool 调用成功率分布                    │
  └─────────────────────────────────────────┘

Grafana Dashboard JSON 示例

下面是 Agent 服务核心 Dashboard 的 JSON 配置（简化版，保留了关键结构）：

{
  "dashboard": {
    "title": "Agent Service Overview",
    "uid": "agent-overview",
    "timezone": "browser",
    "refresh": "30s",
    "time": {
      "from": "now-6h",
      "to": "now"
    },
    "panels": [
      {
        "id": 1,
        "title": "Agent Request Rate (QPS)",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 0 },
        "targets": [
          {
            "expr": "sum(rate(agent_request_total[5m]))",
            "legendFormat": "Total QPS",
            "refId": "A"
          },
          {
            "expr": "sum(rate(agent_request_total{status=~\"5..\"}[5m]))",
            "legendFormat": "Error QPS",
            "refId": "B"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "reqps",
            "custom": {
              "drawStyle": "line",
              "lineWidth": 2,
              "fillOpacity": 10
            }
          }
        }
      },
      {
        "id": 2,
        "title": "Request Latency P99",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 12, "y": 0 },
        "targets": [
          {
            "expr": "histogram_quantile(0.99, sum by (le) (rate(agent_request_duration_seconds_bucket[5m])))",
            "legendFormat": "p99",
            "refId": "A"
          },
          {
            "expr": "histogram_quantile(0.95, sum by (le) (rate(agent_request_duration_seconds_bucket[5m])))",
            "legendFormat": "p95",
            "refId": "B"
          },
          {
            "expr": "histogram_quantile(0.50, sum by (le) (rate(agent_request_duration_seconds_bucket[5m])))",
            "legendFormat": "p50",
            "refId": "C"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "s",
            "custom": {
              "drawStyle": "line",
              "lineWidth": 2,
              "fillOpacity": 10
            }
          }
        }
      },
      {
        "id": 3,
        "title": "LLM Token Consumption Rate",
        "type": "timeseries",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 8 },
        "targets": [
          {
            "expr": "sum by (model) (rate(llm_tokens_total[5m])) * 60",
            "legendFormat": "{{model}} tokens/min",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "short",
            "custom": {
              "drawStyle": "line",
              "lineWidth": 2,
              "fillOpacity": 10
            }
          }
        }
      },
      {
        "id": 4,
        "title": "LLM API Cost ($/hour)",
        "type": "stat",
        "gridPos": { "h": 8, "w": 6, "x": 12, "y": 8 },
        "targets": [
          {
            "expr": "sum(increase(llm_cost_dollars_total[1h]))",
            "legendFormat": "Last 1h Cost",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "currencyUSD",
            "thresholds": {
              "steps": [
                { "color": "green", "value": null },
                { "color": "yellow", "value": 50 },
                { "color": "red", "value": 100 }
              ]
            }
          }
        }
      },
      {
        "id": 5,
        "title": "Tool Call Success Rate",
        "type": "gauge",
        "gridPos": { "h": 8, "w": 6, "x": 18, "y": 8 },
        "targets": [
          {
            "expr": "sum(rate(tool_call_total{status=\"success\"}[5m])) / sum(rate(tool_call_total[5m])) * 100",
            "legendFormat": "Success Rate",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "color": "red", "value": null },
                { "color": "yellow", "value": 95 },
                { "color": "green", "value": 99 }
              ]
            }
          }
        }
      },
      {
        "id": 6,
        "title": "Agent Loop Iterations Distribution",
        "type": "heatmap",
        "gridPos": { "h": 8, "w": 12, "x": 0, "y": 16 },
        "targets": [
          {
            "expr": "sum(increase(agent_loop_iterations_bucket[5m])) by (le)",
            "legendFormat": "{{le}}",
            "refId": "A",
            "format": "heatmap"
          }
        ]
      },
      {
        "id": 7,
        "title": "Active Sessions",
        "type": "stat",
        "gridPos": { "h": 8, "w": 6, "x": 12, "y": 16 },
        "targets": [
          {
            "expr": "agent_active_sessions",
            "legendFormat": "{{service}}",
            "refId": "A"
          }
        ]
      },
      {
        "id": 8,
        "title": "Cache Hit Rate",
        "type": "gauge",
        "gridPos": { "h": 8, "w": 6, "x": 18, "y": 16 },
        "targets": [
          {
            "expr": "sum(rate(agent_cache_total{result=\"hit\"}[5m])) / sum(rate(agent_cache_total[5m])) * 100",
            "legendFormat": "Cache Hit Rate",
            "refId": "A"
          }
        ],
        "fieldConfig": {
          "defaults": {
            "unit": "percent",
            "min": 0,
            "max": 100,
            "thresholds": {
              "steps": [
                { "color": "red", "value": null },
                { "color": "yellow", "value": 50 },
                { "color": "green", "value": 80 }
              ]
            }
          }
        }
      }
    ]
  }
}

日志聚合：Loki vs ELK

日志是排查问题的最后一道防线。目前主流的两大方案是 Loki 和 ELK (Elasticsearch + Logstash + Kibana)。

方案对比

Loki vs ELK 对比
=============================

  维度           Loki                    ELK
  ─────         ────                    ────
  存储成本       低 (只索引标签)          高 (全文索引)
  查询方式       LogQL (类似 PromQL)      KQL / Lucene
  部署复杂度     低 (单二进制)            高 (三个组件)
  全文搜索       弱 (依赖 grep)           强 (倒排索引)
  与 Prometheus  天然集成                 需要额外配置
  适用场景       中小规模、云原生          大规模、需要全文搜索

  推荐选择:
  · 已经用 Prometheus + Grafana → 选 Loki (生态统一)
  · 需要复杂日志分析/全文搜索 → 选 ELK
  · 预算有限 → 选 Loki

Agent 服务的日志规范

不管选哪个方案，日志的格式和内容要规范。对于 Agent 服务，我推荐结构化日志：

import logging
import json
from datetime import datetime, timezone

class AgentLogFormatter:
    """Agent 服务结构化日志格式化器"""

    def format(self, record):
        log_entry = {
            "timestamp": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "service": "agent-orchestrator",
            "message": record.getMessage(),
        }

        # Agent 特有的上下文信息
        if hasattr(record, "trace_id"):
            log_entry["trace_id"] = record.trace_id
        if hasattr(record, "session_id"):
            log_entry["session_id"] = record.session_id
        if hasattr(record, "user_id"):
            log_entry["user_id"] = record.user_id
        if hasattr(record, "agent_step"):
            log_entry["agent_step"] = record.agent_step
        if hasattr(record, "llm_model"):
            log_entry["llm_model"] = record.llm_model
        if hasattr(record, "tool_name"):
            log_entry["tool_name"] = record.tool_name
        if hasattr(record, "token_count"):
            log_entry["token_count"] = record.token_count
        if hasattr(record, "latency_ms"):
            log_entry["latency_ms"] = record.latency_ms

        return json.dumps(log_entry, ensure_ascii=False)

# 使用示例
logger = logging.getLogger("agent")

# 记录 LLM 调用
logger.info(
    "LLM call completed",
    extra={
        "trace_id": "abc-123",
        "session_id": "sess-456",
        "llm_model": "gpt-4",
        "token_count": {"input": 1200, "output": 450},
        "latency_ms": 2341,
    }
)

# 记录 Tool 调用
logger.info(
    "Tool call completed",
    extra={
        "trace_id": "abc-123",
        "session_id": "sess-456",
        "tool_name": "web_search",
        "latency_ms": 890,
    }
)

# 记录错误
logger.error(
    "LLM API rate limit exceeded",
    extra={
        "trace_id": "abc-123",
        "session_id": "sess-456",
        "llm_model": "gpt-4",
        "retry_after": 30,
    }
)

Loki 的 LogQL 查询示例

# 查看所有错误日志
{service="agent-orchestrator"} |= "ERROR"

# 查看特定 trace_id 的完整链路日志
{service=~".+"} | json | trace_id="abc-123"

# 查看 LLM 调用耗时超过 10 秒的日志
{service="agent-orchestrator"} | json | latency_ms > 10000

# 统计每分钟的错误数
sum(rate({service="agent-orchestrator"} |= "ERROR" [1m])) by (service)

# 查看特定用户的 Agent 会话日志
{service=~".+"} | json | user_id="user-789" | line_format "{{.timestamp}} {{.level}} {{.message}}"

分布式追踪：OpenTelemetry + Jaeger

Agent 服务的调用链天然就是分布式的一个请求可能经过多个服务、多次 LLM 调用、多次 Tool 调用。没有分布式追踪，你根本不知道时间花在哪里了。

OpenTelemetry 集成

OpenTelemetry (OTel) 是分布式追踪的事实标准。下面是一个 Python Agent 服务的集成示例：

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.resources import Resource

# 初始化 Tracer
resource = Resource.create({
    "service.name": "agent-orchestrator",
    "service.version": "1.2.0",
    "deployment.environment": "production",
})

provider = TracerProvider(resource=resource)
provider.add_span_processor(
    BatchSpanProcessor(
        OTLPSpanExporter(endpoint="jaeger-collector:4317")
    )
)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

# Agent 执行链路追踪
async def run_agent(user_query: str):
    with tracer.start_as_current_span("agent.run") as root_span:
        root_span.set_attribute("user.query", user_query)

        # Step 1: 理解用户意图
        with tracer.start_as_current_span("agent.understand_intent") as span:
            intent = await understand_intent(user_query)
            span.set_attribute("intent.type", intent.type)
            span.set_attribute("intent.confidence", intent.confidence)

        # Step 2: RAG 检索
        with tracer.start_as_current_span("agent.rag_retrieval") as span:
            context = await rag_retrieve(user_query)
            span.set_attribute("rag.chunks_count", len(context.chunks))
            span.set_attribute("rag.retrieval_time_ms", context.latency_ms)

        # Step 3: LLM 调用
        with tracer.start_as_current_span("agent.llm_call") as span:
            span.set_attribute("llm.model", "gpt-4")
            response = await llm_call(user_query, context)
            span.set_attribute("llm.input_tokens", response.input_tokens)
            span.set_attribute("llm.output_tokens", response.output_tokens)
            span.set_attribute("llm.latency_ms", response.latency_ms)

        # Step 4: Tool 调用 (如果有)
        if response.tool_calls:
            for tool_call in response.tool_calls:
                with tracer.start_as_current_span(
                    f"agent.tool_call.{tool_call.name}"
                ) as span:
                    span.set_attribute("tool.name", tool_call.name)
                    result = await execute_tool(tool_call)
                    span.set_attribute("tool.status", result.status)
                    span.set_attribute("tool.latency_ms", result.latency_ms)

        return response

追踪数据的实际价值

通过 Jaeger 看板，你可以快速回答这些问题：

一个 Agent 请求的总耗时是多少？时间花在了哪些环节？
LLM 调用是瓶颈吗？Token 量是否合理？
Tool 调用是否超时？哪个 Tool 最慢？
是否存在不必要的串行调用？

Agent 特有的监控指标

除了通用的监控指标，Agent 服务有一些独特的监控维度，这些是传统 Web 服务不需要关注但 Agent 服务必须关注的：

1. LLM 延迟与 Token 消耗

# LLM 调用 p99 延迟 (按模型分组)
histogram_quantile(0.99,
  sum by (model, le) (
    rate(llm_call_duration_seconds_bucket[5m])
  )
)

# Token 消耗趋势 (每分钟)
sum by (model) (rate(llm_tokens_total[5m])) * 60

# 单次请求的平均 Token 消耗
sum(rate(llm_tokens_total[5m]))
/
sum(rate(agent_request_total[5m]))

2. 每请求成本

# 每小时 LLM 花费 (美元)
sum(increase(llm_cost_dollars_total[1h]))

# 按模型分组的成本占比
sum by (model) (increase(llm_cost_dollars_total[24h]))
/
sum(increase(llm_cost_dollars_total[24h])) * 100

3. Tool 调用成功率

# 各 Tool 的成功率
sum by (tool_name) (rate(tool_call_total{status="success"}[5m]))
/
sum by (tool_name) (rate(tool_call_total[5m]))

# Tool 调用失败率趋势
1 - (
  sum(rate(tool_call_total{status="success"}[5m]))
  / sum(rate(tool_call_total[5m]))
)

4. Agent 循环效率

# Agent 平均循环次数
histogram_quantile(0.50,
  sum by (le) (rate(agent_loop_iterations_bucket[5m]))
)

# 异常循环 (超过 20 次迭代) 的占比
sum(rate(agent_loop_iterations_bucket{le="20"}[5m]))
/
sum(rate(agent_loop_iterations_bucket[5m]))

告警规则设计

告警是监控体系的"最后一公里"。好的告警规则应该做到：不漏报、不误报、不告警疲劳。

Prometheus AlertManager 配置

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: "smtp.example.com:587"
  smtp_from: "alertmanager@example.com"
  smtp_auth_username: "alertmanager@example.com"
  smtp_auth_password: "password"

# 路由规则
route:
  receiver: "default-receiver"
  group_by: ["alertname", "service"]
  group_wait: 30s        # 首次告警等待时间
  group_interval: 5m     # 同组告警间隔
  repeat_interval: 4h    # 重复告警间隔
  routes:
    # P0 告警走 PagerDuty
    - match:
        severity: "critical"
      receiver: "pagerduty-critical"
      continue: true
    # P1 告警走 Slack
    - match:
        severity: "warning"
      receiver: "slack-warning"
    # 成本告警单独路由
    - match_re:
        alertname: "LLMCostHigh.*"
      receiver: "finance-slack"

# 接收器配置
receivers:
  - name: "default-receiver"
    email_configs:
      - to: "oncall@example.com"
        send_resolved: true

  - name: "pagerduty-critical"
    pagerduty_configs:
      - service_key: "<your-pagerduty-key>"
        severity: "critical"
        description: "{{ .GroupLabels.alertname }}: {{ .CommonAnnotations.summary }}"

  - name: "slack-warning"
    slack_configs:
      - api_url: "<your-slack-webhook>"
        channel: "#agent-alerts"
        title: "{{ .GroupLabels.alertname }}"
        text: >-
          {{ range .Alerts }}
          *Alert:* {{ .Annotations.summary }}
          *Severity:* {{ .Labels.severity }}
          *Details:* {{ .Annotations.description }}
          {{ end }}
        send_resolved: true

  - name: "finance-slack"
    slack_configs:
      - api_url: "<your-slack-webhook>"
        channel: "#agent-cost-alerts"

# 静默规则 (可选)
# inhibit_rules:
#   - source_match:
#       severity: "critical"
#     target_match:
#       severity: "warning"
#     equal: ["alertname", "service"]

核心告警规则

# agent-alerts.yml
groups:
  # ===== 基础设施告警 =====
  - name: infrastructure
    rules:
      - alert: AgentServiceDown
        expr: up{job="agent-orchestrator"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Agent 服务实例宕机"
          description: "{{ $labels.instance }} 已经超过 1 分钟无法访问"

      - alert: HighCPUUsage
        expr: process_cpu_seconds_total > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "{{ $labels.instance }} CPU 使用率超过 80%"

      - alert: HighMemoryUsage
        expr: process_resident_memory_bytes / 1024 / 1024 > 2048
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用超过 2GB"

  # ===== Agent 服务告警 =====
  - name: agent-service
    rules:
      - alert: AgentHighErrorRate
        expr: |
          sum(rate(agent_request_total{status=~"5.."}[5m]))
          / sum(rate(agent_request_total[5m])) > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Agent 服务错误率超过 5%"
          description: "当前错误率: {{ $value | humanizePercentage }}"

      - alert: AgentHighLatency
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(agent_request_duration_seconds_bucket[5m]))
          ) > 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent 服务 p99 延迟超过 15 秒"
          description: "当前 p99 延迟: {{ $value }}s"

      - alert: AgentHighLatencyCritical
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(agent_request_duration_seconds_bucket[5m]))
          ) > 30
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "Agent 服务 p99 延迟超过 30 秒"
          description: "当前 p99 延迟: {{ $value }}s，需要立即排查"

  # ===== LLM 相关告警 =====
  - name: llm-monitoring
    rules:
      - alert: LLMLatencyHigh
        expr: |
          histogram_quantile(0.99,
            sum by (model, le) (rate(llm_call_duration_seconds_bucket[5m]))
          ) > 20
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM API 延迟过高"
          description: "模型 {{ $labels.model }} p99 延迟: {{ $value }}s"

      - alert: LLMAPIFailure
        expr: |
          sum by (model) (rate(llm_call_duration_seconds_count{status=~"5.."}[5m]))
          / sum by (model) (rate(llm_call_duration_seconds_count[5m])) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "LLM API 错误率超过 10%"
          description: "模型 {{ $labels.model }} 错误率: {{ $value | humanizePercentage }}"

      - alert: LLMTokenRateHigh
        expr: |
          sum(rate(llm_tokens_total[5m])) * 60 > 100000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Token 消耗速率异常偏高"
          description: "当前每分钟消耗 {{ $value }} tokens，可能存在无限循环"

      - alert: LLMCostHigh
        expr: |
          sum(increase(llm_cost_dollars_total[1h])) > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM API 成本超过预期"
          description: "过去 1 小时花费 ${{ $value }}"

  # ===== Tool 调用告警 =====
  - name: tool-monitoring
    rules:
      - alert: ToolCallFailureRateHigh
        expr: |
          sum by (tool_name) (rate(tool_call_total{status="failure"}[5m]))
          / sum by (tool_name) (rate(tool_call_total[5m])) > 0.15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tool 调用失败率过高"
          description: "Tool {{ $labels.tool_name }} 失败率: {{ $value | humanizePercentage }}"

      - alert: ToolCallLatencyHigh
        expr: |
          histogram_quantile(0.95,
            sum by (tool_name, le) (rate(tool_call_duration_seconds_bucket[5m]))
          ) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Tool 调用延迟过高"
          description: "Tool {{ $labels.tool_name }} p95 延迟: {{ $value }}s"

  # ===== Agent 循环告警 =====
  - name: agent-loop
    rules:
      - alert: AgentLoopTooMany
        expr: |
          histogram_quantile(0.99,
            sum by (le) (rate(agent_loop_iterations_bucket[5m]))
          ) > 20
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Agent 循环次数异常偏高"
          description: "可能存在无限循环或死循环，p99 循环次数: {{ $value }}"

      - alert: AgentQueueDepthHigh
        expr: agent_queue_depth > 100
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Agent 请求队列积压"
          description: "当前队列深度: {{ $value }}，可能需要扩容"

SLO/SLA 定义

告警规则的设计应该基于 SLO (Service Level Objective)。先定义 SLO，再根据 SLO 设计告警阈值。

Agent 服务 SLO 定义
=============================

  指标                  SLO 目标          错误预算 (每月)
  ────                  ──────          ──────────────
  可用性                99.9%            43.8 分钟
  p99 延迟              ≤ 15 秒          -
  Tool 调用成功率        ≥ 98%            -
  LLM API 错误率        ≤ 2%             -
  每请求成本            ≤ $0.05          -

  错误预算消耗策略:
  ┌─────────────────┬──────────────────────────┐
  │ 预算剩余         │ 策略                       │
  ├─────────────────┼──────────────────────────┤
  │ &gt; 50%          │ 正常发布                   │
  │ 20% ~ 50%       │ 仅修复性变更               │
  │ &lt; 20%          │ 冻结发布，全力修复稳定性     │
  │ 耗尽             │ 停止一切新功能开发           │
  └─────────────────┴──────────────────────────┘

On-Call 策略

告警设计好了，谁来响应？怎么响应？

On-Call 轮值表设计

On-Call 轮值设计
=============================

  角色              职责                    响应时间
  ────              ────                    ────────
  Primary On-Call    第一响应人              5 分钟
  Secondary On-Call  Primary 不响应时兜底     15 分钟
  Manager On-Call    重大事故升级对象         30 分钟

  轮值周期: 每周轮换
  交接时间: 周一上午 10:00

  响应流程:
  收到告警 → 确认(5 min) → 初步排查(15 min) → 升级/修复
     │
     ▼
  告警分级:
  ┌─────────┬─────────────────────────────────┐
  │ P0      │ 服务完全不可用，用户无法使用        │
  │         │ 响应: 5 分钟，立即处理             │
  ├─────────┼─────────────────────────────────┤
  │ P1      │ 服务部分降级，用户体验受损          │
  │         │ 响应: 15 分钟，优先处理            │
  ├─────────┼─────────────────────────────────┤
  │ P2      │ 非关键功能异常，不影响核心链路      │
  │         │ 响应: 1 小时，正常排期处理          │
  └─────────┴─────────────────────────────────┘

常见 On-Call 操作手册

## 告警处理 SOP

### AgentServiceDown (服务宕机)
检查 Pod 状态: kubectl get pods -n agent-system
查看 Pod 日志: kubectl logs -n agent-system &lt;pod-name&gt; --tail=100
检查是否有 OOM Kill: kubectl describe pod &lt;pod-name&gt;
如果是 OOM，增加内存限制或排查内存泄漏
如果是健康检查失败，检查 /health 端点

### AgentHighErrorRate (错误率过高)
查看 Grafana Dashboard 确认错误类型
查看 Loki 日志，按 trace_id 聚合
检查 LLM API 状态页
检查 Tool 服务是否正常
如果是 LLM 限流，考虑切换备用模型

### LLMTokenRateHigh (Token 消耗异常)
查看是否有异常的高循环 Agent 会话
检查是否有 Prompt 注入攻击
检查 RAG 检索是否返回了过大的上下文
考虑对单次请求设置 Token 上限

### ToolCallFailureRateHigh (Tool 失败率高)
检查对应 Tool 服务的健康状态
查看 Tool 调用的错误日志
检查网络连通性
如果是外部 API，检查对方状态页

常见坑与解决方案

坑 1: 告警风暴

问题: 一个基础服务挂了，导致所有下游服务同时告警，短时间内收到几十条告警，真正的核心告警被淹没。

解决: 使用 AlertManager 的 group_by 和 inhibit_rules。同一组告警合并通知，高优先级告警抑制低优先级告警。

# 抑制规则示例
inhibit_rules:
  - source_match:
      alertname: "AgentServiceDown"
    target_match:
      alertname: "AgentHighErrorRate"
    equal: ["instance"]
  # 服务宕机时，抑制该实例的其他告警

坑 2: 监控指标爆炸

问题: 为了"全面监控"，给每个指标都加了高基数的 label（比如 user_id），导致 Prometheus 存储和查询都扛不住。

解决: 严格控制 label 的基数。user_id 这种高基数信息放在 trace_id 里通过链路追踪查看，不要放在 metrics 的 label 里。

# 错误示范 -- label 基数太高
agent_request_total{user_id="user-12345"}  # user_id 可能有百万级

# 正确示范 -- 用低基数 label
agent_request_total{service="agent-orchestrator", status="200"}

坑 3: 没有设 SLO，告警阈值拍脑袋

问题: 告警阈值凭经验设置，要么太松（问题发生了没告警），要么太紧（频繁误报）。

解决: 基于历史数据和业务需求定义 SLO。先收集一周的正常指标分布，然后将 SLO 设定为正常范围的合理边界。

坑 4: 只监控不分析

问题: Dashboard 搞了一堆，但没人看。告警设了一堆，但收到后不知道怎么处理。

解决: 建立定期 Review 机制。每周团队 Review 一次 Dashboard，看看有没有异常趋势。每次告警后写 Postmortem，把处理经验沉淀到 SOP 文档中。

坑 5: 日志和追踪没有关联

问题: 出了问题，先看 Dashboard 发现延迟高，再去看日志找不到对应信息，因为日志和链路追踪没有通过 trace_id 关联起来。

解决: 从一开始就在日志中注入 trace_id，保证三者（Metrics、Logs、Traces）可以通过 trace_id 互相跳转。Grafana 支持从 Metrics 直接跳转到对应的 Traces 和 Logs。

Metrics → Traces → Logs 关联路径
=============================

  1. 在 Dashboard 中发现异常指标
  2. 点击异常时间点，查看对应 traces
  3. 在 trace 详情中查看每个 span 的 logs
  4. 通过 trace_id 在 Loki 中搜索完整日志

  这就是可观测性三支柱的完整闭环

参考资料

Prometheus 官方文档: https://prometheus.io/docs/
Grafana 官方文档: https://grafana.com/docs/
OpenTelemetry 文档: https://opentelemetry.io/docs/
Loki 文档: https://grafana.com/docs/loki/latest/
AlertManager 配置: https://prometheus.io/docs/alerting/latest/configuration/
Google SRE Book: https://sre.google/sre-book/table-of-contents/
Awesome Prometheus: https://github.com/roaldnefs/awesome-prometheus

监控三大支柱：Metrics、Logs、Traces​

整体监控架构​

Prometheus：指标采集的核心​

为什么选 Prometheus​

数据模型​

Agent 服务的自定义指标​

PromQL 实战​

Prometheus 配置详解​

Grafana：可视化和仪表盘​

Dashboard 设计思路​

Grafana Dashboard JSON 示例​

日志聚合：Loki vs ELK​

方案对比​

Agent 服务的日志规范​

Loki 的 LogQL 查询示例​

分布式追踪：OpenTelemetry + Jaeger​

OpenTelemetry 集成​

追踪数据的实际价值​

Agent 特有的监控指标​

1. LLM 延迟与 Token 消耗​

2. 每请求成本​

3. Tool 调用成功率​

4. Agent 循环效率​

告警规则设计​

Prometheus AlertManager 配置​

核心告警规则​

SLO/SLA 定义​

On-Call 策略​

On-Call 轮值表设计​

常见 On-Call 操作手册​

常见坑与解决方案​

坑 1: 告警风暴​

坑 2: 监控指标爆炸​

坑 3: 没有设 SLO，告警阈值拍脑袋​

坑 4: 只监控不分析​

坑 5: 日志和追踪没有关联​

参考资料​