5.0 KiB

Raw Blame History

监控指标定义

概述

本文档详细描述监控系统的指标定义和LLM成本追踪。

Prometheus指标

位置：backend/app/monitoring/metrics.py

API层指标

指标名	类型	标签	说明
geo_api_requests_total	Counter	method, endpoint, status	HTTP请求总数
geo_api_request_duration_seconds	Histogram	method, endpoint	请求延迟分布
geo_api_requests_in_progress	Gauge	method, endpoint	当前处理中的请求数

Agent层指标

指标名	类型	标签	说明
geo_agent_executions_total	Counter	agent_name, status	Agent执行总数
geo_agent_execution_duration_seconds	Histogram	agent_name	Agent执行耗时
geo_agent_running_tasks	Gauge	agent_name	当前运行的任务数

LLM层指标

指标名	类型	标签	说明
geo_llm_requests_total	Counter	provider, model, status	LLM请求总数
geo_llm_request_duration_seconds	Histogram	provider, model	LLM请求耗时
geo_llm_tokens_total	Counter	provider, model, token_type	Token消耗总量
geo_llm_cost_estimated	Gauge	provider, model	预估成本(USD)

业务层指标

指标名	类型	标签	说明
geo_brands_total	Gauge	-	品牌总数
geo_queries_total	Counter	platform, status	查询总数
geo_content_generated_total	Counter	-	生成内容总数
geo_citations_detected_total	Counter	platform	引用检测总数

LLM成本追踪

位置：backend/app/monitoring/llm_metrics.py

成本估算表

LLM_COST_PER_TOKEN = {
    # OpenAI
    ("openai", "gpt-4o"): {
        "prompt": 0.000005,      # $5/1M tokens
        "completion": 0.000015,   # $15/1M tokens
    },
    ("openai", "gpt-4o-mini"): {
        "prompt": 0.00000015,    # $0.15/1M tokens
        "completion": 0.0000006,  # $0.60/1M tokens
    },
    ("openai", "gpt-4-turbo"): {
        "prompt": 0.00001,       # $10/1M tokens
        "completion": 0.00003,    # $30/1M tokens
    },
    # DeepSeek
    ("deepseek", "deepseek-chat"): {
        "prompt": 0.00000014,    # $0.14/1M tokens
        "completion": 0.00000028, # $0.28/1M tokens
    },
    ("deepseek", "deepseek-coder"): {
        "prompt": 0.00000014,
        "completion": 0.00000028,
    },
}

成本计算公式

总成本 = prompt_tokens * prompt_price + completion_tokens * completion_price

LLMMetricsWrapper

class LLMMetricsWrapper:
    def record_request(
        self,
        status: str,
        duration: float,
        prompt_tokens: int = None,
        completion_tokens: int = None,
    ):
        """
        记录LLM请求指标
        1. 记录请求数和耗时
        2. 记录Token消耗
        3. 估算并记录成本
        """

Histogram buckets

API延迟

buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0)
# 单位：秒
# P50/P90/P99 估算

Agent执行耗时

buckets=(0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0, 120.0)
# 单位：秒

LLM请求耗时

buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0)
# 单位：秒

健康检查

位置：backend/app/services/health_checker.py

检查端点

路径	说明	检查项
GET /health	综合健康检查	所有检查项
GET /health/ready	就绪检查	数据库、Redis
GET /health/live	存活检查	服务运行状态

检查项

检查项	说明	超时
database	数据库连接	5s
redis	Redis连接	5s
disk	磁盘空间	-
memory	内存使用率	-

告警规则

告警条件

规则	条件	级别	说明
API响应超时	p99 > 5s	Warning	99分位响应时间超过5秒
API错误率高	error_rate > 5%	Error	错误率超过5%
队列积压	queue_depth > 1000	Warning	队列深度超过1000
Agent离线	heartbeat_timeout	Critical	心跳超时

告警级别

级别	说明	通知方式
Warning	警告	日志
Error	错误	日志+邮件
Critical	严重	日志+邮件+短信

监控集成

Prometheus端点

GET /metrics

返回Prometheus格式的指标数据。

Grafana仪表板

推荐配置的仪表板：

API Dashboard
- QPS
- 延迟分布 (P50/P90/P99)
- 错误率
Agent Dashboard
- 各Agent执行次数
- 执行耗时
- 成功率
LLM Dashboard
- 请求数
- Token消耗
- 成本趋势
业务 Dashboard
- 品牌数量
- 内容生成量
- 引用检测量

指标收集流程

1. API请求 → 中间件记录 (method, endpoint, status, duration)
2. Agent执行 → AgentHooks记录 (agent_name, status, duration)
3. LLM调用 → LLMMetricsWrapper记录 (provider, model, tokens, cost)
4. 后台任务 → 定时汇总写入数据库

5.0 KiB Raw Blame History

监控指标定义

概述

Prometheus指标

API层指标

Agent层指标

LLM层指标

业务层指标

LLM成本追踪

成本估算表

成本计算公式

LLMMetricsWrapper

Histogram buckets

API延迟

Agent执行耗时

LLM请求耗时

健康检查

检查端点

检查项

告警规则

告警条件

告警级别

监控集成

Prometheus端点

Grafana仪表板

指标收集流程

5.0 KiB

Raw Blame History