geo/docs/02-模块说明/metrics-definition.md

# 监控指标定义

## 概述

本文档详细描述监控系统的指标定义和LLM成本追踪。

## Prometheus指标

位置：`backend/app/monitoring/metrics.py`

### API层指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| geo_api_requests_total | Counter | method, endpoint, status | HTTP请求总数 |
| geo_api_request_duration_seconds | Histogram | method, endpoint | 请求延迟分布 |
| geo_api_requests_in_progress | Gauge | method, endpoint | 当前处理中的请求数 |

### Agent层指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| geo_agent_executions_total | Counter | agent_name, status | Agent执行总数 |
| geo_agent_execution_duration_seconds | Histogram | agent_name | Agent执行耗时 |
| geo_agent_running_tasks | Gauge | agent_name | 当前运行的任务数 |

### LLM层指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| geo_llm_requests_total | Counter | provider, model, status | LLM请求总数 |
| geo_llm_request_duration_seconds | Histogram | provider, model | LLM请求耗时 |
| geo_llm_tokens_total | Counter | provider, model, token_type | Token消耗总量 |
| geo_llm_cost_estimated | Gauge | provider, model | 预估成本(USD) |

### 业务层指标

| 指标名 | 类型 | 标签 | 说明 |
|--------|------|------|------|
| geo_brands_total | Gauge | - | 品牌总数 |
| geo_queries_total | Counter | platform, status | 查询总数 |
| geo_content_generated_total | Counter | - | 生成内容总数 |
| geo_citations_detected_total | Counter | platform | 引用检测总数 |

## LLM成本追踪

位置：`backend/app/monitoring/llm_metrics.py`

### 成本估算表

```python
LLM_COST_PER_TOKEN = {
    # OpenAI
    ("openai", "gpt-4o"): {
        "prompt": 0.000005,      # $5/1M tokens
        "completion": 0.000015,   # $15/1M tokens
    },
    ("openai", "gpt-4o-mini"): {
        "prompt": 0.00000015,    # $0.15/1M tokens
        "completion": 0.0000006,  # $0.60/1M tokens
    },
    ("openai", "gpt-4-turbo"): {
        "prompt": 0.00001,       # $10/1M tokens
        "completion": 0.00003,    # $30/1M tokens
    },
    # DeepSeek
    ("deepseek", "deepseek-chat"): {
        "prompt": 0.00000014,    # $0.14/1M tokens
        "completion": 0.00000028, # $0.28/1M tokens
    },
    ("deepseek", "deepseek-coder"): {
        "prompt": 0.00000014,
        "completion": 0.00000028,
    },
}
```

### 成本计算公式

```
总成本 = prompt_tokens * prompt_price + completion_tokens * completion_price
```

### LLMMetricsWrapper

```python
class LLMMetricsWrapper:
    def record_request(
        self,
        status: str,
        duration: float,
        prompt_tokens: int = None,
        completion_tokens: int = None,
    ):
        """
        记录LLM请求指标
        1. 记录请求数和耗时
        2. 记录Token消耗
        3. 估算并记录成本
        """
```

## Histogram buckets

### API延迟

```python
buckets=(0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0)
# 单位：秒
# P50/P90/P99 估算
```

### Agent执行耗时

```python
buckets=(0.1, 0.5, 1.0, 5.0, 10.0, 30.0, 60.0, 120.0)
# 单位：秒
```

### LLM请求耗时

```python
buckets=(0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0)
# 单位：秒
```

## 健康检查

位置：`backend/app/services/health_checker.py`

### 检查端点

| 路径 | 说明 | 检查项 |
|------|------|--------|
| GET /health | 综合健康检查 | 所有检查项 |
| GET /health/ready | 就绪检查 | 数据库、Redis |
| GET /health/live | 存活检查 | 服务运行状态 |

### 检查项

| 检查项 | 说明 | 超时 |
|--------|------|------|
| database | 数据库连接 | 5s |
| redis | Redis连接 | 5s |
| disk | 磁盘空间 | - |
| memory | 内存使用率 | - |

## 告警规则

### 告警条件

| 规则 | 条件 | 级别 | 说明 |
|------|------|------|------|
| API响应超时 | p99 > 5s | Warning | 99分位响应时间超过5秒 |
| API错误率高 | error_rate > 5% | Error | 错误率超过5% |
| 队列积压 | queue_depth > 1000 | Warning | 队列深度超过1000 |
| Agent离线 | heartbeat_timeout | Critical | 心跳超时 |

### 告警级别

| 级别 | 说明 | 通知方式 |
|------|------|----------|
| Warning | 警告 | 日志 |
| Error | 错误 | 日志+邮件 |
| Critical | 严重 | 日志+邮件+短信 |

## 监控集成

### Prometheus端点

```
GET /metrics
```

返回Prometheus格式的指标数据。

### Grafana仪表板

推荐配置的仪表板：

1. **API Dashboard**
   - QPS
   - 延迟分布 (P50/P90/P99)
   - 错误率

2. **Agent Dashboard**
   - 各Agent执行次数
   - 执行耗时
   - 成功率

3. **LLM Dashboard**
   - 请求数
   - Token消耗
   - 成本趋势

4. **业务 Dashboard**
   - 品牌数量
   - 内容生成量
   - 引用检测量

## 指标收集流程

```
1. API请求 → 中间件记录 (method, endpoint, status, duration)
2. Agent执行 → AgentHooks记录 (agent_name, status, duration)
3. LLM调用 → LLMMetricsWrapper记录 (provider, model, tokens, cost)
4. 后台任务 → 定时汇总写入数据库
```