fischer-agentkit/docs/plans/2026-06-15-002-feat-e2e-cap...

---
title: "feat: E2E能力分析框架改进与路由智能化提升"
type: feat
status: active
created: 2026-06-15
plan-depth: standard
---

# E2E能力分析框架改进与路由智能化提升

## Summary

改进E2E能力分析框架，解决当前基准数据集与实际技能不对应、覆盖面窄（仅19条）、指标判断过于简化等核心问题。同时将ExpertTeamRouter集成到CostAwareRouter自动触发链路，增加路由器直接回测层，并将基准用例扩展至60条，使召回率/F1/过拟合检测等指标具备统计意义。

## Problem Frame

当前E2E能力分析框架存在四个关键问题：

1. **基准数据与实际技能脱节**：`benchmark_dataset.py` 中的 `expected_skill`（如 `email_composer`、`i18n_translator`）与 `configs/skills/` 中的15个实际技能不对应，导致路由回测结果无意义
2. **覆盖面过窄**：仅19条基准用例，PRF统计不稳定；缺少 SemanticRouter、ExpertTeamRouter、AlignmentGuard 的专项基准
3. **指标判断粗糙**：`complexity_correct` 直接等于 `execution_mode_correct`，无法独立评估复杂度估算；改进策略中的 `target_module` 引用了旧文件名
4. **团队路由未自动集成**：`ExpertTeamRouter` 与 `CostAwareRouter` 独立运行，`TEAM_COLLAB` 模式无法自动触发

## Requirements

- R1: 基准数据集中的 `expected_skill` 必须与 `configs/skills/` 中的实际技能一一对应
- R2: 基准用例数量扩展至60条，覆盖路由/执行/团队/一致性/对齐守卫五个维度
- R3: 增加路由器直接回测层（不经过HTTP API），能区分路由错误与API层错误
- R4: `complexity_correct` 独立于 `execution_mode_correct`，基于 HeuristicClassifier 分数与期望复杂度的映射判断
- R5: ExpertTeamRouter 集成到 CostAwareRouter.route() 中，高复杂度任务自动触发 TEAM_COLLAB
- R6: 增加 SemanticRouter 专项基准（相似度分数分布、三档精确率）
- R7: 增加 AlignmentGuard 约束检查基准
- R8: 修正改进策略中的 target_module 文件路径
- R9: 报告输出保持中文

## Key Technical Decisions

### KTD1: 双层回测架构

**决策**：在现有HTTP API层E2E测试之上，增加路由器直接回测层。

**理由**：纯API测试无法区分"路由器选错了技能"和"API层传递参数出错"两种失败模式。直接回测层调用 `CostAwareRouter.route()` 方法，记录 `SkillRoutingResult` 的完整字段（`match_method`、`match_confidence`、`execution_trace`），使根因分析能精确定位到具体路由层。

**替代方案**：保持纯API层测试 → 被否决，因为无法满足R3的精确诊断需求。

### KTD2: ExpertTeamRouter 集成方式

**决策**：在 `CostAwareRouter._route_layer2()` 末尾增加 ExpertTeamRouter 检查点。当 Layer 2 判定 `execution_mode=REACT` 且 `complexity >= 0.7` 时，调用 `ExpertTeamRouter.resolve()` 判断是否升级为 `TEAM_COLLAB`。

**理由**：保持三层路由的递进式架构不变，仅在 Layer 2 出口处增加团队模式升级逻辑，最小化对现有路由流程的侵入。

### KTD3: 复杂度正确性判断策略

**决策**：基于 HeuristicClassifier 返回的浮点复杂度分数与期望复杂度等级的映射区间判断：`low=[0, 0.3)`、`medium=[0.3, 0.7)`、`high=[0.7, 1.0]`。

**理由**：直接使用浮点分数比仅比较执行模式更精确，能区分"复杂度分数0.29被判为low但期望medium"和"复杂度分数0.65被判为medium且期望medium"两种情况。

### KTD4: 基准用例与实际技能对齐

**决策**：从 `configs/skills/` 的15个实际技能中提取 `intent.keywords` 和 `intent.description`，自动生成基准用例的 `expected_skill`，而非手动硬编码。

**理由**：手动维护的技能名容易与实际配置脱节（当前问题）。自动对齐确保基准数据始终反映最新的技能配置。

---

## Implementation Units

### U1. 基准数据集与实际技能对齐

**Goal**: 修复 benchmark_dataset.py 中 expected_skill 与实际技能的对应关系，扩展至60条用例

**Dependencies**: 无

**Files**:
- `tests/e2e/benchmark_dataset.py` — 重写基准数据集
- `tests/e2e/benchmark_generator.py` — 新增：从技能配置自动生成基准用例

**Approach**:
1. 新增 `BenchmarkGenerator` 类，读取 `configs/skills/*.yaml`，提取每个技能的 `intent.keywords`、`intent.description`、`intent.examples`，自动生成 `BenchmarkCase`
2. 为每个技能生成3-5条基准用例：1条原始输入 + 2-4条改写
3. 保留手动定义的边界用例（问候语、身份识别、无匹配回退）
4. 新增维度：`alignment`（对齐守卫）、`semantic_router`（语义路由专项）
5. 总目标：路由20+、执行15+、团队10+、一致性10+、对齐守卫5+

**Patterns to follow**: `BenchmarkCase` Pydantic frozen model 模式

**Test scenarios**:
- 生成的基准用例 expected_skill 全部存在于 configs/skills/ 中
- 每个技能至少有1条基准用例
- paraphrases 非空的用例占比 > 60%
- 总用例数 >= 60

**Verification**: 运行 `python -c "from tests.e2e.benchmark_dataset import ALL_BENCHMARKS; print(len(ALL_BENCHMARKS))"` 确认 >= 60

### U2. 路由器直接回测层

**Goal**: 增加不经过HTTP API的路由器直接回测，记录完整路由结果

**Dependencies**: U1

**Files**:
- `tests/e2e/test_capability_router_direct.py` — 新增：路由器直接回测
- `tests/e2e/conftest.py` — 增加 router fixture

**Approach**:
1. 在 conftest.py 中增加 `cost_aware_router` fixture，直接实例化 `CostAwareRouter`（使用 MockLLMProvider）
2. 新增 `test_capability_router_direct.py`，对每个基准用例调用 `router.route(query)` 并记录完整 `SkillRoutingResult`
3. 记录字段：`skill_name`、`execution_mode`、`complexity`、`match_method`（layer0/layer1/layer1.5/layer2）、`match_confidence`、`execution_trace`
4. 将路由器回测结果也写入 MetricsCollector，增加 `match_method` 和 `match_confidence` 字段

**Patterns to follow**: 现有 `test_capability_routing.py` 的参数化测试模式

**Test scenarios**:
- Layer 0 规则匹配：问候语 → DIRECT_CHAT，@skill:xxx → 对应技能
- Layer 1 复杂度分类：简单问答 → low，多步分析 → high
- Layer 1.5 语义路由：同义改写 → 相同技能，相似度 > 0.6
- Layer 2 能力匹配：高复杂度 → REACT/TEAM_COLLAB
- 路由器回测与API回测结果一致性 > 90%

**Verification**: 运行 `pytest tests/e2e/test_capability_router_direct.py -v` 全部通过

### U3. 指标体系增强

**Goal**: 修复 complexity_correct 判断逻辑，增加语义路由/团队路由指标，修正 target_module 路径

**Dependencies**: U1

**Files**:
- `tests/e2e/capability_metrics.py` — 增强指标模型和分析器
- `tests/e2e/benchmark_dataset.py` — 增加 semantic_router / alignment 类别

**Approach**:
1. `CapabilityObservation` 增加 `actual_complexity_score: float | None`、`actual_match_method: str | None`、`actual_match_confidence: float | None` 字段
2. `complexity_correct` 改为基于分数区间映射判断（KTD3）
3. `MetricsAnalyzer` 增加 `analyze_semantic_router()` 方法：按 high/medium/low 三档统计精确率
4. `MetricsAnalyzer` 增加 `analyze_team_routing()` 方法：统计 `explicit_team` vs `complexity_suggestion` 的成功率
5. 修正 `plan_improvements()` 中所有 `target_module`：`cost_aware_router.py` → `chat/skill_routing.py`
6. 报告增加"语义路由分析"和"团队路由分析"章节

**Patterns to follow**: 现有 `MetricsAnalyzer` 的分析方法模式

**Test scenarios**:
- complexity_correct 独立于 execution_mode_correct
- 语义路由三档精确率计算正确
- 团队路由成功率计算正确
- target_module 路径与实际代码对应
- 中文报告输出包含新增章节

**Verification**: 运行 `pytest tests/e2e/test_capability_routing.py tests/e2e/test_capability_react.py -v` 通过

### U4. ExpertTeamRouter 集成到 CostAwareRouter

**Goal**: 高复杂度任务自动触发 TEAM_COLLAB 模式

**Dependencies**: U2

**Files**:
- `src/agentkit/chat/skill_routing.py` — 修改 `_route_layer2()` 增加团队升级逻辑
- `src/agentkit/experts/router.py` — 增加 `can_handle()` 方法供路由器查询
- `tests/unit/chat/test_skill_routing.py` — 增加团队路由单元测试

**Approach**:
1. 在 `CostAwareRouter._route_layer2()` 末尾，当 `execution_mode == REACT` 且 `complexity >= COMPLEXITY_THRESHOLD` 时，调用 `ExpertTeamRouter.resolve(content, complexity)`
2. 如果 `ExpertTeamRouter` 返回有效结果，升级 `execution_mode` 为 `TEAM_COLLAB`，并在 `execution_trace` 中记录 `"team_upgrade": True`
3. 在 `ExpertTeamRouter` 中增加 `can_handle(content: str) -> bool` 方法，检查是否有匹配的专家模板
4. 保持向后兼容：如果 `ExpertTeamRouter` 不可用（未配置专家模板），静默跳过

**Patterns to follow**: 现有 `_route_layer2()` 的 Vickrey 拍卖路径模式

**Test scenarios**:
- 高复杂度 + 有专家模板 → TEAM_COLLAB
- 高复杂度 + 无专家模板 → 保持 REACT
- 低复杂度 → 不触发团队路由
- @team 前缀 → 直接 TEAM_COLLAB（Layer 0 处理）
- execution_trace 包含 team_upgrade 标记

**Verification**: 运行 `pytest tests/unit/chat/test_skill_routing.py -v -k team` 通过

### U5. AlignmentGuard 与 CascadeDetector 指标集成

**Goal**: 将对齐守卫约束违规和级联告警纳入E2E指标收集

**Dependencies**: U3

**Files**:
- `tests/e2e/test_capability_alignment.py` — 新增：对齐守卫基准测试
- `tests/e2e/capability_metrics.py` — 增加 alignment 维度指标

**Approach**:
1. 新增 `test_capability_alignment.py`，包含5+条对齐守卫基准用例：
   - 否定约束测试（"不要提及价格"→ 输出不含价格）
   - 肯定约束测试（"必须包含摘要"→ 输出含摘要）
   - 级联告警测试（连续5次相似查询 → 触发 CascadeAlert）
2. `CapabilityObservation` 增加 `alignment_violations: int`、`cascade_alert: bool` 字段
3. `MetricsAnalyzer` 增加 `analyze_alignment()` 方法
4. 报告增加"对齐守卫分析"章节

**Patterns to follow**: 现有 `test_capability_team.py` 的测试模式

**Test scenarios**:
- 否定约束：输出不包含禁止内容
- 肯定约束：输出包含必要内容
- 级联告警：连续交互触发告警
- 无约束：正常通过

**Verification**: 运行 `pytest tests/e2e/test_capability_alignment.py -v` 通过

### U6. 运行脚本与CI集成

**Goal**: 更新运行脚本，支持分层回测和CI集成

**Dependencies**: U2, U3, U4, U5

**Files**:
- `scripts/run_e2e.sh` — 增加直接回测和分层运行选项
- `tests/e2e/conftest.py` — 确保 pytest_sessionfinish 报告生成正确

**Approach**:
1. `run_e2e.sh` 增加 `--direct` 选项（仅运行路由器直接回测）
2. `run_e2e.sh` 增加 `--alignment` 选项（仅运行对齐守卫测试）
3. `run_e2e.sh` 增加 `--full` 选项（运行全部：API + 直接 + 对齐）
4. 确保报告输出目录 `test-results/e2e/` 在 CI 中作为 artifact 上传
5. 增加 `--baseline` 选项：与上次报告对比，输出指标变化趋势

**Patterns to follow**: 现有 `run_e2e.sh` 的选项模式

**Test scenarios**:
- `--direct` 仅运行路由器直接回测
- `--alignment` 仅运行对齐守卫测试
- `--full` 运行所有能力测试
- `--analyze` 生成完整中文报告
- 报告文件正确保存到 test-results/e2e/

**Verification**: 运行 `./scripts/run_e2e.sh --direct` 和 `./scripts/run_e2e.sh --analyze` 验证

---

## Scope Boundaries

### In Scope
- 基准数据集与实际技能对齐并扩展至60条
- 路由器直接回测层
- 指标体系增强（复杂度、语义路由、团队路由）
- ExpertTeamRouter 集成到 CostAwareRouter
- AlignmentGuard 指标集成
- 运行脚本更新

### Out of Scope
- CostAwareRouter 三层架构重写
- 新增 LLM Provider
- 前端界面修改
- 生产环境部署
- intent.examples 嵌入到 SemanticRouter（可作为后续优化）
- disambiguation_keywords 配置字段（改进策略已规划，但属于技能配置层面的独立改进）

### Deferred to Follow-Up Work
- 基于用户真实查询日志的基准用例持续扩充
- 复杂度评估模型训练（替代启发式规则）
- 意图泛化CI防线的 GitHub Actions 配置
- OutputStandardizer.quality_score 与路由决策的关联分析

---

## Risks & Mitigations

| 风险 | 影响 | 缓解措施 |
|------|------|----------|
| ExpertTeamRouter 集成可能影响现有路由性能 | Layer 2 增加一次 resolve() 调用 | 仅在 complexity >= 0.7 时触发，且 can_handle() 快速返回 |
| 基准用例自动生成可能产生低质量用例 | PRF 指标失真 | 人工审核自动生成的用例，保留手动边界用例 |
| 路由器直接回测需要 MockLLMProvider 完整支持 | 某些路由路径无法测试 | 优先覆盖 Layer 0/1，Layer 1.5/2 标记为需要真实 LLM |
| 60条用例可能增加E2E运行时间 | CI 流水线变慢 | 按维度分组运行，支持 `--fast` 快速失败模式 |

---

## System-Wide Impact

- **路由层**：`skill_routing.py` 增加 ExpertTeamRouter 调用点，影响所有高复杂度请求的路由决策
- **测试层**：新增3个测试文件，conftest.py 增加2个 fixture，运行脚本增加4个选项
- **报告层**：能力分析报告增加3个章节（语义路由、团队路由、对齐守卫）
- **配置层**：无配置文件变更（disambiguation_keywords 推迟到后续）