13 KiB

Raw Blame History

title	type	status	created	plan-depth
feat: E2E能力分析框架改进与路由智能化提升	feat	active	2026-06-15	standard

E2E能力分析框架改进与路由智能化提升

Summary

改进E2E能力分析框架，解决当前基准数据集与实际技能不对应、覆盖面窄（仅19条）、指标判断过于简化等核心问题。同时将ExpertTeamRouter集成到CostAwareRouter自动触发链路，增加路由器直接回测层，并将基准用例扩展至60条，使召回率/F1/过拟合检测等指标具备统计意义。

Problem Frame

当前E2E能力分析框架存在四个关键问题：

基准数据与实际技能脱节：benchmark_dataset.py 中的 expected_skill（如 email_composer、i18n_translator）与 configs/skills/ 中的15个实际技能不对应，导致路由回测结果无意义
覆盖面过窄：仅19条基准用例，PRF统计不稳定；缺少 SemanticRouter、ExpertTeamRouter、AlignmentGuard 的专项基准
指标判断粗糙：complexity_correct 直接等于 execution_mode_correct，无法独立评估复杂度估算；改进策略中的 target_module 引用了旧文件名
团队路由未自动集成：ExpertTeamRouter 与 CostAwareRouter 独立运行，TEAM_COLLAB 模式无法自动触发

Requirements

R1: 基准数据集中的 expected_skill 必须与 configs/skills/ 中的实际技能一一对应
R2: 基准用例数量扩展至60条，覆盖路由/执行/团队/一致性/对齐守卫五个维度
R3: 增加路由器直接回测层（不经过HTTP API），能区分路由错误与API层错误
R4: complexity_correct 独立于 execution_mode_correct，基于 HeuristicClassifier 分数与期望复杂度的映射判断
R5: ExpertTeamRouter 集成到 CostAwareRouter.route() 中，高复杂度任务自动触发 TEAM_COLLAB
R6: 增加 SemanticRouter 专项基准（相似度分数分布、三档精确率）
R7: 增加 AlignmentGuard 约束检查基准
R8: 修正改进策略中的 target_module 文件路径
R9: 报告输出保持中文

Key Technical Decisions

KTD1: 双层回测架构

决策：在现有HTTP API层E2E测试之上，增加路由器直接回测层。

理由：纯API测试无法区分"路由器选错了技能"和"API层传递参数出错"两种失败模式。直接回测层调用 CostAwareRouter.route() 方法，记录 SkillRoutingResult 的完整字段（match_method、match_confidence、execution_trace），使根因分析能精确定位到具体路由层。

替代方案：保持纯API层测试 → 被否决，因为无法满足R3的精确诊断需求。

KTD2: ExpertTeamRouter 集成方式

决策：在 CostAwareRouter._route_layer2() 末尾增加 ExpertTeamRouter 检查点。当 Layer 2 判定 execution_mode=REACT 且 complexity >= 0.7 时，调用 ExpertTeamRouter.resolve() 判断是否升级为 TEAM_COLLAB。

理由：保持三层路由的递进式架构不变，仅在 Layer 2 出口处增加团队模式升级逻辑，最小化对现有路由流程的侵入。

KTD3: 复杂度正确性判断策略

决策：基于 HeuristicClassifier 返回的浮点复杂度分数与期望复杂度等级的映射区间判断：low=[0, 0.3)、medium=[0.3, 0.7)、high=[0.7, 1.0]。

理由：直接使用浮点分数比仅比较执行模式更精确，能区分"复杂度分数0.29被判为low但期望medium"和"复杂度分数0.65被判为medium且期望medium"两种情况。

KTD4: 基准用例与实际技能对齐

决策：从 configs/skills/ 的15个实际技能中提取 intent.keywords 和 intent.description，自动生成基准用例的 expected_skill，而非手动硬编码。

理由：手动维护的技能名容易与实际配置脱节（当前问题）。自动对齐确保基准数据始终反映最新的技能配置。

Implementation Units

U1. 基准数据集与实际技能对齐

Goal: 修复 benchmark_dataset.py 中 expected_skill 与实际技能的对应关系，扩展至60条用例

Dependencies: 无

Files:

tests/e2e/benchmark_dataset.py — 重写基准数据集
tests/e2e/benchmark_generator.py — 新增：从技能配置自动生成基准用例

Approach:

新增 BenchmarkGenerator 类，读取 configs/skills/*.yaml，提取每个技能的 intent.keywords、intent.description、intent.examples，自动生成 BenchmarkCase
为每个技能生成3-5条基准用例：1条原始输入 + 2-4条改写
保留手动定义的边界用例（问候语、身份识别、无匹配回退）
新增维度：alignment（对齐守卫）、semantic_router（语义路由专项）
总目标：路由20+、执行15+、团队10+、一致性10+、对齐守卫5+

Patterns to follow: BenchmarkCase Pydantic frozen model 模式

Test scenarios:

生成的基准用例 expected_skill 全部存在于 configs/skills/ 中
每个技能至少有1条基准用例
paraphrases 非空的用例占比 > 60%
总用例数 >= 60

Verification: 运行 python -c "from tests.e2e.benchmark_dataset import ALL_BENCHMARKS; print(len(ALL_BENCHMARKS))" 确认 >= 60

U2. 路由器直接回测层

Goal: 增加不经过HTTP API的路由器直接回测，记录完整路由结果

Dependencies: U1

Files:

tests/e2e/test_capability_router_direct.py — 新增：路由器直接回测
tests/e2e/conftest.py — 增加 router fixture

Approach:

在 conftest.py 中增加 cost_aware_router fixture，直接实例化 CostAwareRouter（使用 MockLLMProvider）
新增 test_capability_router_direct.py，对每个基准用例调用 router.route(query) 并记录完整 SkillRoutingResult
记录字段：skill_name、execution_mode、complexity、match_method（layer0/layer1/layer1.5/layer2）、match_confidence、execution_trace
将路由器回测结果也写入 MetricsCollector，增加 match_method 和 match_confidence 字段

Patterns to follow: 现有 test_capability_routing.py 的参数化测试模式

Test scenarios:

Layer 0 规则匹配：问候语 → DIRECT_CHAT，@skill:xxx → 对应技能
Layer 1 复杂度分类：简单问答 → low，多步分析 → high
Layer 1.5 语义路由：同义改写 → 相同技能，相似度 > 0.6
Layer 2 能力匹配：高复杂度 → REACT/TEAM_COLLAB
路由器回测与API回测结果一致性 > 90%

Verification: 运行 pytest tests/e2e/test_capability_router_direct.py -v 全部通过

U3. 指标体系增强

Goal: 修复 complexity_correct 判断逻辑，增加语义路由/团队路由指标，修正 target_module 路径

Dependencies: U1

Files:

tests/e2e/capability_metrics.py — 增强指标模型和分析器
tests/e2e/benchmark_dataset.py — 增加 semantic_router / alignment 类别

Approach:

CapabilityObservation 增加 actual_complexity_score: float | None、actual_match_method: str | None、actual_match_confidence: float | None 字段
complexity_correct 改为基于分数区间映射判断（KTD3）
MetricsAnalyzer 增加 analyze_semantic_router() 方法：按 high/medium/low 三档统计精确率
MetricsAnalyzer 增加 analyze_team_routing() 方法：统计 explicit_team vs complexity_suggestion 的成功率
修正 plan_improvements() 中所有 target_module：cost_aware_router.py → chat/skill_routing.py
报告增加"语义路由分析"和"团队路由分析"章节

Patterns to follow: 现有 MetricsAnalyzer 的分析方法模式

Test scenarios:

complexity_correct 独立于 execution_mode_correct
语义路由三档精确率计算正确
团队路由成功率计算正确
target_module 路径与实际代码对应
中文报告输出包含新增章节

Verification: 运行 pytest tests/e2e/test_capability_routing.py tests/e2e/test_capability_react.py -v 通过

U4. ExpertTeamRouter 集成到 CostAwareRouter

Goal: 高复杂度任务自动触发 TEAM_COLLAB 模式

Dependencies: U2

Files:

src/agentkit/chat/skill_routing.py — 修改 _route_layer2() 增加团队升级逻辑
src/agentkit/experts/router.py — 增加 can_handle() 方法供路由器查询
tests/unit/chat/test_skill_routing.py — 增加团队路由单元测试

Approach:

在 CostAwareRouter._route_layer2() 末尾，当 execution_mode == REACT 且 complexity >= COMPLEXITY_THRESHOLD 时，调用 ExpertTeamRouter.resolve(content, complexity)
如果 ExpertTeamRouter 返回有效结果，升级 execution_mode 为 TEAM_COLLAB，并在 execution_trace 中记录 "team_upgrade": True
在 ExpertTeamRouter 中增加 can_handle(content: str) -> bool 方法，检查是否有匹配的专家模板
保持向后兼容：如果 ExpertTeamRouter 不可用（未配置专家模板），静默跳过

Patterns to follow: 现有 _route_layer2() 的 Vickrey 拍卖路径模式

Test scenarios:

高复杂度 + 有专家模板 → TEAM_COLLAB
高复杂度 + 无专家模板 → 保持 REACT
低复杂度 → 不触发团队路由
@team 前缀 → 直接 TEAM_COLLAB（Layer 0 处理）
execution_trace 包含 team_upgrade 标记

Verification: 运行 pytest tests/unit/chat/test_skill_routing.py -v -k team 通过

U5. AlignmentGuard 与 CascadeDetector 指标集成

Goal: 将对齐守卫约束违规和级联告警纳入E2E指标收集

Dependencies: U3

Files:

tests/e2e/test_capability_alignment.py — 新增：对齐守卫基准测试
tests/e2e/capability_metrics.py — 增加 alignment 维度指标

Approach:

新增 test_capability_alignment.py，包含5+条对齐守卫基准用例：
- 否定约束测试（"不要提及价格"→ 输出不含价格）
- 肯定约束测试（"必须包含摘要"→ 输出含摘要）
- 级联告警测试（连续5次相似查询 → 触发 CascadeAlert）
CapabilityObservation 增加 alignment_violations: int、cascade_alert: bool 字段
MetricsAnalyzer 增加 analyze_alignment() 方法
报告增加"对齐守卫分析"章节

Patterns to follow: 现有 test_capability_team.py 的测试模式

Test scenarios:

否定约束：输出不包含禁止内容
肯定约束：输出包含必要内容
级联告警：连续交互触发告警
无约束：正常通过

Verification: 运行 pytest tests/e2e/test_capability_alignment.py -v 通过

U6. 运行脚本与CI集成

Goal: 更新运行脚本，支持分层回测和CI集成

Dependencies: U2, U3, U4, U5

Files:

scripts/run_e2e.sh — 增加直接回测和分层运行选项
tests/e2e/conftest.py — 确保 pytest_sessionfinish 报告生成正确

Approach:

run_e2e.sh 增加 --direct 选项（仅运行路由器直接回测）
run_e2e.sh 增加 --alignment 选项（仅运行对齐守卫测试）
run_e2e.sh 增加 --full 选项（运行全部：API + 直接 + 对齐）
确保报告输出目录 test-results/e2e/ 在 CI 中作为 artifact 上传
增加 --baseline 选项：与上次报告对比，输出指标变化趋势

Patterns to follow: 现有 run_e2e.sh 的选项模式

Test scenarios:

--direct 仅运行路由器直接回测
--alignment 仅运行对齐守卫测试
--full 运行所有能力测试
--analyze 生成完整中文报告
报告文件正确保存到 test-results/e2e/

Verification: 运行 ./scripts/run_e2e.sh --direct 和 ./scripts/run_e2e.sh --analyze 验证

Scope Boundaries

In Scope

基准数据集与实际技能对齐并扩展至60条
路由器直接回测层
指标体系增强（复杂度、语义路由、团队路由）
ExpertTeamRouter 集成到 CostAwareRouter
AlignmentGuard 指标集成
运行脚本更新

Out of Scope

CostAwareRouter 三层架构重写
新增 LLM Provider
前端界面修改
生产环境部署
intent.examples 嵌入到 SemanticRouter（可作为后续优化）
disambiguation_keywords 配置字段（改进策略已规划，但属于技能配置层面的独立改进）

Deferred to Follow-Up Work

基于用户真实查询日志的基准用例持续扩充
复杂度评估模型训练（替代启发式规则）
意图泛化CI防线的 GitHub Actions 配置
OutputStandardizer.quality_score 与路由决策的关联分析

Risks & Mitigations

风险	影响	缓解措施
ExpertTeamRouter 集成可能影响现有路由性能	Layer 2 增加一次 resolve() 调用	仅在 complexity >= 0.7 时触发，且 can_handle() 快速返回
基准用例自动生成可能产生低质量用例	PRF 指标失真	人工审核自动生成的用例，保留手动边界用例
路由器直接回测需要 MockLLMProvider 完整支持	某些路由路径无法测试	优先覆盖 Layer 0/1，Layer 1.5/2 标记为需要真实 LLM
60条用例可能增加E2E运行时间	CI 流水线变慢	按维度分组运行，支持 `--fast` 快速失败模式

System-Wide Impact

路由层：skill_routing.py 增加 ExpertTeamRouter 调用点，影响所有高复杂度请求的路由决策
测试层：新增3个测试文件，conftest.py 增加2个 fixture，运行脚本增加4个选项
报告层：能力分析报告增加3个章节（语义路由、团队路由、对齐守卫）
配置层：无配置文件变更（disambiguation_keywords 推迟到后续）

13 KiB Raw Blame History Unescape Escape

E2E能力分析框架改进与路由智能化提升

Summary

Problem Frame

Requirements

Key Technical Decisions

KTD1: 双层回测架构

KTD2: ExpertTeamRouter 集成方式

KTD3: 复杂度正确性判断策略

KTD4: 基准用例与实际技能对齐

Implementation Units

U1. 基准数据集与实际技能对齐

U2. 路由器直接回测层

U3. 指标体系增强

U4. ExpertTeamRouter 集成到 CostAwareRouter

U5. AlignmentGuard 与 CascadeDetector 指标集成

U6. 运行脚本与CI集成

Scope Boundaries

In Scope

Out of Scope

Deferred to Follow-Up Work

Risks & Mitigations

System-Wide Impact

13 KiB

Raw Blame History