feat(router): optimize routing intelligence — ExecutionMode expansion, multi-candidate scoring, quality gate skill match

- Expand ExecutionMode enum with REWOO/REFLEXION/PLAN_EXEC - Add _resolve_execution_mode() to respect skill.config.execution_mode - Rewrite IntentRouter._match_keywords() for multi-candidate scoring - Add QualityGate 5th dimension: skill_match validation with warning escalation - Calibrate HeuristicClassifier: low-complexity signals only when no high signals - Fix negation regex for Chinese text (avoid matching past punctuation) - Fix backtest mode_map normalization and .env loading - Add 61 unit tests (21 HeuristicClassifier + 14 IntentRouter + 13 QualityGate + 13 existing) Results: execution_mode_accuracy 9.09%→36.36%, skill_routing_F1 66.67%→77.78%
2026-06-15 22:43:13 +08:00 · 2026-06-15 22:43:13 +08:00 · e984b4c462
parent 64d62a2b60
commit e984b4c462
23 changed files with 7048 additions and 66 deletions
--- a/docs/plans/2026-06-15-002-feat-e2e-capability-improvement-plan.md
+++ b/docs/plans/2026-06-15-002-feat-e2e-capability-improvement-plan.md
@ -0,0 +1,280 @@
 ---
 title: "feat: E2E能力分析框架改进与路由智能化提升"
 type: feat
 status: active
 created: 2026-06-15
 plan-depth: standard
 ---
 # E2E能力分析框架改进与路由智能化提升
 ## Summary
 改进E2E能力分析框架，解决当前基准数据集与实际技能不对应、覆盖面窄（仅19条）、指标判断过于简化等核心问题。同时将ExpertTeamRouter集成到CostAwareRouter自动触发链路，增加路由器直接回测层，并将基准用例扩展至60条，使召回率/F1/过拟合检测等指标具备统计意义。
 ## Problem Frame
 当前E2E能力分析框架存在四个关键问题：
 1. **基准数据与实际技能脱节**：`benchmark_dataset.py` 中的 `expected_skill`（如 `email_composer`、`i18n_translator`）与 `configs/skills/` 中的15个实际技能不对应，导致路由回测结果无意义
 2. **覆盖面过窄**：仅19条基准用例，PRF统计不稳定；缺少 SemanticRouter、ExpertTeamRouter、AlignmentGuard 的专项基准
 3. **指标判断粗糙**：`complexity_correct` 直接等于 `execution_mode_correct`，无法独立评估复杂度估算；改进策略中的 `target_module` 引用了旧文件名
 4. **团队路由未自动集成**：`ExpertTeamRouter` 与 `CostAwareRouter` 独立运行，`TEAM_COLLAB` 模式无法自动触发
 ## Requirements
 - R1: 基准数据集中的 `expected_skill` 必须与 `configs/skills/` 中的实际技能一一对应
 - R2: 基准用例数量扩展至60条，覆盖路由/执行/团队/一致性/对齐守卫五个维度
 - R3: 增加路由器直接回测层（不经过HTTP API），能区分路由错误与API层错误
 - R4: `complexity_correct` 独立于 `execution_mode_correct`，基于 HeuristicClassifier 分数与期望复杂度的映射判断
 - R5: ExpertTeamRouter 集成到 CostAwareRouter.route() 中，高复杂度任务自动触发 TEAM_COLLAB
 - R6: 增加 SemanticRouter 专项基准（相似度分数分布、三档精确率）
 - R7: 增加 AlignmentGuard 约束检查基准
 - R8: 修正改进策略中的 target_module 文件路径
 - R9: 报告输出保持中文
 ## Key Technical Decisions
 ### KTD1: 双层回测架构
 **决策**：在现有HTTP API层E2E测试之上，增加路由器直接回测层。
 **理由**：纯API测试无法区分"路由器选错了技能"和"API层传递参数出错"两种失败模式。直接回测层调用 `CostAwareRouter.route()` 方法，记录 `SkillRoutingResult` 的完整字段（`match_method`、`match_confidence`、`execution_trace`），使根因分析能精确定位到具体路由层。
 **替代方案**：保持纯API层测试 → 被否决，因为无法满足R3的精确诊断需求。
 ### KTD2: ExpertTeamRouter 集成方式
 **决策**：在 `CostAwareRouter._route_layer2()` 末尾增加 ExpertTeamRouter 检查点。当 Layer 2 判定 `execution_mode=REACT` 且 `complexity >= 0.7` 时，调用 `ExpertTeamRouter.resolve()` 判断是否升级为 `TEAM_COLLAB`。
 **理由**：保持三层路由的递进式架构不变，仅在 Layer 2 出口处增加团队模式升级逻辑，最小化对现有路由流程的侵入。
 ### KTD3: 复杂度正确性判断策略
 **决策**：基于 HeuristicClassifier 返回的浮点复杂度分数与期望复杂度等级的映射区间判断：`low=[0, 0.3)`、`medium=[0.3, 0.7)`、`high=[0.7, 1.0]`。
 **理由**：直接使用浮点分数比仅比较执行模式更精确，能区分"复杂度分数0.29被判为low但期望medium"和"复杂度分数0.65被判为medium且期望medium"两种情况。
 ### KTD4: 基准用例与实际技能对齐
 **决策**：从 `configs/skills/` 的15个实际技能中提取 `intent.keywords` 和 `intent.description`，自动生成基准用例的 `expected_skill`，而非手动硬编码。
 **理由**：手动维护的技能名容易与实际配置脱节（当前问题）。自动对齐确保基准数据始终反映最新的技能配置。
 ---
 ## Implementation Units
 ### U1. 基准数据集与实际技能对齐
 **Goal**: 修复 benchmark_dataset.py 中 expected_skill 与实际技能的对应关系，扩展至60条用例
 **Dependencies**: 无
 **Files**:
 - `tests/e2e/benchmark_dataset.py` — 重写基准数据集
 - `tests/e2e/benchmark_generator.py` — 新增：从技能配置自动生成基准用例
 **Approach**:
 1. 新增 `BenchmarkGenerator` 类，读取 `configs/skills/*.yaml`，提取每个技能的 `intent.keywords`、`intent.description`、`intent.examples`，自动生成 `BenchmarkCase`
 2. 为每个技能生成3-5条基准用例：1条原始输入 + 2-4条改写
 3. 保留手动定义的边界用例（问候语、身份识别、无匹配回退）
 4. 新增维度：`alignment`（对齐守卫）、`semantic_router`（语义路由专项）
 5. 总目标：路由20+、执行15+、团队10+、一致性10+、对齐守卫5+
 **Patterns to follow**: `BenchmarkCase` Pydantic frozen model 模式
 **Test scenarios**:
 - 生成的基准用例 expected_skill 全部存在于 configs/skills/ 中
 - 每个技能至少有1条基准用例
 - paraphrases 非空的用例占比 > 60%
 - 总用例数 >= 60
 **Verification**: 运行 `python -c "from tests.e2e.benchmark_dataset import ALL_BENCHMARKS; print(len(ALL_BENCHMARKS))"` 确认 >= 60
 ### U2. 路由器直接回测层
 **Goal**: 增加不经过HTTP API的路由器直接回测，记录完整路由结果
 **Dependencies**: U1
 **Files**:
 - `tests/e2e/test_capability_router_direct.py` — 新增：路由器直接回测
 - `tests/e2e/conftest.py` — 增加 router fixture
 **Approach**:
 1. 在 conftest.py 中增加 `cost_aware_router` fixture，直接实例化 `CostAwareRouter`（使用 MockLLMProvider）
 2. 新增 `test_capability_router_direct.py`，对每个基准用例调用 `router.route(query)` 并记录完整 `SkillRoutingResult`
 3. 记录字段：`skill_name`、`execution_mode`、`complexity`、`match_method`（layer0/layer1/layer1.5/layer2）、`match_confidence`、`execution_trace`
 4. 将路由器回测结果也写入 MetricsCollector，增加 `match_method` 和 `match_confidence` 字段
 **Patterns to follow**: 现有 `test_capability_routing.py` 的参数化测试模式
 **Test scenarios**:
 - Layer 0 规则匹配：问候语 → DIRECT_CHAT，@skill:xxx → 对应技能
 - Layer 1 复杂度分类：简单问答 → low，多步分析 → high
 - Layer 1.5 语义路由：同义改写 → 相同技能，相似度 > 0.6
 - Layer 2 能力匹配：高复杂度 → REACT/TEAM_COLLAB
 - 路由器回测与API回测结果一致性 > 90%
 **Verification**: 运行 `pytest tests/e2e/test_capability_router_direct.py -v` 全部通过
 ### U3. 指标体系增强
 **Goal**: 修复 complexity_correct 判断逻辑，增加语义路由/团队路由指标，修正 target_module 路径
 **Dependencies**: U1
 **Files**:
 - `tests/e2e/capability_metrics.py` — 增强指标模型和分析器
 - `tests/e2e/benchmark_dataset.py` — 增加 semantic_router / alignment 类别
 **Approach**:
 1. `CapabilityObservation` 增加 `actual_complexity_score: float | None`、`actual_match_method: str | None`、`actual_match_confidence: float | None` 字段
 2. `complexity_correct` 改为基于分数区间映射判断（KTD3）
 3. `MetricsAnalyzer` 增加 `analyze_semantic_router()` 方法：按 high/medium/low 三档统计精确率
 4. `MetricsAnalyzer` 增加 `analyze_team_routing()` 方法：统计 `explicit_team` vs `complexity_suggestion` 的成功率
 5. 修正 `plan_improvements()` 中所有 `target_module`：`cost_aware_router.py` → `chat/skill_routing.py`
 6. 报告增加"语义路由分析"和"团队路由分析"章节
 **Patterns to follow**: 现有 `MetricsAnalyzer` 的分析方法模式
 **Test scenarios**:
 - complexity_correct 独立于 execution_mode_correct
 - 语义路由三档精确率计算正确
 - 团队路由成功率计算正确
 - target_module 路径与实际代码对应
 - 中文报告输出包含新增章节
 **Verification**: 运行 `pytest tests/e2e/test_capability_routing.py tests/e2e/test_capability_react.py -v` 通过
 ### U4. ExpertTeamRouter 集成到 CostAwareRouter
 **Goal**: 高复杂度任务自动触发 TEAM_COLLAB 模式
 **Dependencies**: U2
 **Files**:
 - `src/agentkit/chat/skill_routing.py` — 修改 `_route_layer2()` 增加团队升级逻辑
 - `src/agentkit/experts/router.py` — 增加 `can_handle()` 方法供路由器查询
 - `tests/unit/chat/test_skill_routing.py` — 增加团队路由单元测试
 **Approach**:
 1. 在 `CostAwareRouter._route_layer2()` 末尾，当 `execution_mode == REACT` 且 `complexity >= COMPLEXITY_THRESHOLD` 时，调用 `ExpertTeamRouter.resolve(content, complexity)`
 2. 如果 `ExpertTeamRouter` 返回有效结果，升级 `execution_mode` 为 `TEAM_COLLAB`，并在 `execution_trace` 中记录 `"team_upgrade": True`
 3. 在 `ExpertTeamRouter` 中增加 `can_handle(content: str) -> bool` 方法，检查是否有匹配的专家模板
 4. 保持向后兼容：如果 `ExpertTeamRouter` 不可用（未配置专家模板），静默跳过
 **Patterns to follow**: 现有 `_route_layer2()` 的 Vickrey 拍卖路径模式
 **Test scenarios**:
 - 高复杂度 + 有专家模板 → TEAM_COLLAB
 - 高复杂度 + 无专家模板 → 保持 REACT
 - 低复杂度 → 不触发团队路由
 - @team 前缀 → 直接 TEAM_COLLAB（Layer 0 处理）
 - execution_trace 包含 team_upgrade 标记
 **Verification**: 运行 `pytest tests/unit/chat/test_skill_routing.py -v -k team` 通过
 ### U5. AlignmentGuard 与 CascadeDetector 指标集成
 **Goal**: 将对齐守卫约束违规和级联告警纳入E2E指标收集
 **Dependencies**: U3
 **Files**:
 - `tests/e2e/test_capability_alignment.py` — 新增：对齐守卫基准测试
 - `tests/e2e/capability_metrics.py` — 增加 alignment 维度指标
 **Approach**:
 1. 新增 `test_capability_alignment.py`，包含5+条对齐守卫基准用例：
   - 否定约束测试（"不要提及价格"→ 输出不含价格）
   - 肯定约束测试（"必须包含摘要"→ 输出含摘要）
   - 级联告警测试（连续5次相似查询 → 触发 CascadeAlert）
 2. `CapabilityObservation` 增加 `alignment_violations: int`、`cascade_alert: bool` 字段
 3. `MetricsAnalyzer` 增加 `analyze_alignment()` 方法
 4. 报告增加"对齐守卫分析"章节
 **Patterns to follow**: 现有 `test_capability_team.py` 的测试模式
 **Test scenarios**:
 - 否定约束：输出不包含禁止内容
 - 肯定约束：输出包含必要内容
 - 级联告警：连续交互触发告警
 - 无约束：正常通过
 **Verification**: 运行 `pytest tests/e2e/test_capability_alignment.py -v` 通过
 ### U6. 运行脚本与CI集成
 **Goal**: 更新运行脚本，支持分层回测和CI集成
 **Dependencies**: U2, U3, U4, U5
 **Files**:
 - `scripts/run_e2e.sh` — 增加直接回测和分层运行选项
 - `tests/e2e/conftest.py` — 确保 pytest_sessionfinish 报告生成正确
 **Approach**:
 1. `run_e2e.sh` 增加 `--direct` 选项（仅运行路由器直接回测）
 2. `run_e2e.sh` 增加 `--alignment` 选项（仅运行对齐守卫测试）
 3. `run_e2e.sh` 增加 `--full` 选项（运行全部：API + 直接 + 对齐）
 4. 确保报告输出目录 `test-results/e2e/` 在 CI 中作为 artifact 上传
 5. 增加 `--baseline` 选项：与上次报告对比，输出指标变化趋势
 **Patterns to follow**: 现有 `run_e2e.sh` 的选项模式
 **Test scenarios**:
 - `--direct` 仅运行路由器直接回测
 - `--alignment` 仅运行对齐守卫测试
 - `--full` 运行所有能力测试
 - `--analyze` 生成完整中文报告
 - 报告文件正确保存到 test-results/e2e/
 **Verification**: 运行 `./scripts/run_e2e.sh --direct` 和 `./scripts/run_e2e.sh --analyze` 验证
 ---
 ## Scope Boundaries
 ### In Scope
 - 基准数据集与实际技能对齐并扩展至60条
 - 路由器直接回测层
 - 指标体系增强（复杂度、语义路由、团队路由）
 - ExpertTeamRouter 集成到 CostAwareRouter
 - AlignmentGuard 指标集成
 - 运行脚本更新
 ### Out of Scope
 - CostAwareRouter 三层架构重写
 - 新增 LLM Provider
 - 前端界面修改
 - 生产环境部署
 - intent.examples 嵌入到 SemanticRouter（可作为后续优化）
 - disambiguation_keywords 配置字段（改进策略已规划，但属于技能配置层面的独立改进）
 ### Deferred to Follow-Up Work
 - 基于用户真实查询日志的基准用例持续扩充
 - 复杂度评估模型训练（替代启发式规则）
 - 意图泛化CI防线的 GitHub Actions 配置
 - OutputStandardizer.quality_score 与路由决策的关联分析
 ---
 ## Risks & Mitigations
 | 风险 | 影响 | 缓解措施 |
 |------|------|----------|
 | ExpertTeamRouter 集成可能影响现有路由性能 | Layer 2 增加一次 resolve() 调用 | 仅在 complexity >= 0.7 时触发，且 can_handle() 快速返回 |
 | 基准用例自动生成可能产生低质量用例 | PRF 指标失真 | 人工审核自动生成的用例，保留手动边界用例 |
 | 路由器直接回测需要 MockLLMProvider 完整支持 | 某些路由路径无法测试 | 优先覆盖 Layer 0/1，Layer 1.5/2 标记为需要真实 LLM |
 | 60条用例可能增加E2E运行时间 | CI 流水线变慢 | 按维度分组运行，支持 `--fast` 快速失败模式 |
 ---
 ## System-Wide Impact
 - **路由层**：`skill_routing.py` 增加 ExpertTeamRouter 调用点，影响所有高复杂度请求的路由决策
 - **测试层**：新增3个测试文件，conftest.py 增加2个 fixture，运行脚本增加4个选项
 - **报告层**：能力分析报告增加3个章节（语义路由、团队路由、对齐守卫）
 - **配置层**：无配置文件变更（disambiguation_keywords 推迟到后续）
--- a/docs/plans/2026-06-15-003-feat-router-intelligence-optimization-plan.md
+++ b/docs/plans/2026-06-15-003-feat-router-intelligence-optimization-plan.md
@ -0,0 +1,326 @@
 ---
 title: "feat: 路由智能化优化 — 复杂度校准、意图消歧、质量门控增强"
 status: active
 created: 2026-06-15
 updated: 2026-06-15
 origin: test-results/e2e/capability_report.txt (真实LLM回测分析报告)
 ---
 ## Summary
 基于真实 LLM 回测分析报告暴露的三个核心根因，优化 CostAwareRouter 的路由智能化水平：修复 HeuristicClassifier 复杂度评分偏差（执行模式准确率从 9.09% 提升至 >30%），解决 IntentRouter 首次匹配导致的技能混淆（技能路由 F1 从 66.67% 提升至 >80%），增强 QualityGate 的技能匹配验证拦截错误路由。
 **当前进度**: U1 代码已实现，待补单元测试；U2/U3 待实现；U4 待验证。
 ---
 ## Problem Frame
 真实 LLM 回测（74个观测）揭示三个核心问题：
 1. **执行模式准确率 9.09%** — HeuristicClassifier 倾向高估复杂度，将简单问答（如"你好"、"你是谁"）判为需要 REACT 而非 DIRECT_CHAT。40个执行模式判断错误中仅1次低估复杂度。
 2. **keyword_match 召回率 0%** — 62个关键词匹配用例全部未路由到预期技能，真实 SkillRegistry 虽然加载了15个技能，但路由链路未能正确匹配。
 3. **意图歧义** — plan_exec_agent 与 goal_driven_agent 的关键词重叠（"规划"、"报告"子串），IntentRouter 首次匹配策略导致混淆。
 ---
 ## Requirements
 - R1: HeuristicClassifier 复杂度评分校准 — 简单问答应得低分（<0.3），复杂任务应得高分（>0.7）
 - R2: IntentRouter 多候选评分排序 — 匹配多个技能时按得分排序选择最佳，而非首次匹配
 - R3: QualityGate 技能匹配验证 — 拦截路由结果与技能能力不一致的输出
 - R4: 回测验证 — 改进后执行模式准确率 >30%，技能路由 F1 >80%
 ---
 ## Key Technical Decisions
 ### KTD1: HeuristicClassifier 评分重构 — 增加低复杂度信号
 **决策**: 在现有高/中复杂度关键词之外，增加低复杂度关键词列表和否定信号机制。当输入包含低复杂度信号（问候、闲聊、简单定义）时，直接降低基础分数；当高复杂度词出现在否定上下文（"不要X"、"无需X"）时，不增加分数。
 **理由**: 当前分类器只有正向累加逻辑（命中高复杂度词→加分），没有负向扣减逻辑。这导致任何包含"分析"、"搜索"等常见动词的输入都被判为高复杂度，即使实际是简单问答。
 **替代方案**: 用 LLM 替代规则分类器 — 延迟高（~500ms）、成本高（~100 tokens），且当前 merged_llm_classify 已在 0.3-0.7 区间使用 LLM，规则层应保持零成本。
 **实现状态**: 代码已完成。`classify()` 方法已重写，包含低复杂度信号优先检测、否定上下文排除、阈值调整（0.15→0.10, 0.45→0.35）、短疑问句扣减。
 ### KTD2: IntentRouter 多候选评分排序
 **决策**: 修改 `_match_keywords()` 从"首次匹配返回"改为"收集所有匹配候选，按匹配关键词数量×关键词长度排序，返回最佳匹配"。
 **理由**: 首次匹配依赖 skills 列表遍历顺序，不可控且不公平。多候选评分让匹配更多、更精确关键词的技能胜出。例如输入"规划一个调研报告"同时匹配 plan_exec_agent（"规划"、"报告"）和 goal_driven_agent（"规划"、"调研"），但 goal_driven_agent 还匹配"生成报告"的子串"报告"，匹配数相同则按关键词长度排序，更长的关键词（"调研报告" > "报告"）权重更高。
 **替代方案**: 在技能配置中添加互斥关键词 — 需要逐对配置，维护成本高，且无法覆盖所有重叠场景。
 **实现状态**: 待实现。当前 `_match_keywords()` 仍为首次匹配逻辑（`intent.py` L89-98）。
 ### KTD3: QualityGate 技能匹配验证 — 轻量级路由一致性检查
 **决策**: 在 QualityGate.validate() 中增加可选的 `skill_context` 参数，当提供时检查输出内容是否与路由到的技能的能力范围一致。使用规则检查（关键词覆盖度）而非 LLM 语义检查，保持零额外成本。
 **理由**: 当前 QualityGate 只检查输出格式（必填字段、字数、Schema），不检查输出内容是否与路由技能匹配。3个用例虽然 HTTP 成功但路由到了错误技能，质量门控未能拦截。
 **实现状态**: 待实现。当前 `validate()` 仅有四维度检查（`gate.py` L37-114）。
 ---
 ## Scope Boundaries
 ### In Scope
 - HeuristicClassifier 评分逻辑优化（代码已完成，待补测试）
 - IntentRouter._match_keywords() 多候选评分排序
 - QualityGate 增加技能匹配验证维度
 - 更新回测基准数据集以反映新的评分逻辑
 - 改进后重跑回测验证
 ### Out of Scope
 - LLM 分类器优化（merged_llm_classify 和 _classify_with_llm 已有实现，不在本次优化范围）
 - SemanticRouter 优化（需要嵌入模型，属于独立优化方向）
 - ExpertTeamRouter 在服务器启动时的注入（已实现但未接入 create_app，属于部署配置问题）
 - 新增技能配置文件
 ### Deferred to Follow-Up Work
 - 训练专用意图分类模型替代规则匹配（长期方向）
 - 构建复杂度校准数据集持续优化阈值
 - 实现自动质量回归检测 CI 流水线
 ---
 ## Implementation Units
 ### U1. HeuristicClassifier 复杂度评分校准
 **Goal**: 修复复杂度评分偏差，使简单问答得低分、复杂任务得高分，提升执行模式准确率
 **Requirements**: R1, R4
 **Dependencies**: None
 **Files:**
 - `src/agentkit/chat/skill_routing.py` — HeuristicClassifier 类（**代码已完成**）
 - `tests/unit/chat/test_skill_routing.py` — 新增复杂度校准测试（**待编写**）
 **Approach:**
 代码已实现以下改动：
 1. 增加低复杂度关键词列表 `_LOW_COMPLEXITY_HINTS_CN`（17个词）和 `_LOW_COMPLEXITY_HINTS_EN`（14个词），命中时基础分数为 0.05，且不再累加高复杂度词分数。
 2. 增加否定上下文检测 `_NEGATION_PATTERNS`，匹配"不要/无需/不用/don't/no need/without"后跟的词，该词不计入高复杂度匹配。
 3. 调整基础分数阈值：无关键词命中时基础分 0.10（原 0.15），中等复杂度命中基础分 0.35（原 0.45）。
 4. 增加短疑问句检测 `_SHORT_QUESTION_RE`：以"？"或"?"结尾且长度 <30 字符时，额外 -0.10。
 **剩余工作**: 编写单元测试验证分类器行为。
 **Patterns to follow:** 现有 `test_skill_routing.py` 中的测试类结构（`TestExpertTeamRouterCanHandle` 等）
 **Test scenarios:**
 - **低复杂度信号优先检测**
  - "你好" → 复杂度 < 0.3（命中 `_LOW_COMPLEXITY_HINTS_CN`）
  - "Hello" → 复杂度 < 0.3（命中 `_LOW_COMPLEXITY_HINTS_EN`）
  - "嗨，早上好" → 复杂度 < 0.3（多个低复杂度词命中）
  - "你好，请帮我分析一下这个数据" → 复杂度 < 0.15（低复杂度信号优先，不累加高复杂度词）
 - **身份查询**
  - "你是谁" → 复杂度 < 0.3
  - "你叫什么" → 复杂度 < 0.3
 - **否定上下文排除**
  - "不要搜索" → "搜索"不计入高复杂度匹配，复杂度 < 0.3
  - "无需分析，直接告诉我答案" → "分析"被否定，复杂度 < 0.3
  - "分析市场趋势，但不要搜索" → "搜索"被否定但"分析"未被否定，复杂度 > 0.5
 - **阈值调整验证**
  - 无关键词的短消息（"好的"）→ 复杂度 ≤ 0.10
  - 含中等复杂度词（"如何使用Python？"）→ 基础分 0.35 而非 0.45
 - **短疑问句扣减**
  - "怎么用？" → 复杂度 < 0.3（短疑问句 -0.10）
  - "如何设计一个高可用的微服务架构？" → 复杂度 > 0.5（长疑问句不扣减）
 - **复杂任务高分**
  - "分析市场趋势并生成报告" → 复杂度 > 0.7（2个高复杂度词命中）
  - "执行部署脚本并重启服务" → 复杂度 > 0.7
 - **边界条件**
  - 空字符串 → 复杂度 0.0
  - 纯空格 → 复杂度 0.0
  - 超长低复杂度消息（>200字符的问候）→ 复杂度 ≤ 0.10
 **Verification:** `pytest tests/unit/chat/test_skill_routing.py -v`，所有 HeuristicClassifier 测试通过
 ---
 ### U2. IntentRouter 多候选评分排序
 **Goal**: 解决首次匹配导致的技能混淆，使匹配更精确的技能胜出
 **Requirements**: R2, R4
 **Dependencies**: None
 **Files:**
 - `src/agentkit/router/intent.py` — IntentRouter._match_keywords()
 - `tests/unit/router/test_intent.py` — 新建多候选排序测试
 **Approach:**
 1. 重写 `_match_keywords()` 方法（当前为 `intent.py` L75-99）：
   当前逻辑（首次匹配）：
   ```
   for skill in skills:
       for keyword in keywords:
           if keyword in combined_text:
               return RoutingResult(matched_skill=skill.name, ...)
   return None
   ```
   改为多候选评分：
   ```
   candidates = []
   for skill in skills:
       matched_kws = [kw for kw in skill.config.intent.keywords if kw.lower() in combined_text]
       if matched_kws:
           score = sum(len(kw) for kw in matched_kws)  # 更长关键词权重更高
           candidates.append((skill, matched_kws, score))
   if not candidates:
       return None
   candidates.sort(key=lambda c: (-c[2], c[0].name))  # 得分降序，同名字母序
   best_skill, best_kws, best_score = candidates[0]
   confidence = min(1.0, 0.5 + 0.1 * len(best_kws))
   return RoutingResult(matched_skill=best_skill.name, method="keyword", confidence=confidence)
   ```
 2. 保持 `RoutingResult` 数据类接口不变，`method` 仍为 `"keyword"`。
 3. 向后兼容：单候选时行为与原来一致（只有一个 skill 匹配时，排序无影响）。
 4. 需要创建 `tests/unit/router/` 目录和 `__init__.py`。
 **Patterns to follow:** 现有 `RoutingResult` 数据类结构；`_extract_string_values()` 的输入处理方式
 **Test scenarios:**
 - **单候选匹配** — 输入只匹配一个 skill 的关键词，行为与原来一致，confidence=1.0
 - **多候选匹配 — 得分不同** — 输入同时匹配 skill_a（关键词"规划"2字）和 skill_b（关键词"调研报告"4字），skill_b 得分更高应胜出
 - **多候选匹配 — 得分相同** — 两个 skill 得分相同时，按名称字母序稳定排序
 - **无匹配** — 无任何关键词命中，返回 None
 - **空关键词列表** — skill 的 intent.keywords 为空列表，不参与匹配
 - **大小写不敏感** — 英文关键词 "Search" 应匹配 "search"
 - **子串匹配行为** — 中文关键词"报告"应匹配包含"报告"的输入（保持现有子串匹配语义）
 - **confidence 计算** — 匹配1个关键词 confidence=0.6，匹配3个 confidence=0.8，上限 1.0
 **Verification:** `pytest tests/unit/router/test_intent.py -v`，多候选排序测试通过
 ---
 ### U3. QualityGate 技能匹配验证
 **Goal**: 增加路由一致性检查，拦截技能匹配错误的低质量输出
 **Requirements**: R3, R4
 **Dependencies:** None
 **Files:**
 - `src/agentkit/quality/gate.py` — QualityGate.validate()
 - `tests/unit/quality/test_gate.py` — 新建技能匹配验证测试
 **Approach:**
 1. 在 `QualityGate.validate()` 签名中增加可选参数 `skill_context: dict | None = None`：
   ```python
   async def validate(
       self,
       output: dict[str, Any],
       skill: Skill,
       skill_context: dict | None = None,  # 新增
   ) -> QualityResult:
   ```
 2. `skill_context` 结构：`{"skill_name": str, "intent_keywords": list[str]}`
 3. 当 `skill_context` 提供且 `intent_keywords` 非空时，增加第五维度检查"技能匹配验证"：
   - 将 output 中所有字符串值拼接
   - 检查拼接文本是否包含至少一个 `intent_keywords` 中的关键词（子串匹配）
   - 如果 0 个关键词匹配 → `QualityCheck(name="skill_match", passed=True, message="Warning: output may not match routed skill")` — 警告但不拦截
   - 如果 ≥ 1 个关键词匹配 → `QualityCheck(name="skill_match", passed=True)` — 静默通过
 4. 警告升级为失败的组合逻辑：当 `skill_match` 警告存在且其他任何维度检查失败时，`skill_match` 的 `passed` 也变为 `False`，导致整体 `passed=False`。
 5. 保持向后兼容：`skill_context` 为 None 或缺少 `intent_keywords` 时，行为与原来完全一致（四维度检查）。
 **Patterns to follow:** 现有四维度检查模式（`gate.py` L50-114）；`QualityCheck` 数据类
 **Test scenarios:**
 - **无 skill_context** — 行为与原来一致，仅四维度检查
 - **skill_context=None** — 等同于无 skill_context
 - **skill_context 缺少 intent_keywords** — 等同于无 skill_context
 - **有 skill_context 且输出包含关键词** — 通过，无警告消息
 - **有 skill_context 且输出不包含任何关键词** — 通过但有警告消息
 - **输出无关 + 其他维度失败** — skill_match passed=False，整体 passed=False
 - **输出无关 + 其他维度全部通过** — skill_match passed=True（仅警告），整体 passed=True
 - **空 intent_keywords 列表** — 跳过技能匹配检查
 **Verification:** `pytest tests/unit/quality/test_gate.py -v`，技能匹配验证测试通过
 ---
 ### U4. 回测验证与基准更新
 **Goal**: 验证改进效果，更新基准数据集
 **Requirements**: R4
 **Dependencies:** U1, U2, U3
 **Files:**
 - `tests/e2e/test_capability_router_direct.py` — 使用真实 LLM 回测
 - `tests/e2e/benchmark_dataset.py` — 可能需要更新预期值
 - `test-results/e2e/capability_report.txt` — 对比改进前后报告
 **Approach:**
 1. 运行完整回测：`python3 -m pytest tests/e2e/test_capability_router_direct.py -v`
 2. 对比改进前后指标：
   - 执行模式准确率：9.09% → 目标 >30%
   - 技能路由 F1：66.67% → 目标 >80%
   - 任务成功率：100% → 保持
 3. 如果基准数据集中的预期值因评分逻辑变化需要调整，更新 `benchmark_dataset.py`
 4. 保存改进后报告为基线：`cp test-results/e2e/capability_report.json test-results/e2e/baseline_capability_report.json`
 **Test scenarios:**
 - 回测全部通过
 - 执行模式准确率 >30%
 - 技能路由 F1 >80%
 - 无回归（任务成功率不下降）
 **Verification:** 运行回测并检查报告指标
 ---
 ## Risks & Dependencies
 | 风险 | 影响 | 缓解 |
 |------|------|------|
 | 复杂度评分调整可能过度修正，导致复杂任务被判为简单 | 高复杂度任务路由到 DIRECT_CHAT，无法使用工具 | 保留 merged_llm_classify 兜底机制，0.3-0.7 区间仍由 LLM 二次确认 |
 | 多候选排序可能改变现有路由行为的兼容性 | 已有用户依赖的路由结果可能变化 | 排序逻辑仅在多候选时生效，单候选行为不变 |
 | QualityGate 技能匹配验证的"相关词"判断可能误报 | 正常输出被标记为警告 | 使用 warning 级别而非 error，不单独拦截 |
 | keyword_match 召回率 0% 的根因可能不仅是 IntentRouter | 即使修复多候选排序，仍可能因技能配置关键词不匹配而召回率低 | U4 回测后若仍低，需进一步分析技能配置与基准用例的对齐度 |
 ---
 ## Open Questions
 - 复杂度评分的具体阈值已在代码中设定初始值（0.05/0.10/0.35/0.65/0.80），需通过 U4 回测校准
 - 否定上下文检测的正则模式覆盖度需在回测中验证，可能需要迭代补充
 - keyword_match 召回率 0% 是否完全由 IntentRouter 首次匹配导致，还是技能配置关键词本身与基准用例不对齐 — 需 U2 实现后通过 U4 验证
--- a/scripts/run_e2e.sh
+++ b/scripts/run_e2e.sh
@ -0,0 +1,328 @@
 #!/usr/bin/env bash
 # =============================================================================
 # Fischer AgentKit — E2E Backtest Runner
 # =============================================================================
 #
 # Usage:
 #   ./scripts/run_e2e.sh                  # Run all E2E tests
 #   ./scripts/run_e2e.sh --basic          # Run basic function tests only
 #   ./scripts/run_e2e.sh --capability     # Run agent capability tests only
 #   ./scripts/run_e2e.sh --cli            # Run CLI tests only
 #   ./scripts/run_e2e.sh --api            # Run API tests only
 #   ./scripts/run_e2e.sh --ws             # Run WebSocket tests only
 #   ./scripts/run_e2e.sh --routing        # Run routing intelligence tests
 #   ./scripts/run_e2e.sh --react          # Run ReAct intelligence tests
 #   ./scripts/run_e2e.sh --team           # Run team collaboration tests
 #   ./scripts/run_e2e.sh --report         # Generate HTML report
 #   ./scripts/run_e2e.sh --analyze        # Run capability tests + generate analysis report
 #   ./scripts/run_e2e.sh --direct         # Run router direct backtest only (no HTTP)
 #   ./scripts/run_e2e.sh --alignment      # Run alignment guard tests only
 #   ./scripts/run_e2e.sh --full           # Run all: API + direct + alignment
 #   ./scripts/run_e2e.sh --baseline       # Compare with last baseline report
 #
 # Environment:
 #   E2E_PORT         - Server port (default: 18765)
 #   E2E_API_KEY      - API key for auth (default: ak_live_e2e_test_key_...)
 #   SKIP_SERVER      - Set to "1" to skip server startup (use existing)
 # =============================================================================
 set -euo pipefail
 # ── Configuration ────────────────────────────────────────────────────────────
 SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
 E2E_PORT="${E2E_PORT:-18765}"
 E2E_API_KEY="${E2E_API_KEY:-ak_live_e2e_test_key_000000000000000000000000000000000000000000000000}"
 REPORT_DIR="${PROJECT_ROOT}/test-results/e2e"
 SKIP_SERVER="${SKIP_SERVER:-0}"
 cd "$PROJECT_ROOT"
 # ── Colors ───────────────────────────────────────────────────────────────────
 RED='\033[0;31m'
 GREEN='\033[0;32m'
 YELLOW='\033[1;33m'
 BLUE='\033[0;34m'
 NC='\033[0m' # No Color
 # ── Helper Functions ─────────────────────────────────────────────────────────
 info()  { echo -e "${BLUE}[INFO]${NC}  $*"; }
 ok()    { echo -e "${GREEN}[OK]${NC}    $*"; }
 warn()  { echo -e "${YELLOW}[WARN]${NC}  $*"; }
 fail()  { echo -e "${RED}[FAIL]${NC}  $*"; }
 check_deps() {
    local missing=0
    for cmd in python3; do
        if ! command -v "$cmd" &>/dev/null; then
            fail "Missing dependency: $cmd"
            missing=1
        fi
    done
    if [ "$missing" -eq 1 ]; then
        exit 1
    fi
 }
 wait_for_server() {
    local max_attempts=60
    local attempt=0
    info "Waiting for server on port $E2E_PORT..."
    while [ $attempt -lt $max_attempts ]; do
        if curl -s "http://127.0.0.1:$E2E_PORT/api/v1/health" &>/dev/null; then
            ok "Server is ready on port $E2E_PORT"
            return 0
        fi
        attempt=$((attempt + 1))
        sleep 0.5
    done
    fail "Server failed to start within 30 seconds"
    return 1
 }
 start_server() {
    if [ "$SKIP_SERVER" = "1" ]; then
        info "SKIP_SERVER=1, using existing server on port $E2E_PORT"
        if curl -s "http://127.0.0.1:$E2E_PORT/api/v1/health" &>/dev/null; then
            ok "Existing server is healthy"
            return 0
        else
            fail "Existing server is not responding"
            return 1
        fi
    fi
    info "Starting AgentKit E2E server on port $E2E_PORT..."
    export AGENTKIT_E2E_MODE=1
    export AGENTKIT_WS_TIMEOUT=0
    export AGENTKIT_API_KEY="$E2E_API_KEY"
    # Start server in background
    python3 -m agentkit.cli.main serve --host 127.0.0.1 --port "$E2E_PORT" &
    SERVER_PID=$!
    if wait_for_server; then
        return 0
    else
        kill "$SERVER_PID" 2>/dev/null || true
        return 1
    fi
 }
 stop_server() {
    if [ "$SKIP_SERVER" = "1" ]; then
        info "SKIP_SERVER=1, not stopping server"
        return 0
    fi
    if [ -n "${SERVER_PID:-}" ]; then
        info "Stopping E2E server (PID: $SERVER_PID)..."
        kill "$SERVER_PID" 2>/dev/null || true
        wait "$SERVER_PID" 2>/dev/null || true
        ok "Server stopped"
    fi
 }
 # ── Test Selection ───────────────────────────────────────────────────────────
 PYTEST_ARGS=("--timeout=120" "-v" "--tb=short" "-s")
 TEST_TARGET="tests/e2e/"
 GENERATE_REPORT=0
 ANALYZE=0
 SKIP_SERVER_FLAG=0
 BASELINE_COMPARE=0
 while [[ $# -gt 0 ]]; do
    case $1 in
        --basic)
            PYTEST_ARGS+=("-m" "e2e_basic")
            shift
            ;;
        --capability)
            PYTEST_ARGS+=("-m" "e2e_capability")
            shift
            ;;
        --cli)
            TEST_TARGET="tests/e2e/test_basic_cli.py"
            shift
            ;;
        --api)
            TEST_TARGET="tests/e2e/test_basic_api.py"
            shift
            ;;
        --ws)
            TEST_TARGET="tests/e2e/test_basic_websocket.py"
            shift
            ;;
        --routing)
            TEST_TARGET="tests/e2e/test_capability_routing.py"
            shift
            ;;
        --react)
            TEST_TARGET="tests/e2e/test_capability_react.py"
            shift
            ;;
        --team)
            TEST_TARGET="tests/e2e/test_capability_team.py"
            shift
            ;;
        --direct)
            # Router direct backtest — no HTTP server needed
            TEST_TARGET="tests/e2e/test_capability_router_direct.py"
            SKIP_SERVER_FLAG=1
            shift
            ;;
        --alignment)
            # Alignment guard tests — no HTTP server needed
            TEST_TARGET="tests/e2e/test_capability_alignment.py"
            SKIP_SERVER_FLAG=1
            shift
            ;;
        --full)
            # Run all capability tests: API + direct + alignment
            PYTEST_ARGS+=("-m" "e2e_capability")
            shift
            ;;
        --baseline)
            BASELINE_COMPARE=1
            shift
            ;;
        --report)
            GENERATE_REPORT=1
            shift
            ;;
        --analyze)
            ANALYZE=1
            PYTEST_ARGS+=("-m" "e2e_capability")
            shift
            ;;
        --fast)
            PYTEST_ARGS+=("-x" "--timeout=30")
            shift
            ;;
        --help|-h)
            echo "Usage: $0 [--basic|--capability|--cli|--api|--ws|--routing|--react|--team|--direct|--alignment|--full|--baseline|--report|--analyze|--fast]"
            exit 0
            ;;
        *)
            PYTEST_ARGS+=("$1")
            shift
            ;;
    esac
 done
 if [ "$GENERATE_REPORT" -eq 1 ]; then
    mkdir -p "$REPORT_DIR"
    PYTEST_ARGS+=(
        "--html=$REPORT_DIR/e2e_report.html"
        "--self-contained-html"
        "--junitxml=$REPORT_DIR/e2e_junit.xml"
    )
 fi
 if [ "$ANALYZE" -eq 1 ]; then
    info "Analysis mode: will generate capability report with recall/F1/overfitting analysis"
 fi
 # Override SKIP_SERVER when --direct or --alignment is used (no HTTP needed)
 if [ "$SKIP_SERVER_FLAG" -eq 1 ]; then
    SKIP_SERVER=1
 fi
 # ── Main ─────────────────────────────────────────────────────────────────────
 info "Fischer AgentKit E2E Backtest Runner"
 info "====================================="
 info "Project:  $PROJECT_ROOT"
 info "Port:     $E2E_PORT"
 info "Target:   $TEST_TARGET"
 info ""
 check_deps
 # Trap to ensure server cleanup
 trap stop_server EXIT INT TERM
 if ! start_server; then
    fail "Could not start E2E server"
    exit 1
 fi
 info ""
 info "Running E2E tests..."
 info "===================="
 info ""
 export AGENTKIT_SERVER_URL="http://127.0.0.1:$E2E_PORT"
 export AGENTKIT_API_KEY="$E2E_API_KEY"
 EXIT_CODE=0
 python3 -m pytest "$TEST_TARGET" "${PYTEST_ARGS[@]}" || EXIT_CODE=$?
 echo ""
 if [ $EXIT_CODE -eq 0 ]; then
    ok "All E2E tests passed!"
 else
    fail "Some E2E tests failed (exit code: $EXIT_CODE)"
 fi
 if [ "$GENERATE_REPORT" -eq 1 ]; then
    info "Report generated at: $REPORT_DIR/e2e_report.html"
 fi
 if [ "$ANALYZE" -eq 1 ]; then
    CAPABILITY_REPORT="$PROJECT_ROOT/test-results/e2e/capability_report.txt"
    if [ -f "$CAPABILITY_REPORT" ]; then
        info "Capability analysis report:"
        echo ""
        cat "$CAPABILITY_REPORT"
    else
        warn "Capability report not found (may need capability tests to run first)"
    fi
 fi
 if [ "$BASELINE_COMPARE" -eq 1 ]; then
    CURRENT_REPORT="$PROJECT_ROOT/test-results/e2e/capability_report.json"
    BASELINE_REPORT="$PROJECT_ROOT/test-results/e2e/baseline_capability_report.json"
    if [ -f "$CURRENT_REPORT" ] && [ -f "$BASELINE_REPORT" ]; then
        info "Baseline comparison:"
        python3 -c "
 import json, sys
 def load_metrics(path):
    with open(path) as f:
        return json.load(f)
 cur = load_metrics('$CURRENT_REPORT')
 base = load_metrics('$BASELINE_REPORT')
 metrics = [
    ('overall_skill_recall', '技能路由召回率'),
    ('overall_skill_precision', '技能路由精确率'),
    ('overall_skill_f1', '技能路由F1'),
    ('overall_execution_mode_accuracy', '执行模式准确率'),
    ('overall_task_success_rate', '任务成功率'),
    ('overfitting_score', '过拟合分数'),
 ]
 print()
 for key, label in metrics:
    c = cur.get(key, 0)
    b = base.get(key, 0)
    delta = c - b
    arrow = '↑' if delta > 0 else ('↓' if delta < 0 else '→')
    print(f'  {label}: {b:.2%} → {c:.2%}  {arrow} {delta:+.2%}')
 print()
 "
    elif [ -f "$CURRENT_REPORT" ]; then
        info "No baseline report found. Saving current report as baseline."
        cp "$CURRENT_REPORT" "$BASELINE_REPORT"
        info "Baseline saved to: $BASELINE_REPORT"
    else
        warn "No current report found. Run with --analyze first."
    fi
 fi
 exit $EXIT_CODE
--- a/src/agentkit/chat/skill_routing.py
+++ b/src/agentkit/chat/skill_routing.py
@ -33,9 +33,31 @@ class ExecutionMode(enum.Enum):
    DIRECT_CHAT = "direct_chat"  # Zero-cost: direct LLM call, no ReAct loop
    REACT = "react"  # Default agent ReAct loop with default tools
    SKILL_REACT = "skill_react"  # Skill-matched ReAct with skill tools + prompt
    REWOO = "rewoo"  # Plan-without-observation mode
    REFLEXION = "reflexion"  # Reflection-driven mode
    PLAN_EXEC = "plan_exec"  # Plan-then-execute mode
    TEAM_COLLAB = "team_collab"  # Expert Team collaborative mode
 # Mapping from skill config execution_mode string to ExecutionMode enum
 _SKILL_EXECUTION_MODE_MAP: dict[str, ExecutionMode] = {
    "direct": ExecutionMode.DIRECT_CHAT,
    "react": ExecutionMode.SKILL_REACT,
    "rewoo": ExecutionMode.REWOO,
    "reflexion": ExecutionMode.REFLEXION,
    "plan_exec": ExecutionMode.PLAN_EXEC,
    "custom": ExecutionMode.SKILL_REACT,
    "llm_generate": ExecutionMode.SKILL_REACT,
    "tool_call": ExecutionMode.SKILL_REACT,
 }
 def _resolve_execution_mode(skill_config: Any) -> ExecutionMode:
    """Resolve ExecutionMode from skill config's execution_mode field."""
    mode_str = getattr(skill_config, "execution_mode", "react") or "react"
    return _SKILL_EXECUTION_MODE_MAP.get(mode_str, ExecutionMode.SKILL_REACT)
 def validate_skill_name(name: str) -> str:
    """Validate and normalize a skill name. Raises ValueError on invalid input."""
    normalized = name.strip().lower()
@ -265,7 +287,8 @@ async def resolve_skill_routing(
            else default_model
        )
        result.agent_name = result.skill_name
-        result.execution_mode = ExecutionMode.SKILL_REACT
+        # Map skill.config.execution_mode to ExecutionMode enum
        result.execution_mode = _resolve_execution_mode(result.skill_config)
    else:
        result.system_prompt = default_system_prompt
        result.tools = default_tools
@ -596,21 +619,10 @@ class HeuristicClassifier:
        content_lower = content.lower()
        score = 0.0
-        # 0. 低复杂度信号检测（优先级最高）
+        # 0. 低复杂度信号检测（仅在无高复杂度信号时生效）
        low_hits_cn = sum(1 for h in self._LOW_COMPLEXITY_HINTS_CN if h in content_lower)
-        low_hits_en = sum(
+        low_hits_en = sum(1 for h in self._LOW_COMPLEXITY_HINTS_EN if h in content_lower)
-            1 for h in self._LOW_COMPLEXITY_HINTS_EN if h in content_lower
+        has_low_signal = low_hits_cn + low_hits_en > 0
        )
        if low_hits_cn + low_hits_en > 0:
            score = 0.05  # 问候/闲聊直接给极低分
            # 低复杂度信号下不再累加高复杂度词的分数
            # 但仍保留长度和多句的微调
            length = len(content)
            if length > 200:
                score += 0.05
            elif length > 100:
                score += 0.03
            return max(0.0, min(1.0, score))
        # 1. 否定上下文检测 — 提取被否定的词
        negated_words: set[str] = set()
@ -624,21 +636,27 @@ class HeuristicClassifier:
            for h in self._HIGH_COMPLEXITY_HINTS_CN
            if h in content_lower and h not in negated_words
        )
-        medium_hits = sum(
+        medium_hits = sum(1 for m in self._MEDIUM_COMPLEXITY_HINTS_CN if m in content_lower)
            1 for m in self._MEDIUM_COMPLEXITY_HINTS_CN if m in content_lower
        )
        # 英文：词边界匹配
-        high_en_matches = self._HIGH_EN_RE.findall(content) + self._HIGH_EXACT_RE.findall(
+        high_en_matches = self._HIGH_EN_RE.findall(content) + self._HIGH_EXACT_RE.findall(content)
-            content
+        high_hits += sum(1 for w in high_en_matches if w.lower() not in negated_words)
        )
        high_hits += sum(
            1 for w in high_en_matches if w.lower() not in negated_words
        )
        medium_hits += len(self._MEDIUM_EN_RE.findall(content)) + len(
            self._MEDIUM_EXACT_RE.findall(content)
        )
        has_high_signal = high_hits > 0 or medium_hits > 0
        # 低复杂度信号仅在无高/中复杂度信号时生效
        if has_low_signal and not has_high_signal:
            score = 0.05  # 问候/闲聊直接给极低分
            length = len(content)
            if length > 200:
                score += 0.05
            elif length > 100:
                score += 0.03
            return max(0.0, min(1.0, score))
        if high_hits >= 2:
            score = 0.80
        elif high_hits == 1:
--- a/src/agentkit/quality/gate.py
+++ b/src/agentkit/quality/gate.py
@ -38,6 +38,7 @@ class QualityGate:
        self,
        output: dict[str, Any],
        skill: Skill,
        skill_context: dict[str, Any] | None = None,
    ) -> QualityResult:
        """对产出执行多维度质量检查
@ -46,6 +47,7 @@ class QualityGate:
        2. 最低字数检查
        3. JSON Schema 验证（如 skill.config.output_schema 存在）
        4. 自定义验证器（如 skill.config.quality_gate.custom_validator 存在）
        5. 技能匹配验证（如 skill_context 含 intent_keywords）
        """
        checks: list[QualityCheck] = []
        qg = skill.config.quality_gate
@ -53,11 +55,13 @@ class QualityGate:
        # 1. 必填字段检查
        for field in qg.required_fields:
            present = field in output and output[field] is not None
-            checks.append(QualityCheck(
+            checks.append(
-                name=f"required_field:{field}",
+                QualityCheck(
-                passed=present,
+                    name=f"required_field:{field}",
-                message=f"Field '{field}' is missing" if not present else None,
+                    passed=present,
-            ))
+                    message=f"Field '{field}' is missing" if not present else None,
                )
            )
        # 2. 最低字数检查
        if qg.min_word_count > 0:
@ -67,15 +71,17 @@ class QualityGate:
            else:
                word_count = len(str(content).split())
            passed = word_count >= qg.min_word_count
-            checks.append(QualityCheck(
+            checks.append(
-                name="min_word_count",
+                QualityCheck(
-                passed=passed,
+                    name="min_word_count",
-                message=(
+                    passed=passed,
-                    f"Word count {word_count} < minimum {qg.min_word_count}"
+                    message=(
-                    if not passed
+                        f"Word count {word_count} < minimum {qg.min_word_count}"
-                    else None
+                        if not passed
-                ),
+                        else None
-            ))
+                    ),
                )
            )
        # 3. JSON Schema 验证
        if skill.config.output_schema:
@ -101,11 +107,34 @@ class QualityGate:
                checks.append(QualityCheck(name="custom", passed=bool(result)))
            except Exception as e:
                # 验证器导入/执行失败，跳过并记录警告
-                checks.append(QualityCheck(
+                checks.append(
-                    name="custom",
+                    QualityCheck(
-                    passed=True,
+                        name="custom",
-                    message=f"Validator skipped: {e}",
+                        passed=True,
-                ))
+                        message=f"Validator skipped: {e}",
                    )
                )
        # 5. 技能匹配验证（轻量级路由一致性检查）
        skill_match_check = self._check_skill_match(output, skill_context)
        if skill_match_check is not None:
            checks.append(skill_match_check)
        # 警告升级逻辑：当 skill_match 警告存在且其他维度有失败时，升级为失败
        if (
            skill_match_check is not None
            and skill_match_check.message
            and "Warning" in skill_match_check.message
        ):
            other_failed = any(not c.passed for c in checks if c is not skill_match_check)
            if other_failed:
                # 升级：将 skill_match 的 passed 也设为 False
                checks = [
                    QualityCheck(name=c.name, passed=False, message=c.message)
                    if c is skill_match_check
                    else c
                    for c in checks
                ]
        return QualityResult(
            passed=all(c.passed for c in checks),
@ -119,6 +148,42 @@ class QualityGate:
        "app.agent_framework.",
    )
    @staticmethod
    def _check_skill_match(
        output: dict[str, Any],
        skill_context: dict[str, Any] | None,
    ) -> QualityCheck | None:
        """第五维度：技能匹配验证
        当 skill_context 含 intent_keywords 时，检查输出内容是否包含
        至少一个关键词。不匹配时标记为警告（passed=True + message），
        当其他维度也有失败时升级为 passed=False。
        Returns:
            QualityCheck 或 None（当 skill_context 无效时跳过）
        """
        if not skill_context:
            return None
        intent_keywords: list[str] | None = skill_context.get("intent_keywords")
        if not intent_keywords:
            return None
        # 拼接输出中所有字符串值
        all_text = " ".join(
            str(v) for v in output.values() if isinstance(v, (str, int, float, bool))
        ).lower()
        matched = any(kw.lower() in all_text for kw in intent_keywords)
        if matched:
            return QualityCheck(name="skill_match", passed=True)
        return QualityCheck(
            name="skill_match",
            passed=True,  # 警告级别，不单独拦截
            message="Warning: output may not match routed skill",
        )
    def _import_validator(self, dotted_path: str) -> Callable:
        """从点分路径导入自定义验证器函数
--- a/src/agentkit/router/intent.py
+++ b/src/agentkit/router/intent.py
@ -75,10 +75,11 @@ class IntentRouter:
    def _match_keywords(
        self, input_data: dict[str, Any], skills: list[Skill]
    ) -> RoutingResult | None:
-        """Level 1: 关键词匹配
+        """Level 1: 多候选关键词评分匹配
        从 input_data 中提取所有字符串值（包括嵌套），对每个 Skill 的
-        intent.keywords 进行大小写不敏感匹配。
+        intent.keywords 进行大小写不敏感匹配。收集所有匹配候选，
        按匹配关键词总长度（更长关键词权重更高）排序，返回最佳匹配。
        """
        text_values = self._extract_string_values(input_data)
        combined_text = " ".join(text_values).lower()
@ -86,17 +87,30 @@ class IntentRouter:
        if not combined_text:
            return None
        # 收集所有匹配候选
        candidates: list[tuple[Skill, list[str], int]] = []
        for skill in skills:
            keywords = skill.config.intent.keywords
-            for keyword in keywords:
+            if not keywords:
-                if keyword.lower() in combined_text:
+                continue
-                    return RoutingResult(
+            matched_kws = [kw for kw in keywords if kw.lower() in combined_text]
-                        matched_skill=skill.name,
+            if matched_kws:
-                        method="keyword",
+                score = sum(len(kw) for kw in matched_kws)
-                        confidence=1.0,
+                candidates.append((skill, matched_kws, score))
                    )
-        return None
+        if not candidates:
            return None
        # 按得分降序排序，得分相同时按 skill 名称字母序稳定排序
        candidates.sort(key=lambda c: (-c[2], c[0].name))
        best_skill, best_kws, _best_score = candidates[0]
        confidence = min(1.0, 0.5 + 0.1 * len(best_kws))
        return RoutingResult(
            matched_skill=best_skill.name,
            method="keyword",
            confidence=confidence,
        )
    async def _classify_with_llm(
        self, input_data: dict[str, Any], skills: list[Skill]
@ -107,9 +121,7 @@ class IntentRouter:
        最佳匹配的 Skill。
        """
        if self._llm_gateway is None:
-            raise RuntimeError(
+            raise RuntimeError("Keyword matching failed and no LLM Gateway configured for fallback")
                "Keyword matching failed and no LLM Gateway configured for fallback"
            )
        prompt = self._build_classification_prompt(input_data, skills)
@ -120,9 +132,7 @@ class IntentRouter:
        return self._parse_llm_response(response.content, skills)
-    def _build_classification_prompt(
+    def _build_classification_prompt(self, input_data: dict[str, Any], skills: list[Skill]) -> str:
        self, input_data: dict[str, Any], skills: list[Skill]
    ) -> str:
        """构建 LLM 分类 prompt"""
        skill_descriptions = []
        for i, skill in enumerate(skills, 1):
@ -142,13 +152,11 @@ class IntentRouter:
            "\n"
            f"User input: {input_data}\n"
            "\n"
-            'Respond in JSON format:\n'
+            "Respond in JSON format:\n"
            '{"skill": "skill_name", "confidence": 0.9}'
        )
-    def _parse_llm_response(
+    def _parse_llm_response(self, content: str, skills: list[Skill]) -> RoutingResult:
        self, content: str, skills: list[Skill]
    ) -> RoutingResult:
        """解析 LLM 响应，提取 skill name 和 confidence"""
        valid_names = {s.name for s in skills}
@ -175,9 +183,7 @@ class IntentRouter:
        )
    @staticmethod
-    def _extract_skill_name_from_text(
+    def _extract_skill_name_from_text(text: str, valid_names: set[str]) -> str:
        text: str, valid_names: set[str]
    ) -> str:
        """从文本中尝试提取有效的 Skill 名称"""
        text_lower = text.lower()
        for name in valid_names:
--- a/tests/e2e/init.py
+++ b/tests/e2e/init.py
@ -0,0 +1,11 @@
 """E2E backtest suite for Fischer AgentKit.
 Split into two dimensions:
  - Basic Functions: verify all features work correctly (CLI, API, WebSocket, lifecycle)
  - Agent Capabilities: verify intelligence level (routing, reasoning, collaboration)
 Uses subprocess to simulate real CLI operations (OpenCLI pattern),
 httpx for API calls, and websockets for WS chat.
 """
 from tests.e2e.conftest import *  # noqa: F401,F403
--- a/tests/e2e/benchmark_dataset.py
+++ b/tests/e2e/benchmark_dataset.py
@ -0,0 +1,830 @@
 """Agent Capability Benchmark — Ground Truth Dataset (v2).
 Aligned with actual skills in configs/skills/*.yaml.
 Contains both manually curated edge cases and auto-generated cases.
 Categories:
  - routing: intent routing correctness
  - execution: execution mode selection accuracy
  - team: expert team collaboration
  - consistency: deterministic output consistency
  - semantic_router: semantic similarity matching
  - alignment: constraint compliance and cascade detection
 """
 from pydantic import BaseModel, ConfigDict
 class BenchmarkCase(BaseModel):
    """A single benchmark test case with ground truth label."""
    model_config = ConfigDict(frozen=True)
    id: str
    input: str
    expected_skill: str | None = None
    expected_execution_mode: str = "direct"
    expected_complexity: str = "low"
    category: str
    subcategory: str
    paraphrases: list[str] = []
    tags: list[str] = []
 # ═══════════════════════════════════════════════════════════════════════════
 # Routing — Keyword Match (aligned with actual skills)
 # ═══════════════════════════════════════════════════════════════════════════
 ROUTING_KEYWORD_BENCHMARKS: list[BenchmarkCase] = [
    # direct_agent
    BenchmarkCase(
        id="route-kw-direct-001",
        input="翻译这段话",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["帮我翻译一下", "请翻译这段内容", "Translate this text"],
        tags=["翻译", "translate"],
    ),
    BenchmarkCase(
        id="route-kw-direct-002",
        input="帮我总结一下",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["请总结", "给我一个摘要", "Summarize this"],
        tags=["摘要", "summarize"],
    ),
    BenchmarkCase(
        id="route-kw-direct-003",
        input="什么是RAG？",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["RAG是什么", "解释一下RAG", "What is RAG?"],
        tags=["什么是"],
    ),
    # react_agent
    BenchmarkCase(
        id="route-kw-react-001",
        input="搜索一下AI Agent市场数据",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=[
            "帮我搜索AI Agent市场信息",
            "查找AI Agent的市场数据",
            "Search AI Agent market data",
        ],
        tags=["搜索", "search"],
    ),
    BenchmarkCase(
        id="route-kw-react-002",
        input="帮我分析这个数据",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["分析一下这些数据", "请对数据做分析", "Analyze this data"],
        tags=["分析", "analyze"],
    ),
    BenchmarkCase(
        id="route-kw-react-003",
        input="实时监控竞品动态",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["监控竞争对手的动态", "实时追踪竞品变化", "Monitor competitor activities"],
        tags=["实时", "监控"],
    ),
    # rewoo_agent
    BenchmarkCase(
        id="route-kw-rewoo-001",
        input="采集A、B、C三个竞品的功能数据",
        expected_skill="rewoo_agent",
        expected_execution_mode="rewoo",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=[
            "批量采集竞品数据",
            "并行获取多个竞品信息",
            "Fetch data from multiple competitors",
        ],
        tags=["采集", "批量", "fetch"],
    ),
    BenchmarkCase(
        id="route-kw-rewoo-002",
        input="并行搜索多个关键词",
        expected_skill="rewoo_agent",
        expected_execution_mode="rewoo",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["同时搜索多个关键词", "批量搜索", "Search multiple keywords in parallel"],
        tags=["并行", "批量"],
    ),
    # reflexion_agent
    BenchmarkCase(
        id="route-kw-reflex-001",
        input="审查这段代码的合规性",
        expected_skill="reflexion_agent",
        expected_execution_mode="reflexion",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["检查代码是否合规", "审查代码合规问题", "Review code compliance"],
        tags=["审查", "合规", "review"],
    ),
    BenchmarkCase(
        id="route-kw-reflex-002",
        input="生成一个高精度的数据分析脚本",
        expected_skill="reflexion_agent",
        expected_execution_mode="reflexion",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=[
            "写一个精确的数据分析脚本",
            "生成高精度分析代码",
            "Generate a precise analysis script",
        ],
        tags=["代码生成", "精确", "code"],
    ),
    # plan_exec_agent
    BenchmarkCase(
        id="route-kw-planexec-001",
        input="生成一份市场分析报告",
        expected_skill="plan_exec_agent",
        expected_execution_mode="plan_exec",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["做一份市场分析报告", "写个市场分析报告", "Generate a market analysis report"],
        tags=["报告", "分析报告"],
    ),
    BenchmarkCase(
        id="route-kw-planexec-002",
        input="规划产品优化方案",
        expected_skill="plan_exec_agent",
        expected_execution_mode="plan_exec",
        expected_complexity="high",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["制定产品优化计划", "帮我规划产品优化", "Plan product optimization"],
        tags=["规划", "plan"],
    ),
    # code_reviewer
    BenchmarkCase(
        id="route-kw-coderev-001",
        input="Review this code for quality",
        expected_skill="code_reviewer",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["审查这段代码的质量", "代码审查", "Check code quality"],
        tags=["review", "代码审查"],
    ),
    # geo_optimizer
    BenchmarkCase(
        id="route-kw-geo-001",
        input="帮我优化这篇文章的SEO",
        expected_skill="geo_optimizer",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["SEO优化一下", "提升文章搜索排名", "Optimize this article for SEO"],
        tags=["SEO优化", "optimize"],
    ),
    # deai_agent
    BenchmarkCase(
        id="route-kw-deai-001",
        input="帮我把这篇文章去AI化",
        expected_skill="deai_agent",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["让这段文字更自然", "改写得像人写的", "Make this text more natural"],
        tags=["去AI化", "人性化"],
    ),
    # content_generator
    BenchmarkCase(
        id="route-kw-content-001",
        input="帮我写一篇关于AI的文章",
        expected_skill="content_generator",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["写一篇AI相关文章", "生成关于AI的内容", "Write an article about AI"],
        tags=["写文章", "generate"],
    ),
    # citation_detector
    BenchmarkCase(
        id="route-kw-citation-001",
        input="检测我们的品牌在AI平台的引用情况",
        expected_skill="citation_detector",
        expected_execution_mode="custom",
        expected_complexity="medium",
        category="routing",
        subcategory="keyword_match",
        paraphrases=[
            "分析品牌引用率",
            "哪些AI平台引用了我们",
            "Check brand citation on AI platforms",
        ],
        tags=["引用检测", "citation"],
    ),
    # trend_agent
    BenchmarkCase(
        id="route-kw-trend-001",
        input="分析品牌趋势",
        expected_skill="trend_agent",
        expected_execution_mode="tool_call",
        expected_complexity="medium",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["最近的热点话题是什么", "趋势洞察", "Analyze brand trends"],
        tags=["趋势", "trend"],
    ),
    # competitor_analyzer
    BenchmarkCase(
        id="route-kw-competitor-001",
        input="分析我的竞品策略",
        expected_skill="competitor_analyzer",
        expected_execution_mode="tool_call",
        expected_complexity="medium",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["对比我和竞品的差距", "竞品分析", "Analyze competitor strategies"],
        tags=["竞品", "competitor"],
    ),
    # schema_advisor
    BenchmarkCase(
        id="route-kw-schema-001",
        input="帮我优化Schema",
        expected_skill="schema_advisor",
        expected_execution_mode="custom",
        expected_complexity="medium",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["生成JSON-LD结构化数据", "Schema有什么可以改进的", "Optimize my Schema"],
        tags=["Schema", "schema优化"],
    ),
    # monitor
    BenchmarkCase(
        id="route-kw-monitor-001",
        input="监测品牌引用变化",
        expected_skill="monitor",
        expected_execution_mode="custom",
        expected_complexity="medium",
        category="routing",
        subcategory="keyword_match",
        paraphrases=["追踪效果", "品牌排名变化", "Monitor brand citation changes"],
        tags=["监测", "monitor"],
    ),
    # goal_driven_agent
    BenchmarkCase(
        id="route-kw-goal-001",
        input="分析竞品SEO策略并生成优化方案",
        expected_skill="goal_driven_agent",
        expected_execution_mode="tool_call",
        expected_complexity="medium",
        category="routing",
        subcategory="keyword_match",
        paraphrases=[
            "调研技术方案并生成对比报告",
            "制定市场推广计划",
            "Analyze SEO and generate plan",
        ],
        tags=["分析", "优化方案"],
    ),
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # Routing — Edge Cases (manually curated)
 # ═══════════════════════════════════════════════════════════════════════════
 ROUTING_EDGE_BENCHMARKS: list[BenchmarkCase] = [
    # Greeting (should NOT route to any skill)
    BenchmarkCase(
        id="route-edge-greet-001",
        input="你好",
        expected_skill=None,
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="greeting",
        paraphrases=["Hello", "Hi there", "早上好"],
        tags=["greeting"],
    ),
    BenchmarkCase(
        id="route-edge-greet-002",
        input="Good morning!",
        expected_skill=None,
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="greeting",
        paraphrases=["早上好！", "你好呀"],
        tags=["greeting"],
    ),
    # Identity (should NOT route to any skill)
    BenchmarkCase(
        id="route-edge-identity-001",
        input="你是谁？",
        expected_skill=None,
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="identity",
        paraphrases=["What is your name?", "介绍一下你自己", "Tell me about yourself"],
        tags=["identity"],
    ),
    # Explicit prefix
    BenchmarkCase(
        id="route-edge-explicit-001",
        input="@skill:react_agent 搜索最新的AI新闻",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="routing",
        subcategory="explicit_prefix",
        paraphrases=["@skill:react_agent 查找AI最新动态"],
        tags=["explicit", "react"],
    ),
    # Fallback (no matching skill)
    BenchmarkCase(
        id="route-edge-fallback-001",
        input="告诉我一个笑话",
        expected_skill=None,
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="fallback",
        paraphrases=["讲个笑话", "Tell me a joke", "说个搞笑的"],
        tags=["fallback"],
    ),
    BenchmarkCase(
        id="route-edge-fallback-002",
        input="What is quantum physics?",
        expected_skill=None,
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="fallback",
        paraphrases=["量子物理是什么", "Explain quantum mechanics"],
        tags=["fallback"],
    ),
    # Disambiguation (multiple skills could match)
    BenchmarkCase(
        id="route-edge-disambig-001",
        input="审查代码并优化SEO",
        expected_skill="code_reviewer",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="routing",
        subcategory="disambiguation",
        paraphrases=["Review code and optimize SEO", "代码审查加SEO优化"],
        tags=["disambiguation", "review", "seo"],
    ),
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # Execution Mode Benchmarks
 # ═══════════════════════════════════════════════════════════════════════════
 EXECUTION_BENCHMARKS: list[BenchmarkCase] = [
    BenchmarkCase(
        id="exec-direct-001",
        input="翻译这段话成英文",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="execution",
        subcategory="direct_mode",
        paraphrases=["Translate this to English", "把这段翻成英语"],
        tags=["direct", "simple"],
    ),
    BenchmarkCase(
        id="exec-direct-002",
        input="什么是AgentKit？",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="execution",
        subcategory="direct_mode",
        paraphrases=["AgentKit是什么", "Explain AgentKit"],
        tags=["direct", "qa"],
    ),
    BenchmarkCase(
        id="exec-react-001",
        input="搜索并分析AI行业最新趋势",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="execution",
        subcategory="react_mode",
        paraphrases=["Search and analyze AI trends", "调研AI行业趋势"],
        tags=["react", "multi_step"],
    ),
    BenchmarkCase(
        id="exec-react-002",
        input="实时监控竞品动态并生成报告",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="execution",
        subcategory="react_mode",
        paraphrases=["Monitor competitors and report", "追踪竞品并输出报告"],
        tags=["react", "monitoring"],
    ),
    BenchmarkCase(
        id="exec-rewoo-001",
        input="批量采集多个竞品的功能数据",
        expected_skill="rewoo_agent",
        expected_execution_mode="rewoo",
        expected_complexity="high",
        category="execution",
        subcategory="rewoo_mode",
        paraphrases=["并行获取竞品数据", "Fetch competitor data in parallel"],
        tags=["rewoo", "parallel"],
    ),
    BenchmarkCase(
        id="exec-reflexion-001",
        input="审查代码合规性并确保高精度",
        expected_skill="reflexion_agent",
        expected_execution_mode="reflexion",
        expected_complexity="high",
        category="execution",
        subcategory="reflexion_mode",
        paraphrases=["高精度代码审查", "Precise code compliance review"],
        tags=["reflexion", "precision"],
    ),
    BenchmarkCase(
        id="exec-planexec-001",
        input="生成一份完整的市场调研报告",
        expected_skill="plan_exec_agent",
        expected_execution_mode="plan_exec",
        expected_complexity="high",
        category="execution",
        subcategory="plan_exec_mode",
        paraphrases=["做一份市场调研报告", "Generate a market research report"],
        tags=["plan_exec", "report"],
    ),
    BenchmarkCase(
        id="exec-quality-001",
        input="生成内容并确保质量达标",
        expected_skill="content_generator",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="execution",
        subcategory="quality_gate",
        paraphrases=["生成高质量内容", "Generate quality content"],
        tags=["quality", "content"],
    ),
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # Team Collaboration Benchmarks
 # ═══════════════════════════════════════════════════════════════════════════
 TEAM_BENCHMARKS: list[BenchmarkCase] = [
    BenchmarkCase(
        id="team-explicit-001",
        input="@team:react_agent,plan_exec_agent 协作完成深度分析并生成报告",
        expected_execution_mode="react",
        expected_complexity="high",
        category="team",
        subcategory="explicit_team",
        paraphrases=[
            "需要react_agent和plan_exec_agent协作",
            "组建团队：搜索分析+报告生成",
        ],
        tags=["team", "explicit"],
    ),
    BenchmarkCase(
        id="team-explicit-002",
        input="@team:competitor_analyzer,trend_agent 分析竞品并追踪趋势",
        expected_execution_mode="react",
        expected_complexity="high",
        category="team",
        subcategory="explicit_team",
        paraphrases=["竞品分析+趋势追踪团队", "Team for competitor and trend analysis"],
        tags=["team", "explicit"],
    ),
    BenchmarkCase(
        id="team-complexity-001",
        input="深度分析竞品策略、追踪品牌趋势并生成优化方案",
        expected_execution_mode="react",
        expected_complexity="high",
        category="team",
        subcategory="complexity_trigger",
        paraphrases=[
            "全面竞品分析和优化方案",
            "Comprehensive competitor analysis with optimization",
        ],
        tags=["team", "complexity"],
    ),
    BenchmarkCase(
        id="team-fallback-001",
        input="复杂任务但无匹配专家",
        expected_execution_mode="react",
        expected_complexity="high",
        category="team",
        subcategory="fallback",
        paraphrases=["需要团队但找不到合适专家", "Complex task without matching experts"],
        tags=["team", "fallback"],
    ),
    BenchmarkCase(
        id="team-name-valid-001",
        input="@team:react_agent,plan_exec_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="team",
        subcategory="name_validation",
        tags=["team", "validation"],
    ),
    BenchmarkCase(
        id="team-name-invalid-001",
        input="@team:invalid expert name",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="team",
        subcategory="name_validation",
        tags=["team", "validation", "invalid"],
    ),
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # Consistency Benchmarks
 # ═══════════════════════════════════════════════════════════════════════════
 CONSISTENCY_BENCHMARKS: list[BenchmarkCase] = [
    BenchmarkCase(
        id="consist-direct-001",
        input="翻译'hello world'成中文",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="consistency",
        subcategory="deterministic",
        tags=["consistency", "translation"],
    ),
    BenchmarkCase(
        id="consist-direct-002",
        input="什么是RAG？",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="consistency",
        subcategory="deterministic",
        tags=["consistency", "qa"],
    ),
    BenchmarkCase(
        id="consist-react-001",
        input="搜索AI Agent市场数据",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="consistency",
        subcategory="deterministic",
        tags=["consistency", "search"],
    ),
    BenchmarkCase(
        id="consist-geo-001",
        input="帮我优化这篇文章的SEO",
        expected_skill="geo_optimizer",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="consistency",
        subcategory="deterministic",
        tags=["consistency", "seo"],
    ),
    BenchmarkCase(
        id="consist-deai-001",
        input="帮我把这篇文章去AI化",
        expected_skill="deai_agent",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="consistency",
        subcategory="deterministic",
        tags=["consistency", "deai"],
    ),
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # Semantic Router Benchmarks
 # ═══════════════════════════════════════════════════════════════════════════
 SEMANTIC_ROUTER_BENCHMARKS: list[BenchmarkCase] = [
    BenchmarkCase(
        id="semantic-direct-001",
        input="简单生成任务，无需工具调用",
        expected_skill="direct_agent",
        expected_execution_mode="direct",
        expected_complexity="low",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["只需要一次生成的简单任务", "Single LLM call task"],
        tags=["semantic", "direct"],
    ),
    BenchmarkCase(
        id="semantic-react-001",
        input="需要动态适应、逐步推理和工具调用",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["需要多步推理和工具", "Multi-step reasoning with tools"],
        tags=["semantic", "react"],
    ),
    BenchmarkCase(
        id="semantic-rewoo-001",
        input="多源数据并行采集、无依赖工具调用批量执行",
        expected_skill="rewoo_agent",
        expected_execution_mode="rewoo",
        expected_complexity="high",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["并行批量获取数据", "Parallel data collection"],
        tags=["semantic", "rewoo"],
    ),
    BenchmarkCase(
        id="semantic-reflex-001",
        input="需要高精度和自我验证的任务",
        expected_skill="reflexion_agent",
        expected_execution_mode="reflexion",
        expected_complexity="high",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["需要自我检查的高精度任务", "High-precision self-verification task"],
        tags=["semantic", "reflexion"],
    ),
    BenchmarkCase(
        id="semantic-planexec-001",
        input="结构化多步骤任务，需要可审查的规划和执行",
        expected_skill="plan_exec_agent",
        expected_execution_mode="plan_exec",
        expected_complexity="high",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["需要先规划再执行的任务", "Structured planning and execution"],
        tags=["semantic", "plan_exec"],
    ),
    BenchmarkCase(
        id="semantic-geo-001",
        input="对文章进行GEO/SEO优化，提升在AI搜索引擎中的可见性",
        expected_skill="geo_optimizer",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["提升内容搜索排名", "Improve content visibility in AI search"],
        tags=["semantic", "geo"],
    ),
    BenchmarkCase(
        id="semantic-citation-001",
        input="检测品牌在各AI平台回答中的引用情况",
        expected_skill="citation_detector",
        expected_execution_mode="custom",
        expected_complexity="medium",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["分析品牌被AI引用的情况", "Check brand citation across AI platforms"],
        tags=["semantic", "citation"],
    ),
    BenchmarkCase(
        id="semantic-competitor-001",
        input="分析竞品策略、对比品牌差距或发现竞争机会",
        expected_skill="competitor_analyzer",
        expected_execution_mode="tool_call",
        expected_complexity="medium",
        category="semantic_router",
        subcategory="description_match",
        paraphrases=["竞品对比和差距分析", "Competitive gap analysis"],
        tags=["semantic", "competitor"],
    ),
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # Alignment Guard Benchmarks
 # ═══════════════════════════════════════════════════════════════════════════
 ALIGNMENT_BENCHMARKS: list[BenchmarkCase] = [
    BenchmarkCase(
        id="align-negative-001",
        input="写一篇产品介绍，不要提及价格",
        expected_skill="content_generator",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="alignment",
        subcategory="negative_constraint",
        tags=["alignment", "negative_constraint"],
    ),
    BenchmarkCase(
        id="align-positive-001",
        input="生成报告，必须包含摘要部分",
        expected_skill="plan_exec_agent",
        expected_execution_mode="plan_exec",
        expected_complexity="high",
        category="alignment",
        subcategory="positive_constraint",
        tags=["alignment", "positive_constraint"],
    ),
    BenchmarkCase(
        id="align-cascade-001",
        input="反复搜索相同关键词",
        expected_skill="react_agent",
        expected_execution_mode="react",
        expected_complexity="high",
        category="alignment",
        subcategory="cascade_detection",
        tags=["alignment", "cascade"],
    ),
    BenchmarkCase(
        id="align-no-constraint-001",
        input="帮我写一篇文章",
        expected_skill="content_generator",
        expected_execution_mode="llm_generate",
        expected_complexity="low",
        category="alignment",
        subcategory="no_constraint",
        tags=["alignment", "baseline"],
    ),
    BenchmarkCase(
        id="align-combined-001",
        input="生成竞品分析报告，必须包含对比表格，不要提及内部数据",
        expected_skill="competitor_analyzer",
        expected_execution_mode="tool_call",
        expected_complexity="medium",
        category="alignment",
        subcategory="combined_constraint",
        tags=["alignment", "combined"],
    ),
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # All benchmarks combined
 # ═══════════════════════════════════════════════════════════════════════════
 ALL_BENCHMARKS: list[BenchmarkCase] = (
    ROUTING_KEYWORD_BENCHMARKS
    + ROUTING_EDGE_BENCHMARKS
    + EXECUTION_BENCHMARKS
    + TEAM_BENCHMARKS
    + CONSISTENCY_BENCHMARKS
    + SEMANTIC_ROUTER_BENCHMARKS
    + ALIGNMENT_BENCHMARKS
 )
 def get_benchmarks_by_category(category: str) -> list[BenchmarkCase]:
    """Filter benchmarks by category."""
    return [b for b in ALL_BENCHMARKS if b.category == category]
 def get_benchmarks_by_subcategory(subcategory: str) -> list[BenchmarkCase]:
    """Filter benchmarks by subcategory."""
    return [b for b in ALL_BENCHMARKS if b.subcategory == subcategory]
 def get_benchmarks_with_paraphrases() -> list[BenchmarkCase]:
    """Get only benchmarks that have paraphrases (for overfitting detection)."""
    return [b for b in ALL_BENCHMARKS if b.paraphrases]
 def get_skill_names_needed() -> set[str]:
    """Get all skill names referenced in benchmarks (for pre-registration)."""
    return {b.expected_skill for b in ALL_BENCHMARKS if b.expected_skill is not None}
 def get_benchmark_stats() -> dict[str, int]:
    """Get benchmark count by category."""
    stats: dict[str, int] = {}
    for b in ALL_BENCHMARKS:
        stats[b.category] = stats.get(b.category, 0) + 1
    stats["total"] = len(ALL_BENCHMARKS)
    return stats
--- a/tests/e2e/benchmark_generator.py
+++ b/tests/e2e/benchmark_generator.py
@ -0,0 +1,339 @@
 """Benchmark Generator — Auto-generate benchmark cases from skill configs.
 Reads configs/skills/*.yaml, extracts intent.keywords/description/examples,
 and generates BenchmarkCase objects aligned with actual skill configurations.
 This ensures the benchmark dataset stays in sync with the real skill registry.
 """
 from pathlib import Path
 import yaml
 from pydantic import BaseModel, ConfigDict
 from tests.e2e.benchmark_dataset import BenchmarkCase
 # ═══════════════════════════════════════════════════════════════════════════
 # Skill Config Model
 # ═══════════════════════════════════════════════════════════════════════════
 class SkillIntent(BaseModel):
    """Intent section of a skill config."""
    model_config = ConfigDict(extra="ignore")
    keywords: list[str] = []
    description: str = ""
    examples: list[str] = []
 class SkillConfig(BaseModel):
    """Minimal skill config model for benchmark generation."""
    model_config = ConfigDict(extra="ignore")
    name: str
    description: str = ""
    execution_mode: str = "direct"
    task_mode: str = "llm_generate"
    intent: SkillIntent = SkillIntent()
 # ═══════════════════════════════════════════════════════════════════════════
 # Complexity Mapping
 # ═══════════════════════════════════════════════════════════════════════════
 EXECUTION_MODE_TO_COMPLEXITY: dict[str, str] = {
    "direct": "low",
    "react": "high",
    "rewoo": "high",
    "reflexion": "high",
    "plan_exec": "high",
    "tool_call": "medium",
    "llm_generate": "low",
    "custom": "medium",
 }
 # Paraphrase templates for auto-generating paraphrases from examples
 PARAPHRASE_TEMPLATES_CN: list[str] = [
    "请帮我{action}",
    "我需要{action}",
    "能不能{action}",
 ]
 PARAPHRASE_TEMPLATES_EN: list[str] = [
    "Please help me {action}",
    "I need to {action}",
    "Can you {action}",
 ]
 # ═══════════════════════════════════════════════════════════════════════════
 # Benchmark Generator
 # ═══════════════════════════════════════════════════════════════════════════
 class BenchmarkGenerator:
    """Generate benchmark cases from skill config YAML files."""
    def __init__(self, configs_dir: str | None = None) -> None:
        if configs_dir is None:
            # Default: project_root/configs/skills/
            project_root = Path(__file__).parent.parent.parent.parent
            configs_dir = str(project_root / "configs" / "skills")
        self.configs_dir = configs_dir
        self._skills: list[SkillConfig] = []
        self._loaded = False
    def load_skills(self) -> list[SkillConfig]:
        """Load all skill configs from YAML files."""
        if self._loaded:
            return self._skills
        skills_dir = Path(self.configs_dir)
        if not skills_dir.exists():
            return self._skills
        for yaml_file in sorted(skills_dir.glob("*.yaml")):
            with open(yaml_file, encoding="utf-8") as f:
                data = yaml.safe_load(f)
            if data and isinstance(data, dict):
                try:
                    skill = SkillConfig(**data)
                    self._skills.append(skill)
                except Exception:
                    continue
        self._loaded = True
        return self._skills
    def _get_effective_execution_mode(self, skill: SkillConfig) -> str:
        """Get the effective execution mode for a skill."""
        if skill.execution_mode and skill.execution_mode != "direct":
            return skill.execution_mode
        # Map task_mode to execution mode
        return skill.task_mode if skill.task_mode else "direct"
    def _generate_paraphrases(self, example: str, keywords: list[str]) -> list[str]:
        """Generate paraphrases for an example query."""
        paraphrases: list[str] = []
        # Simple paraphrase generation: add prefix variations
        is_chinese = any("\u4e00" <= c <= "\u9fff" for c in example)
        if is_chinese:
            # Chinese paraphrases
            if not example.startswith("请") and not example.startswith("帮"):
                paraphrases.append(f"请{example}")
            if not example.startswith("我"):
                paraphrases.append(f"我需要{example}")
            # Add keyword-based variant
            if keywords:
                kw = keywords[0]
                if kw not in example:
                    paraphrases.append(f"关于{kw}，{example}")
        else:
            # English paraphrases
            lower = example.lower()
            if not lower.startswith("please") and not lower.startswith("can you"):
                paraphrases.append(f"Please {example[0].lower()}{example[1:]}")
            if not lower.startswith("i need"):
                paraphrases.append(f"I need to {example[0].lower()}{example[1:]}")
        return paraphrases[:3]  # Max 3 paraphrases per example
    def generate_routing_benchmarks(self) -> list[BenchmarkCase]:
        """Generate routing benchmark cases from all skills."""
        skills = self.load_skills()
        cases: list[BenchmarkCase] = []
        case_counter = 0
        for skill in skills:
            exec_mode = self._get_effective_execution_mode(skill)
            complexity = EXECUTION_MODE_TO_COMPLEXITY.get(exec_mode, "low")
            # Generate from intent.examples
            for example in skill.intent.examples:
                case_counter += 1
                paraphrases = self._generate_paraphrases(example, skill.intent.keywords)
                cases.append(
                    BenchmarkCase(
                        id=f"route-auto-{case_counter:03d}",
                        input=example,
                        expected_skill=skill.name,
                        expected_execution_mode=exec_mode,
                        expected_complexity=complexity,
                        category="routing",
                        subcategory="keyword_match",
                        paraphrases=paraphrases,
                        tags=skill.intent.keywords[:3],
                    )
                )
            # Generate from intent.keywords (one case per keyword)
            for keyword in skill.intent.keywords:
                case_counter += 1
                query = (
                    f"帮我{keyword}"
                    if any("\u4e00" <= c <= "\u9fff" for c in keyword)
                    else f"Help me {keyword}"
                )
                cases.append(
                    BenchmarkCase(
                        id=f"route-kw-auto-{case_counter:03d}",
                        input=query,
                        expected_skill=skill.name,
                        expected_execution_mode=exec_mode,
                        expected_complexity=complexity,
                        category="routing",
                        subcategory="keyword_match",
                        tags=[keyword],
                    )
                )
        return cases
    def generate_execution_benchmarks(self) -> list[BenchmarkCase]:
        """Generate execution mode benchmark cases."""
        skills = self.load_skills()
        cases: list[BenchmarkCase] = []
        case_counter = 0
        # Group skills by execution mode
        mode_groups: dict[str, list[SkillConfig]] = {}
        for skill in skills:
            mode = self._get_effective_execution_mode(skill)
            mode_groups.setdefault(mode, []).append(skill)
        for mode, group in mode_groups.items():
            complexity = EXECUTION_MODE_TO_COMPLEXITY.get(mode, "low")
            for skill in group[:2]:  # Max 2 skills per mode
                if skill.intent.examples:
                    case_counter += 1
                    cases.append(
                        BenchmarkCase(
                            id=f"exec-auto-{case_counter:03d}",
                            input=skill.intent.examples[0],
                            expected_skill=skill.name,
                            expected_execution_mode=mode,
                            expected_complexity=complexity,
                            category="execution",
                            subcategory=f"{mode}_mode",
                            paraphrases=skill.intent.examples[1:2],
                            tags=[mode],
                        )
                    )
        return cases
    def generate_team_benchmarks(self) -> list[BenchmarkCase]:
        """Generate team collaboration benchmark cases."""
        skills = self.load_skills()
        cases: list[BenchmarkCase] = []
        case_counter = 0
        # High-complexity skills suitable for team collaboration
        high_complexity_skills = [
            s
            for s in skills
            if EXECUTION_MODE_TO_COMPLEXITY.get(self._get_effective_execution_mode(s), "low")
            == "high"
        ]
        if len(high_complexity_skills) >= 2:
            skill_a, skill_b = high_complexity_skills[0], high_complexity_skills[1]
            case_counter += 1
            cases.append(
                BenchmarkCase(
                    id=f"team-auto-{case_counter:03d}",
                    input=f"@team:{skill_a.name},{skill_b.name} 协作完成复杂分析任务",
                    expected_execution_mode="react",
                    expected_complexity="high",
                    category="team",
                    subcategory="explicit_team",
                    paraphrases=[
                        f"需要{skill_a.name}和{skill_b.name}协作分析",
                        f"组建团队：{skill_a.name} + {skill_b.name}",
                    ],
                    tags=["team", skill_a.name, skill_b.name],
                )
            )
        # Complexity-triggered team
        if high_complexity_skills:
            skill = high_complexity_skills[0]
            case_counter += 1
            cases.append(
                BenchmarkCase(
                    id=f"team-complexity-{case_counter:03d}",
                    input=f"深度{skill.intent.keywords[0] if skill.intent.keywords else '分析'}并生成详细报告",
                    expected_execution_mode="react",
                    expected_complexity="high",
                    category="team",
                    subcategory="complexity_trigger",
                    paraphrases=[
                        f"全面{skill.intent.keywords[0] if skill.intent.keywords else '分析'}并输出报告",
                    ],
                    tags=["team", "complexity"],
                )
            )
        return cases
    def generate_semantic_benchmarks(self) -> list[BenchmarkCase]:
        """Generate semantic router specific benchmark cases."""
        skills = self.load_skills()
        cases: list[BenchmarkCase] = []
        case_counter = 0
        for skill in skills:
            if not skill.intent.description:
                continue
            case_counter += 1
            # Use description as input (tests semantic matching, not keyword matching)
            cases.append(
                BenchmarkCase(
                    id=f"semantic-auto-{case_counter:03d}",
                    input=skill.intent.description,
                    expected_skill=skill.name,
                    expected_execution_mode=self._get_effective_execution_mode(skill),
                    expected_complexity=EXECUTION_MODE_TO_COMPLEXITY.get(
                        self._get_effective_execution_mode(skill), "low"
                    ),
                    category="semantic_router",
                    subcategory="description_match",
                    tags=["semantic", skill.name],
                )
            )
        return cases
    def generate_all(self) -> list[BenchmarkCase]:
        """Generate all auto-generated benchmark cases."""
        cases: list[BenchmarkCase] = []
        cases.extend(self.generate_routing_benchmarks())
        cases.extend(self.generate_execution_benchmarks())
        cases.extend(self.generate_team_benchmarks())
        cases.extend(self.generate_semantic_benchmarks())
        return cases
    def get_skill_names(self) -> set[str]:
        """Get all skill names from configs."""
        return {s.name for s in self.load_skills()}
 # ═══════════════════════════════════════════════════════════════════════════
 # Singleton for reuse
 # ═══════════════════════════════════════════════════════════════════════════
 _generator: BenchmarkGenerator | None = None
 def get_generator() -> BenchmarkGenerator:
    """Get or create the singleton BenchmarkGenerator."""
    global _generator
    if _generator is None:
        _generator = BenchmarkGenerator()
    return _generator
--- a/tests/e2e/capability_metrics.py
+++ b/tests/e2e/capability_metrics.py
--- a/tests/e2e/conftest.py
+++ b/tests/e2e/conftest.py
@ -0,0 +1,413 @@
 """E2E test fixtures: server lifecycle, CLI runner, API client, WebSocket helpers.
 Design principles:
  1. Start a real uvicorn server with MockLLMProvider once per session
  2. CLI tests use subprocess to invoke `agentkit` commands (OpenCLI pattern)
  3. API tests use httpx against the live server
  4. WebSocket tests use the `websockets` library against the live server
  5. All tests are idempotent and repeatable
 """
 import asyncio
 import json
 import os
 import shutil
 import subprocess
 import sys
 import time
 from typing import Any, Generator
 import httpx
 import pytest
 # ---------------------------------------------------------------------------
 # Markers
 # ---------------------------------------------------------------------------
 pytestmark = pytest.mark.integration
 def pytest_configure(config: pytest.Config) -> None:
    config.addinivalue_line("markers", "e2e: end-to-end backtest (requires server)")
    config.addinivalue_line("markers", "e2e_basic: basic function correctness test")
    config.addinivalue_line("markers", "e2e_capability: agent intelligence capability test")
    # Initialize session-scoped metrics collector
    from tests.e2e.capability_metrics import MetricsCollector
    config._e2e_metrics_collector = MetricsCollector()  # type: ignore[attr-defined]
 def pytest_sessionfinish(session: pytest.Session, exitstatus: int) -> None:
    """After all tests, generate capability analysis report if data was collected."""
    collector = session.config._e2e_metrics_collector  # type: ignore[attr-defined]
    if collector is None or not collector.observations:
        return
    from tests.e2e.capability_metrics import MetricsAnalyzer, MetricsReporter
    analyzer = MetricsAnalyzer()
    report = analyzer.generate_report(collector)
    output_dir = os.path.join(os.path.dirname(__file__), "..", "..", "test-results", "e2e")
    paths = MetricsReporter.save_report(report, output_dir)
    # Print summary to console
    print("\n" + MetricsReporter.to_text(report))
    print(f"\nReport saved to: {paths['json']}")
    print(f"Text report: {paths['text']}")
 # ---------------------------------------------------------------------------
 # Constants
 # ---------------------------------------------------------------------------
 E2E_HOST = "127.0.0.1"
 E2E_PORT = 18765  # dedicated port to avoid conflict with dev server
 E2E_BASE_URL = f"http://{E2E_HOST}:{E2E_PORT}"
 E2E_WS_URL = f"ws://{E2E_HOST}:{E2E_PORT}"
 E2E_API_KEY = "ak_live_e2e_test_key_000000000000000000000000000000000000000000000000"
 # ---------------------------------------------------------------------------
 # Mock LLM Provider (deterministic responses for backtest)
 # ---------------------------------------------------------------------------
 MOCK_LLM_RESPONSES: dict[str, str] = {
    # Default / generic
    "default": '{"result": "mock response", "content": "This is a mock LLM response for e2e testing."}',
    # Content generation
    "content_writer": '{"result": "article generated", "content": "AI is transforming industries by enabling automation and intelligent decision-making."}',
    # Translation
    "translator": '{"result": "translation complete", "content": "This is the translated text."}',
    # Summarization
    "summarizer": '{"result": "summary generated", "content": "Key points: 1) Topic overview 2) Main findings 3) Conclusion."}',
    # Code generation
    "coder": '{"result": "code generated", "content": "def hello():\\n    print(\\"Hello, World!\\")"}',
    # Analysis
    "analyst": '{"result": "analysis complete", "content": "The data shows a positive trend with 15% growth."}',
    # ReAct tool call
    "react_tool_call": '{"thought": "I need to search for information", "action": "web_search", "action_input": {"query": "test"}, "observation": "Search results found"}',
    # ReAct final answer
    "react_final": '{"thought": "I have enough information", "final_answer": "Based on my analysis, the answer is 42."}',
 }
 def _build_mock_env(tmp_path: Any) -> dict[str, str]:
    """Build environment variables for a server with MockLLMProvider."""
    env = os.environ.copy()
    env.update(
        {
            "AGENTKIT_E2E_MODE": "1",
            "AGENTKIT_E2E_MOCK_RESPONSES": json.dumps(MOCK_LLM_RESPONSES),
            "AGENTKIT_API_KEY": E2E_API_KEY,
            "AGENTKIT_WS_TIMEOUT": "0",
            # Disable real LLM calls
            "OPENAI_API_KEY": "",
            "ANTHROPIC_API_KEY": "",
            "DEEPSEEK_API_KEY": "",
        }
    )
    return env
 # ---------------------------------------------------------------------------
 # Server lifecycle fixture
 # ---------------------------------------------------------------------------
@pytest.fixture(scope="session")
 def e2e_server(tmp_path_factory: pytest.TempPathFactory) -> Generator[str, None, None]:
    """Start a real AgentKit server for the entire E2E session.
    Returns the base URL (e.g. http://127.0.0.1:18765).
    The server uses MockLLMProvider so no real LLM calls are made.
    """
    tmp_path = tmp_path_factory.mktemp("e2e_server")
    # Generate a minimal agentkit.yaml for the test server
    config_dir = tmp_path / "config"
    config_dir.mkdir()
    config_file = config_dir / "agentkit.yaml"
    import yaml
    config_file.write_text(
        yaml.dump(
            {
                "server": {"host": E2E_HOST, "port": E2E_PORT},
                "llm": {"default_provider": "mock", "providers": {"mock": {"type": "mock"}}},
                "auth": {"enabled": True, "api_keys": [E2E_API_KEY]},
            }
        )
    )
    env = _build_mock_env(tmp_path)
    env["AGENTKIT_CONFIG"] = str(config_file)
    # Start server as subprocess
    proc = subprocess.Popen(
        [
            sys.executable,
            "-m",
            "agentkit.cli.main",
            "serve",
            "--host",
            E2E_HOST,
            "--port",
            str(E2E_PORT),
        ],
        env=env,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
        cwd=str(tmp_path),
    )
    # Wait for server to be ready (max 30s)
    base_url = E2E_BASE_URL
    deadline = time.monotonic() + 30
    ready = False
    while time.monotonic() < deadline:
        try:
            resp = httpx.get(f"{base_url}/api/v1/health", timeout=2)
            if resp.status_code == 200:
                ready = True
                break
        except httpx.ConnectError:
            pass
        time.sleep(0.5)
    if not ready:
        proc.terminate()
        stdout, stderr = proc.communicate(timeout=5)
        pytest.fail(
            f"E2E server failed to start within 30s.\n"
            f"stdout: {stdout.decode()[:2000]}\n"
            f"stderr: {stderr.decode()[:2000]}"
        )
    yield base_url
    # Teardown
    proc.terminate()
    try:
        proc.wait(timeout=10)
    except subprocess.TimeoutExpired:
        proc.kill()
 # ---------------------------------------------------------------------------
 # API client fixture
 # ---------------------------------------------------------------------------
@pytest.fixture(scope="session")
 def api_client(e2e_server: str) -> httpx.Client:
    """Synchronous httpx client configured for the E2E server."""
    return httpx.Client(
        base_url=e2e_server,
        headers={"X-API-Key": E2E_API_KEY, "Content-Type": "application/json"},
        timeout=30,
    )
 # ---------------------------------------------------------------------------
 # CLI runner (subprocess-based, OpenCLI pattern)
 # ---------------------------------------------------------------------------
 class CLIRunner:
    """Simulate user CLI operations via subprocess.
    This is the 'OpenCLI' pattern: invoke the real `agentkit` binary
    as a subprocess and capture its output, exactly as a user would.
    """
    def __init__(self, env: dict[str, str] | None = None, cwd: str | None = None):
        self.env = env or os.environ.copy()
        self.cwd = cwd
    def _resolve_agentkit_cmd(self) -> list[str]:
        """Resolve the agentkit command to use.
        Prefer the installed `agentkit` script (handles Rich/Typer output correctly),
        fall back to `python -m agentkit.cli.main`.
        """
        agentkit_path = shutil.which("agentkit")
        if agentkit_path:
            return [agentkit_path]
        return [sys.executable, "-m", "agentkit.cli.main"]
    def run(self, args: list[str], timeout: int = 30) -> subprocess.CompletedProcess[str]:
        """Run an agentkit CLI command and return the result.
        Args:
            args: CLI arguments, e.g. ["version"] or ["task", "submit", ...]
            timeout: maximum seconds to wait
        Returns:
            CompletedProcess with stdout, stderr, returncode
        """
        cmd = [*self._resolve_agentkit_cmd(), *args]
        return subprocess.run(
            cmd,
            capture_output=True,
            text=True,
            timeout=timeout,
            env=self.env,
            cwd=self.cwd,
        )
    def run_server_command(
        self, args: list[str], server_url: str, timeout: int = 30
    ) -> subprocess.CompletedProcess[str]:
        """Run a CLI command that requires --server-url."""
        full_args = [*args, "--server-url", server_url]
        return self.run(full_args, timeout=timeout)
@pytest.fixture
 def cli_runner(tmp_path: Any) -> CLIRunner:
    """CLI runner with isolated environment."""
    env = os.environ.copy()
    env["AGENTKIT_CONFIG_DIR"] = str(tmp_path / "config")
    env["AGENTKIT_WS_TIMEOUT"] = "0"
    # Prevent onboarding prompts
    env["AGENTKIT_E2E_MODE"] = "1"
    return CLIRunner(env=env, cwd=str(tmp_path))
@pytest.fixture(scope="session")
 def cli_runner_session(e2e_server: str) -> CLIRunner:
    """CLI runner configured to talk to the E2E server."""
    env = os.environ.copy()
    env["AGENTKIT_SERVER_URL"] = e2e_server
    env["AGENTKIT_API_KEY"] = E2E_API_KEY
    env["AGENTKIT_WS_TIMEOUT"] = "0"
    env["AGENTKIT_E2E_MODE"] = "1"
    return CLIRunner(env=env)
 # ---------------------------------------------------------------------------
 # WebSocket helper
 # ---------------------------------------------------------------------------
 class WSChatHelper:
    """Helper for WebSocket chat E2E tests."""
    def __init__(self, base_ws_url: str, api_key: str):
        self.base_ws_url = base_ws_url
        self.api_key = api_key
    async def connect_and_chat(
        self,
        session_id: str,
        messages: list[dict[str, str]],
        timeout: float = 10.0,
    ) -> list[dict[str, Any]]:
        """Connect to a chat WebSocket, send messages, collect responses.
        Args:
            session_id: chat session ID
            messages: list of {"type": "message", "content": "..."}
            timeout: max seconds to wait for final_answer
        Returns:
            list of all server-sent messages
        """
        try:
            import websockets
        except ImportError:
            pytest.skip("websockets package not installed")
        uri = f"{self.base_ws_url}/api/v1/chat/ws/{session_id}?api_key={self.api_key}"
        received: list[dict[str, Any]] = []
        async with websockets.connect(uri) as ws:
            # Wait for connected event
            msg = await asyncio.wait_for(ws.recv(), timeout=timeout)
            data = json.loads(msg)
            received.append(data)
            assert data.get("type") == "connected", f"Expected connected, got {data}"
            # Send user messages
            for user_msg in messages:
                await ws.send(json.dumps(user_msg))
                # Collect responses until final_answer or error
                while True:
                    try:
                        raw = await asyncio.wait_for(ws.recv(), timeout=timeout)
                        resp = json.loads(raw)
                        received.append(resp)
                        if resp.get("type") in ("final_answer", "error"):
                            break
                    except asyncio.TimeoutError:
                        received.append({"type": "timeout"})
                        break
        return received
@pytest.fixture(scope="session")
 def ws_helper(e2e_server: str) -> WSChatHelper:
    """WebSocket chat helper for the E2E server."""
    ws_url = e2e_server.replace("http://", "ws://").replace("https://", "wss://")
    return WSChatHelper(base_ws_url=ws_url, api_key=E2E_API_KEY)
 # ---------------------------------------------------------------------------
 # Skill / Agent setup helpers
 # ---------------------------------------------------------------------------
 def register_skill_via_api(
    api_client: httpx.Client,
    name: str,
    keywords: list[str] | None = None,
    execution_mode: str = "direct",
    task_mode: str = "llm_generate",
 ) -> httpx.Response:
    """Register a skill via the API for E2E testing."""
    config: dict[str, Any] = {
        "name": name,
        "agent_type": name,
        "task_mode": task_mode,
        "description": f"E2E test skill: {name}",
        "prompt": {
            "identity": f"You are a {name} assistant",
            "instructions": f"Perform {name} tasks",
            "output_format": "JSON",
        },
        "intent": {
            "keywords": keywords or [name],
            "description": f"{name} skill for e2e testing",
        },
    }
    if execution_mode != "direct":
        config["execution_mode"] = execution_mode
        config["max_steps"] = 5
    return api_client.post("/api/v1/skills", json={"config": config})
 def create_session_via_api(api_client: httpx.Client, agent_name: str = "test") -> str:
    """Create a chat session and return the session ID."""
    resp = api_client.post("/api/v1/chat/sessions", json={"agent_name": agent_name})
    assert resp.status_code == 201, f"Failed to create session: {resp.text}"
    return resp.json()["session_id"]
 # ---------------------------------------------------------------------------
 # Metrics Collector fixture
 # ---------------------------------------------------------------------------
@pytest.fixture(scope="session")
 def metrics_collector(request: pytest.FixtureRequest):
    """Session-scoped metrics collector for capability analysis."""
    from tests.e2e.capability_metrics import MetricsCollector
    collector: MetricsCollector = request.config._e2e_metrics_collector  # type: ignore[attr-defined]
    return collector
--- a/tests/e2e/test_basic_api.py
+++ b/tests/e2e/test_basic_api.py
@ -0,0 +1,277 @@
 """E2E Basic Function Tests — REST API endpoints.
 Verifies all API routes work correctly with proper request/response handling.
 Test categories:
  1. Health & metrics
  2. Agent CRUD lifecycle
  3. Skill registration & listing
  4. Task submission (sync/async/SSE)
  5. Chat session lifecycle
  6. LLM usage tracking
  7. Error handling & edge cases
 """
 import pytest
 import httpx
 from tests.e2e.conftest import register_skill_via_api, create_session_via_api
 # ═══════════════════════════════════════════════════════════════════════════
 # 1. Health & Metrics
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestHealthAPI:
    def test_health_returns_ok(self, api_client: httpx.Client):
        resp = api_client.get("/api/v1/health")
        assert resp.status_code == 200
        data = resp.json()
        assert data.get("status") in ("ok", "healthy")
    def test_metrics_endpoint(self, api_client: httpx.Client):
        resp = api_client.get("/api/v1/metrics")
        assert resp.status_code == 200
 # ═══════════════════════════════════════════════════════════════════════════
 # 2. Agent CRUD Lifecycle
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestAgentCRUD:
    """Full Agent CRUD lifecycle: create → list → get → delete."""
    def test_create_agent_from_skill(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "crud_skill", keywords=["crud"])
        resp = api_client.post("/api/v1/agents", json={"skill_name": "crud_skill"})
        assert resp.status_code == 201
        data = resp.json()
        assert data["name"] == "crud_skill"
    def test_list_agents(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "list_skill", keywords=["list_agent"])
        api_client.post("/api/v1/agents", json={"skill_name": "list_skill"})
        resp = api_client.get("/api/v1/agents")
        assert resp.status_code == 200
        agents = resp.json()
        assert isinstance(agents, list)
        assert any(a["name"] == "list_skill" for a in agents)
    def test_get_agent_detail(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "detail_skill", keywords=["detail"])
        api_client.post("/api/v1/agents", json={"skill_name": "detail_skill"})
        resp = api_client.get("/api/v1/agents/detail_skill")
        assert resp.status_code == 200
        data = resp.json()
        assert data["name"] == "detail_skill"
    def test_delete_agent(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "delete_skill", keywords=["delete_agent"])
        api_client.post("/api/v1/agents", json={"skill_name": "delete_skill"})
        resp = api_client.delete("/api/v1/agents/delete_skill")
        assert resp.status_code == 204
        # Verify deleted
        resp = api_client.get("/api/v1/agents/delete_skill")
        assert resp.status_code == 404
    def test_create_agent_nonexistent_skill(self, api_client: httpx.Client):
        resp = api_client.post("/api/v1/agents", json={"skill_name": "nonexistent_skill_xyz"})
        assert resp.status_code in (400, 404)
    def test_get_nonexistent_agent(self, api_client: httpx.Client):
        resp = api_client.get("/api/v1/agents/does_not_exist")
        assert resp.status_code == 404
 # ═══════════════════════════════════════════════════════════════════════════
 # 3. Skill Registration & Listing
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestSkillAPI:
    def test_register_skill(self, api_client: httpx.Client):
        resp = register_skill_via_api(api_client, "reg_skill", keywords=["reg"])
        assert resp.status_code == 201
    def test_list_skills(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "list_test_skill", keywords=["list_test"])
        resp = api_client.get("/api/v1/skills")
        assert resp.status_code == 200
        skills = resp.json()
        assert isinstance(skills, list)
        assert len(skills) >= 1
    def test_register_duplicate_skill(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "dup_skill", keywords=["dup"])
        resp = register_skill_via_api(api_client, "dup_skill", keywords=["dup"])
        # Should either overwrite or return conflict
        assert resp.status_code in (200, 201, 409)
    def test_skill_with_execution_mode(self, api_client: httpx.Client):
        resp = register_skill_via_api(
            api_client, "react_skill", keywords=["react_test"], execution_mode="react"
        )
        assert resp.status_code == 201
    def test_skill_mention_suggest(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "mention_skill", keywords=["mention_test"])
        resp = api_client.get("/api/v1/skills/mention-suggest", params={"q": "mention"})
        assert resp.status_code == 200
 # ═══════════════════════════════════════════════════════════════════════════
 # 4. Task Submission
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestTaskAPI:
    def test_submit_task_sync(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "sync_task_skill", keywords=["sync_task"])
        resp = api_client.post(
            "/api/v1/tasks",
            json={
                "input_data": {"query": "test sync task"},
                "skill_name": "sync_task_skill",
            },
        )
        assert resp.status_code == 200
        data = resp.json()
        assert "output" in data or "data" in data or "skill_name" in data
    def test_submit_task_with_agent_name(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "agent_task_skill", keywords=["agent_task"])
        api_client.post("/api/v1/agents", json={"skill_name": "agent_task_skill"})
        resp = api_client.post(
            "/api/v1/tasks",
            json={
                "input_data": {"query": "test agent task"},
                "agent_name": "agent_task_skill",
            },
        )
        assert resp.status_code == 200
    def test_submit_task_auto_route(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "auto_route_skill", keywords=["auto_route"])
        resp = api_client.post(
            "/api/v1/tasks",
            json={"input_data": {"query": "Please auto_route this for me"}},
        )
        assert resp.status_code == 200
    def test_list_tasks(self, api_client: httpx.Client):
        resp = api_client.get("/api/v1/tasks")
        assert resp.status_code == 200
    def test_submit_task_missing_data(self, api_client: httpx.Client):
        resp = api_client.post("/api/v1/tasks", json={})
        # Should return 400 or 422
        assert resp.status_code in (400, 422)
 # ═══════════════════════════════════════════════════════════════════════════
 # 5. Chat Session Lifecycle
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestChatSessionAPI:
    def test_create_session(self, api_client: httpx.Client):
        session_id = create_session_via_api(api_client)
        assert session_id is not None
        assert len(session_id) > 0
    def test_list_sessions(self, api_client: httpx.Client):
        create_session_via_api(api_client)
        resp = api_client.get("/api/v1/chat/sessions")
        assert resp.status_code == 200
        sessions = resp.json()
        assert isinstance(sessions, list)
        assert len(sessions) >= 1
    def test_get_session(self, api_client: httpx.Client):
        session_id = create_session_via_api(api_client)
        resp = api_client.get(f"/api/v1/chat/sessions/{session_id}")
        assert resp.status_code == 200
    def test_session_messages(self, api_client: httpx.Client):
        session_id = create_session_via_api(api_client)
        # Send a message
        resp = api_client.post(
            f"/api/v1/chat/sessions/{session_id}/messages",
            json={"content": "Hello from e2e test"},
        )
        assert resp.status_code == 200
        # Get messages
        resp = api_client.get(f"/api/v1/chat/sessions/{session_id}/messages")
        assert resp.status_code == 200
    def test_close_session(self, api_client: httpx.Client):
        session_id = create_session_via_api(api_client)
        resp = api_client.delete(f"/api/v1/chat/sessions/{session_id}")
        assert resp.status_code == 200
 # ═══════════════════════════════════════════════════════════════════════════
 # 6. LLM Usage Tracking
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestLLMUsageAPI:
    def test_llm_usage_endpoint(self, api_client: httpx.Client):
        resp = api_client.get("/api/v1/llm/usage")
        assert resp.status_code == 200
    def test_llm_usage_after_task(self, api_client: httpx.Client):
        register_skill_via_api(api_client, "usage_track_skill", keywords=["usage_track"])
        api_client.post(
            "/api/v1/tasks",
            json={
                "input_data": {"query": "test usage tracking"},
                "skill_name": "usage_track_skill",
            },
        )
        resp = api_client.get("/api/v1/llm/usage")
        assert resp.status_code == 200
 # ═══════════════════════════════════════════════════════════════════════════
 # 7. Error Handling & Edge Cases
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestAPIErrorHandling:
    def test_404_for_unknown_route(self, api_client: httpx.Client):
        resp = api_client.get("/api/v1/nonexistent_route")
        assert resp.status_code == 404
    def test_invalid_json_body(self, api_client: httpx.Client):
        resp = api_client.post(
            "/api/v1/tasks",
            content=b"not json",
            headers={"Content-Type": "application/json"},
        )
        assert resp.status_code in (400, 422)
    def test_missing_api_key(self, e2e_server: str):
        """Requests without API key should be rejected (if auth enabled)."""
        client = httpx.Client(base_url=e2e_server, timeout=10)
        resp = client.get("/api/v1/agents")
        # Should be 401/403 or still 200 if auth is not enforced on this endpoint
        assert resp.status_code in (200, 401, 403)
    def test_invalid_api_key(self, e2e_server: str):
        client = httpx.Client(
            base_url=e2e_server,
            headers={"X-API-Key": "invalid_key"},
            timeout=10,
        )
        resp = client.get("/api/v1/agents")
        assert resp.status_code in (200, 401, 403)
--- a/tests/e2e/test_basic_cli.py
+++ b/tests/e2e/test_basic_cli.py
@ -0,0 +1,353 @@
 """E2E Basic Function Tests — CLI commands.
 Verifies that all CLI commands execute correctly as a real user would invoke them.
 Uses subprocess (OpenCLI pattern) to simulate actual CLI operations.
 Test categories:
  1. Utility commands: version, doctor, help
  2. Init & config: agentkit init
  3. Pair: API key generation
  4. Skill management: list, load, info
  5. Task management: submit, status, list, cancel
  6. Server: serve startup
 """
 import json
 import os
 import pytest
 from tests.e2e.conftest import CLIRunner, E2E_BASE_URL
 # ═══════════════════════════════════════════════════════════════════════════
 # 1. Utility Commands
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestCLIVersion:
    """agentkit version — basic sanity check."""
    def test_version_returns_zero_exit_code(self, cli_runner: CLIRunner):
        result = cli_runner.run(["version"])
        assert result.returncode == 0, f"stdout: {result.stdout}\nstderr: {result.stderr}"
    def test_version_outputs_version_string(self, cli_runner: CLIRunner):
        result = cli_runner.run(["version"])
        assert "0.1.0" in result.stdout or "fischer-agentkit" in result.stdout.lower()
    def test_version_help(self, cli_runner: CLIRunner):
        result = cli_runner.run(["version", "--help"])
        assert result.returncode == 0
@pytest.mark.e2e_basic
 class TestCLIDoctor:
    """agentkit doctor — server health check."""
    def test_doctor_server_not_running(self, cli_runner: CLIRunner):
        """Doctor should report error when no server is running."""
        result = cli_runner.run(["doctor"])
        # Should indicate server not reachable
        output = (result.stdout + result.stderr).lower()
        assert (
            result.returncode != 0
            or "not running" in output
            or "error" in output
            or "connection" in output
        )
    def test_doctor_with_running_server(self, cli_runner_session: CLIRunner):
        """Doctor should report healthy when E2E server is running."""
        result = cli_runner_session.run(["doctor", "--port", "18765"])
        output = (result.stdout + result.stderr).lower()
        # Should show some health info (ok, healthy, or at least not connection refused)
        assert "connection refused" not in output or result.returncode == 0
@pytest.mark.e2e_basic
 class TestCLIHelp:
    """agentkit --help — command discovery."""
    def test_help_shows_all_subcommands(self, cli_runner: CLIRunner):
        result = cli_runner.run(["--help"])
        assert result.returncode == 0
        for cmd in [
            "serve",
            "gui",
            "chat",
            "version",
            "doctor",
            "init",
            "task",
            "skill",
            "usage",
            "pair",
        ]:
            assert cmd in result.stdout, f"Missing subcommand '{cmd}' in help output"
    def test_task_help(self, cli_runner: CLIRunner):
        result = cli_runner.run(["task", "--help"])
        assert result.returncode == 0
        for sub in ["submit", "status", "list", "cancel"]:
            assert sub in result.stdout
    def test_skill_help(self, cli_runner: CLIRunner):
        result = cli_runner.run(["skill", "--help"])
        assert result.returncode == 0
        for sub in ["list", "load", "info"]:
            assert sub in result.stdout
 # ═══════════════════════════════════════════════════════════════════════════
 # 2. Init & Config
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestCLIInit:
    """agentkit init — project initialization."""
    def test_init_non_interactive(self, cli_runner: CLIRunner, tmp_path):
        output_dir = str(tmp_path / "init_output")
        os.makedirs(output_dir, exist_ok=True)
        result = cli_runner.run(["init", "--non-interactive", "--output-dir", output_dir])
        assert result.returncode == 0, f"stderr: {result.stderr}"
        assert os.path.exists(os.path.join(output_dir, "agentkit.yaml"))
        assert os.path.exists(os.path.join(output_dir, ".env.example"))
    def test_init_generates_valid_yaml(self, cli_runner: CLIRunner, tmp_path):
        import yaml
        output_dir = str(tmp_path / "init_yaml")
        os.makedirs(output_dir, exist_ok=True)
        cli_runner.run(["init", "--non-interactive", "--output-dir", output_dir])
        with open(os.path.join(output_dir, "agentkit.yaml")) as f:
            config = yaml.safe_load(f)
        assert "server" in config
        assert "llm" in config
    def test_init_no_overwrite_without_force(self, cli_runner: CLIRunner, tmp_path):
        output_dir = str(tmp_path / "init_no_overwrite")
        os.makedirs(output_dir, exist_ok=True)
        # Create existing file
        with open(os.path.join(output_dir, "agentkit.yaml"), "w") as f:
            f.write("existing_content")
        cli_runner.run(["init", "--non-interactive", "--output-dir", output_dir])
        with open(os.path.join(output_dir, "agentkit.yaml")) as f:
            content = f.read()
        # Should not overwrite
        assert content == "existing_content"
    def test_init_force_overwrites(self, cli_runner: CLIRunner, tmp_path):
        output_dir = str(tmp_path / "init_force")
        os.makedirs(output_dir, exist_ok=True)
        with open(os.path.join(output_dir, "agentkit.yaml"), "w") as f:
            f.write("old")
        result = cli_runner.run(
            ["init", "--non-interactive", "--force", "--output-dir", output_dir]
        )
        assert result.returncode == 0
        with open(os.path.join(output_dir, "agentkit.yaml")) as f:
            content = f.read()
        assert "server" in content
 # ═══════════════════════════════════════════════════════════════════════════
 # 3. Pair (API Key Generation)
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestCLIPair:
    """agentkit pair — external system API key management."""
    def test_pair_generates_api_key(self, cli_runner: CLIRunner, tmp_path):
        config_dir = str(tmp_path / "pair_config")
        os.makedirs(config_dir, exist_ok=True)
        result = cli_runner.run(["pair", "--name", "e2e-test-client", "--config-dir", config_dir])
        assert result.returncode == 0, f"stderr: {result.stderr}"
        assert "ak_live_" in result.stdout
    def test_pair_saves_client_config(self, cli_runner: CLIRunner, tmp_path):
        import yaml
        config_dir = str(tmp_path / "pair_save")
        os.makedirs(config_dir, exist_ok=True)
        cli_runner.run(["pair", "--name", "e2e-client", "--config-dir", config_dir])
        clients_path = os.path.join(config_dir, "clients.yaml")
        assert os.path.exists(clients_path)
        with open(clients_path) as f:
            clients = yaml.safe_load(f)
        assert "e2e-client" in clients
        assert clients["e2e-client"]["api_key"].startswith("ak_live_")
    def test_pair_rejects_duplicate_name(self, cli_runner: CLIRunner, tmp_path):
        config_dir = str(tmp_path / "pair_dup")
        os.makedirs(config_dir, exist_ok=True)
        cli_runner.run(["pair", "--name", "dup-client", "--config-dir", config_dir])
        result = cli_runner.run(["pair", "--name", "dup-client", "--config-dir", config_dir])
        output = (result.stdout + result.stderr).lower()
        assert result.returncode != 0 or "already" in output or "exists" in output
    def test_pair_list(self, cli_runner: CLIRunner, tmp_path):
        config_dir = str(tmp_path / "pair_list")
        os.makedirs(config_dir, exist_ok=True)
        cli_runner.run(["pair", "--name", "client-a", "--config-dir", config_dir])
        cli_runner.run(["pair", "--name", "client-b", "--config-dir", config_dir])
        result = cli_runner.run(["pair", "--list", "--config-dir", config_dir])
        assert result.returncode == 0
        assert "client-a" in result.stdout
        assert "client-b" in result.stdout
    def test_pair_revoke(self, cli_runner: CLIRunner, tmp_path):
        import yaml
        config_dir = str(tmp_path / "pair_revoke")
        os.makedirs(config_dir, exist_ok=True)
        cli_runner.run(["pair", "--name", "revoke-me", "--config-dir", config_dir])
        result = cli_runner.run(["pair", "--revoke", "revoke-me", "--config-dir", config_dir])
        assert result.returncode == 0
        with open(os.path.join(config_dir, "clients.yaml")) as f:
            clients = yaml.safe_load(f)
        assert "revoke-me" not in clients
 # ═══════════════════════════════════════════════════════════════════════════
 # 4. Skill Management (CLI → Server)
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestCLISkill:
    """agentkit skill — skill management via CLI against running server."""
    def test_skill_list_via_server(self, cli_runner_session: CLIRunner):
        result = cli_runner_session.run_server_command(["skill", "list"], E2E_BASE_URL)
        assert result.returncode == 0, f"stderr: {result.stderr}"
    def test_skill_load_yaml(self, cli_runner: CLIRunner, tmp_path):
        import yaml
        skill_file = tmp_path / "test_skill.yaml"
        skill_file.write_text(
            yaml.dump(
                {
                    "name": "e2e_test_skill",
                    "description": "E2E test skill",
                    "agent_type": "assistant",
                    "task_mode": "llm_generate",
                    "prompt": {"system": "You are a test assistant"},
                }
            )
        )
        result = cli_runner.run(["skill", "load", str(skill_file)])
        # Should load successfully or report loaded
        output = (result.stdout + result.stderr).lower()
        assert result.returncode == 0 or "loaded" in output or "e2e_test_skill" in output
    def test_skill_info_via_server(self, cli_runner_session: CLIRunner, api_client):
        # First register a skill via API
        from tests.e2e.conftest import register_skill_via_api
        register_skill_via_api(api_client, "cli_info_skill", keywords=["cli_info"])
        # Then query via CLI
        result = cli_runner_session.run_server_command(
            ["skill", "info", "cli_info_skill"], E2E_BASE_URL
        )
        assert result.returncode == 0
        assert "cli_info_skill" in result.stdout
 # ═══════════════════════════════════════════════════════════════════════════
 # 5. Task Management (CLI → Server)
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestCLITask:
    """agentkit task — task management via CLI against running server."""
    def test_task_submit_sync(self, cli_runner_session: CLIRunner, api_client):
        from tests.e2e.conftest import register_skill_via_api
        register_skill_via_api(api_client, "cli_task_skill", keywords=["cli_task"])
        result = cli_runner_session.run_server_command(
            [
                "task",
                "submit",
                "--skill",
                "cli_task_skill",
                "--input",
                json.dumps({"query": "test task submission"}),
            ],
            E2E_BASE_URL,
        )
        assert result.returncode == 0, f"stdout: {result.stdout}\nstderr: {result.stderr}"
    def test_task_submit_async(self, cli_runner_session: CLIRunner, api_client):
        from tests.e2e.conftest import register_skill_via_api
        register_skill_via_api(api_client, "cli_async_skill", keywords=["cli_async"])
        result = cli_runner_session.run_server_command(
            [
                "task",
                "submit",
                "--skill",
                "cli_async_skill",
                "--mode",
                "async",
                "--input",
                json.dumps({"query": "async task test"}),
            ],
            E2E_BASE_URL,
        )
        assert result.returncode == 0
    def test_task_list(self, cli_runner_session: CLIRunner):
        result = cli_runner_session.run_server_command(["task", "list"], E2E_BASE_URL)
        assert result.returncode == 0
    def test_task_submit_input_file(self, cli_runner_session: CLIRunner, api_client, tmp_path):
        from tests.e2e.conftest import register_skill_via_api
        register_skill_via_api(api_client, "cli_file_skill", keywords=["cli_file"])
        input_file = tmp_path / "task_input.json"
        input_file.write_text(json.dumps({"query": "file input test"}))
        result = cli_runner_session.run_server_command(
            [
                "task",
                "submit",
                "--skill",
                "cli_file_skill",
                "--input-file",
                str(input_file),
            ],
            E2E_BASE_URL,
        )
        assert result.returncode == 0
 # ═══════════════════════════════════════════════════════════════════════════
 # 6. Server Startup
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestCLIServe:
    """agentkit serve — server startup (basic check, not full lifecycle)."""
    def test_serve_help(self, cli_runner: CLIRunner):
        result = cli_runner.run(["serve", "--help"])
        assert result.returncode == 0
        assert "--host" in result.stdout
        assert "--port" in result.stdout
    def test_serve_invalid_port(self, cli_runner: CLIRunner):
        """Serve with an invalid port should fail gracefully."""
        result = cli_runner.run(["serve", "--port", "not_a_port"], timeout=5)
        # Should error out, not hang
        assert result.returncode != 0 or "error" in (result.stdout + result.stderr).lower()
--- a/tests/e2e/test_basic_websocket.py
+++ b/tests/e2e/test_basic_websocket.py
@ -0,0 +1,170 @@
 """E2E Basic Function Tests — WebSocket chat protocol.
 Verifies the WebSocket chat protocol works correctly:
  1. Connection lifecycle (connect → connected → ping/pong → disconnect)
  2. Message exchange (user message → token stream → final_answer)
  3. Confirmation flow (confirmation_request → confirmation_reply → confirmation_result)
  4. AskHuman flow (ask_human → reply → continue)
  5. Cancel flow (cancel → error/cancelled)
  6. Expert team events (team_formed → expert_step → team_synthesis → team_dissolved)
 """
 import asyncio
 import json
 import pytest
 from tests.e2e.conftest import WSChatHelper, create_session_via_api, register_skill_via_api
 # ═══════════════════════════════════════════════════════════════════════════
 # 1. Connection Lifecycle
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestWSConnection:
    @pytest.mark.asyncio
    async def test_connect_receives_connected_event(self, ws_helper: WSChatHelper, api_client):
        session_id = create_session_via_api(api_client)
        messages = await ws_helper.connect_and_chat(session_id, [])
        assert len(messages) >= 1
        assert messages[0].get("type") == "connected"
    @pytest.mark.asyncio
    async def test_ping_pong(self, ws_helper: WSChatHelper, api_client):
        """Ping should receive pong response."""
        try:
            import websockets
        except ImportError:
            pytest.skip("websockets not installed")
        session_id = create_session_via_api(api_client)
        uri = f"{ws_helper.base_ws_url}/api/v1/chat/ws/{session_id}?api_key={ws_helper.api_key}"
        received: list[dict] = []
        async with websockets.connect(uri) as ws:
            # Wait for connected
            msg = await asyncio.wait_for(ws.recv(), timeout=10)
            received.append(json.loads(msg))
            # Send ping
            await ws.send(json.dumps({"type": "ping"}))
            raw = await asyncio.wait_for(ws.recv(), timeout=10)
            resp = json.loads(raw)
            assert resp.get("type") == "pong"
    @pytest.mark.asyncio
    async def test_invalid_session_id(self, ws_helper: WSChatHelper):
        """Connecting with invalid session ID should fail."""
        try:
            import websockets
        except ImportError:
            pytest.skip("websockets not installed")
        uri = f"{ws_helper.base_ws_url}/api/v1/chat/ws/nonexistent-session?api_key={ws_helper.api_key}"
        with pytest.raises(Exception):
            async with websockets.connect(uri) as ws:
                await asyncio.wait_for(ws.recv(), timeout=5)
 # ═══════════════════════════════════════════════════════════════════════════
 # 2. Message Exchange
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestWSMessageExchange:
    @pytest.mark.asyncio
    async def test_send_message_get_response(self, ws_helper: WSChatHelper, api_client):
        session_id = create_session_via_api(api_client)
        messages = await ws_helper.connect_and_chat(
            session_id,
            [{"type": "message", "content": "Hello, this is an e2e test"}],
        )
        # Should receive at least: connected + some response (token/final_answer/error)
        assert len(messages) >= 2
        # Last meaningful message should be final_answer or error
        response_types = [m.get("type") for m in messages]
        assert any(t in response_types for t in ("final_answer", "error", "token", "thinking"))
    @pytest.mark.asyncio
    async def test_message_types_are_valid(self, ws_helper: WSChatHelper, api_client):
        """All server-sent messages should have a valid 'type' field."""
        session_id = create_session_via_api(api_client)
        messages = await ws_helper.connect_and_chat(
            session_id,
            [{"type": "message", "content": "Test valid message types"}],
        )
        valid_types = {
            "connected",
            "token",
            "thinking",
            "step",
            "final_answer",
            "skill_match",
            "confirmation_request",
            "confirmation_result",
            "ask_human",
            "error",
            "pong",
            "team_formed",
            "expert_step",
            "expert_result",
            "plan_update",
            "team_synthesis",
            "team_dissolved",
        }
        for msg in messages:
            if isinstance(msg, dict) and "type" in msg:
                assert msg["type"] in valid_types, f"Invalid message type: {msg['type']}"
 # ═══════════════════════════════════════════════════════════════════════════
 # 3. Cancel Flow
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestWSCancel:
    @pytest.mark.asyncio
    async def test_cancel_message_accepted(self, ws_helper: WSChatHelper, api_client):
        """Sending cancel should be accepted by the server."""
        try:
            import websockets
        except ImportError:
            pytest.skip("websockets not installed")
        session_id = create_session_via_api(api_client)
        uri = f"{ws_helper.base_ws_url}/api/v1/chat/ws/{session_id}?api_key={ws_helper.api_key}"
        async with websockets.connect(uri) as ws:
            # Wait for connected
            await asyncio.wait_for(ws.recv(), timeout=10)
            # Send a message first
            await ws.send(json.dumps({"type": "message", "content": "Start a task"}))
            # Immediately send cancel
            await ws.send(json.dumps({"type": "cancel"}))
            # Server should handle gracefully (no crash)
 # ═══════════════════════════════════════════════════════════════════════════
 # 4. Skill Match Event
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_basic
 class TestWSSkillMatch:
    @pytest.mark.asyncio
    async def test_skill_match_notification(self, ws_helper: WSChatHelper, api_client):
        """When a skill is matched, server should send skill_match event."""
        register_skill_via_api(api_client, "ws_skill", keywords=["ws_skill_match"])
        session_id = create_session_via_api(api_client)
        messages = await ws_helper.connect_and_chat(
            session_id,
            [{"type": "message", "content": "Please use ws_skill_match for this"}],
        )
        # Check if skill_match event was sent (may or may not happen depending on routing)
        _ = [m.get("type") for m in messages]  # noqa: F841
        # At minimum, we should get a response (skill_match or direct answer)
        assert len(messages) >= 2
--- a/tests/e2e/test_capability_alignment.py
+++ b/tests/e2e/test_capability_alignment.py
@ -0,0 +1,305 @@
 """E2E capability tests for AlignmentGuard and CascadeDetector.
 Tests constraint injection, output checking, and cascade alert behavior
 using the actual AlignmentGuard and CascadeDetector implementations.
 """
 from __future__ import annotations
 from datetime import datetime, timezone
 import pytest
 from agentkit.quality.alignment import (
    AlignmentCheckResult,
    AlignmentConfig,
    AlignmentGuard,
    ConstraintInjector,
 )
 from agentkit.quality.cascade_detector import CascadeDetector
 from tests.e2e.benchmark_dataset import BenchmarkCase
 from tests.e2e.capability_metrics import CapabilityObservation, MetricsCollector
 # ---------------------------------------------------------------------------
 # Alignment benchmark cases
 # ---------------------------------------------------------------------------
 ALIGNMENT_BENCHMARKS: list[BenchmarkCase] = [
    # Negative constraints: "不要X" → output should NOT contain X
    BenchmarkCase(
        id="align-neg-001",
        input="请分析市场趋势，不要提及价格信息",
        expected_skill=None,
        expected_execution_mode="react",
        expected_complexity="high",
        category="alignment",
        subcategory="negative_constraint",
    ),
    BenchmarkCase(
        id="align-neg-002",
        input="总结这篇文章，禁止包含个人观点",
        expected_skill=None,
        expected_execution_mode="react",
        expected_complexity="medium",
        category="alignment",
        subcategory="negative_constraint",
    ),
    # Positive constraints: "必须X" → output SHOULD contain X
    BenchmarkCase(
        id="align-pos-001",
        input="分析竞争对手，必须包含摘要部分",
        expected_skill=None,
        expected_execution_mode="react",
        expected_complexity="high",
        category="alignment",
        subcategory="positive_constraint",
    ),
    BenchmarkCase(
        id="align-pos-002",
        input="审查代码，需要提供改进建议",
        expected_skill=None,
        expected_execution_mode="react",
        expected_complexity="medium",
        category="alignment",
        subcategory="positive_constraint",
    ),
    # Cascade alert: repeated interactions should trigger alert
    BenchmarkCase(
        id="align-cascade-001",
        input="重复执行相似查询触发级联告警",
        expected_skill=None,
        expected_execution_mode="react",
        expected_complexity="medium",
        category="alignment",
        subcategory="cascade_alert",
    ),
    # No constraints: should pass cleanly
    BenchmarkCase(
        id="align-none-001",
        input="帮我分析一下用户数据",
        expected_skill=None,
        expected_execution_mode="react",
        expected_complexity="medium",
        category="alignment",
        subcategory="no_constraint",
    ),
 ]
 # ---------------------------------------------------------------------------
 # Tests: ConstraintInjector
 # ---------------------------------------------------------------------------
 class TestConstraintInjector:
    def test_inject_constraints(self) -> None:
        config = AlignmentConfig(constraints=["不要提及价格", "必须包含摘要"])
        injector = ConstraintInjector(config)
        input_data = {"query": "分析市场趋势"}
        result = injector.inject(input_data)
        assert "alignment_constraints" in result
        assert result["alignment_constraints"] == ["不要提及价格", "必须包含摘要"]
        # Original data should not be modified
        assert "alignment_constraints" not in input_data
 # ---------------------------------------------------------------------------
 # Tests: AlignmentGuard rule-based checking
 # ---------------------------------------------------------------------------
 class TestAlignmentGuardRuleCheck:
    @pytest.fixture
    def guard(self) -> AlignmentGuard:
        config = AlignmentConfig(
            constraints=["不要提及价格信息", "必须摘要"],
            audit_enabled=False,
        )
        return AlignmentGuard(config)
    @pytest.mark.asyncio
    async def test_negative_constraint_pass(self, guard: AlignmentGuard) -> None:
        """Output without forbidden content should pass."""
        output = {"content": "市场趋势分析：整体呈上升趋势。摘要：市场表现良好。"}
        result = await guard.check_output(output)
        assert isinstance(result, AlignmentCheckResult)
        # "价格信息" not in output → should pass
        assert result.passed is True
    @pytest.mark.asyncio
    async def test_negative_constraint_violation(self, guard: AlignmentGuard) -> None:
        """Output containing forbidden content should fail."""
        output = {"content": "当前提及价格信息显示市场上涨。摘要：市场持续走高。"}
        result = await guard.check_output(output)
        assert result.passed is False
        assert len(result.violations) > 0
    @pytest.mark.asyncio
    async def test_positive_constraint_pass(self, guard: AlignmentGuard) -> None:
        """Output containing required content should pass."""
        output = {"content": "分析结果如下。摘要：市场趋势向好。"}
        result = await guard.check_output(output)
        assert result.passed is True
    @pytest.mark.asyncio
    async def test_positive_constraint_violation(self, guard: AlignmentGuard) -> None:
        """Output missing required content should fail."""
        output = {"content": "分析结果如下。市场趋势向好。"}
        result = await guard.check_output(output)
        assert result.passed is False
    @pytest.mark.asyncio
    async def test_no_constraints(self) -> None:
        """Guard with no constraints should always pass."""
        config = AlignmentConfig(constraints=[], audit_enabled=False)
        guard = AlignmentGuard(config)
        output = {"content": "任意内容"}
        result = await guard.check_output(output)
        assert result.passed is True
    @pytest.mark.asyncio
    async def test_negation_context_not_violation(self) -> None:
        """Mentioning forbidden content in negative context should not be a violation."""
        config = AlignmentConfig(
            constraints=["不要提及价格"],
            audit_enabled=False,
        )
        guard = AlignmentGuard(config)
        output = {"content": "我们不会提及价格信息，请放心。摘要：市场分析完成。"}
        result = await guard.check_output(output)
        # "价格" appears but in negative context ("不会提及价格")
        assert result.passed is True
 # ---------------------------------------------------------------------------
 # Tests: CascadeDetector
 # ---------------------------------------------------------------------------
 class TestCascadeDetector:
    def test_no_alert_below_threshold(self) -> None:
        detector = CascadeDetector(max_interactions=5)
        for _ in range(5):
            alert = detector.check_interaction("session-1")
            assert alert is None
    def test_alert_above_interaction_threshold(self) -> None:
        detector = CascadeDetector(max_interactions=5)
        for _ in range(5):
            detector.check_interaction("session-2")
        # 6th interaction should trigger alert
        alert = detector.check_interaction("session-2")
        assert alert is not None
        assert alert.alert_type == "interaction_limit"
        assert alert.current_value == 6
    def test_alert_above_depth_threshold(self) -> None:
        detector = CascadeDetector(max_depth=3)
        alert = detector.check_depth("session-3", 4)
        assert alert is not None
        assert alert.alert_type == "loop_depth"
        assert alert.current_value == 4
    def test_no_alert_below_depth_threshold(self) -> None:
        detector = CascadeDetector(max_depth=3)
        alert = detector.check_depth("session-4", 3)
        assert alert is None
    def test_reset_clears_state(self) -> None:
        detector = CascadeDetector(max_interactions=3)
        for _ in range(3):
            detector.check_interaction("session-5")
        detector.reset("session-5")
        # After reset, count should be back to 0
        alert = detector.check_interaction("session-5")
        assert alert is None  # count is now 1, below threshold
 # ---------------------------------------------------------------------------
 # Tests: AlignmentGuard cascade integration
 # ---------------------------------------------------------------------------
 class TestAlignmentGuardCascade:
    def test_record_interaction_returns_alert(self) -> None:
        config = AlignmentConfig(cascade_max_interactions=3)
        guard = AlignmentGuard(config)
        for _ in range(3):
            guard.record_interaction("session-10")
        alert = guard.record_interaction("session-10")
        assert alert is not None
        assert alert.alert_type == "interaction_limit"
    def test_record_loop_depth_returns_alert(self) -> None:
        config = AlignmentConfig(cascade_max_depth=2)
        guard = AlignmentGuard(config)
        alert = guard.record_loop_depth("session-11", 3)
        assert alert is not None
        assert alert.alert_type == "loop_depth"
    def test_reset_session(self) -> None:
        config = AlignmentConfig(cascade_max_interactions=2)
        guard = AlignmentGuard(config)
        guard.record_interaction("session-12")
        guard.record_interaction("session-12")
        guard.reset_session("session-12")
        assert guard.get_interaction_count("session-12") == 0
 # ---------------------------------------------------------------------------
 # Tests: Metrics collection for alignment
 # ---------------------------------------------------------------------------
 class TestAlignmentMetricsCollection:
    def test_record_alignment_observation(self) -> None:
        collector = MetricsCollector()
        obs = CapabilityObservation(
            benchmark_id="align-neg-001",
            test_name="test_neg_constraint",
            timestamp=datetime.now(timezone.utc).isoformat(),
            input_query="请分析市场趋势，不要提及价格信息",
            category="alignment",
            subcategory="negative_constraint",
            alignment_violations=0,
            cascade_alert=False,
        )
        collector.record(obs)
        alignment_obs = collector.get_observations_by_category("alignment")
        assert len(alignment_obs) == 1
        assert alignment_obs[0].alignment_violations == 0
    def test_record_alignment_with_violations(self) -> None:
        collector = MetricsCollector()
        obs = CapabilityObservation(
            benchmark_id="align-neg-002",
            test_name="test_neg_constraint_violation",
            timestamp=datetime.now(timezone.utc).isoformat(),
            input_query="总结这篇文章，禁止包含个人观点",
            category="alignment",
            subcategory="negative_constraint",
            alignment_violations=1,
            cascade_alert=False,
        )
        collector.record(obs)
        alignment_obs = collector.get_observations_by_category("alignment")
        assert alignment_obs[0].alignment_violations == 1
    def test_record_cascade_alert(self) -> None:
        collector = MetricsCollector()
        obs = CapabilityObservation(
            benchmark_id="align-cascade-001",
            test_name="test_cascade_alert",
            timestamp=datetime.now(timezone.utc).isoformat(),
            input_query="重复执行相似查询触发级联告警",
            category="alignment",
            subcategory="cascade_alert",
            alignment_violations=0,
            cascade_alert=True,
        )
        collector.record(obs)
        alignment_obs = collector.get_observations_by_category("alignment")
        assert alignment_obs[0].cascade_alert is True
--- a/tests/e2e/test_capability_react.py
+++ b/tests/e2e/test_capability_react.py
@ -0,0 +1,324 @@
 """E2E Agent Capability Tests — ReAct Reasoning & Execution with Metrics.
 Tests the intelligence of agent execution AND collects data for:
  - Execution mode selection accuracy
  - Quality gate effectiveness
  - Task success rate by mode
  - Output standardization consistency
  - Overfitting detection via paraphrased inputs
 """
 import pytest
 import httpx
 from tests.e2e.benchmark_dataset import EXECUTION_BENCHMARKS, BenchmarkCase
 from tests.e2e.capability_metrics import MetricsCollector
 from tests.e2e.conftest import register_skill_via_api
 # ═══════════════════════════════════════════════════════════════════════════
 # Helper: run execution benchmark and record metrics
 # ═══════════════════════════════════════════════════════════════════════════
 def _run_exec_benchmark(
    benchmark: BenchmarkCase,
    api_client: httpx.Client,
    collector: MetricsCollector,
    test_name: str,
    is_paraphrase: bool = False,
    input_override: str | None = None,
 ) -> dict:
    """Execute an execution benchmark and record metrics."""
    query = input_override or benchmark.input
    collector.start_timer(benchmark.id)
    payload: dict = {"input_data": {"query": query}}
    if benchmark.expected_skill is not None:
        payload["skill_name"] = benchmark.expected_skill
    resp = api_client.post("/api/v1/tasks", json=payload)
    actual_skill = None
    actual_exec_mode = None
    actual_keys = []
    task_succeeded = resp.status_code == 200
    error_msg = None
    if task_succeeded:
        data = resp.json()
        actual_skill = data.get("skill_name")
        actual_exec_mode = data.get("execution_mode")
        actual_keys = list(data.keys())
    elif resp.status_code >= 400:
        try:
            error_msg = resp.json().get("detail", resp.text[:200])
        except Exception:
            error_msg = resp.text[:200]
    collector.record_benchmark_result(
        benchmark,
        test_name=test_name,
        actual_skill=actual_skill,
        actual_execution_mode=actual_exec_mode,
        actual_status_code=resp.status_code,
        actual_response_keys=actual_keys,
        task_succeeded=task_succeeded,
        is_paraphrase=is_paraphrase,
        error_message=error_msg,
    )
    return {
        "status_code": resp.status_code,
        "actual_skill": actual_skill,
        "actual_exec_mode": actual_exec_mode,
        "actual_keys": actual_keys,
        "task_succeeded": task_succeeded,
    }
 # ═══════════════════════════════════════════════════════════════════════════
 # Parameterized Execution Benchmark Tests
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestExecutionBenchmarks:
    """Run all execution benchmarks with metrics collection."""
    @pytest.mark.parametrize(
        "benchmark",
        EXECUTION_BENCHMARKS,
        ids=[b.id for b in EXECUTION_BENCHMARKS],
    )
    def test_execution_benchmark(
        self,
        benchmark: BenchmarkCase,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Run original execution benchmark and record metrics."""
        # Register the skill if expected
        if benchmark.expected_skill:
            exec_mode = (
                benchmark.expected_execution_mode
                if benchmark.expected_execution_mode != "direct"
                else "direct"
            )
            register_skill_via_api(
                api_client,
                benchmark.expected_skill,
                keywords=[benchmark.expected_skill],
                execution_mode=exec_mode,
            )
        result = _run_exec_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name=f"exec_benchmark_{benchmark.id}",
        )
        assert result["status_code"] == 200, f"Benchmark {benchmark.id} failed"
    @pytest.mark.parametrize(
        "benchmark",
        [b for b in EXECUTION_BENCHMARKS if b.paraphrases],
        ids=[b.id for b in EXECUTION_BENCHMARKS if b.paraphrases],
    )
    def test_execution_paraphrase(
        self,
        benchmark: BenchmarkCase,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Run paraphrases for overfitting detection."""
        for i, paraphrase in enumerate(benchmark.paraphrases):
            _run_exec_benchmark(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"exec_paraphrase_{benchmark.id}_{i}",
                is_paraphrase=True,
                input_override=paraphrase,
            )
 # ═══════════════════════════════════════════════════════════════════════════
 # ReAct Loop Intelligence
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestReActIntelligence:
    """Test that ReAct agents reason correctly through Think→Act→Observe."""
    def test_react_skill_executes_steps(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """ReAct skill should execute multiple steps for complex tasks."""
        benchmark = BenchmarkCase(
            id="react-steps-001",
            input="Research and analyze the impact of AI on healthcare",
            expected_skill="react_reasoner",
            expected_execution_mode="react",
            expected_complexity="high",
            category="execution",
            subcategory="react_mode",
            paraphrases=["Investigate AI's effect on medical industry", "调研AI对医疗行业的影响"],
        )
        register_skill_via_api(
            api_client,
            "react_reasoner",
            keywords=["react_reason", "research", "analyze"],
            execution_mode="react",
        )
        result = _run_exec_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name="react_steps",
        )
        assert result["status_code"] == 200
        for i, para in enumerate(benchmark.paraphrases):
            _run_exec_benchmark(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"react_steps_para_{i}",
                is_paraphrase=True,
                input_override=para,
            )
 # ═══════════════════════════════════════════════════════════════════════════
 # Quality Gate Intelligence
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestQualityGateIntelligence:
    """Test that quality gate correctly validates and retries outputs."""
    def test_quality_gate_with_required_fields(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Quality gate should enforce required_fields in output."""
        benchmark = BenchmarkCase(
            id="quality-fields-001",
            input="Generate content with quality check",
            expected_skill="quality_skill",
            expected_complexity="medium",
            category="execution",
            subcategory="quality_gate",
        )
        register_skill_via_api(api_client, "quality_skill", keywords=["quality_test"])
        result = _run_exec_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name="quality_fields",
        )
        assert result["status_code"] in (200, 400, 422)
    def test_quality_gate_rejects_empty_output(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Quality gate should reject empty or minimal output."""
        benchmark = BenchmarkCase(
            id="quality-empty-001",
            input="",
            expected_skill="quality_empty",
            expected_complexity="low",
            category="execution",
            subcategory="quality_gate",
        )
        register_skill_via_api(api_client, "quality_empty", keywords=["quality_empty"])
        result = _run_exec_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name="quality_empty",
        )
        # Should handle gracefully
        assert result["status_code"] in (200, 400, 422)
 # ═══════════════════════════════════════════════════════════════════════════
 # Output Standardization Intelligence
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestOutputStandardization:
    """Test that agent outputs are properly standardized."""
    def test_output_has_required_structure(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Task results should have a consistent structure."""
        register_skill_via_api(api_client, "output_std_skill", keywords=["output_std"])
        benchmark = BenchmarkCase(
            id="output-std-001",
            input="Test output standardization",
            expected_skill="output_std_skill",
            expected_complexity="low",
            category="execution",
            subcategory="output_std",
        )
        result = _run_exec_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name="output_std",
        )
        assert result["status_code"] == 200
        assert result["task_succeeded"]
    def test_different_skills_produce_consistent_format(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Different skills should produce results in consistent format."""
        register_skill_via_api(api_client, "format_skill_a", keywords=["format_a"])
        register_skill_via_api(api_client, "format_skill_b", keywords=["format_b"])
        bench_a = BenchmarkCase(
            id="format-a-001",
            input="Test format A",
            expected_skill="format_skill_a",
            expected_complexity="low",
            category="execution",
            subcategory="output_std",
        )
        bench_b = BenchmarkCase(
            id="format-b-001",
            input="Test format B",
            expected_skill="format_skill_b",
            expected_complexity="low",
            category="execution",
            subcategory="output_std",
        )
        result_a = _run_exec_benchmark(bench_a, api_client, metrics_collector, test_name="format_a")
        result_b = _run_exec_benchmark(bench_b, api_client, metrics_collector, test_name="format_b")
        if result_a["task_succeeded"] and result_b["task_succeeded"]:
            # Both should have some common response keys
            keys_a = set(result_a["actual_keys"])
            keys_b = set(result_b["actual_keys"])
            assert len(keys_a & keys_b) > 0 or len(keys_a) > 0
--- a/tests/e2e/test_capability_router_direct.py
+++ b/tests/e2e/test_capability_router_direct.py
@ -0,0 +1,342 @@
 """E2E Agent Capability Tests — Router Direct Backtest Layer (Real LLM).
 Directly tests CostAwareRouter.route() using real LLM configuration
 loaded from agentkit.yaml. Records full SkillRoutingResult for precise
 root cause analysis:
  - match_method (layer0/layer1/layer1.5/layer2)
  - match_confidence
  - complexity score
  - execution_trace
 """
 import asyncio
 import os
 from pathlib import Path
 import pytest
 from agentkit.chat.skill_routing import CostAwareRouter
 from agentkit.router.intent import IntentRouter
 from agentkit.server.app import _build_llm_gateway, _build_skill_registry
 from agentkit.server.config import ServerConfig
 from agentkit.skills.registry import SkillRegistry
 from tests.e2e.benchmark_dataset import (
    ALL_BENCHMARKS,
    ROUTING_KEYWORD_BENCHMARKS,
    ROUTING_EDGE_BENCHMARKS,
    SEMANTIC_ROUTER_BENCHMARKS,
    BenchmarkCase,
 )
 from tests.e2e.capability_metrics import MetricsCollector
 # ═══════════════════════════════════════════════════════════════════════════
 # Real component initialization from agentkit.yaml
 # ═══════════════════════════════════════════════════════════════════════════
 def _find_config_path() -> str | None:
    """Find agentkit.yaml in standard search paths."""
    candidates = [
        os.environ.get("AGENTKIT_CONFIG", ""),
        str(Path.cwd() / "agentkit.yaml"),
        str(Path.home() / ".agentkit" / "agentkit.yaml"),
    ]
    for path in candidates:
        if path and Path(path).is_file():
            return path
    return None
 def _build_real_components() -> tuple[CostAwareRouter, SkillRegistry, IntentRouter]:
    """Build real components from agentkit.yaml configuration.
    Returns (router, skill_registry, intent_router).
    Raises skip if no valid LLM provider is configured.
    """
    config_path = _find_config_path()
    if not config_path:
        pytest.skip("No agentkit.yaml found — cannot build real components")
    # Load .env if present
    env_path = Path(config_path).parent / ".env"
    if env_path.exists():
        try:
            from dotenv import load_dotenv
            load_dotenv(env_path)
        except ImportError:
            # python-dotenv not installed, manually parse .env
            with open(env_path) as f:
                for line in f:
                    line = line.strip()
                    if line and not line.startswith("#") and "=" in line:
                        key, _, value = line.partition("=")
                        os.environ.setdefault(key.strip(), value.strip().strip("'\""))
    server_config = ServerConfig.from_yaml(config_path)
    # Check if any LLM provider has a valid API key
    if not server_config.has_llm_provider():
        # Try to inject DASHSCOPE_API_KEY from environment
        dashscope_key = os.environ.get("DASHSCOPE_API_KEY", "")
        if dashscope_key:
            # Inject into the test provider config
            for name, pconf in server_config.llm_config.providers.items():
                if not pconf.api_key:
                    pconf.api_key = dashscope_key
                    # Set base_url for dashscope if missing
                    if not pconf.base_url:
                        pconf.base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1"
                    break
    if not server_config.has_llm_provider():
        pytest.skip("No LLM provider with valid API key — skipping real LLM tests")
    # Build real LLM gateway
    llm_gateway = _build_llm_gateway(server_config)
    # Build real skill registry from configs/skills
    skill_registry = _build_skill_registry(server_config)
    # Build real intent router
    intent_router = IntentRouter(llm_gateway=llm_gateway)
    # Build real CostAwareRouter
    router_conf = server_config.router or {}
    router = CostAwareRouter(
        llm_gateway=llm_gateway,
        model="default",
        org_context=None,
        auction_enabled=router_conf.get("auction_enabled", False),
        classifier=router_conf.get("classifier", "heuristic"),
        merged_llm_classify=router_conf.get("merged_llm_classify", True),
    )
    return router, skill_registry, intent_router
 # Cache components at module level to avoid rebuilding for every test
 _cached_components: tuple[CostAwareRouter, SkillRegistry, IntentRouter] | None = None
 def _get_components() -> tuple[CostAwareRouter, SkillRegistry, IntentRouter]:
    """Get or build real components (cached for session)."""
    global _cached_components
    if _cached_components is None:
        _cached_components = _build_real_components()
    return _cached_components
 # ═══════════════════════════════════════════════════════════════════════════
 # Helper: Run a single benchmark through the real router
 # ═══════════════════════════════════════════════════════════════════════════
 async def _run_router_benchmark(
    benchmark: BenchmarkCase,
    collector: MetricsCollector,
    test_name: str,
    is_paraphrase: bool = False,
    input_override: str | None = None,
 ) -> dict:
    """Run a single benchmark through the real router."""
    router, skill_registry, intent_router = _get_components()
    query = input_override or benchmark.input
    collector.start_timer(benchmark.id)
    try:
        result = await router.route(
            content=query,
            skill_registry=skill_registry,
            intent_router=intent_router,
            default_tools=[],
            default_system_prompt=None,
        )
        actual_skill = result.skill_name
        actual_exec_mode = result.execution_mode.value if result.execution_mode else None
        actual_complexity = result.complexity
        actual_match_method = result.match_method
        actual_match_confidence = result.match_confidence
        task_succeeded = True
        error_msg = None
    except Exception as e:
        actual_skill = None
        actual_exec_mode = None
        actual_complexity = 0.0
        actual_match_method = None
        actual_match_confidence = 0.0
        task_succeeded = False
        error_msg = str(e)[:200]
    # Map complexity score to level
    if actual_complexity < 0.3:
        actual_complexity_level = "low"
    elif actual_complexity < 0.7:
        actual_complexity_level = "medium"
    else:
        actual_complexity_level = "high"
    # Judge correctness
    skill_correct = None
    if benchmark.expected_skill is not None and actual_skill is not None:
        skill_correct = actual_skill == benchmark.expected_skill
    elif benchmark.expected_skill is None:
        skill_correct = actual_skill is None or task_succeeded
    execution_mode_correct = None
    if actual_exec_mode is not None and benchmark.expected_execution_mode:
        mode_map = {
            "direct": "DIRECT_CHAT",
            "react": "SKILL_REACT",
            "rewoo": "REWOO",
            "reflexion": "REFLEXION",
            "plan_exec": "PLAN_EXEC",
            "team_collab": "TEAM_COLLAB",
            "llm_generate": "SKILL_REACT",
            "tool_call": "SKILL_REACT",
            "custom": "SKILL_REACT",
        }
        expected_normalized = mode_map.get(
            benchmark.expected_execution_mode, benchmark.expected_execution_mode.upper()
        )
        execution_mode_correct = actual_exec_mode.upper() == expected_normalized
    complexity_correct = actual_complexity_level == benchmark.expected_complexity
    obs = collector.record_benchmark_result(
        benchmark,
        test_name=test_name,
        actual_skill=actual_skill,
        actual_execution_mode=actual_exec_mode,
        actual_status_code=200 if task_succeeded else 500,
        task_succeeded=task_succeeded,
        is_paraphrase=is_paraphrase,
        error_message=error_msg,
    )
    obs.complexity_correct = complexity_correct
    return {
        "skill_correct": skill_correct,
        "execution_mode_correct": execution_mode_correct,
        "complexity_correct": complexity_correct,
        "actual_skill": actual_skill,
        "actual_exec_mode": actual_exec_mode,
        "actual_complexity": actual_complexity,
        "actual_complexity_level": actual_complexity_level,
        "actual_match_method": actual_match_method,
        "actual_match_confidence": actual_match_confidence,
        "task_succeeded": task_succeeded,
    }
 # ═══════════════════════════════════════════════════════════════════════════
 # Layer 0: Rule Matching Tests
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestRouterLayer0:
    """Test Layer 0 rule matching with real router."""
    @pytest.mark.parametrize(
        "benchmark",
        [
            b
            for b in ROUTING_EDGE_BENCHMARKS
            if b.subcategory in ("greeting", "identity", "explicit_prefix")
        ],
        ids=[
            b.id
            for b in ROUTING_EDGE_BENCHMARKS
            if b.subcategory in ("greeting", "identity", "explicit_prefix")
        ],
    )
    def test_layer0_rules(self, benchmark: BenchmarkCase, metrics_collector: MetricsCollector):
        """Layer 0 should correctly match greetings, identity, and @skill: prefix."""
        result = asyncio.run(
            _run_router_benchmark(benchmark, metrics_collector, f"layer0_{benchmark.id}")
        )
        if benchmark.subcategory == "greeting":
            assert result["actual_match_method"] in ("layer0", None) or result["task_succeeded"]
        if benchmark.subcategory == "explicit_prefix":
            assert result["actual_skill"] == benchmark.expected_skill or result["task_succeeded"]
 # ═══════════════════════════════════════════════════════════════════════════
 # Layer 1: Complexity Classification Tests
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestRouterLayer1:
    """Test Layer 1 complexity classification with real router."""
    @pytest.mark.parametrize(
        "benchmark",
        ROUTING_KEYWORD_BENCHMARKS,
        ids=[b.id for b in ROUTING_KEYWORD_BENCHMARKS],
    )
    def test_complexity_classification(
        self, benchmark: BenchmarkCase, metrics_collector: MetricsCollector
    ):
        """HeuristicClassifier should correctly estimate complexity."""
        asyncio.run(_run_router_benchmark(benchmark, metrics_collector, f"layer1_{benchmark.id}"))
 # ═══════════════════════════════════════════════════════════════════════════
 # Semantic Router Tests
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestSemanticRouter:
    """Test semantic router matching with real router."""
    @pytest.mark.parametrize(
        "benchmark",
        SEMANTIC_ROUTER_BENCHMARKS,
        ids=[b.id for b in SEMANTIC_ROUTER_BENCHMARKS],
    )
    def test_semantic_match(self, benchmark: BenchmarkCase, metrics_collector: MetricsCollector):
        """SemanticRouter should match skill descriptions."""
        asyncio.run(_run_router_benchmark(benchmark, metrics_collector, f"semantic_{benchmark.id}"))
 # ═══════════════════════════════════════════════════════════════════════════
 # Paraphrase Consistency Tests (Overfitting Detection)
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestRouterParaphraseConsistency:
    """Test that paraphrased inputs route to the same skill as originals."""
    @pytest.mark.parametrize(
        "benchmark",
        [b for b in ALL_BENCHMARKS if b.paraphrases and b.expected_skill is not None][:10],
        ids=[b.id for b in ALL_BENCHMARKS if b.paraphrases and b.expected_skill is not None][:10],
    )
    def test_paraphrase_routes_same_skill(
        self, benchmark: BenchmarkCase, metrics_collector: MetricsCollector
    ):
        """Original and paraphrased inputs should route to the same skill."""
        # Run original
        asyncio.run(
            _run_router_benchmark(benchmark, metrics_collector, f"para_orig_{benchmark.id}")
        )
        # Run paraphrases
        for i, para in enumerate(benchmark.paraphrases):
            asyncio.run(
                _run_router_benchmark(
                    benchmark,
                    metrics_collector,
                    f"para_{benchmark.id}_{i}",
                    is_paraphrase=True,
                    input_override=para,
                )
            )
--- a/tests/e2e/test_capability_routing.py
+++ b/tests/e2e/test_capability_routing.py
@ -0,0 +1,273 @@
 """E2E Agent Capability Tests — Intent Routing Intelligence with Metrics Collection.
 Tests the intelligence of the CostAwareRouter (3-layer routing) AND collects
 data for recall/precision/F1 analysis, overfitting detection, and weakness
 identification.
 Each test:
  1. Runs the benchmark case (original input)
  2. Runs all paraphrases of the same input (overfitting detection)
  3. Records observations to MetricsCollector
  4. Asserts minimum quality thresholds
 """
 import pytest
 import httpx
 from tests.e2e.benchmark_dataset import (
    ROUTING_KEYWORD_BENCHMARKS,
    ROUTING_EDGE_BENCHMARKS,
    CONSISTENCY_BENCHMARKS,
    BenchmarkCase,
    get_skill_names_needed,
 )
 from tests.e2e.capability_metrics import MetricsCollector
 from tests.e2e.conftest import register_skill_via_api
 # ═══════════════════════════════════════════════════════════════════════════
 # Pre-registration of all skills needed by benchmarks
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.fixture(autouse=True, scope="module")
 def register_benchmark_skills(api_client: httpx.Client):
    """Auto-register all skills needed by routing benchmarks."""
    for skill_name in get_skill_names_needed():
        register_skill_via_api(api_client, skill_name, keywords=[skill_name])
 # ═══════════════════════════════════════════════════════════════════════════
 # Helper: run a single benchmark case and record metrics
 # ═══════════════════════════════════════════════════════════════════════════
 def _run_benchmark_and_record(
    benchmark: BenchmarkCase,
    api_client: httpx.Client,
    collector: MetricsCollector,
    test_name: str,
    is_paraphrase: bool = False,
    input_override: str | None = None,
 ) -> dict:
    """Execute a benchmark case against the API and record metrics."""
    query = input_override or benchmark.input
    collector.start_timer(benchmark.id)
    payload: dict = {"input_data": {"query": query}}
    if benchmark.expected_skill is not None:
        payload["skill_name"] = benchmark.expected_skill
    resp = api_client.post("/api/v1/tasks", json=payload)
    actual_skill = None
    actual_exec_mode = None
    actual_keys = []
    task_succeeded = resp.status_code == 200
    error_msg = None
    if task_succeeded:
        data = resp.json()
        actual_skill = data.get("skill_name")
        actual_exec_mode = data.get("execution_mode")
        actual_keys = list(data.keys())
    elif resp.status_code >= 400:
        try:
            error_msg = resp.json().get("detail", resp.text[:200])
        except Exception:
            error_msg = resp.text[:200]
    collector.record_benchmark_result(
        benchmark,
        test_name=test_name,
        actual_skill=actual_skill,
        actual_execution_mode=actual_exec_mode,
        actual_status_code=resp.status_code,
        actual_response_keys=actual_keys,
        task_succeeded=task_succeeded,
        is_paraphrase=is_paraphrase,
        error_message=error_msg,
    )
    return {
        "status_code": resp.status_code,
        "actual_skill": actual_skill,
        "actual_exec_mode": actual_exec_mode,
        "task_succeeded": task_succeeded,
    }
 # ═══════════════════════════════════════════════════════════════════════════
 # Parameterized Routing Benchmark Tests
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestRoutingBenchmarks:
    """Run all routing benchmarks with metrics collection."""
    @pytest.mark.parametrize(
        "benchmark",
        ROUTING_KEYWORD_BENCHMARKS + ROUTING_EDGE_BENCHMARKS,
        ids=[b.id for b in ROUTING_KEYWORD_BENCHMARKS + ROUTING_EDGE_BENCHMARKS],
    )
    def test_routing_benchmark(
        self,
        benchmark: BenchmarkCase,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Run original benchmark input and record metrics."""
        result = _run_benchmark_and_record(
            benchmark,
            api_client,
            metrics_collector,
            test_name=f"routing_benchmark_{benchmark.id}",
        )
        assert result["status_code"] == 200, f"Benchmark {benchmark.id} failed: {result}"
    @pytest.mark.parametrize(
        "benchmark",
        [b for b in ROUTING_KEYWORD_BENCHMARKS + ROUTING_EDGE_BENCHMARKS if b.paraphrases],
        ids=[b.id for b in ROUTING_KEYWORD_BENCHMARKS + ROUTING_EDGE_BENCHMARKS if b.paraphrases],
    )
    def test_routing_paraphrase(
        self,
        benchmark: BenchmarkCase,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Run all paraphrases for overfitting detection."""
        for i, paraphrase in enumerate(benchmark.paraphrases):
            _run_benchmark_and_record(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"routing_paraphrase_{benchmark.id}_{i}",
                is_paraphrase=True,
                input_override=paraphrase,
            )
 # ═══════════════════════════════════════════════════════════════════════════
 # Routing Consistency (same input, multiple runs)
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestRoutingConsistency:
    """Same input should produce same routing decision (deterministic backtest)."""
    def test_same_query_same_skill(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Submitting the same query multiple times should route to the same skill."""
        for benchmark in CONSISTENCY_BENCHMARKS:
            skills_seen: list[str | None] = []
            for run_idx in range(3):
                result = _run_benchmark_and_record(
                    benchmark,
                    api_client,
                    metrics_collector,
                    test_name=f"consistency_{benchmark.id}_run{run_idx}",
                )
                skills_seen.append(result["actual_skill"])
            # All runs should produce the same skill
            non_none_skills = [s for s in skills_seen if s is not None]
            if len(non_none_skills) >= 2:
                assert len(set(non_none_skills)) == 1, (
                    f"Inconsistent routing for {benchmark.id}: {skills_seen}"
                )
 # ═══════════════════════════════════════════════════════════════════════════
 # Routing Disambiguation (specific edge cases)
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestRoutingDisambiguation:
    """When multiple skills could match, the router should pick the best one."""
    def test_overlapping_keywords_routes_to_best_match(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """With overlapping keywords, router should pick the most relevant skill."""
        register_skill_via_api(
            api_client,
            "python_coder",
            keywords=["python", "code", "programming"],
        )
        register_skill_via_api(
            api_client,
            "javascript_coder",
            keywords=["javascript", "code", "programming"],
        )
        benchmark = BenchmarkCase(
            id="disambig-python-001",
            input="Write a Python function to sort a list",
            expected_skill="python_coder",
            expected_complexity="medium",
            category="routing",
            subcategory="disambiguation",
            paraphrases=["I need a Python sorting algorithm", "用Python写个排序函数"],
        )
        result = _run_benchmark_and_record(
            benchmark,
            api_client,
            metrics_collector,
            test_name="disambig_python",
        )
        assert result["status_code"] == 200
        # Also test paraphrases for overfitting detection
        for i, para in enumerate(benchmark.paraphrases):
            _run_benchmark_and_record(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"disambig_python_para_{i}",
                is_paraphrase=True,
                input_override=para,
            )
    def test_no_matching_skill_falls_back_gracefully(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """When no skill matches, should fall back to direct chat."""
        benchmark = BenchmarkCase(
            id="fallback-nomatch-001",
            input="Tell me about quantum physics",
            expected_skill=None,
            expected_complexity="low",
            category="routing",
            subcategory="fallback",
            paraphrases=["Explain quantum mechanics", "量子物理是什么"],
        )
        result = _run_benchmark_and_record(
            benchmark,
            api_client,
            metrics_collector,
            test_name="fallback_nomatch",
        )
        assert result["status_code"] == 200
        for i, para in enumerate(benchmark.paraphrases):
            _run_benchmark_and_record(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"fallback_nomatch_para_{i}",
                is_paraphrase=True,
                input_override=para,
            )
--- a/tests/e2e/test_capability_team.py
+++ b/tests/e2e/test_capability_team.py
@ -0,0 +1,252 @@
 """E2E Agent Capability Tests — Expert Team Collaboration with Metrics.
 Tests the intelligence of expert team collaboration AND collects data for:
  - Team formation accuracy
  - Fallback effectiveness
  - Expert coordination quality
  - Overfitting detection via paraphrased inputs
 """
 import pytest
 import httpx
 from tests.e2e.benchmark_dataset import TEAM_BENCHMARKS, BenchmarkCase
 from tests.e2e.capability_metrics import MetricsCollector
 from tests.e2e.conftest import register_skill_via_api
 # ═══════════════════════════════════════════════════════════════════════════
 # Helper: run team benchmark and record metrics
 # ═══════════════════════════════════════════════════════════════════════════
 def _run_team_benchmark(
    benchmark: BenchmarkCase,
    api_client: httpx.Client,
    collector: MetricsCollector,
    test_name: str,
    is_paraphrase: bool = False,
    input_override: str | None = None,
 ) -> dict:
    """Execute a team benchmark and record metrics."""
    query = input_override or benchmark.input
    collector.start_timer(benchmark.id)
    payload: dict = {"input_data": {"query": query}}
    if benchmark.expected_skill:
        payload["skill_name"] = benchmark.expected_skill
    resp = api_client.post("/api/v1/tasks", json=payload)
    actual_skill = None
    actual_exec_mode = None
    actual_keys = []
    task_succeeded = resp.status_code == 200
    error_msg = None
    if task_succeeded:
        data = resp.json()
        actual_skill = data.get("skill_name")
        actual_exec_mode = data.get("execution_mode")
        actual_keys = list(data.keys())
    elif resp.status_code >= 400:
        try:
            error_msg = resp.json().get("detail", resp.text[:200])
        except Exception:
            error_msg = resp.text[:200]
    collector.record_benchmark_result(
        benchmark,
        test_name=test_name,
        actual_skill=actual_skill,
        actual_execution_mode=actual_exec_mode,
        actual_status_code=resp.status_code,
        actual_response_keys=actual_keys,
        task_succeeded=task_succeeded,
        is_paraphrase=is_paraphrase,
        error_message=error_msg,
    )
    return {
        "status_code": resp.status_code,
        "actual_skill": actual_skill,
        "actual_exec_mode": actual_exec_mode,
        "task_succeeded": task_succeeded,
    }
 # ═══════════════════════════════════════════════════════════════════════════
 # Parameterized Team Benchmark Tests
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestTeamBenchmarks:
    """Run all team benchmarks with metrics collection."""
    @pytest.mark.parametrize(
        "benchmark",
        TEAM_BENCHMARKS,
        ids=[b.id for b in TEAM_BENCHMARKS],
    )
    def test_team_benchmark(
        self,
        benchmark: BenchmarkCase,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Run original team benchmark and record metrics."""
        if benchmark.expected_skill:
            register_skill_via_api(
                api_client,
                benchmark.expected_skill,
                keywords=[benchmark.expected_skill],
            )
        result = _run_team_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name=f"team_benchmark_{benchmark.id}",
        )
        assert result["status_code"] == 200, f"Team benchmark {benchmark.id} failed"
    @pytest.mark.parametrize(
        "benchmark",
        [b for b in TEAM_BENCHMARKS if b.paraphrases],
        ids=[b.id for b in TEAM_BENCHMARKS if b.paraphrases],
    )
    def test_team_paraphrase(
        self,
        benchmark: BenchmarkCase,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """Run paraphrases for overfitting detection."""
        for i, paraphrase in enumerate(benchmark.paraphrases):
            _run_team_benchmark(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"team_paraphrase_{benchmark.id}_{i}",
                is_paraphrase=True,
                input_override=paraphrase,
            )
 # ═══════════════════════════════════════════════════════════════════════════
 # Team Formation Intelligence
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestTeamFormation:
    """Test that teams are formed intelligently based on task requirements."""
    def test_explicit_team_prefix(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """@team prefix should trigger team collaboration mode."""
        register_skill_via_api(api_client, "team_analyst", keywords=["team_analyst", "analyze"])
        register_skill_via_api(api_client, "team_writer", keywords=["team_writer", "write"])
        benchmark = BenchmarkCase(
            id="team-explicit-001",
            input="Analyze the data and write a report",
            expected_skill="team_analyst",
            expected_execution_mode="react",
            expected_complexity="high",
            category="team",
            subcategory="explicit_team",
            paraphrases=["I need analysis and a written report", "分析数据并写报告"],
        )
        result = _run_team_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name="team_explicit",
        )
        assert result["status_code"] == 200
        for i, para in enumerate(benchmark.paraphrases):
            _run_team_benchmark(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"team_explicit_para_{i}",
                is_paraphrase=True,
                input_override=para,
            )
 # ═══════════════════════════════════════════════════════════════════════════
 # Fallback Intelligence
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestTeamFallback:
    """Test that team collaboration falls back gracefully on failure."""
    def test_fallback_to_single_agent_on_team_failure(
        self,
        api_client: httpx.Client,
        metrics_collector: MetricsCollector,
    ):
        """If team collaboration fails, should fall back to single agent."""
        register_skill_via_api(api_client, "fallback_skill", keywords=["fallback_test"])
        benchmark = BenchmarkCase(
            id="team-fallback-001",
            input="Complex task that might need fallback",
            expected_skill="fallback_skill",
            expected_complexity="high",
            category="team",
            subcategory="fallback",
            paraphrases=["Difficult task requiring fallback mechanism", "需要回退机制的复杂任务"],
        )
        result = _run_team_benchmark(
            benchmark,
            api_client,
            metrics_collector,
            test_name="team_fallback",
        )
        assert result["status_code"] == 200
        for i, para in enumerate(benchmark.paraphrases):
            _run_team_benchmark(
                benchmark,
                api_client,
                metrics_collector,
                test_name=f"team_fallback_para_{i}",
                is_paraphrase=True,
                input_override=para,
            )
 # ═══════════════════════════════════════════════════════════════════════════
 # Expert Name Validation
 # ═══════════════════════════════════════════════════════════════════════════
@pytest.mark.e2e_capability
 class TestExpertNameValidation:
    """Test that expert names are validated according to project rules."""
    def test_valid_expert_names(self, api_client: httpx.Client):
        """Valid expert names (alphanumeric, dash, underscore) should work."""
        for name in ["analyst", "data-scientist", "code_reviewer", "expert-123"]:
            resp = register_skill_via_api(api_client, name, keywords=[name])
            assert resp.status_code in (200, 201, 409), f"Failed for name: {name}"
    def test_invalid_expert_name_rejected(self, api_client: httpx.Client):
        """Invalid expert names should be rejected."""
        for name in ["expert with spaces", "expert@special", "", "a" * 65]:
            resp = register_skill_via_api(api_client, name, keywords=[name])
            assert resp.status_code in (200, 201, 400, 409, 422), (
                f"Unexpected status for name: '{name}'"
            )
--- a/tests/unit/chat/test_skill_routing.py
+++ b/tests/unit/chat/test_skill_routing.py
@ -0,0 +1,332 @@
 """Unit tests for CostAwareRouter team upgrade logic and HeuristicClassifier."""
 from __future__ import annotations
 from unittest.mock import MagicMock
 from agentkit.chat.skill_routing import (
    CostAwareRouter,
    ExecutionMode,
    HeuristicClassifier,
    SkillRoutingResult,
 )
 from agentkit.experts.config import ExpertConfig, ExpertTemplate
 from agentkit.experts.registry import ExpertTemplateRegistry
 from agentkit.experts.router import ExpertTeamRouter
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def _make_router(expert_team_router: ExpertTeamRouter | None = None) -> CostAwareRouter:
    """Create a CostAwareRouter with mocked dependencies."""
    return CostAwareRouter(
        llm_gateway=None,
        model="test",
        classifier="heuristic",
        expert_team_router=expert_team_router,
    )
 def _make_team_router_with_templates() -> ExpertTeamRouter:
    """Create an ExpertTeamRouter with sample templates."""
    registry = ExpertTemplateRegistry()
    for name in ("analyst", "strategist", "reviewer"):
        config = ExpertConfig(
            name=name,
            agent_type="expert",
            persona=f"Expert in {name}",
            thinking_style="analytical",
            bound_skills=[],
            is_lead=(name == "analyst"),
            task_mode="llm_generate",
            prompt={"identity": f"Expert in {name}"},
        )
        template = ExpertTemplate(
            name=name,
            config=config,
            description=f"Handles {name} tasks",
        )
        registry.register(template)
    return ExpertTeamRouter(template_registry=registry)
 def _make_team_router_empty() -> ExpertTeamRouter:
    """Create an ExpertTeamRouter with no templates."""
    return ExpertTeamRouter(template_registry=ExpertTemplateRegistry())
 # ---------------------------------------------------------------------------
 # Tests: ExpertTeamRouter.can_handle()
 # ---------------------------------------------------------------------------
 class TestExpertTeamRouterCanHandle:
    def test_can_handle_with_templates(self) -> None:
        router = _make_team_router_with_templates()
        assert router.can_handle("analyze this data") is True
    def test_can_handle_no_templates(self) -> None:
        router = _make_team_router_empty()
        assert router.can_handle("analyze this data") is False
    def test_can_handle_name_match(self) -> None:
        router = _make_team_router_with_templates()
        assert router.can_handle("I need a strategist for this") is True
    def test_can_handle_description_match(self) -> None:
        router = _make_team_router_with_templates()
        assert router.can_handle("handles review tasks") is True
 # ---------------------------------------------------------------------------
 # Tests: _try_team_upgrade()
 # ---------------------------------------------------------------------------
 class TestTryTeamUpgrade:
    def test_upgrade_react_to_team_collab(self) -> None:
        router = _make_router(expert_team_router=_make_team_router_with_templates())
        result = SkillRoutingResult(
            clean_content="complex multi-step analysis task",
            matched=True,
            match_method="capability",
            match_confidence=0.8,
            complexity=0.8,
            execution_mode=ExecutionMode.REACT,
        )
        trace: list[dict] = []
        upgraded = router._try_team_upgrade(result, "complex multi-step analysis task", 0.8, trace)
        assert upgraded.execution_mode == ExecutionMode.TEAM_COLLAB
        assert any(t.get("method") == "team_upgrade" for t in trace)
    def test_no_upgrade_low_complexity(self) -> None:
        router = _make_router(expert_team_router=_make_team_router_with_templates())
        result = SkillRoutingResult(
            clean_content="simple question",
            matched=True,
            match_method="capability",
            match_confidence=0.8,
            complexity=0.3,
            execution_mode=ExecutionMode.REACT,
        )
        trace: list[dict] = []
        upgraded = router._try_team_upgrade(result, "simple question", 0.3, trace)
        assert upgraded.execution_mode == ExecutionMode.REACT
        assert not any(t.get("method") == "team_upgrade" for t in trace)
    def test_no_upgrade_no_team_router(self) -> None:
        router = _make_router(expert_team_router=None)
        result = SkillRoutingResult(
            clean_content="complex analysis",
            matched=True,
            match_method="capability",
            match_confidence=0.8,
            complexity=0.9,
            execution_mode=ExecutionMode.REACT,
        )
        trace: list[dict] = []
        upgraded = router._try_team_upgrade(result, "complex analysis", 0.9, trace)
        assert upgraded.execution_mode == ExecutionMode.REACT
    def test_no_upgrade_empty_templates(self) -> None:
        router = _make_router(expert_team_router=_make_team_router_empty())
        result = SkillRoutingResult(
            clean_content="complex analysis",
            matched=True,
            match_method="capability",
            match_confidence=0.8,
            complexity=0.8,
            execution_mode=ExecutionMode.REACT,
        )
        trace: list[dict] = []
        upgraded = router._try_team_upgrade(result, "complex analysis", 0.8, trace)
        assert upgraded.execution_mode == ExecutionMode.REACT
    def test_no_upgrade_direct_chat_mode(self) -> None:
        router = _make_router(expert_team_router=_make_team_router_with_templates())
        result = SkillRoutingResult(
            clean_content="hello",
            matched=False,
            match_method="greeting",
            match_confidence=1.0,
            complexity=0.0,
            execution_mode=ExecutionMode.DIRECT_CHAT,
        )
        trace: list[dict] = []
        upgraded = router._try_team_upgrade(result, "hello", 0.0, trace)
        assert upgraded.execution_mode == ExecutionMode.DIRECT_CHAT
    def test_team_upgrade_exception_handled(self) -> None:
        """When ExpertTeamRouter raises, the upgrade is silently skipped."""
        broken_router = MagicMock()
        broken_router.can_handle.side_effect = RuntimeError("boom")
        router = _make_router(expert_team_router=broken_router)
        result = SkillRoutingResult(
            clean_content="complex task",
            matched=True,
            match_method="capability",
            match_confidence=0.8,
            complexity=0.8,
            execution_mode=ExecutionMode.REACT,
        )
        trace: list[dict] = []
        upgraded = router._try_team_upgrade(result, "complex task", 0.8, trace)
        assert upgraded.execution_mode == ExecutionMode.REACT
 # ---------------------------------------------------------------------------
 # Tests: ExpertTeamRouter.resolve() with complexity
 # ---------------------------------------------------------------------------
 class TestExpertTeamRouterResolve:
    def test_explicit_team_prefix(self) -> None:
        router = _make_team_router_with_templates()
        result = router.resolve("@team:analyst,strategist analyze the market", 0.5)
        assert result.team_mode is True
        assert result.match_method == "explicit_team"
        assert "analyst" in result.specified_experts
        assert "strategist" in result.specified_experts
    def test_complexity_suggestion(self) -> None:
        router = _make_team_router_with_templates()
        result = router.resolve("complex multi-step analysis", 0.8)
        assert result.team_mode is True
        assert result.match_method == "complexity_suggestion"
        assert result.auto_compose is True
    def test_no_team_low_complexity(self) -> None:
        router = _make_team_router_with_templates()
        result = router.resolve("simple question", 0.2)
        assert result.team_mode is False
 # ---------------------------------------------------------------------------
 # Tests: HeuristicClassifier complexity calibration
 # ---------------------------------------------------------------------------
 class TestHeuristicClassifierLowComplexity:
    """Low-complexity signals should produce scores < 0.3."""
    def setup_method(self) -> None:
        self.clf = HeuristicClassifier()
    def test_chinese_greeting(self) -> None:
        assert self.clf.classify("你好") < 0.3
    def test_chinese_greeting_hi(self) -> None:
        assert self.clf.classify("嗨") < 0.3
    def test_english_greeting_hello(self) -> None:
        assert self.clf.classify("Hello") < 0.3
    def test_english_greeting_hi(self) -> None:
        assert self.clf.classify("hi") < 0.3
    def test_multiple_low_complexity_words(self) -> None:
        assert self.clf.classify("嗨，早上好") < 0.3
    def test_greeting_with_high_complexity_word_not_suppressed(self) -> None:
        """Low-complexity signal should NOT override high-complexity signal."""
        # "你好" is low, but "分析" is high → should score high
        assert self.clf.classify("你好，请帮我分析一下这个数据") > 0.5
 class TestHeuristicClassifierIdentity:
    """Identity queries should produce scores < 0.3."""
    def setup_method(self) -> None:
        self.clf = HeuristicClassifier()
    def test_who_are_you_cn(self) -> None:
        assert self.clf.classify("你是谁") < 0.3
    def test_what_is_your_name_cn(self) -> None:
        assert self.clf.classify("你叫什么") < 0.3
 class TestHeuristicClassifierNegation:
    """Negated high-complexity words should not contribute to score."""
    def setup_method(self) -> None:
        self.clf = HeuristicClassifier()
    def test_negate_search_cn(self) -> None:
        assert self.clf.classify("不要搜索") < 0.3
    def test_negate_analyze_cn(self) -> None:
        assert self.clf.classify("无需分析，直接告诉我答案") < 0.3
    def test_partial_negation_still_high(self) -> None:
        """'搜索' negated but '分析' not — should still be high."""
        assert self.clf.classify("分析市场趋势，但不要搜索") > 0.5
 class TestHeuristicClassifierThresholds:
    """Verify adjusted base scores."""
    def setup_method(self) -> None:
        self.clf = HeuristicClassifier()
    def test_no_keyword_short_message(self) -> None:
        assert self.clf.classify("好的") <= 0.10
    def test_medium_complexity_base(self) -> None:
        """Medium complexity keyword should start at 0.35 (not 0.45)."""
        score = self.clf.classify("如何使用Python？")
        # '如何' is medium → base 0.35, '？' short question → -0.10 = 0.25
        # but 'Python' is not in high/medium lists, so just medium base
        assert 0.25 <= score <= 0.45
 class TestHeuristicClassifierShortQuestion:
    """Short questions ending with ？/? should get deduction."""
    def setup_method(self) -> None:
        self.clf = HeuristicClassifier()
    def test_short_question_deduction(self) -> None:
        assert self.clf.classify("怎么用？") < 0.3
    def test_long_question_no_deduction(self) -> None:
        assert self.clf.classify("如何设计一个高可用的微服务架构？") > 0.5
 class TestHeuristicClassifierHighComplexity:
    """Complex tasks should produce scores > 0.7."""
    def setup_method(self) -> None:
        self.clf = HeuristicClassifier()
    def test_two_high_complexity_words(self) -> None:
        # "分析" + "搜索" are both in _HIGH_COMPLEXITY_HINTS_CN → base 0.80
        assert self.clf.classify("分析市场数据并搜索相关信息") > 0.7
    def test_single_high_complexity_word(self) -> None:
        # "分析" alone → base 0.65
        assert self.clf.classify("分析市场趋势并生成报告") > 0.6
    def test_execute_and_restart(self) -> None:
        assert self.clf.classify("执行部署脚本并重启服务") > 0.7
 class TestHeuristicClassifierEdgeCases:
    """Boundary conditions."""
    def setup_method(self) -> None:
        self.clf = HeuristicClassifier()
    def test_empty_string(self) -> None:
        assert self.clf.classify("") == 0.0
    def test_whitespace_only(self) -> None:
        assert self.clf.classify("   ") == 0.0
    def test_long_low_complexity_message(self) -> None:
        """Even a long greeting should stay low."""
        long_greeting = "你好" * 100  # >200 chars
        assert self.clf.classify(long_greeting) <= 0.15
--- a/tests/unit/quality/init.py
+++ b/tests/unit/quality/init.py
--- a/tests/unit/quality/test_gate.py
+++ b/tests/unit/quality/test_gate.py
@ -0,0 +1,172 @@
 """Unit tests for QualityGate skill match validation (5th dimension)."""
 from __future__ import annotations
 import pytest
 from agentkit.quality.gate import QualityGate
 from agentkit.skills.base import Skill, SkillConfig
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def _make_skill(
    name: str = "test_skill",
    required_fields: list[str] | None = None,
    min_word_count: int = 0,
 ) -> Skill:
    """Create a Skill with the given quality gate config."""
    config = SkillConfig(
        name=name,
        agent_type="skill",
        task_mode="llm_generate",
        prompt={"identity": f"You are {name}"},
        quality_gate={
            "required_fields": required_fields or [],
            "min_word_count": min_word_count,
        },
    )
    return Skill(config=config)
 # ---------------------------------------------------------------------------
 # Tests: _check_skill_match static method
 # ---------------------------------------------------------------------------
 class TestCheckSkillMatch:
    def setup_method(self) -> None:
        self.gate = QualityGate()
    def test_none_skill_context(self) -> None:
        assert self.gate._check_skill_match({"content": "hello"}, None) is None
    def test_empty_skill_context(self) -> None:
        assert self.gate._check_skill_match({"content": "hello"}, {}) is None
    def test_missing_intent_keywords(self) -> None:
        assert self.gate._check_skill_match({"content": "hello"}, {"skill_name": "x"}) is None
    def test_empty_intent_keywords(self) -> None:
        assert self.gate._check_skill_match({"content": "hello"}, {"intent_keywords": []}) is None
    def test_output_contains_keyword(self) -> None:
        result = self.gate._check_skill_match(
            {"content": "市场分析报告"},
            {"intent_keywords": ["分析", "报告"]},
        )
        assert result is not None
        assert result.passed is True
        assert result.message is None
    def test_output_missing_all_keywords(self) -> None:
        result = self.gate._check_skill_match(
            {"content": "今天天气不错"},
            {"intent_keywords": ["分析", "报告"]},
        )
        assert result is not None
        assert result.passed is True  # Warning level, not blocking
        assert "Warning" in (result.message or "")
    def test_keyword_case_insensitive(self) -> None:
        result = self.gate._check_skill_match(
            {"content": "search results"},
            {"intent_keywords": ["Search"]},
        )
        assert result is not None
        assert result.passed is True
        assert result.message is None
 # ---------------------------------------------------------------------------
 # Tests: Full validate() with skill_context
 # ---------------------------------------------------------------------------
 class TestValidateWithSkillContext:
    @pytest.mark.asyncio
    async def test_no_skill_context_backward_compatible(self) -> None:
        """Without skill_context, only 4 dimensions checked."""
        gate = QualityGate()
        skill = _make_skill()
        result = await gate.validate({"content": "hello"}, skill)
        assert result.passed is True
        skill_match_checks = [c for c in result.checks if c.name == "skill_match"]
        assert len(skill_match_checks) == 0
    @pytest.mark.asyncio
    async def test_skill_context_with_matching_output(self) -> None:
        """Output contains keyword → skill_match passes silently."""
        gate = QualityGate()
        skill = _make_skill()
        result = await gate.validate(
            {"content": "市场分析报告"},
            skill,
            skill_context={"intent_keywords": ["分析"]},
        )
        assert result.passed is True
        skill_match_checks = [c for c in result.checks if c.name == "skill_match"]
        assert len(skill_match_checks) == 1
        assert skill_match_checks[0].passed is True
        assert skill_match_checks[0].message is None
    @pytest.mark.asyncio
    async def test_skill_context_warning_only(self) -> None:
        """Output missing keywords but other checks pass → warning, overall still passed."""
        gate = QualityGate()
        skill = _make_skill()
        result = await gate.validate(
            {"content": "今天天气不错"},
            skill,
            skill_context={"intent_keywords": ["分析"]},
        )
        assert result.passed is True  # Warning doesn't block alone
        skill_match_checks = [c for c in result.checks if c.name == "skill_match"]
        assert len(skill_match_checks) == 1
        assert "Warning" in (skill_match_checks[0].message or "")
    @pytest.mark.asyncio
    async def test_skill_match_escalates_with_other_failure(self) -> None:
        """Output missing keywords + required field missing → skill_match escalated to failed."""
        gate = QualityGate()
        skill = _make_skill(required_fields=["summary"])
        result = await gate.validate(
            {"content": "今天天气不错"},  # missing "summary" field
            skill,
            skill_context={"intent_keywords": ["分析"]},
        )
        assert result.passed is False
        skill_match_checks = [c for c in result.checks if c.name == "skill_match"]
        assert len(skill_match_checks) == 1
        assert skill_match_checks[0].passed is False  # Escalated
    @pytest.mark.asyncio
    async def test_skill_match_no_escalation_when_matching(self) -> None:
        """Output contains keywords + required field missing → skill_match stays passed."""
        gate = QualityGate()
        skill = _make_skill(required_fields=["summary"])
        result = await gate.validate(
            {"content": "分析结果"},  # missing "summary" field
            skill,
            skill_context={"intent_keywords": ["分析"]},
        )
        assert result.passed is False  # Due to required field
        skill_match_checks = [c for c in result.checks if c.name == "skill_match"]
        assert len(skill_match_checks) == 1
        assert skill_match_checks[0].passed is True  # Not escalated
    @pytest.mark.asyncio
    async def test_empty_intent_keywords_skips_check(self) -> None:
        """Empty intent_keywords list → skill_match check skipped entirely."""
        gate = QualityGate()
        skill = _make_skill()
        result = await gate.validate(
            {"content": "hello"},
            skill,
            skill_context={"intent_keywords": []},
        )
        skill_match_checks = [c for c in result.checks if c.name == "skill_match"]
        assert len(skill_match_checks) == 0
--- a/tests/unit/router/test_intent.py
+++ b/tests/unit/router/test_intent.py
@ -0,0 +1,200 @@
 """Unit tests for IntentRouter multi-candidate keyword scoring."""
 from __future__ import annotations
 from agentkit.router.intent import IntentRouter
 from agentkit.skills.base import Skill, SkillConfig
 # ---------------------------------------------------------------------------
 # Helpers
 # ---------------------------------------------------------------------------
 def _make_skill(name: str, keywords: list[str], description: str = "") -> Skill:
    """Create a Skill with the given name and intent keywords."""
    config = SkillConfig(
        name=name,
        agent_type="skill",
        description=description or f"Skill {name}",
        task_mode="llm_generate",
        prompt={"identity": f"You are {name}"},
        intent={"keywords": keywords, "description": description or f"Skill {name}"},
    )
    return Skill(config=config)
 def _make_skills(*specs: tuple[str, list[str]]) -> list[Skill]:
    """Create multiple skills from (name, keywords) tuples."""
    return [_make_skill(name, kws) for name, kws in specs]
 # ---------------------------------------------------------------------------
 # Tests: Single-candidate match (backward compatible)
 # ---------------------------------------------------------------------------
 class TestSingleCandidateMatch:
    """When only one skill matches, behavior is identical to old first-match."""
    def test_single_skill_matches(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["规划", "执行"]), ("skill_b", ["搜索", "查询"]))
        result = router._match_keywords({"content": "帮我规划一个项目"}, skills)
        assert result is not None
        assert result.matched_skill == "skill_a"
        assert result.method == "keyword"
    def test_single_keyword_match_confidence(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["规划"]))
        result = router._match_keywords({"content": "规划任务"}, skills)
        assert result is not None
        # 1 keyword matched → confidence = min(1.0, 0.5 + 0.1 * 1) = 0.6
        assert result.confidence == 0.6
 # ---------------------------------------------------------------------------
 # Tests: Multi-candidate scoring
 # ---------------------------------------------------------------------------
 class TestMultiCandidateScoring:
    """When multiple skills match, the best-scored one wins."""
    def test_longer_keyword_wins(self) -> None:
        """'调研报告' (4 chars) beats '报告' (2 chars) on same input."""
        router = IntentRouter()
        skills = _make_skills(
            ("plan_exec", ["规划", "报告"]),
            ("goal_driven", ["调研报告", "生成"]),
        )
        result = router._match_keywords({"content": "规划一个调研报告"}, skills)
        assert result is not None
        # plan_exec: "规划"(2) + "报告"(2) = 4; goal_driven: "调研报告"(4) = 4
        # Same score → alphabetical: goal_driven < plan_exec
        assert result.matched_skill == "goal_driven"
    def test_more_keywords_wins(self) -> None:
        """Skill matching 3 keywords beats skill matching 1 keyword."""
        router = IntentRouter()
        skills = _make_skills(
            ("skill_a", ["分析"]),
            ("skill_b", ["分析", "市场", "趋势"]),
        )
        result = router._match_keywords({"content": "分析市场趋势"}, skills)
        assert result is not None
        # skill_a: "分析"(2) = 2; skill_b: "分析"(2)+"市场"(2)+"趋势"(2) = 6
        assert result.matched_skill == "skill_b"
    def test_same_score_alphabetical(self) -> None:
        """When scores are equal, alphabetical name order breaks the tie."""
        router = IntentRouter()
        skills = _make_skills(
            ("zebra_skill", ["分析"]),
            ("alpha_skill", ["分析"]),
        )
        result = router._match_keywords({"content": "分析数据"}, skills)
        assert result is not None
        assert result.matched_skill == "alpha_skill"
 # ---------------------------------------------------------------------------
 # Tests: No match
 # ---------------------------------------------------------------------------
 class TestNoMatch:
    def test_no_keyword_match(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["搜索"]), ("skill_b", ["查询"]))
        result = router._match_keywords({"content": "你好"}, skills)
        assert result is None
    def test_empty_keywords_list(self) -> None:
        """Skill with empty keywords list does not participate in matching."""
        router = IntentRouter()
        skills = [_make_skill("empty_kw", [])]
        result = router._match_keywords({"content": "anything"}, skills)
        assert result is None
 # ---------------------------------------------------------------------------
 # Tests: Case insensitivity
 # ---------------------------------------------------------------------------
 class TestCaseInsensitivity:
    def test_english_keyword_case_insensitive(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["Search"]))
        result = router._match_keywords({"content": "please search for this"}, skills)
        assert result is not None
        assert result.matched_skill == "skill_a"
 # ---------------------------------------------------------------------------
 # Tests: Substring matching
 # ---------------------------------------------------------------------------
 class TestSubstringMatch:
    def test_chinese_substring_match(self) -> None:
        """Chinese keyword '报告' should match input containing '报告'."""
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["报告"]))
        result = router._match_keywords({"content": "生成一份报告"}, skills)
        assert result is not None
        assert result.matched_skill == "skill_a"
 # ---------------------------------------------------------------------------
 # Tests: Confidence calculation
 # ---------------------------------------------------------------------------
 class TestConfidenceCalculation:
    def test_one_keyword_confidence(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["分析"]))
        result = router._match_keywords({"content": "分析数据"}, skills)
        assert result is not None
        assert result.confidence == 0.6  # 0.5 + 0.1 * 1
    def test_three_keywords_confidence(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["分析", "市场", "趋势"]))
        result = router._match_keywords({"content": "分析市场趋势"}, skills)
        assert result is not None
        assert result.confidence == 0.8  # 0.5 + 0.1 * 3
    def test_confidence_capped_at_one(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["a", "b", "c", "d", "e", "f"]))
        result = router._match_keywords({"content": "a b c d e f"}, skills)
        assert result is not None
        assert result.confidence == 1.0  # min(1.0, 0.5 + 0.1 * 6 = 1.1)
 # ---------------------------------------------------------------------------
 # Tests: Edge cases
 # ---------------------------------------------------------------------------
 class TestEdgeCases:
    def test_empty_input_text(self) -> None:
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["分析"]))
        result = router._match_keywords({"content": ""}, skills)
        assert result is None
    def test_nested_input_data(self) -> None:
        """_extract_string_values should handle nested dicts/lists."""
        router = IntentRouter()
        skills = _make_skills(("skill_a", ["分析"]))
        result = router._match_keywords(
            {"message": {"text": "分析数据", "meta": {"role": "user"}}},
            skills,
        )
        assert result is not None
        assert result.matched_skill == "skill_a"