feat: add LLM and GUI benchmark modes with real agent testing

2026-06-17 12:55:19 +08:00 · 2026-06-17 12:55:19 +08:00 · a1318df420
parent 1fbfd9d132
commit a1318df420
5 changed files with 1639 additions and 858 deletions
--- a/configs/skills/benchmark_runner.yaml
+++ b/configs/skills/benchmark_runner.yaml
@ -40,55 +40,92 @@ prompt:
    采用行业 Benchmark 方法论（SWE-bench / AgentBench / ToolBench 风格），
    提供 Accuracy / Precision / Recall / F1 / Latency / Consistency 等完整指标。
-    ## 可用命令
+    ## 测试模式（--mode）
    支持三种测试模式，可组合使用：
    ### Mock 模式（默认，快速、无 LLM 依赖）
    ```bash
    python3 -m agentkit.cli.main benchmark --mode mock --report --verbose
    ```
    全部使用 Mock 数据，7 个维度 53 个用例，适合 CI/CD 快速回归。
    ### LLM 模式（使用真实 LLM）
    ```bash
    python3 -m agentkit.cli.main benchmark --mode llm --report --verbose
    ```
    从 agentkit.yaml 加载真实 LLM 配置，测试 LLM 推理能力：
    - 意图理解：LLM 是否正确识别用户意图
    - 工具选择：LLM 是否选择正确工具
    - 多步推理：LLM 是否能分解复杂任务
    - 代码生成：LLM 是否能生成可执行代码
    - 错误恢复：LLM 是否能给出修复建议
    需要 agentkit.yaml 中配置了有效的 LLM API key。
    ### GUI 模式（启动真实服务器测试端到端）
    ```bash
    python3 -m agentkit.cli.main benchmark --mode gui --report --verbose
    ```
    自动启动 agentkit gui 服务器，测试：
    - 服务启动：agentkit gui --port XXXX 能否成功启动
    - API 可用性：/api/v1/health, /api/v1/skills, /api/v1/chat
    - WebSocket 连接：ws://localhost:XXXX/api/v1/ws
    - 前端资源：HTML/JS/CSS 是否可访问
    测试完成后自动关闭服务器。
    ### 全部模式（Mock + LLM + GUI）
    ```bash
    python3 -m agentkit.cli.main benchmark --mode all --report --verbose
    ```
    运行所有 9 个维度共 63 个测试用例，最全面的评估。
    ### 完整回测（推荐）
    ```bash
-    python3 -m agentkit.cli.main benchmark --report --verbose
+    python3 -m agentkit.cli.main benchmark --mode all --report --verbose
    ```
-    运行所有 7 个维度共 53 个标准化测试用例，生成 JSON + Markdown 报告。
+    运行所有 9 个维度（7 Mock + 1 LLM + 1 GUI）共 63 个测试用例。
    默认运行 3 次取均值 ± 标准差，附带 95% Wilson 置信区间。
    ### 快速回测
    ```bash
-    python3 -m agentkit.cli.main benchmark --fast --report
+    python3 -m agentkit.cli.main benchmark --mode mock --fast --report
    ```
-    运行核心用例（约 22 个），适合开发时快速验证。
+    运行 Mock 模式核心用例（约 22 个），适合开发时快速验证。
    ### 单维度回测
    ```bash
    python3 -m agentkit.cli.main benchmark --dimension <dim> --verbose
    ```
-    可选维度：preprocessing, overfitting, efficiency, tool_search, event_model, spec_management, verification
+    可选维度：preprocessing, overfitting, efficiency, tool_search, event_model,
    spec_management, verification, llm_reasoning, gui_integration
    ### 多次运行取均值（--runs）
    ```bash
-    python3 -m agentkit.cli.main benchmark --runs 5 --report
+    python3 -m agentkit.cli.main benchmark --mode all --runs 3 --report
    ```
    指定运行次数（默认 3），计算 accuracy_mean ± accuracy_std 和 95% 置信区间。
    适用于稳定性评估和回归检测。
    ### 基线对比（--baseline）
    ```bash
-    python3 -m agentkit.cli.main benchmark --baseline --report
+    python3 -m agentkit.cli.main benchmark --mode all --baseline --report
    ```
    首次运行自动创建基线（baseline.json），后续运行与基线对比，显示 ↑/↓ 变化趋势。
    适用于 CI/CD 回归监控。
    ### Markdown 报告（默认）
    ```bash
-    python3 -m agentkit.cli.main benchmark --report --format markdown
+    python3 -m agentkit.cli.main benchmark --mode all --report --format markdown
    ```
    生成人类可读的 Markdown 报告，包含指标表格、失败用例分析、改进建议。
    ### HTML 报告
    ```bash
-    python3 -m agentkit.cli.main benchmark --report --format html
+    python3 -m agentkit.cli.main benchmark --mode all --report --format html
    ```
    ### JSON 报告
    ```bash
-    python3 -m agentkit.cli.main benchmark --report --format json
+    python3 -m agentkit.cli.main benchmark --mode all --report --format json
    ```
    仅生成 JSON 报告，适合机器解析和 CI 集成。
@ -96,11 +133,11 @@ prompt:
    ```bash
    python3 -m pytest tests/e2e/test_capability_comprehensive.py -v -m e2e_capability
    ```
-    运行 64 个测试（10 维度，含标准 Benchmark 框架集成测试），生成 comprehensive_report。
+    运行 64 个测试（含标准 Benchmark 框架集成测试），生成 comprehensive_report。
    ### 指定输出目录
    ```bash
-    python3 -m agentkit.cli.main benchmark --report -o ./my-results
+    python3 -m agentkit.cli.main benchmark --mode all --report -o ./my-results
    ```
    ## 测试维度说明
@ -113,14 +150,24 @@ prompt:
    - **Consistency** — 一致性（过拟合检测，改写输入的稳定性）
    - **95% CI** — Wilson 置信区间（多次运行时）
-    维度清单：
+    维度清单（9 个维度，按模式分组）：
-    1. **preprocessing** — 预处理准确度：greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT
+
-    2. **overfitting** — 过拟合检测：同一意图不同表达的一致性（Consistency 指标）
+    **Mock 模式（7 维度，53 用例）**：
-    3. **efficiency** — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms（Latency 指标）
+    1. **preprocessing** [Mock] — 预处理准确度：greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT
-    4. **tool_search** — 工具搜索准确度：BM25 相关性排序（P/R/F1 指标）
+    2. **overfitting** [Mock] — 过拟合检测：同一意图不同表达的一致性
-    5. **event_model** — 事件模型完整性：SQ/EQ 双队列生命周期
+    3. **efficiency** [Mock] — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms
-    6. **spec_management** — Spec 管理：CRUD 操作
+    4. **tool_search** [Mock] — 工具搜索准确度：BM25 相关性排序
-    7. **verification** — 验证循环：verify/retry 行为
+    5. **event_model** [Mock] — 事件模型完整性：SQ/EQ 双队列生命周期
    6. **spec_management** [Mock] — Spec 管理：CRUD 操作
    7. **verification** [Mock] — 验证循环：verify/retry 行为
    **LLM 模式（1 维度，5 用例）**：
    8. **llm_reasoning** [LLM] — LLM 推理能力：意图理解/工具选择/多步推理/代码生成/错误恢复
       使用真实 LLM 调用，记录 Token 使用量和响应延迟。
    **GUI 模式（1 维度，5 用例）**：
    9. **gui_integration** [GUI] — GUI 集成测试：服务启动/API 可用性/WebSocket/前端资源
       自动启动 agentkit gui 服务器，测试完成后自动清理。
    ## 报告位置
    - CLI 报告：`test-results/benchmark/benchmark_report.{json,md,html}`
@ -131,10 +178,15 @@ prompt:
    1. 运行测试命令
    2. 读取生成的报告文件（JSON + Markdown）
    3. 向用户展示结果摘要表格，包含各维度的 Accuracy / P / R / F1 / Latency
-    4. 如有失败用例，分析根因（wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded）
+    4. 标注每个维度使用的模式（[Mock] / [LLM] / [GUI]）
-    5. 对比基线报告（如使用 --baseline），展示各维度准确率的 ↑/↓ 变化趋势
+    5. 如有失败用例，分析根因（wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded / gui_failure）
-    6. 关注关键指标：P95 延迟 > 100ms 需提示性能问题，Consistency < 100% 需提示过拟合风险
+    6. 对比基线报告（如使用 --baseline），展示各维度准确率的 ↑/↓ 变化趋势
-    7. 给出针对性改进建议，基于指标数据而非主观判断
+    7. 关注关键指标：
       - P95 延迟 > 100ms 需提示性能问题
       - Consistency < 100% 需提示过拟合风险
       - LLM 维度 timeout 需提示模型响应慢或超时阈值需调整
       - GUI 维度失败需提示服务器配置或端口问题
    8. 给出针对性改进建议，基于指标数据而非主观判断
 llm:
  model: "default"
--- a/src/agentkit/cli/benchmark.py
+++ b/src/agentkit/cli/benchmark.py
--- a/test-results/benchmark/benchmark_report.json
+++ b/test-results/benchmark/benchmark_report.json
--- a/test-results/benchmark/benchmark_report.md
+++ b/test-results/benchmark/benchmark_report.md
@ -1,10 +1,11 @@
 # AgentKit 能力基准测试报告
 ## 测试概要
- 时间: 2026-06-17T04:00:50.738066+00:00
+- 时间: 2026-06-17T04:52:53.863927+00:00
 - 版本: 0.1.0
- 运行次数: 3
+- 模式: all
- 总体准确率: 100.0% ± 0.0%
+- 运行次数: 1
 - 总体准确率: 95.2% ± 0.0%
 ## 与行业 Benchmark 对比
@ -16,7 +17,7 @@
 ## 维度结果
-### 1. 预处理准确度 (Preprocessing Accuracy)
+### 1. 预处理准确度 (Preprocessing Accuracy) [Mock]
 | 指标 | 值 |
 |---|---|
@ -26,8 +27,8 @@
 | Recall | 100.0% |
 | F1 | 100.0% |
 | Latency p50 | 0.01ms |
-| Latency p95 | 0.03ms |
+| Latency p95 | 0.06ms |
-| Latency p99 | 0.06ms |
+| Latency p99 | 0.11ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 15 / 15 / 0 |
@ -48,7 +49,7 @@
 | medium | 7 | 7 | 100.0% |
 | hard | 3 | 3 | 100.0% |
-### 2. 过拟合检测 (Overfitting Detection)
+### 2. 过拟合检测 (Overfitting Detection) [Mock]
 | 指标 | 值 |
 |---|---|
@ -57,9 +58,9 @@
 | Precision | 100.0% |
 | Recall | 100.0% |
 | F1 | 100.0% |
-| Latency p50 | 0.04ms |
+| Latency p50 | 0.03ms |
 | Latency p95 | 0.06ms |
-| Latency p99 | 0.07ms |
+| Latency p99 | 0.06ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 5 / 0 |
@ -81,7 +82,7 @@
 | easy | 1 | 1 | 100.0% |
 | hard | 1 | 1 | 100.0% |
-### 3. 效率测试 (Efficiency)
+### 3. 效率测试 (Efficiency) [Mock]
 | 指标 | 值 |
 |---|---|
@ -90,9 +91,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 0.40ms |
+| Latency p50 | 0.33ms |
-| Latency p95 | 0.77ms |
+| Latency p95 | 0.62ms |
-| Latency p99 | 0.82ms |
+| Latency p99 | 0.66ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 5 / 0 |
@ -110,7 +111,7 @@
 | easy | 2 | 2 | 100.0% |
 | medium | 3 | 3 | 100.0% |
-### 4. 工具搜索 (Tool Search)
+### 4. 工具搜索 (Tool Search) [Mock]
 | 指标 | 值 |
 |---|---|
@ -119,9 +120,9 @@
 | Precision | 83.3% |
 | Recall | 83.3% |
 | F1 | 83.3% |
-| Latency p50 | 0.01ms |
+| Latency p50 | 0.02ms |
-| Latency p95 | 0.02ms |
+| Latency p95 | 0.03ms |
-| Latency p99 | 0.02ms |
+| Latency p99 | 0.03ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 10 / 10 / 0 |
@ -141,7 +142,7 @@
 | easy | 7 | 7 | 100.0% |
 | medium | 3 | 3 | 100.0% |
-### 5. 事件模型 (Event Model)
+### 5. 事件模型 (Event Model) [Mock]
 | 指标 | 值 |
 |---|---|
@ -150,9 +151,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 0.04ms |
+| Latency p50 | 0.06ms |
-| Latency p95 | 15.68ms |
+| Latency p95 | 16.00ms |
-| Latency p99 | 19.84ms |
+| Latency p99 | 20.24ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 6 / 6 / 0 |
@ -169,7 +170,7 @@
 |---|---|---|---|
 | easy | 6 | 6 | 100.0% |
-### 6. 规格管理 (Spec Management)
+### 6. 规格管理 (Spec Management) [Mock]
 | 指标 | 值 |
 |---|---|
@ -178,9 +179,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 1.41ms |
+| Latency p50 | 1.38ms |
-| Latency p95 | 3.60ms |
+| Latency p95 | 3.46ms |
-| Latency p99 | 4.04ms |
+| Latency p99 | 4.01ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 7 / 7 / 0 |
@ -198,7 +199,7 @@
 | easy | 6 | 6 | 100.0% |
 | medium | 1 | 1 | 100.0% |
-### 7. 验证循环 (Verification Loop)
+### 7. 验证循环 (Verification Loop) [Mock]
 | 指标 | 值 |
 |---|---|
@ -207,9 +208,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 25.44ms |
+| Latency p50 | 22.00ms |
-| Latency p95 | 413.42ms |
+| Latency p95 | 411.57ms |
-| Latency p99 | 488.32ms |
+| Latency p99 | 487.06ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 5 / 0 |
@ -229,6 +230,84 @@
 | easy | 2 | 2 | 100.0% |
 | medium | 3 | 3 | 100.0% |
 ### 8. LLM 推理能力 (LLM Reasoning) [LLM]
 | 指标 | 值 |
 |---|---|
 | Accuracy | 60.0% ± 0.0% |
 | 95% CI | [23.1%, 88.2%] |
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
 | Latency p50 | 25149.49ms |
 | Latency p95 | 30001.17ms |
 | Latency p99 | 30001.23ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 3 / 2 |
 #### 按类别分布
 | 类别 | 用例数 | 通过 | 准确率 |
 |---|---|---|---|
 | intent_understanding | 1 | 1 | 100.0% |
 | tool_selection | 1 | 1 | 100.0% |
 | multi_step | 1 | 0 | 0.0% |
 | code_generation | 1 | 1 | 100.0% |
 | error_recovery | 1 | 0 | 0.0% |
 #### 按难度分布
 | 难度 | 用例数 | 通过 | 准确率 |
 |---|---|---|---|
 | easy | 1 | 1 | 100.0% |
 | medium | 2 | 2 | 100.0% |
 | hard | 2 | 0 | 0.0% |
 #### 失败用例分析
 | 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 |
 |---|---|---|---|---|---|
 | llm-003 | multi_step | hard | react | timeout | timeout |
 | llm-005 | error_recovery | hard | react | timeout | timeout |
 ### 9. GUI 集成测试 (GUI Integration) [GUI]
 | 指标 | 值 |
 |---|---|
 | Accuracy | 80.0% ± 0.0% |
 | 95% CI | [37.5%, 96.4%] |
 | Precision | 80.0% |
 | Recall | 80.0% |
 | F1 | 80.0% |
 | Latency p50 | 0.00ms |
 | Latency p95 | 0.00ms |
 | Latency p99 | 0.00ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 4 / 1 |
 #### 按类别分布
 | 类别 | 用例数 | 通过 | 准确率 |
 |---|---|---|---|
 | service_startup | 1 | 1 | 100.0% |
 | api_availability | 2 | 2 | 100.0% |
 | websocket | 1 | 0 | 0.0% |
 | frontend | 1 | 1 | 100.0% |
 #### 按难度分布
 | 难度 | 用例数 | 通过 | 准确率 |
 |---|---|---|---|
 | easy | 2 | 2 | 100.0% |
 | medium | 2 | 2 | 100.0% |
 | hard | 1 | 0 | 0.0% |
 #### 失败用例分析
 | 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 |
 |---|---|---|---|---|---|
 | gui-004 | websocket | hard | connected | failed | gui_failure |
 ## 基线对比
 | 维度 | 基线准确率 | 当前准确率 | 变化 |
@ -240,7 +319,12 @@
 | event_model | 100.0% | 100.0% | — |
 | spec_management | 100.0% | 100.0% | — |
 | verification | 100.0% | 100.0% | — |
 | llm_reasoning | 0.0% | 60.0% | ↑ |
 | gui_integration | 0.0% | 80.0% | ↑ |
 ## 问题总结与改进建议
- **verification**: P95 延迟 413.42ms 较高，建议优化性能
+- **verification**: P95 延迟 411.57ms 较高，建议优化性能
 - **llm_reasoning**: 准确率 60.0% 低于 90%，建议检查失败用例并优化
 - **llm_reasoning**: P95 延迟 30001.17ms 较高，建议优化性能
 - **gui_integration**: 准确率 80.0% 低于 90%，建议检查失败用例并优化
--- a/test-results/benchmark/benchmark_report.txt
+++ b/test-results/benchmark/benchmark_report.txt
@ -1,7 +1,7 @@
 ======================================================================
 AgentKit Benchmark Report
 ======================================================================
-Timestamp:      2026-06-17T03:26:25.072956+00:00
+Timestamp:      2026-06-17T03:31:00.118497+00:00
 Version:        0.1.0
 Overall Score:  98.0%
 Summary:        50/51 tests passed (1 failed) across 7 dimensions.