diff --git a/configs/skills/benchmark_runner.yaml b/configs/skills/benchmark_runner.yaml index 159ccbf..cff4e92 100644 --- a/configs/skills/benchmark_runner.yaml +++ b/configs/skills/benchmark_runner.yaml @@ -40,55 +40,92 @@ prompt: 采用行业 Benchmark 方法论(SWE-bench / AgentBench / ToolBench 风格), 提供 Accuracy / Precision / Recall / F1 / Latency / Consistency 等完整指标。 - ## 可用命令 + ## 测试模式(--mode) + 支持三种测试模式,可组合使用: + + ### Mock 模式(默认,快速、无 LLM 依赖) + ```bash + python3 -m agentkit.cli.main benchmark --mode mock --report --verbose + ``` + 全部使用 Mock 数据,7 个维度 53 个用例,适合 CI/CD 快速回归。 + + ### LLM 模式(使用真实 LLM) + ```bash + python3 -m agentkit.cli.main benchmark --mode llm --report --verbose + ``` + 从 agentkit.yaml 加载真实 LLM 配置,测试 LLM 推理能力: + - 意图理解:LLM 是否正确识别用户意图 + - 工具选择:LLM 是否选择正确工具 + - 多步推理:LLM 是否能分解复杂任务 + - 代码生成:LLM 是否能生成可执行代码 + - 错误恢复:LLM 是否能给出修复建议 + 需要 agentkit.yaml 中配置了有效的 LLM API key。 + + ### GUI 模式(启动真实服务器测试端到端) + ```bash + python3 -m agentkit.cli.main benchmark --mode gui --report --verbose + ``` + 自动启动 agentkit gui 服务器,测试: + - 服务启动:agentkit gui --port XXXX 能否成功启动 + - API 可用性:/api/v1/health, /api/v1/skills, /api/v1/chat + - WebSocket 连接:ws://localhost:XXXX/api/v1/ws + - 前端资源:HTML/JS/CSS 是否可访问 + 测试完成后自动关闭服务器。 + + ### 全部模式(Mock + LLM + GUI) + ```bash + python3 -m agentkit.cli.main benchmark --mode all --report --verbose + ``` + 运行所有 9 个维度共 63 个测试用例,最全面的评估。 ### 完整回测(推荐) ```bash - python3 -m agentkit.cli.main benchmark --report --verbose + python3 -m agentkit.cli.main benchmark --mode all --report --verbose ``` - 运行所有 7 个维度共 53 个标准化测试用例,生成 JSON + Markdown 报告。 + 运行所有 9 个维度(7 Mock + 1 LLM + 1 GUI)共 63 个测试用例。 默认运行 3 次取均值 ± 标准差,附带 95% Wilson 置信区间。 ### 快速回测 ```bash - python3 -m agentkit.cli.main benchmark --fast --report + python3 -m agentkit.cli.main benchmark --mode mock --fast --report ``` - 运行核心用例(约 22 个),适合开发时快速验证。 + 运行 Mock 模式核心用例(约 22 个),适合开发时快速验证。 ### 单维度回测 ```bash python3 -m agentkit.cli.main benchmark --dimension --verbose ``` - 可选维度:preprocessing, overfitting, efficiency, tool_search, event_model, spec_management, verification + 可选维度:preprocessing, overfitting, efficiency, tool_search, event_model, + spec_management, verification, llm_reasoning, gui_integration ### 多次运行取均值(--runs) ```bash - python3 -m agentkit.cli.main benchmark --runs 5 --report + python3 -m agentkit.cli.main benchmark --mode all --runs 3 --report ``` 指定运行次数(默认 3),计算 accuracy_mean ± accuracy_std 和 95% 置信区间。 适用于稳定性评估和回归检测。 ### 基线对比(--baseline) ```bash - python3 -m agentkit.cli.main benchmark --baseline --report + python3 -m agentkit.cli.main benchmark --mode all --baseline --report ``` 首次运行自动创建基线(baseline.json),后续运行与基线对比,显示 ↑/↓ 变化趋势。 适用于 CI/CD 回归监控。 ### Markdown 报告(默认) ```bash - python3 -m agentkit.cli.main benchmark --report --format markdown + python3 -m agentkit.cli.main benchmark --mode all --report --format markdown ``` 生成人类可读的 Markdown 报告,包含指标表格、失败用例分析、改进建议。 ### HTML 报告 ```bash - python3 -m agentkit.cli.main benchmark --report --format html + python3 -m agentkit.cli.main benchmark --mode all --report --format html ``` ### JSON 报告 ```bash - python3 -m agentkit.cli.main benchmark --report --format json + python3 -m agentkit.cli.main benchmark --mode all --report --format json ``` 仅生成 JSON 报告,适合机器解析和 CI 集成。 @@ -96,11 +133,11 @@ prompt: ```bash python3 -m pytest tests/e2e/test_capability_comprehensive.py -v -m e2e_capability ``` - 运行 64 个测试(10 维度,含标准 Benchmark 框架集成测试),生成 comprehensive_report。 + 运行 64 个测试(含标准 Benchmark 框架集成测试),生成 comprehensive_report。 ### 指定输出目录 ```bash - python3 -m agentkit.cli.main benchmark --report -o ./my-results + python3 -m agentkit.cli.main benchmark --mode all --report -o ./my-results ``` ## 测试维度说明 @@ -113,14 +150,24 @@ prompt: - **Consistency** — 一致性(过拟合检测,改写输入的稳定性) - **95% CI** — Wilson 置信区间(多次运行时) - 维度清单: - 1. **preprocessing** — 预处理准确度:greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT - 2. **overfitting** — 过拟合检测:同一意图不同表达的一致性(Consistency 指标) - 3. **efficiency** — 执行效率:预处理延迟 < 50ms, 工具搜索延迟 < 10ms(Latency 指标) - 4. **tool_search** — 工具搜索准确度:BM25 相关性排序(P/R/F1 指标) - 5. **event_model** — 事件模型完整性:SQ/EQ 双队列生命周期 - 6. **spec_management** — Spec 管理:CRUD 操作 - 7. **verification** — 验证循环:verify/retry 行为 + 维度清单(9 个维度,按模式分组): + + **Mock 模式(7 维度,53 用例)**: + 1. **preprocessing** [Mock] — 预处理准确度:greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT + 2. **overfitting** [Mock] — 过拟合检测:同一意图不同表达的一致性 + 3. **efficiency** [Mock] — 执行效率:预处理延迟 < 50ms, 工具搜索延迟 < 10ms + 4. **tool_search** [Mock] — 工具搜索准确度:BM25 相关性排序 + 5. **event_model** [Mock] — 事件模型完整性:SQ/EQ 双队列生命周期 + 6. **spec_management** [Mock] — Spec 管理:CRUD 操作 + 7. **verification** [Mock] — 验证循环:verify/retry 行为 + + **LLM 模式(1 维度,5 用例)**: + 8. **llm_reasoning** [LLM] — LLM 推理能力:意图理解/工具选择/多步推理/代码生成/错误恢复 + 使用真实 LLM 调用,记录 Token 使用量和响应延迟。 + + **GUI 模式(1 维度,5 用例)**: + 9. **gui_integration** [GUI] — GUI 集成测试:服务启动/API 可用性/WebSocket/前端资源 + 自动启动 agentkit gui 服务器,测试完成后自动清理。 ## 报告位置 - CLI 报告:`test-results/benchmark/benchmark_report.{json,md,html}` @@ -131,10 +178,15 @@ prompt: 1. 运行测试命令 2. 读取生成的报告文件(JSON + Markdown) 3. 向用户展示结果摘要表格,包含各维度的 Accuracy / P / R / F1 / Latency - 4. 如有失败用例,分析根因(wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded) - 5. 对比基线报告(如使用 --baseline),展示各维度准确率的 ↑/↓ 变化趋势 - 6. 关注关键指标:P95 延迟 > 100ms 需提示性能问题,Consistency < 100% 需提示过拟合风险 - 7. 给出针对性改进建议,基于指标数据而非主观判断 + 4. 标注每个维度使用的模式([Mock] / [LLM] / [GUI]) + 5. 如有失败用例,分析根因(wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded / gui_failure) + 6. 对比基线报告(如使用 --baseline),展示各维度准确率的 ↑/↓ 变化趋势 + 7. 关注关键指标: + - P95 延迟 > 100ms 需提示性能问题 + - Consistency < 100% 需提示过拟合风险 + - LLM 维度 timeout 需提示模型响应慢或超时阈值需调整 + - GUI 维度失败需提示服务器配置或端口问题 + 8. 给出针对性改进建议,基于指标数据而非主观判断 llm: model: "default" diff --git a/src/agentkit/cli/benchmark.py b/src/agentkit/cli/benchmark.py index b52e257..f56a2ca 100644 --- a/src/agentkit/cli/benchmark.py +++ b/src/agentkit/cli/benchmark.py @@ -8,26 +8,36 @@ Implements industry-standard benchmark methodology (SWE-bench / AgentBench / Too - Markdown + JSON + HTML report generation - Baseline comparison (↑/↓) -Tests core AgentKit components directly (no pytest subprocess, no real LLM): -- preprocessing: RequestPreprocessor routing accuracy -- overfitting: routing consistency across paraphrases -- efficiency: component execution timing -- tool_search: ToolSearchIndex BM25 relevance -- event_model: SubmissionQueue / EventQueue lifecycle -- spec_management: SpecManager CRUD operations -- verification: VerificationLoop execute/retry behavior +Three execution modes via --mode: +- mock: 全部使用 Mock(默认,快速、无 LLM 依赖) +- llm: 使用真实 LLM(需要 agentkit.yaml 配置) +- gui: 启动真实 GUI 服务器测试端到端 +- all: 运行所有模式(Mock + LLM + GUI) + +Tests core AgentKit components: +- preprocessing: RequestPreprocessor routing accuracy [Mock] +- overfitting: routing consistency across paraphrases [Mock] +- efficiency: component execution timing [Mock] +- tool_search: ToolSearchIndex BM25 relevance [Mock] +- event_model: SubmissionQueue / EventQueue lifecycle [Mock] +- spec_management: SpecManager CRUD operations [Mock] +- verification: VerificationLoop execute/retry behavior [Mock] +- llm_reasoning: Real LLM intent/tool/multi-step/code/error [LLM] +- gui_integration: agentkit gui end-to-end (API/WS/frontend) [GUI] Usage: - agentkit benchmark # run all dimensions + agentkit benchmark # run all mock dimensions + agentkit benchmark --mode mock # explicit mock mode (default) + agentkit benchmark --mode llm --report # LLM mode with report + agentkit benchmark --mode gui --report # GUI mode with report + agentkit benchmark --mode all --report # all modes agentkit benchmark -d preprocessing # single dimension - agentkit benchmark --report # generate reports agentkit benchmark --fast # core cases only agentkit benchmark --verbose # detailed output agentkit benchmark --format html # HTML format agentkit benchmark -o ./results # output directory agentkit benchmark --runs 3 # multiple runs (default 3) agentkit benchmark --baseline # compare with baseline - agentkit benchmark --format markdown # Markdown report (default) """ from __future__ import annotations @@ -75,9 +85,38 @@ class BenchmarkDimension(str, Enum): EVENT_MODEL = "event_model" SPEC_MANAGEMENT = "spec_management" VERIFICATION = "verification" + LLM_REASONING = "llm_reasoning" + GUI_INTEGRATION = "gui_integration" ALL = "all" +class BenchmarkMode(str, Enum): + """Benchmark execution mode. + + MOCK: 全部使用 Mock(快速、无 LLM 依赖) + LLM: 使用真实 LLM(需要 agentkit.yaml) + GUI: 启动真实 GUI 服务器测试 + ALL: 运行所有模式(Mock + LLM + GUI) + """ + + MOCK = "mock" + LLM = "llm" + GUI = "gui" + ALL = "all" + + +# Mock dimensions (no LLM dependency) +_MOCK_DIMENSIONS: list[BenchmarkDimension] = [ + BenchmarkDimension.PREPROCESSING, + BenchmarkDimension.OVERFITTING, + BenchmarkDimension.EFFICIENCY, + BenchmarkDimension.TOOL_SEARCH, + BenchmarkDimension.EVENT_MODEL, + BenchmarkDimension.SPEC_MANAGEMENT, + BenchmarkDimension.VERIFICATION, +] + + # --------------------------------------------------------------------------- # Data structures # --------------------------------------------------------------------------- @@ -108,6 +147,7 @@ class BenchmarkTask: tags: list[str] description: str paraphrases: list[str] = field(default_factory=list) + expected_keywords: list[str] = field(default_factory=list) @dataclass @@ -188,576 +228,195 @@ class BenchmarkContext: # --------------------------------------------------------------------------- +# fmt: off TASK_SET: list[BenchmarkTask] = [ # === Preprocessing (15 tasks) === - BenchmarkTask( - "prep-001", - "preprocessing", - "greeting", - "easy", - "你好", - "direct_chat", - ["regex", "chinese"], - "中文问候应路由到 DIRECT_CHAT", - ), - BenchmarkTask( - "prep-002", - "preprocessing", - "greeting", - "easy", - "hello", - "direct_chat", - ["regex", "english"], - "英文问候应路由到 DIRECT_CHAT", - ), - BenchmarkTask( - "prep-003", - "preprocessing", - "greeting", - "easy", - "谢谢", - "direct_chat", - ["regex", "chitchat"], - "感谢语应路由到 DIRECT_CHAT", - ), - BenchmarkTask( - "prep-004", - "preprocessing", - "greeting", - "easy", - "你是谁", - "direct_chat", - ["regex", "identity"], - "身份询问应路由到 DIRECT_CHAT", - ), - BenchmarkTask( - "prep-005", - "preprocessing", - "tool_query", - "medium", - "搜索golang教程", - "react", - ["search", "default"], - "搜索类请求应路由到 REACT", - ), - BenchmarkTask( - "prep-006", - "preprocessing", - "tool_query", - "medium", - "执行ls命令", - "react", - ["shell", "default"], - "Shell 执行类请求应路由到 REACT", - ), - BenchmarkTask( - "prep-007", - "preprocessing", - "tool_query", - "medium", - "翻译hello为中文", - "react", - ["translate", "default"], - "翻译类请求应路由到 REACT", - ), - BenchmarkTask( - "prep-008", - "preprocessing", - "tool_query", - "medium", - "什么是机器学习", - "react", - ["knowledge", "default"], - "知识查询类请求应路由到 REACT", - ), - BenchmarkTask( - "prep-009", - "preprocessing", - "tool_query", - "medium", - "帮我分析数据", - "react", - ["analysis", "default"], - "分析类请求应路由到 REACT", - ), - BenchmarkTask( - "prep-010", - "preprocessing", - "skill_prefix", - "medium", - "@skill:react_agent 查看ip", - "skill_react", - ["skill", "react"], - "有效 skill 前缀应路由到 SKILL_REACT", - ), - BenchmarkTask( - "prep-011", - "preprocessing", - "skill_prefix", - "medium", - "@skill:chat_only 你好", - "direct_chat", - ["skill", "direct"], - "direct 模式 skill 前缀应路由到 DIRECT_CHAT", - ), - BenchmarkTask( - "prep-012", - "preprocessing", - "skill_prefix", - "hard", - "@skill:nonexistent 做点什么", - "react", - ["skill", "fallback"], - "无效 skill 前缀应回退到 REACT", - ), - BenchmarkTask( - "prep-013", - "preprocessing", - "complex", - "hard", - "帮我分析这个数据并生成报告", - "react", - ["multi_step"], - "多步骤复杂任务应路由到 REACT", - ), - BenchmarkTask( - "prep-014", - "preprocessing", - "complex", - "easy", - "随便聊聊", - "react", - ["chitchat", "default"], - "非匹配闲聊应回退到 REACT", - ), - BenchmarkTask( - "prep-015", - "preprocessing", - "complex", - "hard", + BenchmarkTask("prep-001", "preprocessing", "greeting", "easy", "你好", + "direct_chat", ["regex", "chinese"], "中文问候应路由到 DIRECT_CHAT"), + BenchmarkTask("prep-002", "preprocessing", "greeting", "easy", "hello", + "direct_chat", ["regex", "english"], "英文问候应路由到 DIRECT_CHAT"), + BenchmarkTask("prep-003", "preprocessing", "greeting", "easy", "谢谢", + "direct_chat", ["regex", "chitchat"], "感谢语应路由到 DIRECT_CHAT"), + BenchmarkTask("prep-004", "preprocessing", "greeting", "easy", "你是谁", + "direct_chat", ["regex", "identity"], "身份询问应路由到 DIRECT_CHAT"), + BenchmarkTask("prep-005", "preprocessing", "tool_query", "medium", "搜索golang教程", + "react", ["search", "default"], "搜索类请求应路由到 REACT"), + BenchmarkTask("prep-006", "preprocessing", "tool_query", "medium", "执行ls命令", + "react", ["shell", "default"], "Shell 执行类请求应路由到 REACT"), + BenchmarkTask("prep-007", "preprocessing", "tool_query", "medium", "翻译hello为中文", + "react", ["translate", "default"], "翻译类请求应路由到 REACT"), + BenchmarkTask("prep-008", "preprocessing", "tool_query", "medium", "什么是机器学习", + "react", ["knowledge", "default"], "知识查询类请求应路由到 REACT"), + BenchmarkTask("prep-009", "preprocessing", "tool_query", "medium", "帮我分析数据", + "react", ["analysis", "default"], "分析类请求应路由到 REACT"), + BenchmarkTask("prep-010", "preprocessing", "skill_prefix", "medium", "@skill:react_agent 查看ip", + "skill_react", ["skill", "react"], "有效 skill 前缀应路由到 SKILL_REACT"), + BenchmarkTask("prep-011", "preprocessing", "skill_prefix", "medium", "@skill:chat_only 你好", + "direct_chat", ["skill", "direct"], "direct 模式 skill 前缀应路由到 DIRECT_CHAT"), + BenchmarkTask("prep-012", "preprocessing", "skill_prefix", "hard", "@skill:nonexistent 做点什么", + "react", ["skill", "fallback"], "无效 skill 前缀应回退到 REACT"), + BenchmarkTask("prep-013", "preprocessing", "complex", "hard", "帮我分析这个数据并生成报告", + "react", ["multi_step"], "多步骤复杂任务应路由到 REACT"), + BenchmarkTask("prep-014", "preprocessing", "complex", "easy", "随便聊聊", + "react", ["chitchat", "default"], "非匹配闲聊应回退到 REACT"), + BenchmarkTask("prep-015", "preprocessing", "complex", "hard", "请帮我完成以下任务:1. 查询天气 2. 生成报告", - "react", - ["multi_step"], - "多步骤任务应路由到 REACT", - ), + "react", ["multi_step"], "多步骤任务应路由到 REACT"), # === Overfitting (5 groups) === - BenchmarkTask( - "over-001", - "overfitting", - "ip_check", - "medium", - "查下ip", - "react", - ["colloquial"], - "IP 查询改写一致性", - paraphrases=["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"], - ), - BenchmarkTask( - "over-002", - "overfitting", - "search", - "medium", - "搜索golang教程", - "react", - ["search"], - "搜索改写一致性", - paraphrases=["搜索golang教程", "搜一下golang教程", "找下golang学习资料"], - ), - BenchmarkTask( - "over-003", - "overfitting", - "greeting", - "easy", - "你好", - "direct_chat", - ["greeting"], - "问候改写一致性", - paraphrases=["你好", "hello", "hi", "嗨", "哈喽"], - ), - BenchmarkTask( - "over-004", - "overfitting", - "tool_use", - "medium", - "执行ls命令", - "react", - ["shell"], - "工具使用改写一致性", - paraphrases=["执行ls命令", "运行ls", "跑一下ls"], - ), - BenchmarkTask( - "over-005", - "overfitting", - "complex", - "hard", - "帮我分析数据", - "react", - ["analysis"], - "复杂任务改写一致性", - paraphrases=["帮我分析数据", "分析一下数据", "看看这些数据"], - ), + BenchmarkTask("over-001", "overfitting", "ip_check", "medium", "查下ip", + "react", ["colloquial"], "IP 查询改写一致性", + paraphrases=["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"]), + BenchmarkTask("over-002", "overfitting", "search", "medium", "搜索golang教程", + "react", ["search"], "搜索改写一致性", + paraphrases=["搜索golang教程", "搜一下golang教程", "找下golang学习资料"]), + BenchmarkTask("over-003", "overfitting", "greeting", "easy", "你好", + "direct_chat", ["greeting"], "问候改写一致性", + paraphrases=["你好", "hello", "hi", "嗨", "哈喽"]), + BenchmarkTask("over-004", "overfitting", "tool_use", "medium", "执行ls命令", + "react", ["shell"], "工具使用改写一致性", + paraphrases=["执行ls命令", "运行ls", "跑一下ls"]), + BenchmarkTask("over-005", "overfitting", "complex", "hard", "帮我分析数据", + "react", ["analysis"], "复杂任务改写一致性", + paraphrases=["帮我分析数据", "分析一下数据", "看看这些数据"]), # === Efficiency (5 tasks) === - BenchmarkTask( - "eff-001", - "efficiency", - "preprocess_latency", - "easy", - "你好", - "<=50ms", - ["greeting", "preprocess"], - "问候预处理延迟 < 50ms", - ), - BenchmarkTask( - "eff-002", - "efficiency", - "preprocess_latency", - "medium", - "查下ip", - "<=50ms", - ["react", "preprocess"], - "REACT 预处理延迟 < 50ms", - ), - BenchmarkTask( - "eff-003", - "efficiency", - "preprocess_latency", - "medium", - "@skill:react_agent test", - "<=50ms", - ["skill", "preprocess"], - "Skill 前缀预处理延迟 < 50ms", - ), - BenchmarkTask( - "eff-004", - "efficiency", - "tool_search_latency", - "medium", - "read file", - "<=10ms", - ["tool_search", "bm25"], - "工具搜索延迟 < 10ms", - ), - BenchmarkTask( - "eff-005", - "efficiency", - "tool_search_latency", - "easy", - "", - "<=5ms", - ["tool_search", "empty"], - "空查询工具搜索延迟 < 5ms", - ), + BenchmarkTask("eff-001", "efficiency", "preprocess_latency", "easy", "你好", + "<=50ms", ["greeting", "preprocess"], "问候预处理延迟 < 50ms"), + BenchmarkTask("eff-002", "efficiency", "preprocess_latency", "medium", "查下ip", + "<=50ms", ["react", "preprocess"], "REACT 预处理延迟 < 50ms"), + BenchmarkTask("eff-003", "efficiency", "preprocess_latency", "medium", "@skill:react_agent test", + "<=50ms", ["skill", "preprocess"], "Skill 前缀预处理延迟 < 50ms"), + BenchmarkTask("eff-004", "efficiency", "tool_search_latency", "medium", "read file", + "<=10ms", ["tool_search", "bm25"], "工具搜索延迟 < 10ms"), + BenchmarkTask("eff-005", "efficiency", "tool_search_latency", "easy", "", + "<=5ms", ["tool_search", "empty"], "空查询工具搜索延迟 < 5ms"), # === Tool Search (10 tasks) === - BenchmarkTask( - "ts-001", - "tool_search", - "exact_match", - "easy", - "read file", - "read_file", - ["bm25", "exact"], - "精确匹配 read_file", - ), - BenchmarkTask( - "ts-002", - "tool_search", - "exact_match", - "easy", - "write file content", - "write_file", - ["bm25", "exact"], - "精确匹配 write_file", - ), - BenchmarkTask( - "ts-003", - "tool_search", - "exact_match", - "easy", - "search web information", - "web_search", - ["bm25", "exact"], - "精确匹配 web_search", - ), - BenchmarkTask( - "ts-004", - "tool_search", - "exact_match", - "easy", - "execute shell command", - "shell_exec", - ["bm25", "exact"], - "精确匹配 shell_exec", - ), - BenchmarkTask( - "ts-005", - "tool_search", - "exact_match", - "easy", - "send http request url", - "http_request", - ["bm25", "exact"], - "精确匹配 http_request", - ), - BenchmarkTask( - "ts-006", - "tool_search", - "fuzzy_match", - "medium", - "io file", - "read_file", - ["bm25", "fuzzy", "tag"], - "标签模糊匹配 io file", - ), - BenchmarkTask( - "ts-007", - "tool_search", - "fuzzy_match", - "medium", - "search query engine", - "web_search", - ["bm25", "fuzzy", "multi"], - "多关键词模糊匹配", - ), - BenchmarkTask( - "ts-008", - "tool_search", - "no_match", - "easy", - "", - "__none__", - ["bm25", "empty"], - "空查询应返回空结果", - ), - BenchmarkTask( - "ts-009", - "tool_search", - "no_match", - "easy", - "zzzznonexistent", - "__none__", - ["bm25", "no_match"], - "无匹配查询应返回空结果", - ), - BenchmarkTask( - "ts-010", - "tool_search", - "top_k", - "medium", - "file", - "read_file", - ["bm25", "top_k"], - "top_k=1 限制返回数", - ), + BenchmarkTask("ts-001", "tool_search", "exact_match", "easy", "read file", + "read_file", ["bm25", "exact"], "精确匹配 read_file"), + BenchmarkTask("ts-002", "tool_search", "exact_match", "easy", "write file content", + "write_file", ["bm25", "exact"], "精确匹配 write_file"), + BenchmarkTask("ts-003", "tool_search", "exact_match", "easy", "search web information", + "web_search", ["bm25", "exact"], "精确匹配 web_search"), + BenchmarkTask("ts-004", "tool_search", "exact_match", "easy", "execute shell command", + "shell_exec", ["bm25", "exact"], "精确匹配 shell_exec"), + BenchmarkTask("ts-005", "tool_search", "exact_match", "easy", "send http request url", + "http_request", ["bm25", "exact"], "精确匹配 http_request"), + BenchmarkTask("ts-006", "tool_search", "fuzzy_match", "medium", "io file", + "read_file", ["bm25", "fuzzy", "tag"], "标签模糊匹配 io file"), + BenchmarkTask("ts-007", "tool_search", "fuzzy_match", "medium", "search query engine", + "web_search", ["bm25", "fuzzy", "multi"], "多关键词模糊匹配"), + BenchmarkTask("ts-008", "tool_search", "no_match", "easy", "", + "__none__", ["bm25", "empty"], "空查询应返回空结果"), + BenchmarkTask("ts-009", "tool_search", "no_match", "easy", "zzzznonexistent", + "__none__", ["bm25", "no_match"], "无匹配查询应返回空结果"), + BenchmarkTask("ts-010", "tool_search", "top_k", "medium", "file", + "read_file", ["bm25", "top_k"], "top_k=1 限制返回数"), # === Event Model (6 tasks) === - BenchmarkTask( - "ev-001", - "event_model", - "sq_lifecycle", - "easy", - "submit+drain", - "passed", - ["sq", "submit"], - "SQ 提交并消费", - ), - BenchmarkTask( - "ev-002", - "event_model", - "sq_lifecycle", - "easy", - "cancel", - "passed", - ["sq", "cancel"], - "SQ 取消任务", - ), - BenchmarkTask( - "ev-003", - "event_model", - "sq_lifecycle", - "easy", - "close", - "passed", - ["sq", "close"], - "SQ 关闭后拒绝提交", - ), - BenchmarkTask( - "ev-004", - "event_model", - "eq_lifecycle", - "easy", - "emit+replay", - "passed", - ["eq", "replay"], - "EQ 发射并回放", - ), - BenchmarkTask( - "ev-005", - "event_model", - "eq_lifecycle", - "easy", - "close", - "passed", - ["eq", "close"], - "EQ 关闭哨兵退出", - ), - BenchmarkTask( - "ev-006", - "event_model", - "eq_lifecycle", - "easy", - "subscriber_count", - "passed", - ["eq", "count"], - "EQ 初始订阅者计数", - ), + BenchmarkTask("ev-001", "event_model", "sq_lifecycle", "easy", "submit+drain", + "passed", ["sq", "submit"], "SQ 提交并消费"), + BenchmarkTask("ev-002", "event_model", "sq_lifecycle", "easy", "cancel", + "passed", ["sq", "cancel"], "SQ 取消任务"), + BenchmarkTask("ev-003", "event_model", "sq_lifecycle", "easy", "close", + "passed", ["sq", "close"], "SQ 关闭后拒绝提交"), + BenchmarkTask("ev-004", "event_model", "eq_lifecycle", "easy", "emit+replay", + "passed", ["eq", "replay"], "EQ 发射并回放"), + BenchmarkTask("ev-005", "event_model", "eq_lifecycle", "easy", "close", + "passed", ["eq", "close"], "EQ 关闭哨兵退出"), + BenchmarkTask("ev-006", "event_model", "eq_lifecycle", "easy", "subscriber_count", + "passed", ["eq", "count"], "EQ 初始订阅者计数"), # === Spec Management (7 tasks) === - BenchmarkTask( - "sm-001", - "spec_management", - "crud", - "easy", - "create", - "passed", - ["create"], - "Spec 创建", - ), - BenchmarkTask( - "sm-002", - "spec_management", - "crud", - "easy", - "get", - "passed", - ["read"], - "Spec 读取", - ), - BenchmarkTask( - "sm-003", - "spec_management", - "crud", - "easy", - "update", - "passed", - ["update"], - "Spec 更新", - ), - BenchmarkTask( - "sm-004", - "spec_management", - "crud", - "easy", - "delete", - "passed", - ["delete"], - "Spec 删除", - ), - BenchmarkTask( - "sm-005", - "spec_management", - "crud", - "easy", - "list", - "passed", - ["list"], - "Spec 列表", - ), - BenchmarkTask( - "sm-006", - "spec_management", - "edge", - "medium", - "confirm", - "passed", - ["confirm"], - "Spec 确认", - ), - BenchmarkTask( - "sm-007", - "spec_management", - "edge", - "easy", - "missing", - "passed", - ["missing"], - "Spec 不存在返回 None", - ), + BenchmarkTask("sm-001", "spec_management", "crud", "easy", "create", + "passed", ["create"], "Spec 创建"), + BenchmarkTask("sm-002", "spec_management", "crud", "easy", "get", + "passed", ["read"], "Spec 读取"), + BenchmarkTask("sm-003", "spec_management", "crud", "easy", "update", + "passed", ["update"], "Spec 更新"), + BenchmarkTask("sm-004", "spec_management", "crud", "easy", "delete", + "passed", ["delete"], "Spec 删除"), + BenchmarkTask("sm-005", "spec_management", "crud", "easy", "list", + "passed", ["list"], "Spec 列表"), + BenchmarkTask("sm-006", "spec_management", "edge", "medium", "confirm", + "passed", ["confirm"], "Spec 确认"), + BenchmarkTask("sm-007", "spec_management", "edge", "easy", "missing", + "passed", ["missing"], "Spec 不存在返回 None"), # === Verification (5 tasks) === - BenchmarkTask( - "vf-001", - "verification", - "basic", - "easy", - "pass", - "passed", - ["pass"], - "验证通过命令", - ), - BenchmarkTask( - "vf-002", - "verification", - "basic", - "easy", - "fail", - "passed", - ["fail"], - "验证失败命令", - ), - BenchmarkTask( - "vf-003", - "verification", - "retry", - "medium", - "fix_callback", - "passed", - ["retry", "callback"], - "重试与修复回调", - ), - BenchmarkTask( - "vf-004", - "verification", - "timeout", - "medium", - "timeout", - "passed", - ["timeout"], - "超时检测", - ), - BenchmarkTask( - "vf-005", - "verification", - "multi", - "medium", - "multi_command", - "passed", - ["multi"], - "多命令验证", - ), + BenchmarkTask("vf-001", "verification", "basic", "easy", "pass", + "passed", ["pass"], "验证通过命令"), + BenchmarkTask("vf-002", "verification", "basic", "easy", "fail", + "passed", ["fail"], "验证失败命令"), + BenchmarkTask("vf-003", "verification", "retry", "medium", "fix_callback", + "passed", ["retry", "callback"], "重试与修复回调"), + BenchmarkTask("vf-004", "verification", "timeout", "medium", "timeout", + "passed", ["timeout"], "超时检测"), + BenchmarkTask("vf-005", "verification", "multi", "medium", "multi_command", + "passed", ["multi"], "多命令验证"), ] +# fmt: on +# fmt: off _FAST_CORE_IDS: set[str] = { - "prep-001", - "prep-005", - "prep-010", - "prep-012", - "over-001", - "over-003", - "eff-001", - "eff-004", - "ts-001", - "ts-003", - "ts-008", - "ts-010", - "ev-001", - "ev-004", - "ev-005", - "sm-001", - "sm-002", - "sm-006", - "sm-004", - "vf-001", - "vf-002", - "vf-003", + "prep-001", "prep-005", "prep-010", "prep-012", "over-001", "over-003", + "eff-001", "eff-004", "ts-001", "ts-003", "ts-008", "ts-010", + "ev-001", "ev-004", "ev-005", "sm-001", "sm-002", "sm-006", "sm-004", + "vf-001", "vf-002", "vf-003", "llm-001", "llm-003", "gui-001", "gui-002", "gui-004", } +# fmt: on + + +# --------------------------------------------------------------------------- +# LLM Reasoning tasks (require real LLM via agentkit.yaml) +# --------------------------------------------------------------------------- + + +# fmt: off +LLM_REASONING_TASKS: list[BenchmarkTask] = [ + BenchmarkTask("llm-001", "llm_reasoning", "intent_understanding", "easy", + "帮我查看当前服务器的IP地址", "react", ["intent", "tool_use"], + "LLM 应识别需要使用工具查看 IP", + expected_keywords=["ip", "地址", "ifconfig", "hostname", "网络"]), + BenchmarkTask("llm-002", "llm_reasoning", "tool_selection", "medium", + "搜索最新的 AI Agent 论文", "react", ["tool_selection", "web_search"], + "LLM 应选择 web_search 工具", + expected_keywords=["search", "搜索", "web", "论文", "paper", "agent"]), + BenchmarkTask("llm-003", "llm_reasoning", "multi_step", "hard", + "分析这段代码的性能问题并给出优化建议:def fib(n): return fib(n-1)+fib(n-2) if n>1 else n", + "react", ["multi_step", "code_analysis"], "LLM 应分析代码并给出优化建议", + expected_keywords=["fib", "递归", "优化", "缓存", "memo", "迭代", "动态规划", "性能"]), + BenchmarkTask("llm-004", "llm_reasoning", "code_generation", "medium", + "写一个 Python 函数来计算斐波那契数列", "react", ["code_gen"], + "LLM 应生成可执行的 Python 代码", + expected_keywords=["def", "fib", "return", "python"]), + BenchmarkTask("llm-005", "llm_reasoning", "error_recovery", "hard", + "这个报错怎么解决:ModuleNotFoundError: No module named 'agentkit'", + "react", ["error_recovery"], "LLM 应给出 pip install 建议", + expected_keywords=["pip", "install", "agentkit", "安装", "模块"]), +] +# fmt: on + + +# --------------------------------------------------------------------------- +# GUI Integration tasks (require starting real agentkit gui server) +# --------------------------------------------------------------------------- + + +# fmt: off +GUI_INTEGRATION_TASKS: list[BenchmarkTask] = [ + BenchmarkTask("gui-001", "gui_integration", "service_startup", "easy", + "agentkit gui --port {port}", "started", ["startup", "subprocess"], + "GUI 服务应成功启动并响应健康检查"), + BenchmarkTask("gui-002", "gui_integration", "api_availability", "medium", + "GET /api/v1/health, GET /api/v1/skills", "200", ["api", "http"], + "核心 API 端点应返回 200"), + BenchmarkTask("gui-003", "gui_integration", "api_availability", "medium", + "POST /api/v1/chat", "reachable", ["api", "chat"], + "Chat API 端点应可达(不要求成功,要求响应)"), + BenchmarkTask("gui-004", "gui_integration", "websocket", "hard", + "ws://localhost:{port}/api/v1/ws/{session}", "connected", + ["websocket", "realtime"], "WebSocket 端点应能建立连接并交换 ping/pong"), + BenchmarkTask("gui-005", "gui_integration", "frontend", "easy", + "GET /", "html", ["frontend", "static"], "前端首页应返回 HTML 内容"), +] +# fmt: on # --------------------------------------------------------------------------- @@ -892,6 +551,468 @@ def _make_context(tmp_dir: Path) -> BenchmarkContext: ) +# --------------------------------------------------------------------------- +# Real component builder (loads from agentkit.yaml for LLM mode) +# --------------------------------------------------------------------------- + + +def _find_config_path() -> str | None: + """Find agentkit.yaml config file (cwd or ~/.agentkit/).""" + import os as _os + + candidates = [ + _os.environ.get("AGENTKIT_CONFIG", ""), + str(Path.cwd() / "agentkit.yaml"), + str(Path.home() / ".agentkit" / "agentkit.yaml"), + ] + for path in candidates: + if path and Path(path).is_file(): + return path + return None + + +def _build_real_components() -> tuple[object, object, object] | None: + """Build real components from agentkit.yaml for LLM mode. + + Returns (preprocessor, skill_registry, llm_gateway) or None if config + is missing or no LLM provider is available. + """ + import os as _os + + from agentkit.chat.request_preprocessor import RequestPreprocessor + from agentkit.server.app import _build_llm_gateway, _build_skill_registry + from agentkit.server.config import load_config_with_dotenv + + config_path = _find_config_path() + if not config_path: + console.print("[yellow]No agentkit.yaml found — skipping LLM mode.[/yellow]") + return None + + server_config = load_config_with_dotenv(config_path) + + # Fallback: inject DASHSCOPE_API_KEY from env if providers lack keys + if not server_config.has_llm_provider(): + dashscope_key = _os.environ.get("DASHSCOPE_API_KEY", "") + if dashscope_key: + for _name, pconf in server_config.llm_config.providers.items(): + if not pconf.api_key: + pconf.api_key = dashscope_key + if not pconf.base_url: + if dashscope_key.startswith("sk-sp-"): + pconf.base_url = "https://coding.dashscope.aliyuncs.com/v1" + else: + pconf.base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1" + break + + if not server_config.has_llm_provider(): + console.print("[yellow]No LLM provider with valid API key — skipping LLM mode.[/yellow]") + return None + + skill_registry = _build_skill_registry(server_config) + preprocessor = RequestPreprocessor(skill_registry=skill_registry) + llm_gateway = _build_llm_gateway(server_config) + return preprocessor, skill_registry, llm_gateway + + +# --------------------------------------------------------------------------- +# LLM Reasoning dimension executor +# --------------------------------------------------------------------------- + + +async def _execute_llm_reasoning_task( + task: BenchmarkTask, + preprocessor: object, + llm_gateway: object, +) -> ExecutionResult: + """Execute a single LLM reasoning task. + + Steps: + 1. Call RequestPreprocessor.preprocess() to get execution mode. + 2. If REACT mode, call LLMGateway.chat() with 30s timeout. + 3. Check LLM response for expected keywords. + 4. Record latency and token usage. + """ + start = time.perf_counter() + + # Step 1: preprocess to get execution mode + routing = await preprocessor.preprocess(content=task.input) # type: ignore[attr-defined] + actual_mode = routing.execution_mode.value + + # Step 2: if REACT, call LLM and check keywords + if actual_mode == "react": + try: + response = await asyncio.wait_for( + llm_gateway.chat( # type: ignore[attr-defined] + messages=[{"role": "user", "content": task.input}], + model="default", + agent_name="benchmark", + max_tokens=512, + ), + timeout=30.0, + ) + content = (response.content or "").lower() + tokens = response.usage.total_tokens if response.usage else 0 + + # Step 3: check expected keywords + if task.expected_keywords: + passed = any(kw.lower() in content for kw in task.expected_keywords) + else: + passed = bool(content.strip()) + + elapsed = (time.perf_counter() - start) * 1000 + return ExecutionResult( + actual=f"mode=react tokens={tokens} len={len(content)}", + passed=passed, + duration_ms=round(elapsed, 4), + detail=f"mode={actual_mode} keywords={task.expected_keywords}", + ) + except TimeoutError: + elapsed = (time.perf_counter() - start) * 1000 + return ExecutionResult( + actual="timeout", + passed=False, + duration_ms=round(elapsed, 4), + detail="LLM call timed out after 30s", + ) + except Exception as e: + elapsed = (time.perf_counter() - start) * 1000 + return ExecutionResult( + actual=f"error:{type(e).__name__}", + passed=False, + duration_ms=round(elapsed, 4), + detail=f"LLM error: {e}", + ) + else: + # Non-REACT mode: check if matches expected + passed = actual_mode == task.expected + elapsed = (time.perf_counter() - start) * 1000 + return ExecutionResult( + actual=f"mode={actual_mode}", + passed=passed, + duration_ms=round(elapsed, 4), + detail=f"Expected {task.expected}, got {actual_mode}", + ) + + +async def _run_llm_reasoning( + runs: int, + fast: bool, + verbose: bool, + preprocessor: object, + llm_gateway: object, +) -> DimensionResult: + """Run LLM reasoning benchmark dimension with real LLM calls.""" + tasks = list(LLM_REASONING_TASKS) + if fast: + tasks = [t for t in tasks if t.task_id in _FAST_CORE_IDS] + + all_runs_cases: list[list[CaseResult]] = [] + accuracies: list[float] = [] + + for _run_idx in range(runs): + cases: list[CaseResult] = [] + for task in tasks: + try: + result = await _execute_llm_reasoning_task(task, preprocessor, llm_gateway) + except Exception as e: + result = ExecutionResult( + actual=f"__exception__:{type(e).__name__}", + passed=False, + duration_ms=0.0, + detail=str(e), + ) + root_cause = "none" if result.passed else _classify_llm_root_cause(result) + case = CaseResult( + task_id=task.task_id, + dimension=task.dimension, + category=task.category, + difficulty=task.difficulty, + passed=result.passed, + expected=task.expected, + actual=result.actual, + duration_ms=result.duration_ms, + root_cause=root_cause, + detail=result.detail, + consistency=result.consistency, + ) + cases.append(case) + if verbose: + status = "[green]✓[/green]" if case.passed else "[red]✗[/red]" + console.print( + f" {status} {task.task_id}: {result.actual} ({result.duration_ms:.2f}ms)" + ) + all_runs_cases.append(cases) + passed_count = sum(1 for c in cases if c.passed) + accuracies.append(passed_count / len(cases) if cases else 0.0) + + final_cases = all_runs_cases[-1] if all_runs_cases else [] + metrics = _compute_metrics(final_cases, accuracies if runs > 1 else None) + return DimensionResult( + dimension="llm_reasoning", + metrics=metrics, + cases=final_cases, + by_category=_aggregate_by(final_cases, "category"), + by_difficulty=_aggregate_by(final_cases, "difficulty"), + ) + + +def _classify_llm_root_cause(result: ExecutionResult) -> str: + """Classify root cause for LLM reasoning failures.""" + if "timeout" in result.actual: + return "timeout" + if "error" in result.actual or "__exception__" in result.actual: + return "exception" + if "mode=" in result.actual and "react" not in result.actual: + return "wrong_mode" + return "keyword_miss" + + +# --------------------------------------------------------------------------- +# GUI Integration dimension executor +# --------------------------------------------------------------------------- + + +def _find_free_port() -> int: + """Find a free TCP port for the GUI server.""" + import socket + + with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s: + s.bind(("", 0)) + return int(s.getsockname()[1]) + + +async def _wait_for_server(base_url: str, timeout_s: float = 30.0) -> bool: + """Poll health endpoint until server is ready or timeout.""" + import httpx + + deadline = time.perf_counter() + timeout_s + while time.perf_counter() < deadline: + try: + async with httpx.AsyncClient(timeout=2.0) as client: + resp = await client.get(f"{base_url}/api/v1/health") + if resp.status_code == 200: + return True + except Exception: + await asyncio.sleep(0.5) + return False + + +async def _run_gui_integration( + runs: int, + fast: bool, + verbose: bool, +) -> DimensionResult: + """Run GUI integration benchmark by starting a real agentkit gui server.""" + import os as _os + import subprocess + import sys + + import httpx + + tasks = list(GUI_INTEGRATION_TASKS) + if fast: + tasks = [t for t in tasks if t.task_id in _FAST_CORE_IDS] + + def _case( + tid: str, cat: str, diff: str, actual: str, expected: str, passed: bool, detail: str + ) -> CaseResult: + return CaseResult( + tid, + "gui_integration", + cat, + diff, + passed, + expected, + actual, + 0.0, + "none" if passed else "gui_failure", + detail, + ) + + def _log(tid: str, passed: bool, label: str) -> None: + if verbose: + status = "[green]✓[/green]" if passed else "[red]✗[/red]" + console.print(f" {status} {tid}: {label}") + + all_runs_cases: list[list[CaseResult]] = [] + accuracies: list[float] = [] + + for _ in range(runs): + cases: list[CaseResult] = [] + port = _find_free_port() + base_url = f"http://localhost:{port}" + proc = subprocess.Popen( + [ + sys.executable, + "-m", + "agentkit", + "gui", + "--port", + str(port), + "--no-open", + "--host", + "127.0.0.1", + ], + stdout=subprocess.DEVNULL, + stderr=subprocess.DEVNULL, + env={**_os.environ, "AGENTKIT_GUI_MODE": "1"}, + ) + try: + # gui-001: service startup + startup_pass = await _wait_for_server(base_url, timeout_s=30.0) + cases.append( + _case( + "gui-001", + "service_startup", + "easy", + "started" if startup_pass else "failed", + "started", + startup_pass, + f"port={port} pid={proc.pid}", + ) + ) + _log("gui-001", startup_pass, f"port={port}") + + if not startup_pass: + for task in tasks[1:]: + cases.append( + _case( + task.task_id, + task.category, + task.difficulty, + "skipped", + task.expected, + False, + "server not started", + ) + ) + all_runs_cases.append(cases) + accuracies.append(0.0) + continue + + # gui-002: API availability (health + skills) + api_pass = False + api_detail = "N/A" + try: + async with httpx.AsyncClient(timeout=5.0) as client: + h_resp = await client.get(f"{base_url}/api/v1/health") + s_resp = await client.get(f"{base_url}/api/v1/skills") + api_pass = h_resp.status_code == 200 and s_resp.status_code == 200 + api_detail = f"health={h_resp.status_code} skills={s_resp.status_code}" + except Exception as e: + api_detail = f"error: {e}" + cases.append( + _case( + "gui-002", + "api_availability", + "medium", + "200" if api_pass else "error", + "200", + api_pass, + api_detail, + ) + ) + _log("gui-002", api_pass, "health+skills") + + # gui-003: chat API reachability + chat_pass = False + chat_detail = "N/A" + try: + async with httpx.AsyncClient(timeout=5.0) as client: + c_resp = await client.post( + f"{base_url}/api/v1/chat", + json={"message": "ping", "session_id": "bench-test"}, + ) + chat_pass = c_resp.status_code < 500 + chat_detail = f"status={c_resp.status_code}" + except Exception as e: + chat_detail = f"error: {e}" + cases.append( + _case( + "gui-003", + "api_availability", + "medium", + "reachable" if chat_pass else "unreachable", + "reachable", + chat_pass, + chat_detail, + ) + ) + _log("gui-003", chat_pass, "chat API") + + # gui-004: WebSocket connection + ws_pass = False + ws_detail = "N/A" + try: + import websockets + + ws_url = f"ws://localhost:{port}/api/v1/ws/bench-session" + async with websockets.connect(ws_url, open_timeout=5.0) as ws: + await ws.send('{"type": "ping"}') + msg = await asyncio.wait_for(ws.recv(), timeout=5.0) + ws_pass = "pong" in str(msg).lower() or "error" in str(msg).lower() + ws_detail = f"msg={str(msg)[:50]}" + except Exception as e: + ws_detail = f"error: {e}" + cases.append( + _case( + "gui-004", + "websocket", + "hard", + "connected" if ws_pass else "failed", + "connected", + ws_pass, + ws_detail, + ) + ) + _log("gui-004", ws_pass, "websocket") + + # gui-005: frontend resources + fe_pass = False + fe_detail = "N/A" + try: + async with httpx.AsyncClient(timeout=5.0) as client: + r_resp = await client.get(f"{base_url}/") + fe_pass = r_resp.status_code == 200 and " 1 else None) + return DimensionResult( + dimension="gui_integration", + metrics=metrics, + cases=final_cases, + by_category=_aggregate_by(final_cases, "category"), + by_difficulty=_aggregate_by(final_cases, "difficulty"), + ) + + # --------------------------------------------------------------------------- # Utility functions # --------------------------------------------------------------------------- @@ -1634,6 +1755,7 @@ def _generate_markdown_report( timestamp = str(report_data.get("timestamp", "")) version = str(report_data.get("version", "")) + mode = str(report_data.get("mode", "mock")) runs = int(report_data.get("runs", 1)) overall = float(report_data.get("overall_accuracy", 0.0)) overall_mean = float(report_data.get("overall_accuracy_mean", overall)) @@ -1645,6 +1767,7 @@ def _generate_markdown_report( lines.append("## 测试概要") lines.append(f"- 时间: {timestamp}") lines.append(f"- 版本: {version}") + lines.append(f"- 模式: {mode}") lines.append(f"- 运行次数: {runs}") lines.append(f"- 总体准确率: {overall_mean:.1%} ± {overall_std:.1%}") lines.append("") @@ -1670,13 +1793,15 @@ def _generate_markdown_report( dimensions = {} dim_titles = { - "preprocessing": "1. 预处理准确度 (Preprocessing Accuracy)", - "overfitting": "2. 过拟合检测 (Overfitting Detection)", - "efficiency": "3. 效率测试 (Efficiency)", - "tool_search": "4. 工具搜索 (Tool Search)", - "event_model": "5. 事件模型 (Event Model)", - "spec_management": "6. 规格管理 (Spec Management)", - "verification": "7. 验证循环 (Verification Loop)", + "preprocessing": "1. 预处理准确度 (Preprocessing Accuracy) [Mock]", + "overfitting": "2. 过拟合检测 (Overfitting Detection) [Mock]", + "efficiency": "3. 效率测试 (Efficiency) [Mock]", + "tool_search": "4. 工具搜索 (Tool Search) [Mock]", + "event_model": "5. 事件模型 (Event Model) [Mock]", + "spec_management": "6. 规格管理 (Spec Management) [Mock]", + "verification": "7. 验证循环 (Verification Loop) [Mock]", + "llm_reasoning": "8. LLM 推理能力 (LLM Reasoning) [LLM]", + "gui_integration": "9. GUI 集成测试 (GUI Integration) [GUI]", } lines.append("## 维度结果") @@ -1929,6 +2054,7 @@ def _generate_html_report( timestamp = str(report_data.get("timestamp", "")) version = str(report_data.get("version", "")) + mode = str(report_data.get("mode", "mock")) runs = int(report_data.get("runs", 1)) html = f""" @@ -1956,6 +2082,7 @@ def _generate_html_report(

Timestamp: {timestamp}

Version: {version}

+

Mode: {mode}

Runs: {runs}

Overall Accuracy: {overall:.1%}

@@ -2118,6 +2245,11 @@ def benchmark( "-d", help="Benchmark dimension to run (default: all)", ), + mode: BenchmarkMode = typer.Option( + BenchmarkMode.MOCK, + "--mode", + help="Execution mode: mock (default), llm, gui, or all", + ), report: bool = typer.Option(False, "--report", help="Generate report files"), format: str = typer.Option( "markdown", @@ -2138,18 +2270,22 @@ def benchmark( ): """Run AgentKit capability benchmarks with standardized metrics. - Tests core components directly (no LLM, no pytest subprocess): - preprocessing, overfitting, efficiency, tool_search, event_model, - spec_management, verification. + Supports three execution modes via --mode: + - mock: 全部使用 Mock(默认,快速、无 LLM 依赖) + - llm: 使用真实 LLM(需要 agentkit.yaml 配置) + - gui: 启动真实 GUI 服务器测试端到端 + - all: 运行所有模式(Mock + LLM + GUI) Produces Accuracy / Precision / Recall / F1 / Latency / Consistency metrics with multi-run averaging and 95% confidence intervals. """ import tempfile - # Normalize dimension (Typer may pass string) + # Normalize enums (Typer may pass strings) if isinstance(dimension, str): dimension = BenchmarkDimension(dimension) + if isinstance(mode, str): + mode = BenchmarkMode(mode) # Normalize format fmt = format.lower() @@ -2160,6 +2296,7 @@ def benchmark( console.print( Panel.fit( "[bold cyan]AgentKit Benchmark[/bold cyan]\n" + f"Mode: [yellow]{mode.value}[/yellow] " f"Dimension: [yellow]{dimension.value}[/yellow] " f"Runs: [yellow]{runs}[/yellow] " f"Fast: [yellow]{fast}[/yellow] " @@ -2169,26 +2306,82 @@ def benchmark( ) console.print() - # Determine which dimensions to run - if dimension == BenchmarkDimension.ALL: - dims_to_run = [ - BenchmarkDimension.PREPROCESSING, - BenchmarkDimension.OVERFITTING, - BenchmarkDimension.EFFICIENCY, - BenchmarkDimension.TOOL_SEARCH, - BenchmarkDimension.EVENT_MODEL, - BenchmarkDimension.SPEC_MANAGEMENT, - BenchmarkDimension.VERIFICATION, - ] - else: - dims_to_run = [dimension] + # Determine which dimensions to run based on mode and dimension filter + mock_dims: list[BenchmarkDimension] = [] + run_llm = False + run_gui = False + + if mode == BenchmarkMode.MOCK: + if dimension == BenchmarkDimension.ALL: + mock_dims = list(_MOCK_DIMENSIONS) + elif dimension in _MOCK_DIMENSIONS: + mock_dims = [dimension] + elif mode == BenchmarkMode.LLM: + if dimension in (BenchmarkDimension.ALL, BenchmarkDimension.LLM_REASONING): + run_llm = True + elif mode == BenchmarkMode.GUI: + if dimension in (BenchmarkDimension.ALL, BenchmarkDimension.GUI_INTEGRATION): + run_gui = True + elif mode == BenchmarkMode.ALL: + if dimension == BenchmarkDimension.ALL: + mock_dims = list(_MOCK_DIMENSIONS) + run_llm = True + run_gui = True + elif dimension in _MOCK_DIMENSIONS: + mock_dims = [dimension] + elif dimension == BenchmarkDimension.LLM_REASONING: + run_llm = True + elif dimension == BenchmarkDimension.GUI_INTEGRATION: + run_gui = True results: dict[str, DimensionResult] = {} - with tempfile.TemporaryDirectory(prefix="agentkit-benchmark-") as tmp: - tmp_path = Path(tmp) - ctx = _make_context(tmp_path) + # --- Mock dimensions --- + if mock_dims: + with tempfile.TemporaryDirectory(prefix="agentkit-benchmark-") as tmp: + tmp_path = Path(tmp) + ctx = _make_context(tmp_path) + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + BarColumn(), + TaskProgressColumn(), + console=console, + ) as progress: + for dim in mock_dims: + task = progress.add_task(f"Running [mock] {dim.value}...", total=None) + dim_result = asyncio.run(_run_dimension(dim.value, runs, fast, verbose, ctx)) + results[dim.value] = dim_result + progress.update(task, completed=True, total=1) + + # --- LLM reasoning dimension --- + if run_llm: + console.print("[cyan]Loading real components for LLM mode...[/cyan]") + components = _build_real_components() + if components is None: + console.print( + "[yellow]⚠ LLM mode skipped — no valid agentkit.yaml or API key.[/yellow]" + ) + else: + preprocessor, _skill_registry, llm_gateway = components + with Progress( + SpinnerColumn(), + TextColumn("[progress.description]{task.description}"), + BarColumn(), + TaskProgressColumn(), + console=console, + ) as progress: + task = progress.add_task("Running [llm] llm_reasoning...", total=None) + dim_result = asyncio.run( + _run_llm_reasoning(runs, fast, verbose, preprocessor, llm_gateway) + ) + results["llm_reasoning"] = dim_result + progress.update(task, completed=True, total=1) + + # --- GUI integration dimension --- + if run_gui: + console.print("[cyan]Starting GUI integration tests...[/cyan]") with Progress( SpinnerColumn(), TextColumn("[progress.description]{task.description}"), @@ -2196,11 +2389,14 @@ def benchmark( TaskProgressColumn(), console=console, ) as progress: - for dim in dims_to_run: - task = progress.add_task(f"Running {dim.value}...", total=None) - dim_result = asyncio.run(_run_dimension(dim.value, runs, fast, verbose, ctx)) - results[dim.value] = dim_result - progress.update(task, completed=True, total=1) + task = progress.add_task("Running [gui] gui_integration...", total=None) + dim_result = asyncio.run(_run_gui_integration(runs, fast, verbose)) + results["gui_integration"] = dim_result + progress.update(task, completed=True, total=1) + + if not results: + console.print("[yellow]⚠ No dimensions were run.[/yellow]") + return # Display summary table console.print() @@ -2252,6 +2448,7 @@ def benchmark( report_data: dict[str, object] = { "timestamp": timestamp, "version": version, + "mode": mode.value, "runs": runs, "fast": fast, "overall_accuracy": round(overall_score, 4), diff --git a/test-results/benchmark/benchmark_report.json b/test-results/benchmark/benchmark_report.json index a38ea17..48bc2f3 100644 --- a/test-results/benchmark/benchmark_report.json +++ b/test-results/benchmark/benchmark_report.json @@ -1,12 +1,13 @@ { - "timestamp": "2026-06-17T04:00:50.738066+00:00", + "timestamp": "2026-06-17T04:52:53.863927+00:00", "version": "0.1.0", - "runs": 3, + "mode": "all", + "runs": 1, "fast": false, - "overall_accuracy": 1.0, - "overall_accuracy_mean": 1.0, + "overall_accuracy": 0.9524, + "overall_accuracy_mean": 0.9524, "overall_accuracy_std": 0.0, - "summary": "All 53 tests passed across 7 dimensions.", + "summary": "60/63 tests passed (3 failed) across 9 dimensions.", "dimensions": { "preprocessing": { "metrics": { @@ -14,9 +15,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.006, - "latency_p95_ms": 0.0295, - "latency_p99_ms": 0.0569, + "latency_p50_ms": 0.0128, + "latency_p95_ms": 0.057, + "latency_p99_ms": 0.1086, "consistency": 1.0, "total": 15, "passed": 15, @@ -32,9 +33,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0069, - "latency_p95_ms": 0.0111, - "latency_p99_ms": 0.0117, + "latency_p50_ms": 0.0133, + "latency_p95_ms": 0.026, + "latency_p99_ms": 0.0275, "consistency": 1.0, "total": 4, "passed": 4, @@ -49,9 +50,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0051, - "latency_p95_ms": 0.0052, - "latency_p99_ms": 0.0052, + "latency_p50_ms": 0.0115, + "latency_p95_ms": 0.0166, + "latency_p99_ms": 0.0172, "consistency": 1.0, "total": 5, "passed": 5, @@ -66,9 +67,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0149, - "latency_p95_ms": 0.0588, - "latency_p99_ms": 0.0627, + "latency_p50_ms": 0.0294, + "latency_p95_ms": 0.1123, + "latency_p99_ms": 0.1197, "consistency": 1.0, "total": 3, "passed": 3, @@ -83,9 +84,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0056, - "latency_p95_ms": 0.0074, - "latency_p99_ms": 0.0076, + "latency_p50_ms": 0.0101, + "latency_p95_ms": 0.0125, + "latency_p99_ms": 0.0127, "consistency": 1.0, "total": 3, "passed": 3, @@ -102,9 +103,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0066, - "latency_p95_ms": 0.0109, - "latency_p99_ms": 0.0116, + "latency_p50_ms": 0.0115, + "latency_p95_ms": 0.0253, + "latency_p99_ms": 0.0274, "consistency": 1.0, "total": 5, "passed": 5, @@ -119,9 +120,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0051, - "latency_p95_ms": 0.0132, - "latency_p99_ms": 0.0146, + "latency_p50_ms": 0.0136, + "latency_p95_ms": 0.0263, + "latency_p99_ms": 0.0288, "consistency": 1.0, "total": 7, "passed": 7, @@ -136,9 +137,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0076, - "latency_p95_ms": 0.0581, - "latency_p99_ms": 0.0626, + "latency_p50_ms": 0.0128, + "latency_p95_ms": 0.1106, + "latency_p99_ms": 0.1193, "consistency": 1.0, "total": 3, "passed": 3, @@ -158,7 +159,7 @@ "passed": true, "expected": "direct_chat", "actual": "direct_chat", - "duration_ms": 0.0118, + "duration_ms": 0.0279, "root_cause": "none", "detail": "input='你好' method=regex_direct", "consistency": 1.0 @@ -171,7 +172,7 @@ "passed": true, "expected": "direct_chat", "actual": "direct_chat", - "duration_ms": 0.0071, + "duration_ms": 0.0151, "root_cause": "none", "detail": "input='hello' method=regex_direct", "consistency": 1.0 @@ -184,7 +185,7 @@ "passed": true, "expected": "direct_chat", "actual": "direct_chat", - "duration_ms": 0.0066, + "duration_ms": 0.0111, "root_cause": "none", "detail": "input='谢谢' method=regex_direct", "consistency": 1.0 @@ -197,7 +198,7 @@ "passed": true, "expected": "direct_chat", "actual": "direct_chat", - "duration_ms": 0.006, + "duration_ms": 0.0115, "root_cause": "none", "detail": "input='你是谁' method=regex_direct", "consistency": 1.0 @@ -210,7 +211,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0052, + "duration_ms": 0.0136, "root_cause": "none", "detail": "input='搜索golang教程' method=default_react", "consistency": 1.0 @@ -223,7 +224,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0046, + "duration_ms": 0.0115, "root_cause": "none", "detail": "input='执行ls命令' method=default_react", "consistency": 1.0 @@ -236,7 +237,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0051, + "duration_ms": 0.0174, "root_cause": "none", "detail": "input='翻译hello为中文' method=default_react", "consistency": 1.0 @@ -249,7 +250,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0051, + "duration_ms": 0.0113, "root_cause": "none", "detail": "input='什么是机器学习' method=default_react", "consistency": 1.0 @@ -262,7 +263,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0047, + "duration_ms": 0.0109, "root_cause": "none", "detail": "input='帮我分析数据' method=default_react", "consistency": 1.0 @@ -275,7 +276,7 @@ "passed": true, "expected": "skill_react", "actual": "skill_react", - "duration_ms": 0.0149, + "duration_ms": 0.0294, "root_cause": "none", "detail": "input='@skill:react_agent 查看ip' method=skill_prefix", "consistency": 1.0 @@ -288,7 +289,7 @@ "passed": true, "expected": "direct_chat", "actual": "direct_chat", - "duration_ms": 0.0092, + "duration_ms": 0.0191, "root_cause": "none", "detail": "input='@skill:chat_only 你好' method=skill_prefix", "consistency": 1.0 @@ -301,7 +302,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0637, + "duration_ms": 0.1215, "root_cause": "none", "detail": "input='@skill:nonexistent 做点什么' method=skill_not_found_fallback", "consistency": 1.0 @@ -314,7 +315,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0076, + "duration_ms": 0.0101, "root_cause": "none", "detail": "input='帮我分析这个数据并生成报告' method=default_react", "consistency": 1.0 @@ -327,7 +328,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0056, + "duration_ms": 0.0099, "root_cause": "none", "detail": "input='随便聊聊' method=default_react", "consistency": 1.0 @@ -340,7 +341,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0047, + "duration_ms": 0.0128, "root_cause": "none", "detail": "input='请帮我完成以下任务:1. 查询天气 2. 生成报告' method=default_react", "consistency": 1.0 @@ -353,9 +354,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0426, - "latency_p95_ms": 0.0644, - "latency_p99_ms": 0.0675, + "latency_p50_ms": 0.025, + "latency_p95_ms": 0.0557, + "latency_p99_ms": 0.0596, "consistency": 1.0, "total": 5, "passed": 5, @@ -371,9 +372,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0426, - "latency_p95_ms": 0.0426, - "latency_p99_ms": 0.0426, + "latency_p50_ms": 0.0362, + "latency_p95_ms": 0.0362, + "latency_p99_ms": 0.0362, "consistency": 1.0, "total": 1, "passed": 1, @@ -388,9 +389,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0309, - "latency_p95_ms": 0.0309, - "latency_p99_ms": 0.0309, + "latency_p50_ms": 0.0243, + "latency_p95_ms": 0.0243, + "latency_p99_ms": 0.0243, "consistency": 1.0, "total": 1, "passed": 1, @@ -405,9 +406,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.049, - "latency_p95_ms": 0.049, - "latency_p99_ms": 0.049, + "latency_p50_ms": 0.0606, + "latency_p95_ms": 0.0606, + "latency_p99_ms": 0.0606, "consistency": 1.0, "total": 1, "passed": 1, @@ -422,9 +423,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0252, - "latency_p95_ms": 0.0252, - "latency_p99_ms": 0.0252, + "latency_p50_ms": 0.0233, + "latency_p95_ms": 0.0233, + "latency_p99_ms": 0.0233, "consistency": 1.0, "total": 1, "passed": 1, @@ -439,9 +440,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0683, - "latency_p95_ms": 0.0683, - "latency_p99_ms": 0.0683, + "latency_p50_ms": 0.025, + "latency_p95_ms": 0.025, + "latency_p99_ms": 0.025, "consistency": 1.0, "total": 1, "passed": 1, @@ -458,9 +459,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0309, - "latency_p95_ms": 0.0414, - "latency_p99_ms": 0.0424, + "latency_p50_ms": 0.0243, + "latency_p95_ms": 0.035, + "latency_p99_ms": 0.036, "consistency": 1.0, "total": 3, "passed": 3, @@ -475,9 +476,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.049, - "latency_p95_ms": 0.049, - "latency_p99_ms": 0.049, + "latency_p50_ms": 0.0606, + "latency_p95_ms": 0.0606, + "latency_p99_ms": 0.0606, "consistency": 1.0, "total": 1, "passed": 1, @@ -492,9 +493,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0683, - "latency_p95_ms": 0.0683, - "latency_p99_ms": 0.0683, + "latency_p50_ms": 0.025, + "latency_p95_ms": 0.025, + "latency_p99_ms": 0.025, "consistency": 1.0, "total": 1, "passed": 1, @@ -514,7 +515,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0426, + "duration_ms": 0.0362, "root_cause": "none", "detail": "paraphrases=5 modes=['react', 'react', 'react', 'react', 'react']", "consistency": 1.0 @@ -527,7 +528,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0309, + "duration_ms": 0.0243, "root_cause": "none", "detail": "paraphrases=3 modes=['react', 'react', 'react']", "consistency": 1.0 @@ -540,7 +541,7 @@ "passed": true, "expected": "direct_chat", "actual": "direct_chat", - "duration_ms": 0.049, + "duration_ms": 0.0606, "root_cause": "none", "detail": "paraphrases=5 modes=['direct_chat', 'direct_chat', 'direct_chat', 'direct_chat', 'direct_chat']", "consistency": 1.0 @@ -553,7 +554,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0252, + "duration_ms": 0.0233, "root_cause": "none", "detail": "paraphrases=3 modes=['react', 'react', 'react']", "consistency": 1.0 @@ -566,7 +567,7 @@ "passed": true, "expected": "react", "actual": "react", - "duration_ms": 0.0683, + "duration_ms": 0.025, "root_cause": "none", "detail": "paraphrases=3 modes=['react', 'react', 'react']", "consistency": 1.0 @@ -579,9 +580,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.4, - "latency_p95_ms": 0.768, - "latency_p99_ms": 0.8176, + "latency_p50_ms": 0.33, + "latency_p95_ms": 0.622, + "latency_p99_ms": 0.6604, "consistency": 1.0, "total": 5, "passed": 5, @@ -597,9 +598,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.4, - "latency_p95_ms": 0.508, - "latency_p99_ms": 0.5176, + "latency_p50_ms": 0.33, + "latency_p95_ms": 0.42, + "latency_p99_ms": 0.428, "consistency": 1.0, "total": 3, "passed": 3, @@ -614,9 +615,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.44, - "latency_p95_ms": 0.791, - "latency_p99_ms": 0.8222, + "latency_p50_ms": 0.355, + "latency_p95_ms": 0.6385, + "latency_p99_ms": 0.6637, "consistency": 1.0, "total": 2, "passed": 2, @@ -633,9 +634,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.2, - "latency_p95_ms": 0.335, - "latency_p99_ms": 0.347, + "latency_p50_ms": 0.165, + "latency_p95_ms": 0.2775, + "latency_p99_ms": 0.2875, "consistency": 1.0, "total": 2, "passed": 2, @@ -650,9 +651,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.52, - "latency_p95_ms": 0.799, - "latency_p99_ms": 0.8238, + "latency_p50_ms": 0.43, + "latency_p95_ms": 0.646, + "latency_p99_ms": 0.6652, "consistency": 1.0, "total": 3, "passed": 3, @@ -671,10 +672,10 @@ "difficulty": "easy", "passed": true, "expected": "<=50ms", - "actual": "0.004ms", - "duration_ms": 0.35, + "actual": "0.003ms", + "duration_ms": 0.29, "root_cause": "none", - "detail": "iterations=100 avg=0.004ms threshold=50.0ms", + "detail": "iterations=100 avg=0.003ms threshold=50.0ms", "consistency": 1.0 }, { @@ -684,10 +685,10 @@ "difficulty": "medium", "passed": true, "expected": "<=50ms", - "actual": "0.004ms", - "duration_ms": 0.4, + "actual": "0.003ms", + "duration_ms": 0.33, "root_cause": "none", - "detail": "iterations=100 avg=0.004ms threshold=50.0ms", + "detail": "iterations=100 avg=0.003ms threshold=50.0ms", "consistency": 1.0 }, { @@ -697,10 +698,10 @@ "difficulty": "medium", "passed": true, "expected": "<=50ms", - "actual": "0.005ms", - "duration_ms": 0.52, + "actual": "0.004ms", + "duration_ms": 0.43, "root_cause": "none", - "detail": "iterations=100 avg=0.005ms threshold=50.0ms", + "detail": "iterations=100 avg=0.004ms threshold=50.0ms", "consistency": 1.0 }, { @@ -710,10 +711,10 @@ "difficulty": "medium", "passed": true, "expected": "<=10ms", - "actual": "0.008ms", - "duration_ms": 0.83, + "actual": "0.007ms", + "duration_ms": 0.67, "root_cause": "none", - "detail": "iterations=100 avg=0.008ms threshold=10.0ms", + "detail": "iterations=100 avg=0.007ms threshold=10.0ms", "consistency": 1.0 }, { @@ -724,7 +725,7 @@ "passed": true, "expected": "<=5ms", "actual": "0.000ms", - "duration_ms": 0.05, + "duration_ms": 0.04, "root_cause": "none", "detail": "iterations=100 avg=0.000ms threshold=5.0ms", "consistency": 1.0 @@ -737,9 +738,9 @@ "precision": 0.8333, "recall": 0.8333, "f1": 0.8333, - "latency_p50_ms": 0.0112, - "latency_p95_ms": 0.0153, - "latency_p99_ms": 0.0163, + "latency_p50_ms": 0.0192, + "latency_p95_ms": 0.0278, + "latency_p99_ms": 0.0326, "consistency": 1.0, "total": 10, "passed": 10, @@ -755,9 +756,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0124, - "latency_p95_ms": 0.016, - "latency_p99_ms": 0.0165, + "latency_p50_ms": 0.0199, + "latency_p95_ms": 0.0203, + "latency_p99_ms": 0.0204, "consistency": 1.0, "total": 5, "passed": 5, @@ -772,9 +773,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0108, - "latency_p95_ms": 0.0111, - "latency_p99_ms": 0.0111, + "latency_p50_ms": 0.0264, + "latency_p95_ms": 0.0331, + "latency_p99_ms": 0.0337, "consistency": 1.0, "total": 2, "passed": 2, @@ -789,9 +790,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.0044, - "latency_p95_ms": 0.0071, - "latency_p99_ms": 0.0073, + "latency_p50_ms": 0.0118, + "latency_p95_ms": 0.0122, + "latency_p99_ms": 0.0123, "consistency": 1.0, "total": 2, "passed": 2, @@ -806,9 +807,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0091, - "latency_p95_ms": 0.0091, - "latency_p99_ms": 0.0091, + "latency_p50_ms": 0.016, + "latency_p95_ms": 0.016, + "latency_p99_ms": 0.016, "consistency": 1.0, "total": 1, "passed": 1, @@ -825,9 +826,9 @@ "precision": 0.8333, "recall": 0.8333, "f1": 0.8333, - "latency_p50_ms": 0.0124, - "latency_p95_ms": 0.0158, - "latency_p99_ms": 0.0164, + "latency_p50_ms": 0.0194, + "latency_p95_ms": 0.0203, + "latency_p99_ms": 0.0204, "consistency": 1.0, "total": 7, "passed": 7, @@ -842,9 +843,9 @@ "precision": 1.0, "recall": 1.0, "f1": 1.0, - "latency_p50_ms": 0.0105, - "latency_p95_ms": 0.011, - "latency_p99_ms": 0.0111, + "latency_p50_ms": 0.019, + "latency_p95_ms": 0.0323, + "latency_p99_ms": 0.0335, "consistency": 1.0, "total": 3, "passed": 3, @@ -864,7 +865,7 @@ "passed": true, "expected": "read_file", "actual": "read_file", - "duration_ms": 0.0166, + "duration_ms": 0.0199, "root_cause": "none", "detail": "query='read file' top_k=5 results=2", "consistency": 1.0 @@ -877,7 +878,7 @@ "passed": true, "expected": "write_file", "actual": "write_file", - "duration_ms": 0.0138, + "duration_ms": 0.0204, "root_cause": "none", "detail": "query='write file content' top_k=5 results=2", "consistency": 1.0 @@ -890,7 +891,7 @@ "passed": true, "expected": "web_search", "actual": "web_search", - "duration_ms": 0.0124, + "duration_ms": 0.02, "root_cause": "none", "detail": "query='search web information' top_k=5 results=2", "consistency": 1.0 @@ -903,7 +904,7 @@ "passed": true, "expected": "shell_exec", "actual": "shell_exec", - "duration_ms": 0.0113, + "duration_ms": 0.018, "root_cause": "none", "detail": "query='execute shell command' top_k=5 results=1", "consistency": 1.0 @@ -916,7 +917,7 @@ "passed": true, "expected": "http_request", "actual": "http_request", - "duration_ms": 0.0124, + "duration_ms": 0.0194, "root_cause": "none", "detail": "query='send http request url' top_k=5 results=1", "consistency": 1.0 @@ -929,7 +930,7 @@ "passed": true, "expected": "read_file", "actual": "read_file", - "duration_ms": 0.0105, + "duration_ms": 0.0338, "root_cause": "none", "detail": "query='io file' top_k=5 results=2", "consistency": 1.0 @@ -942,7 +943,7 @@ "passed": true, "expected": "web_search", "actual": "web_search", - "duration_ms": 0.0111, + "duration_ms": 0.019, "root_cause": "none", "detail": "query='search query engine' top_k=5 results=1", "consistency": 1.0 @@ -955,7 +956,7 @@ "passed": true, "expected": "__none__", "actual": "[]", - "duration_ms": 0.0015, + "duration_ms": 0.0112, "root_cause": "none", "detail": "query='' top_k=5 results=0", "consistency": 1.0 @@ -968,7 +969,7 @@ "passed": true, "expected": "__none__", "actual": "[]", - "duration_ms": 0.0074, + "duration_ms": 0.0123, "root_cause": "none", "detail": "query='zzzznonexistent' top_k=5 results=0", "consistency": 1.0 @@ -981,7 +982,7 @@ "passed": true, "expected": "read_file", "actual": "read_file", - "duration_ms": 0.0091, + "duration_ms": 0.016, "root_cause": "none", "detail": "query='file' top_k=1 results=1", "consistency": 1.0 @@ -994,9 +995,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.0409, - "latency_p95_ms": 15.6839, - "latency_p99_ms": 19.8446, + "latency_p50_ms": 0.057, + "latency_p95_ms": 15.9984, + "latency_p99_ms": 20.2369, "consistency": 1.0, "total": 6, "passed": 6, @@ -1012,9 +1013,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.038, - "latency_p95_ms": 0.0773, - "latency_p99_ms": 0.0808, + "latency_p50_ms": 0.046, + "latency_p95_ms": 0.0982, + "latency_p99_ms": 0.1028, "consistency": 1.0, "total": 3, "passed": 3, @@ -1029,9 +1030,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.0438, - "latency_p95_ms": 18.8006, - "latency_p99_ms": 20.4679, + "latency_p50_ms": 0.0681, + "latency_p95_ms": 19.1737, + "latency_p99_ms": 20.8719, "consistency": 1.0, "total": 3, "passed": 3, @@ -1048,9 +1049,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 0.0409, - "latency_p95_ms": 15.6839, - "latency_p99_ms": 19.8446, + "latency_p50_ms": 0.057, + "latency_p95_ms": 15.9984, + "latency_p99_ms": 20.2369, "consistency": 1.0, "total": 6, "passed": 6, @@ -1070,9 +1071,9 @@ "passed": true, "expected": "passed", "actual": "drained=['hello']", - "duration_ms": 0.0817, + "duration_ms": 0.104, "root_cause": "none", - "detail": "task_id=b0a1c409...", + "detail": "task_id=09dccea9...", "consistency": 1.0 }, { @@ -1083,7 +1084,7 @@ "passed": true, "expected": "passed", "actual": "cancelled=True", - "duration_ms": 0.038, + "duration_ms": 0.046, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1096,7 +1097,7 @@ "passed": true, "expected": "passed", "actual": "raised=True closed=True", - "duration_ms": 0.0091, + "duration_ms": 0.0115, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1109,7 +1110,7 @@ "passed": true, "expected": "passed", "actual": "received=1", - "duration_ms": 0.0438, + "duration_ms": 0.0681, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1122,7 +1123,7 @@ "passed": true, "expected": "passed", "actual": "events=1 closed=True", - "duration_ms": 20.8847, + "duration_ms": 21.2965, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1135,7 +1136,7 @@ "passed": true, "expected": "passed", "actual": "subscribers=0", - "duration_ms": 0.0045, + "duration_ms": 0.007, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1148,9 +1149,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 1.414, - "latency_p95_ms": 3.5951, - "latency_p99_ms": 4.0383, + "latency_p50_ms": 1.3834, + "latency_p95_ms": 3.4578, + "latency_p99_ms": 4.0077, "consistency": 1.0, "total": 7, "passed": 7, @@ -1166,9 +1167,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 1.414, - "latency_p95_ms": 3.6332, - "latency_p99_ms": 4.0459, + "latency_p50_ms": 1.3834, + "latency_p95_ms": 3.6044, + "latency_p99_ms": 4.037, "consistency": 1.0, "total": 5, "passed": 5, @@ -1183,9 +1184,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 1.1783, - "latency_p95_ms": 2.1899, - "latency_p99_ms": 2.2798, + "latency_p50_ms": 0.9497, + "latency_p95_ms": 1.7635, + "latency_p99_ms": 1.8358, "consistency": 1.0, "total": 2, "passed": 2, @@ -1202,9 +1203,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 1.3787, - "latency_p95_ms": 3.5042, - "latency_p99_ms": 4.0201, + "latency_p50_ms": 1.3659, + "latency_p95_ms": 3.4693, + "latency_p99_ms": 4.01, "consistency": 1.0, "total": 6, "passed": 6, @@ -1219,9 +1220,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 2.3023, - "latency_p95_ms": 2.3023, - "latency_p99_ms": 2.3023, + "latency_p50_ms": 1.8539, + "latency_p95_ms": 1.8539, + "latency_p99_ms": 1.8539, "consistency": 1.0, "total": 1, "passed": 1, @@ -1241,9 +1242,9 @@ "passed": true, "expected": "passed", "actual": "exists=True", - "duration_ms": 1.414, + "duration_ms": 1.3484, "root_cause": "none", - "detail": "path=/var/folders/6b/ljk5bdq50yxcsth24frf05200000gn/T/agentkit-benchmark-pz2hpb1l/run-2/specs/sm-001/test-spec.yaml", + "detail": "path=/var/folders/6b/ljk5bdq50yxcsth24frf05200000gn/T/agentkit-benchmark-wll_nqgl/run-0/specs/sm-001/test-spec.yaml", "consistency": 1.0 }, { @@ -1254,7 +1255,7 @@ "passed": true, "expected": "passed", "actual": "steps=2", - "duration_ms": 1.3435, + "duration_ms": 1.3834, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1267,7 +1268,7 @@ "passed": true, "expected": "passed", "actual": "goal=Updated goal", - "duration_ms": 1.5695, + "duration_ms": 1.4414, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1280,7 +1281,7 @@ "passed": true, "expected": "passed", "actual": "deleted=True remaining=0", - "duration_ms": 1.1556, + "duration_ms": 1.0766, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1293,7 +1294,7 @@ "passed": true, "expected": "passed", "actual": "count=2", - "duration_ms": 4.1491, + "duration_ms": 4.1452, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1306,7 +1307,7 @@ "passed": true, "expected": "passed", "actual": "status=confirmed", - "duration_ms": 2.3023, + "duration_ms": 1.8539, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1319,7 +1320,7 @@ "passed": true, "expected": "passed", "actual": "result=None", - "duration_ms": 0.0544, + "duration_ms": 0.0454, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1332,9 +1333,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 25.4393, - "latency_p95_ms": 413.4245, - "latency_p99_ms": 488.3185, + "latency_p50_ms": 22.0041, + "latency_p95_ms": 411.5705, + "latency_p99_ms": 487.0649, "consistency": 1.0, "total": 5, "passed": 5, @@ -1350,9 +1351,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 12.9474, - "latency_p95_ms": 13.0775, - "latency_p99_ms": 13.0891, + "latency_p50_ms": 11.4916, + "latency_p95_ms": 11.8303, + "latency_p99_ms": 11.8604, "consistency": 1.0, "total": 2, "passed": 2, @@ -1367,9 +1368,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 38.9547, - "latency_p95_ms": 38.9547, - "latency_p99_ms": 38.9547, + "latency_p50_ms": 34.0985, + "latency_p95_ms": 34.0985, + "latency_p99_ms": 34.0985, "consistency": 1.0, "total": 1, "passed": 1, @@ -1384,9 +1385,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 507.042, - "latency_p95_ms": 507.042, - "latency_p99_ms": 507.042, + "latency_p50_ms": 505.9385, + "latency_p95_ms": 505.9385, + "latency_p99_ms": 505.9385, "consistency": 1.0, "total": 1, "passed": 1, @@ -1401,9 +1402,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 25.4393, - "latency_p95_ms": 25.4393, - "latency_p99_ms": 25.4393, + "latency_p50_ms": 22.0041, + "latency_p95_ms": 22.0041, + "latency_p99_ms": 22.0041, "consistency": 1.0, "total": 1, "passed": 1, @@ -1420,9 +1421,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 12.9474, - "latency_p95_ms": 13.0775, - "latency_p99_ms": 13.0891, + "latency_p50_ms": 11.4916, + "latency_p95_ms": 11.8303, + "latency_p99_ms": 11.8604, "consistency": 1.0, "total": 2, "passed": 2, @@ -1437,9 +1438,9 @@ "precision": 0.0, "recall": 0.0, "f1": 0.0, - "latency_p50_ms": 38.9547, - "latency_p95_ms": 460.2333, - "latency_p99_ms": 497.6803, + "latency_p50_ms": 34.0985, + "latency_p95_ms": 458.7545, + "latency_p99_ms": 496.5017, "consistency": 1.0, "total": 3, "passed": 3, @@ -1459,7 +1460,7 @@ "passed": true, "expected": "passed", "actual": "passed=True attempts=1", - "duration_ms": 13.092, + "duration_ms": 11.8679, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1472,7 +1473,7 @@ "passed": true, "expected": "passed", "actual": "passed=False errors=1", - "duration_ms": 12.8029, + "duration_ms": 11.1154, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1485,7 +1486,7 @@ "passed": true, "expected": "passed", "actual": "attempts=3 callbacks=2", - "duration_ms": 38.9547, + "duration_ms": 34.0985, "root_cause": "none", "detail": "", "consistency": 1.0 @@ -1498,7 +1499,7 @@ "passed": true, "expected": "passed", "actual": "passed=False errors=1", - "duration_ms": 507.042, + "duration_ms": 505.9385, "root_cause": "none", "detail": "errors=['Command timed out after 0.5s: sleep 10']", "consistency": 1.0 @@ -1511,12 +1512,447 @@ "passed": true, "expected": "passed", "actual": "passed=False", - "duration_ms": 25.4393, + "duration_ms": 22.0041, "root_cause": "none", "detail": "", "consistency": 1.0 } ] + }, + "llm_reasoning": { + "metrics": { + "accuracy": 0.6, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 25149.4865, + "latency_p95_ms": 30001.1677, + "latency_p99_ms": 30001.2291, + "consistency": 1.0, + "total": 5, + "passed": 3, + "failed": 2, + "accuracy_mean": 0.6, + "accuracy_std": 0.0, + "ci_lower": 0.2307, + "ci_upper": 0.8824 + }, + "by_category": { + "intent_understanding": { + "accuracy": 1.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 21288.4177, + "latency_p95_ms": 21288.4177, + "latency_p99_ms": 21288.4177, + "consistency": 1.0, + "total": 1, + "passed": 1, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.2065, + "ci_upper": 1.0 + }, + "tool_selection": { + "accuracy": 1.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 5894.9682, + "latency_p95_ms": 5894.9682, + "latency_p99_ms": 5894.9682, + "consistency": 1.0, + "total": 1, + "passed": 1, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.2065, + "ci_upper": 1.0 + }, + "multi_step": { + "accuracy": 0.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 30000.8609, + "latency_p95_ms": 30000.8609, + "latency_p99_ms": 30000.8609, + "consistency": 1.0, + "total": 1, + "passed": 0, + "failed": 1, + "accuracy_mean": 0.0, + "accuracy_std": 0.0, + "ci_lower": 0.0, + "ci_upper": 0.7935 + }, + "code_generation": { + "accuracy": 1.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 25149.4865, + "latency_p95_ms": 25149.4865, + "latency_p99_ms": 25149.4865, + "consistency": 1.0, + "total": 1, + "passed": 1, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.2065, + "ci_upper": 1.0 + }, + "error_recovery": { + "accuracy": 0.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 30001.2444, + "latency_p95_ms": 30001.2444, + "latency_p99_ms": 30001.2444, + "consistency": 1.0, + "total": 1, + "passed": 0, + "failed": 1, + "accuracy_mean": 0.0, + "accuracy_std": 0.0, + "ci_lower": 0.0, + "ci_upper": 0.7935 + } + }, + "by_difficulty": { + "easy": { + "accuracy": 1.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 21288.4177, + "latency_p95_ms": 21288.4177, + "latency_p99_ms": 21288.4177, + "consistency": 1.0, + "total": 1, + "passed": 1, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.2065, + "ci_upper": 1.0 + }, + "medium": { + "accuracy": 1.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 15522.2273, + "latency_p95_ms": 24186.7606, + "latency_p99_ms": 24956.9413, + "consistency": 1.0, + "total": 2, + "passed": 2, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.3424, + "ci_upper": 1.0 + }, + "hard": { + "accuracy": 0.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 30001.0526, + "latency_p95_ms": 30001.2252, + "latency_p99_ms": 30001.2406, + "consistency": 1.0, + "total": 2, + "passed": 0, + "failed": 2, + "accuracy_mean": 0.0, + "accuracy_std": 0.0, + "ci_lower": 0.0, + "ci_upper": 0.6576 + } + }, + "cases": [ + { + "task_id": "llm-001", + "dimension": "llm_reasoning", + "category": "intent_understanding", + "difficulty": "easy", + "passed": true, + "expected": "react", + "actual": "mode=react tokens=1116 len=974", + "duration_ms": 21288.4177, + "root_cause": "none", + "detail": "mode=react keywords=['ip', '地址', 'ifconfig', 'hostname', '网络']", + "consistency": 1.0 + }, + { + "task_id": "llm-002", + "dimension": "llm_reasoning", + "category": "tool_selection", + "difficulty": "medium", + "passed": true, + "expected": "react", + "actual": "mode=react tokens=205 len=87", + "duration_ms": 5894.9682, + "root_cause": "none", + "detail": "mode=react keywords=['search', '搜索', 'web', '论文', 'paper', 'agent']", + "consistency": 1.0 + }, + { + "task_id": "llm-003", + "dimension": "llm_reasoning", + "category": "multi_step", + "difficulty": "hard", + "passed": false, + "expected": "react", + "actual": "timeout", + "duration_ms": 30000.8609, + "root_cause": "timeout", + "detail": "LLM call timed out after 30s", + "consistency": 1.0 + }, + { + "task_id": "llm-004", + "dimension": "llm_reasoning", + "category": "code_generation", + "difficulty": "medium", + "passed": true, + "expected": "react", + "actual": "mode=react tokens=1359 len=1001", + "duration_ms": 25149.4865, + "root_cause": "none", + "detail": "mode=react keywords=['def', 'fib', 'return', 'python']", + "consistency": 1.0 + }, + { + "task_id": "llm-005", + "dimension": "llm_reasoning", + "category": "error_recovery", + "difficulty": "hard", + "passed": false, + "expected": "react", + "actual": "timeout", + "duration_ms": 30001.2444, + "root_cause": "timeout", + "detail": "LLM call timed out after 30s", + "consistency": 1.0 + } + ] + }, + "gui_integration": { + "metrics": { + "accuracy": 0.8, + "precision": 0.8, + "recall": 0.8, + "f1": 0.8, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 5, + "passed": 4, + "failed": 1, + "accuracy_mean": 0.8, + "accuracy_std": 0.0, + "ci_lower": 0.3755, + "ci_upper": 0.9638 + }, + "by_category": { + "service_startup": { + "accuracy": 1.0, + "precision": 1.0, + "recall": 1.0, + "f1": 1.0, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 1, + "passed": 1, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.2065, + "ci_upper": 1.0 + }, + "api_availability": { + "accuracy": 1.0, + "precision": 1.0, + "recall": 1.0, + "f1": 1.0, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 2, + "passed": 2, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.3424, + "ci_upper": 1.0 + }, + "websocket": { + "accuracy": 0.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 1, + "passed": 0, + "failed": 1, + "accuracy_mean": 0.0, + "accuracy_std": 0.0, + "ci_lower": 0.0, + "ci_upper": 0.7935 + }, + "frontend": { + "accuracy": 1.0, + "precision": 1.0, + "recall": 1.0, + "f1": 1.0, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 1, + "passed": 1, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.2065, + "ci_upper": 1.0 + } + }, + "by_difficulty": { + "easy": { + "accuracy": 1.0, + "precision": 1.0, + "recall": 1.0, + "f1": 1.0, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 2, + "passed": 2, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.3424, + "ci_upper": 1.0 + }, + "medium": { + "accuracy": 1.0, + "precision": 1.0, + "recall": 1.0, + "f1": 1.0, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 2, + "passed": 2, + "failed": 0, + "accuracy_mean": 1.0, + "accuracy_std": 0.0, + "ci_lower": 0.3424, + "ci_upper": 1.0 + }, + "hard": { + "accuracy": 0.0, + "precision": 0.0, + "recall": 0.0, + "f1": 0.0, + "latency_p50_ms": 0.0, + "latency_p95_ms": 0.0, + "latency_p99_ms": 0.0, + "consistency": 1.0, + "total": 1, + "passed": 0, + "failed": 1, + "accuracy_mean": 0.0, + "accuracy_std": 0.0, + "ci_lower": 0.0, + "ci_upper": 0.7935 + } + }, + "cases": [ + { + "task_id": "gui-001", + "dimension": "gui_integration", + "category": "service_startup", + "difficulty": "easy", + "passed": true, + "expected": "started", + "actual": "started", + "duration_ms": 0.0, + "root_cause": "none", + "detail": "port=64767 pid=20993", + "consistency": 1.0 + }, + { + "task_id": "gui-002", + "dimension": "gui_integration", + "category": "api_availability", + "difficulty": "medium", + "passed": true, + "expected": "200", + "actual": "200", + "duration_ms": 0.0, + "root_cause": "none", + "detail": "health=200 skills=200", + "consistency": 1.0 + }, + { + "task_id": "gui-003", + "dimension": "gui_integration", + "category": "api_availability", + "difficulty": "medium", + "passed": true, + "expected": "reachable", + "actual": "reachable", + "duration_ms": 0.0, + "root_cause": "none", + "detail": "status=405", + "consistency": 1.0 + }, + { + "task_id": "gui-004", + "dimension": "gui_integration", + "category": "websocket", + "difficulty": "hard", + "passed": false, + "expected": "connected", + "actual": "failed", + "duration_ms": 0.0, + "root_cause": "gui_failure", + "detail": "error: server rejected WebSocket connection: HTTP 403", + "consistency": 1.0 + }, + { + "task_id": "gui-005", + "dimension": "gui_integration", + "category": "frontend", + "difficulty": "easy", + "passed": true, + "expected": "html", + "actual": "html", + "duration_ms": 0.0, + "root_cause": "none", + "detail": "status=200 len=465", + "consistency": 1.0 + } + ] } }, "baseline_comparison": { @@ -1563,6 +1999,18 @@ "current_accuracy": 1.0, "change": 0.0, "direction": "—" + }, + "llm_reasoning": { + "baseline_accuracy": 0.0, + "current_accuracy": 0.6, + "change": 0.6, + "direction": "↑" + }, + "gui_integration": { + "baseline_accuracy": 0.0, + "current_accuracy": 0.8, + "change": 0.8, + "direction": "↑" } } } diff --git a/test-results/benchmark/benchmark_report.md b/test-results/benchmark/benchmark_report.md index 87c6399..a8dde39 100644 --- a/test-results/benchmark/benchmark_report.md +++ b/test-results/benchmark/benchmark_report.md @@ -1,10 +1,11 @@ # AgentKit 能力基准测试报告 ## 测试概要 -- 时间: 2026-06-17T04:00:50.738066+00:00 +- 时间: 2026-06-17T04:52:53.863927+00:00 - 版本: 0.1.0 -- 运行次数: 3 -- 总体准确率: 100.0% ± 0.0% +- 模式: all +- 运行次数: 1 +- 总体准确率: 95.2% ± 0.0% ## 与行业 Benchmark 对比 @@ -16,7 +17,7 @@ ## 维度结果 -### 1. 预处理准确度 (Preprocessing Accuracy) +### 1. 预处理准确度 (Preprocessing Accuracy) [Mock] | 指标 | 值 | |---|---| @@ -26,8 +27,8 @@ | Recall | 100.0% | | F1 | 100.0% | | Latency p50 | 0.01ms | -| Latency p95 | 0.03ms | -| Latency p99 | 0.06ms | +| Latency p95 | 0.06ms | +| Latency p99 | 0.11ms | | Consistency | 100.0% | | Total / Pass / Fail | 15 / 15 / 0 | @@ -48,7 +49,7 @@ | medium | 7 | 7 | 100.0% | | hard | 3 | 3 | 100.0% | -### 2. 过拟合检测 (Overfitting Detection) +### 2. 过拟合检测 (Overfitting Detection) [Mock] | 指标 | 值 | |---|---| @@ -57,9 +58,9 @@ | Precision | 100.0% | | Recall | 100.0% | | F1 | 100.0% | -| Latency p50 | 0.04ms | +| Latency p50 | 0.03ms | | Latency p95 | 0.06ms | -| Latency p99 | 0.07ms | +| Latency p99 | 0.06ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 5 / 0 | @@ -81,7 +82,7 @@ | easy | 1 | 1 | 100.0% | | hard | 1 | 1 | 100.0% | -### 3. 效率测试 (Efficiency) +### 3. 效率测试 (Efficiency) [Mock] | 指标 | 值 | |---|---| @@ -90,9 +91,9 @@ | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | -| Latency p50 | 0.40ms | -| Latency p95 | 0.77ms | -| Latency p99 | 0.82ms | +| Latency p50 | 0.33ms | +| Latency p95 | 0.62ms | +| Latency p99 | 0.66ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 5 / 0 | @@ -110,7 +111,7 @@ | easy | 2 | 2 | 100.0% | | medium | 3 | 3 | 100.0% | -### 4. 工具搜索 (Tool Search) +### 4. 工具搜索 (Tool Search) [Mock] | 指标 | 值 | |---|---| @@ -119,9 +120,9 @@ | Precision | 83.3% | | Recall | 83.3% | | F1 | 83.3% | -| Latency p50 | 0.01ms | -| Latency p95 | 0.02ms | -| Latency p99 | 0.02ms | +| Latency p50 | 0.02ms | +| Latency p95 | 0.03ms | +| Latency p99 | 0.03ms | | Consistency | 100.0% | | Total / Pass / Fail | 10 / 10 / 0 | @@ -141,7 +142,7 @@ | easy | 7 | 7 | 100.0% | | medium | 3 | 3 | 100.0% | -### 5. 事件模型 (Event Model) +### 5. 事件模型 (Event Model) [Mock] | 指标 | 值 | |---|---| @@ -150,9 +151,9 @@ | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | -| Latency p50 | 0.04ms | -| Latency p95 | 15.68ms | -| Latency p99 | 19.84ms | +| Latency p50 | 0.06ms | +| Latency p95 | 16.00ms | +| Latency p99 | 20.24ms | | Consistency | 100.0% | | Total / Pass / Fail | 6 / 6 / 0 | @@ -169,7 +170,7 @@ |---|---|---|---| | easy | 6 | 6 | 100.0% | -### 6. 规格管理 (Spec Management) +### 6. 规格管理 (Spec Management) [Mock] | 指标 | 值 | |---|---| @@ -178,9 +179,9 @@ | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | -| Latency p50 | 1.41ms | -| Latency p95 | 3.60ms | -| Latency p99 | 4.04ms | +| Latency p50 | 1.38ms | +| Latency p95 | 3.46ms | +| Latency p99 | 4.01ms | | Consistency | 100.0% | | Total / Pass / Fail | 7 / 7 / 0 | @@ -198,7 +199,7 @@ | easy | 6 | 6 | 100.0% | | medium | 1 | 1 | 100.0% | -### 7. 验证循环 (Verification Loop) +### 7. 验证循环 (Verification Loop) [Mock] | 指标 | 值 | |---|---| @@ -207,9 +208,9 @@ | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | -| Latency p50 | 25.44ms | -| Latency p95 | 413.42ms | -| Latency p99 | 488.32ms | +| Latency p50 | 22.00ms | +| Latency p95 | 411.57ms | +| Latency p99 | 487.06ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 5 / 0 | @@ -229,6 +230,84 @@ | easy | 2 | 2 | 100.0% | | medium | 3 | 3 | 100.0% | +### 8. LLM 推理能力 (LLM Reasoning) [LLM] + +| 指标 | 值 | +|---|---| +| Accuracy | 60.0% ± 0.0% | +| 95% CI | [23.1%, 88.2%] | +| Precision | 0.0% | +| Recall | 0.0% | +| F1 | 0.0% | +| Latency p50 | 25149.49ms | +| Latency p95 | 30001.17ms | +| Latency p99 | 30001.23ms | +| Consistency | 100.0% | +| Total / Pass / Fail | 5 / 3 / 2 | + +#### 按类别分布 + +| 类别 | 用例数 | 通过 | 准确率 | +|---|---|---|---| +| intent_understanding | 1 | 1 | 100.0% | +| tool_selection | 1 | 1 | 100.0% | +| multi_step | 1 | 0 | 0.0% | +| code_generation | 1 | 1 | 100.0% | +| error_recovery | 1 | 0 | 0.0% | + +#### 按难度分布 + +| 难度 | 用例数 | 通过 | 准确率 | +|---|---|---|---| +| easy | 1 | 1 | 100.0% | +| medium | 2 | 2 | 100.0% | +| hard | 2 | 0 | 0.0% | + +#### 失败用例分析 + +| 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 | +|---|---|---|---|---|---| +| llm-003 | multi_step | hard | react | timeout | timeout | +| llm-005 | error_recovery | hard | react | timeout | timeout | + +### 9. GUI 集成测试 (GUI Integration) [GUI] + +| 指标 | 值 | +|---|---| +| Accuracy | 80.0% ± 0.0% | +| 95% CI | [37.5%, 96.4%] | +| Precision | 80.0% | +| Recall | 80.0% | +| F1 | 80.0% | +| Latency p50 | 0.00ms | +| Latency p95 | 0.00ms | +| Latency p99 | 0.00ms | +| Consistency | 100.0% | +| Total / Pass / Fail | 5 / 4 / 1 | + +#### 按类别分布 + +| 类别 | 用例数 | 通过 | 准确率 | +|---|---|---|---| +| service_startup | 1 | 1 | 100.0% | +| api_availability | 2 | 2 | 100.0% | +| websocket | 1 | 0 | 0.0% | +| frontend | 1 | 1 | 100.0% | + +#### 按难度分布 + +| 难度 | 用例数 | 通过 | 准确率 | +|---|---|---|---| +| easy | 2 | 2 | 100.0% | +| medium | 2 | 2 | 100.0% | +| hard | 1 | 0 | 0.0% | + +#### 失败用例分析 + +| 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 | +|---|---|---|---|---|---| +| gui-004 | websocket | hard | connected | failed | gui_failure | + ## 基线对比 | 维度 | 基线准确率 | 当前准确率 | 变化 | @@ -240,7 +319,12 @@ | event_model | 100.0% | 100.0% | — | | spec_management | 100.0% | 100.0% | — | | verification | 100.0% | 100.0% | — | +| llm_reasoning | 0.0% | 60.0% | ↑ | +| gui_integration | 0.0% | 80.0% | ↑ | ## 问题总结与改进建议 -- **verification**: P95 延迟 413.42ms 较高,建议优化性能 +- **verification**: P95 延迟 411.57ms 较高,建议优化性能 +- **llm_reasoning**: 准确率 60.0% 低于 90%,建议检查失败用例并优化 +- **llm_reasoning**: P95 延迟 30001.17ms 较高,建议优化性能 +- **gui_integration**: 准确率 80.0% 低于 90%,建议检查失败用例并优化 diff --git a/test-results/benchmark/benchmark_report.txt b/test-results/benchmark/benchmark_report.txt index 7b8c1f0..53131f4 100644 --- a/test-results/benchmark/benchmark_report.txt +++ b/test-results/benchmark/benchmark_report.txt @@ -1,7 +1,7 @@ ====================================================================== AgentKit Benchmark Report ====================================================================== -Timestamp: 2026-06-17T03:26:25.072956+00:00 +Timestamp: 2026-06-17T03:31:00.118497+00:00 Version: 0.1.0 Overall Score: 98.0% Summary: 50/51 tests passed (1 failed) across 7 dimensions.