# AgentKit 能力基准测试报告 ## 测试概要 - 时间: 2026-06-17T04:52:53.863927+00:00 - 版本: 0.1.0 - 模式: all - 运行次数: 1 - 总体准确率: 95.2% ± 0.0% ## 与行业 Benchmark 对比 | Benchmark | 测试对象 | AgentKit 对应 | |---|---|---| | SWE-bench | LLM 代码修复 | — (测 LLM 非框架) | | ToolBench | 工具调用 | tool_search 维度 | | AgentBench | Agent 系统 | 全部维度 | ## 维度结果 ### 1. 预处理准确度 (Preprocessing Accuracy) [Mock] | 指标 | 值 | |---|---| | Accuracy | 100.0% ± 0.0% | | 95% CI | [79.6%, 100.0%] | | Precision | 100.0% | | Recall | 100.0% | | F1 | 100.0% | | Latency p50 | 0.01ms | | Latency p95 | 0.06ms | | Latency p99 | 0.11ms | | Consistency | 100.0% | | Total / Pass / Fail | 15 / 15 / 0 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | greeting | 4 | 4 | 100.0% | | tool_query | 5 | 5 | 100.0% | | skill_prefix | 3 | 3 | 100.0% | | complex | 3 | 3 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 5 | 5 | 100.0% | | medium | 7 | 7 | 100.0% | | hard | 3 | 3 | 100.0% | ### 2. 过拟合检测 (Overfitting Detection) [Mock] | 指标 | 值 | |---|---| | Accuracy | 100.0% ± 0.0% | | 95% CI | [56.5%, 100.0%] | | Precision | 100.0% | | Recall | 100.0% | | F1 | 100.0% | | Latency p50 | 0.03ms | | Latency p95 | 0.06ms | | Latency p99 | 0.06ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 5 / 0 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | ip_check | 1 | 1 | 100.0% | | search | 1 | 1 | 100.0% | | greeting | 1 | 1 | 100.0% | | tool_use | 1 | 1 | 100.0% | | complex | 1 | 1 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | medium | 3 | 3 | 100.0% | | easy | 1 | 1 | 100.0% | | hard | 1 | 1 | 100.0% | ### 3. 效率测试 (Efficiency) [Mock] | 指标 | 值 | |---|---| | Accuracy | 100.0% ± 0.0% | | 95% CI | [56.5%, 100.0%] | | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | | Latency p50 | 0.33ms | | Latency p95 | 0.62ms | | Latency p99 | 0.66ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 5 / 0 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | preprocess_latency | 3 | 3 | 100.0% | | tool_search_latency | 2 | 2 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 2 | 2 | 100.0% | | medium | 3 | 3 | 100.0% | ### 4. 工具搜索 (Tool Search) [Mock] | 指标 | 值 | |---|---| | Accuracy | 100.0% ± 0.0% | | 95% CI | [72.2%, 100.0%] | | Precision | 83.3% | | Recall | 83.3% | | F1 | 83.3% | | Latency p50 | 0.02ms | | Latency p95 | 0.03ms | | Latency p99 | 0.03ms | | Consistency | 100.0% | | Total / Pass / Fail | 10 / 10 / 0 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | exact_match | 5 | 5 | 100.0% | | fuzzy_match | 2 | 2 | 100.0% | | no_match | 2 | 2 | 100.0% | | top_k | 1 | 1 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 7 | 7 | 100.0% | | medium | 3 | 3 | 100.0% | ### 5. 事件模型 (Event Model) [Mock] | 指标 | 值 | |---|---| | Accuracy | 100.0% ± 0.0% | | 95% CI | [61.0%, 100.0%] | | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | | Latency p50 | 0.06ms | | Latency p95 | 16.00ms | | Latency p99 | 20.24ms | | Consistency | 100.0% | | Total / Pass / Fail | 6 / 6 / 0 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | sq_lifecycle | 3 | 3 | 100.0% | | eq_lifecycle | 3 | 3 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 6 | 6 | 100.0% | ### 6. 规格管理 (Spec Management) [Mock] | 指标 | 值 | |---|---| | Accuracy | 100.0% ± 0.0% | | 95% CI | [64.6%, 100.0%] | | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | | Latency p50 | 1.38ms | | Latency p95 | 3.46ms | | Latency p99 | 4.01ms | | Consistency | 100.0% | | Total / Pass / Fail | 7 / 7 / 0 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | crud | 5 | 5 | 100.0% | | edge | 2 | 2 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 6 | 6 | 100.0% | | medium | 1 | 1 | 100.0% | ### 7. 验证循环 (Verification Loop) [Mock] | 指标 | 值 | |---|---| | Accuracy | 100.0% ± 0.0% | | 95% CI | [56.5%, 100.0%] | | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | | Latency p50 | 22.00ms | | Latency p95 | 411.57ms | | Latency p99 | 487.06ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 5 / 0 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | basic | 2 | 2 | 100.0% | | retry | 1 | 1 | 100.0% | | timeout | 1 | 1 | 100.0% | | multi | 1 | 1 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 2 | 2 | 100.0% | | medium | 3 | 3 | 100.0% | ### 8. LLM 推理能力 (LLM Reasoning) [LLM] | 指标 | 值 | |---|---| | Accuracy | 60.0% ± 0.0% | | 95% CI | [23.1%, 88.2%] | | Precision | 0.0% | | Recall | 0.0% | | F1 | 0.0% | | Latency p50 | 25149.49ms | | Latency p95 | 30001.17ms | | Latency p99 | 30001.23ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 3 / 2 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | intent_understanding | 1 | 1 | 100.0% | | tool_selection | 1 | 1 | 100.0% | | multi_step | 1 | 0 | 0.0% | | code_generation | 1 | 1 | 100.0% | | error_recovery | 1 | 0 | 0.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 1 | 1 | 100.0% | | medium | 2 | 2 | 100.0% | | hard | 2 | 0 | 0.0% | #### 失败用例分析 | 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 | |---|---|---|---|---|---| | llm-003 | multi_step | hard | react | timeout | timeout | | llm-005 | error_recovery | hard | react | timeout | timeout | ### 9. GUI 集成测试 (GUI Integration) [GUI] | 指标 | 值 | |---|---| | Accuracy | 80.0% ± 0.0% | | 95% CI | [37.5%, 96.4%] | | Precision | 80.0% | | Recall | 80.0% | | F1 | 80.0% | | Latency p50 | 0.00ms | | Latency p95 | 0.00ms | | Latency p99 | 0.00ms | | Consistency | 100.0% | | Total / Pass / Fail | 5 / 4 / 1 | #### 按类别分布 | 类别 | 用例数 | 通过 | 准确率 | |---|---|---|---| | service_startup | 1 | 1 | 100.0% | | api_availability | 2 | 2 | 100.0% | | websocket | 1 | 0 | 0.0% | | frontend | 1 | 1 | 100.0% | #### 按难度分布 | 难度 | 用例数 | 通过 | 准确率 | |---|---|---|---| | easy | 2 | 2 | 100.0% | | medium | 2 | 2 | 100.0% | | hard | 1 | 0 | 0.0% | #### 失败用例分析 | 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 | |---|---|---|---|---|---| | gui-004 | websocket | hard | connected | failed | gui_failure | ## 基线对比 | 维度 | 基线准确率 | 当前准确率 | 变化 | |---|---|---|---| | preprocessing | 100.0% | 100.0% | — | | overfitting | 100.0% | 100.0% | — | | efficiency | 100.0% | 100.0% | — | | tool_search | 100.0% | 100.0% | — | | event_model | 100.0% | 100.0% | — | | spec_management | 100.0% | 100.0% | — | | verification | 100.0% | 100.0% | — | | llm_reasoning | 0.0% | 60.0% | ↑ | | gui_integration | 0.0% | 80.0% | ↑ | ## 问题总结与改进建议 - **verification**: P95 延迟 411.57ms 较高,建议优化性能 - **llm_reasoning**: 准确率 60.0% 低于 90%,建议检查失败用例并优化 - **llm_reasoning**: P95 延迟 30001.17ms 较高,建议优化性能 - **gui_integration**: 准确率 80.0% 低于 90%,建议检查失败用例并优化