AgentKit 能力基准测试报告
测试概要
- 时间: 2026-06-17T04:52:53.863927+00:00
- 版本: 0.1.0
- 模式: all
- 运行次数: 1
- 总体准确率: 95.2% ± 0.0%
与行业 Benchmark 对比
| Benchmark |
测试对象 |
AgentKit 对应 |
| SWE-bench |
LLM 代码修复 |
— (测 LLM 非框架) |
| ToolBench |
工具调用 |
tool_search 维度 |
| AgentBench |
Agent 系统 |
全部维度 |
维度结果
1. 预处理准确度 (Preprocessing Accuracy) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[79.6%, 100.0%] |
| Precision |
100.0% |
| Recall |
100.0% |
| F1 |
100.0% |
| Latency p50 |
0.01ms |
| Latency p95 |
0.06ms |
| Latency p99 |
0.11ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
15 / 15 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| greeting |
4 |
4 |
100.0% |
| tool_query |
5 |
5 |
100.0% |
| skill_prefix |
3 |
3 |
100.0% |
| complex |
3 |
3 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
5 |
5 |
100.0% |
| medium |
7 |
7 |
100.0% |
| hard |
3 |
3 |
100.0% |
2. 过拟合检测 (Overfitting Detection) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[56.5%, 100.0%] |
| Precision |
100.0% |
| Recall |
100.0% |
| F1 |
100.0% |
| Latency p50 |
0.03ms |
| Latency p95 |
0.06ms |
| Latency p99 |
0.06ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 5 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| ip_check |
1 |
1 |
100.0% |
| search |
1 |
1 |
100.0% |
| greeting |
1 |
1 |
100.0% |
| tool_use |
1 |
1 |
100.0% |
| complex |
1 |
1 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| medium |
3 |
3 |
100.0% |
| easy |
1 |
1 |
100.0% |
| hard |
1 |
1 |
100.0% |
3. 效率测试 (Efficiency) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[56.5%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
0.33ms |
| Latency p95 |
0.62ms |
| Latency p99 |
0.66ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 5 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| preprocess_latency |
3 |
3 |
100.0% |
| tool_search_latency |
2 |
2 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
2 |
2 |
100.0% |
| medium |
3 |
3 |
100.0% |
4. 工具搜索 (Tool Search) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[72.2%, 100.0%] |
| Precision |
83.3% |
| Recall |
83.3% |
| F1 |
83.3% |
| Latency p50 |
0.02ms |
| Latency p95 |
0.03ms |
| Latency p99 |
0.03ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
10 / 10 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| exact_match |
5 |
5 |
100.0% |
| fuzzy_match |
2 |
2 |
100.0% |
| no_match |
2 |
2 |
100.0% |
| top_k |
1 |
1 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
7 |
7 |
100.0% |
| medium |
3 |
3 |
100.0% |
5. 事件模型 (Event Model) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[61.0%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
0.06ms |
| Latency p95 |
16.00ms |
| Latency p99 |
20.24ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
6 / 6 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| sq_lifecycle |
3 |
3 |
100.0% |
| eq_lifecycle |
3 |
3 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
6 |
6 |
100.0% |
6. 规格管理 (Spec Management) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[64.6%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
1.38ms |
| Latency p95 |
3.46ms |
| Latency p99 |
4.01ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
7 / 7 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| crud |
5 |
5 |
100.0% |
| edge |
2 |
2 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
6 |
6 |
100.0% |
| medium |
1 |
1 |
100.0% |
7. 验证循环 (Verification Loop) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[56.5%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
22.00ms |
| Latency p95 |
411.57ms |
| Latency p99 |
487.06ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 5 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| basic |
2 |
2 |
100.0% |
| retry |
1 |
1 |
100.0% |
| timeout |
1 |
1 |
100.0% |
| multi |
1 |
1 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
2 |
2 |
100.0% |
| medium |
3 |
3 |
100.0% |
8. LLM 推理能力 (LLM Reasoning) [LLM]
| 指标 |
值 |
| Accuracy |
60.0% ± 0.0% |
| 95% CI |
[23.1%, 88.2%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
25149.49ms |
| Latency p95 |
30001.17ms |
| Latency p99 |
30001.23ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 3 / 2 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| intent_understanding |
1 |
1 |
100.0% |
| tool_selection |
1 |
1 |
100.0% |
| multi_step |
1 |
0 |
0.0% |
| code_generation |
1 |
1 |
100.0% |
| error_recovery |
1 |
0 |
0.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
1 |
1 |
100.0% |
| medium |
2 |
2 |
100.0% |
| hard |
2 |
0 |
0.0% |
失败用例分析
| 用例 ID |
类别 |
难度 |
期望 |
实际 |
根因 |
| llm-003 |
multi_step |
hard |
react |
timeout |
timeout |
| llm-005 |
error_recovery |
hard |
react |
timeout |
timeout |
9. GUI 集成测试 (GUI Integration) [GUI]
| 指标 |
值 |
| Accuracy |
80.0% ± 0.0% |
| 95% CI |
[37.5%, 96.4%] |
| Precision |
80.0% |
| Recall |
80.0% |
| F1 |
80.0% |
| Latency p50 |
0.00ms |
| Latency p95 |
0.00ms |
| Latency p99 |
0.00ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 4 / 1 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| service_startup |
1 |
1 |
100.0% |
| api_availability |
2 |
2 |
100.0% |
| websocket |
1 |
0 |
0.0% |
| frontend |
1 |
1 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
2 |
2 |
100.0% |
| medium |
2 |
2 |
100.0% |
| hard |
1 |
0 |
0.0% |
失败用例分析
| 用例 ID |
类别 |
难度 |
期望 |
实际 |
根因 |
| gui-004 |
websocket |
hard |
connected |
failed |
gui_failure |
基线对比
| 维度 |
基线准确率 |
当前准确率 |
变化 |
| preprocessing |
100.0% |
100.0% |
— |
| overfitting |
100.0% |
100.0% |
— |
| efficiency |
100.0% |
100.0% |
— |
| tool_search |
100.0% |
100.0% |
— |
| event_model |
100.0% |
100.0% |
— |
| spec_management |
100.0% |
100.0% |
— |
| verification |
100.0% |
100.0% |
— |
| llm_reasoning |
0.0% |
60.0% |
↑ |
| gui_integration |
0.0% |
80.0% |
↑ |
问题总结与改进建议
- verification: P95 延迟 411.57ms 较高,建议优化性能
- llm_reasoning: 准确率 60.0% 低于 90%,建议检查失败用例并优化
- llm_reasoning: P95 延迟 30001.17ms 较高,建议优化性能
- gui_integration: 准确率 80.0% 低于 90%,建议检查失败用例并优化