AgentKit 能力基准测试报告
测试概要
- 时间: 2026-06-17T15:47:33.591101+00:00
- 版本: 0.1.0
- 模式: mock
- 运行次数: 1
- 总体准确率: 100.0% ± 0.0%
与行业 Benchmark 对比
| Benchmark |
测试对象 |
AgentKit 对应 |
| SWE-bench |
LLM 代码修复 |
— (测 LLM 非框架) |
| ToolBench |
工具调用 |
tool_search 维度 |
| AgentBench |
Agent 系统 |
全部维度 |
维度结果
1. 预处理准确度 (Preprocessing Accuracy) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[79.6%, 100.0%] |
| Precision |
100.0% |
| Recall |
100.0% |
| F1 |
100.0% |
| Latency p50 |
0.01ms |
| Latency p95 |
0.07ms |
| Latency p99 |
0.11ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
15 / 15 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| greeting |
4 |
4 |
100.0% |
| tool_query |
5 |
5 |
100.0% |
| skill_prefix |
3 |
3 |
100.0% |
| complex |
3 |
3 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
5 |
5 |
100.0% |
| medium |
7 |
7 |
100.0% |
| hard |
3 |
3 |
100.0% |
2. 过拟合检测 (Overfitting Detection) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[56.5%, 100.0%] |
| Precision |
100.0% |
| Recall |
100.0% |
| F1 |
100.0% |
| Latency p50 |
0.01ms |
| Latency p95 |
0.03ms |
| Latency p99 |
0.03ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 5 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| ip_check |
1 |
1 |
100.0% |
| search |
1 |
1 |
100.0% |
| greeting |
1 |
1 |
100.0% |
| tool_use |
1 |
1 |
100.0% |
| complex |
1 |
1 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| medium |
3 |
3 |
100.0% |
| easy |
1 |
1 |
100.0% |
| hard |
1 |
1 |
100.0% |
3. 效率测试 (Efficiency) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[56.5%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
0.33ms |
| Latency p95 |
0.64ms |
| Latency p99 |
0.67ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 5 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| preprocess_latency |
3 |
3 |
100.0% |
| tool_search_latency |
2 |
2 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
2 |
2 |
100.0% |
| medium |
3 |
3 |
100.0% |
4. 工具搜索 (Tool Search) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[72.2%, 100.0%] |
| Precision |
83.3% |
| Recall |
83.3% |
| F1 |
83.3% |
| Latency p50 |
0.01ms |
| Latency p95 |
0.02ms |
| Latency p99 |
0.02ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
10 / 10 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| exact_match |
5 |
5 |
100.0% |
| fuzzy_match |
2 |
2 |
100.0% |
| no_match |
2 |
2 |
100.0% |
| top_k |
1 |
1 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
7 |
7 |
100.0% |
| medium |
3 |
3 |
100.0% |
5. 事件模型 (Event Model) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[61.0%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
0.05ms |
| Latency p95 |
15.87ms |
| Latency p99 |
20.08ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
6 / 6 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| sq_lifecycle |
3 |
3 |
100.0% |
| eq_lifecycle |
3 |
3 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
6 |
6 |
100.0% |
6. 规格管理 (Spec Management) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[64.6%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
1.94ms |
| Latency p95 |
2.94ms |
| Latency p99 |
3.25ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
7 / 7 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| crud |
5 |
5 |
100.0% |
| edge |
2 |
2 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
6 |
6 |
100.0% |
| medium |
1 |
1 |
100.0% |
7. 验证循环 (Verification Loop) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[56.5%, 100.0%] |
| Precision |
0.0% |
| Recall |
0.0% |
| F1 |
0.0% |
| Latency p50 |
22.22ms |
| Latency p95 |
47.79ms |
| Latency p99 |
50.93ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
5 / 5 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| basic |
2 |
2 |
100.0% |
| retry |
1 |
1 |
100.0% |
| timeout |
1 |
1 |
100.0% |
| multi |
1 |
1 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
2 |
2 |
100.0% |
| medium |
3 |
3 |
100.0% |
8. 私董会路由 (Board Meeting Routing) [Mock]
| 指标 |
值 |
| Accuracy |
100.0% ± 0.0% |
| 95% CI |
[82.4%, 100.0%] |
| Precision |
100.0% |
| Recall |
100.0% |
| F1 |
100.0% |
| Latency p50 |
0.01ms |
| Latency p95 |
0.39ms |
| Latency p99 |
1.19ms |
| Consistency |
100.0% |
| Total / Pass / Fail |
18 / 18 / 0 |
按类别分布
| 类别 |
用例数 |
通过 |
准确率 |
| default_template |
3 |
3 |
100.0% |
| explicit_experts |
3 |
3 |
100.0% |
| topic_extraction |
3 |
3 |
100.0% |
| no_match |
3 |
3 |
100.0% |
| name_validation |
3 |
3 |
100.0% |
| stop_command |
3 |
3 |
100.0% |
按难度分布
| 难度 |
用例数 |
通过 |
准确率 |
| easy |
11 |
11 |
100.0% |
| medium |
7 |
7 |
100.0% |
问题总结与改进建议