269 lines
5.8 KiB
Markdown
269 lines
5.8 KiB
Markdown
# AgentKit 能力基准测试报告
|
|
|
|
## 测试概要
|
|
- 时间: 2026-06-17T15:47:33.591101+00:00
|
|
- 版本: 0.1.0
|
|
- 模式: mock
|
|
- 运行次数: 1
|
|
- 总体准确率: 100.0% ± 0.0%
|
|
|
|
## 与行业 Benchmark 对比
|
|
|
|
| Benchmark | 测试对象 | AgentKit 对应 |
|
|
|---|---|---|
|
|
| SWE-bench | LLM 代码修复 | — (测 LLM 非框架) |
|
|
| ToolBench | 工具调用 | tool_search 维度 |
|
|
| AgentBench | Agent 系统 | 全部维度 |
|
|
|
|
## 维度结果
|
|
|
|
### 1. 预处理准确度 (Preprocessing Accuracy) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [79.6%, 100.0%] |
|
|
| Precision | 100.0% |
|
|
| Recall | 100.0% |
|
|
| F1 | 100.0% |
|
|
| Latency p50 | 0.01ms |
|
|
| Latency p95 | 0.07ms |
|
|
| Latency p99 | 0.11ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 15 / 15 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| greeting | 4 | 4 | 100.0% |
|
|
| tool_query | 5 | 5 | 100.0% |
|
|
| skill_prefix | 3 | 3 | 100.0% |
|
|
| complex | 3 | 3 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| easy | 5 | 5 | 100.0% |
|
|
| medium | 7 | 7 | 100.0% |
|
|
| hard | 3 | 3 | 100.0% |
|
|
|
|
### 2. 过拟合检测 (Overfitting Detection) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [56.5%, 100.0%] |
|
|
| Precision | 100.0% |
|
|
| Recall | 100.0% |
|
|
| F1 | 100.0% |
|
|
| Latency p50 | 0.01ms |
|
|
| Latency p95 | 0.03ms |
|
|
| Latency p99 | 0.03ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 5 / 5 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| ip_check | 1 | 1 | 100.0% |
|
|
| search | 1 | 1 | 100.0% |
|
|
| greeting | 1 | 1 | 100.0% |
|
|
| tool_use | 1 | 1 | 100.0% |
|
|
| complex | 1 | 1 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| medium | 3 | 3 | 100.0% |
|
|
| easy | 1 | 1 | 100.0% |
|
|
| hard | 1 | 1 | 100.0% |
|
|
|
|
### 3. 效率测试 (Efficiency) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [56.5%, 100.0%] |
|
|
| Precision | 0.0% |
|
|
| Recall | 0.0% |
|
|
| F1 | 0.0% |
|
|
| Latency p50 | 0.33ms |
|
|
| Latency p95 | 0.64ms |
|
|
| Latency p99 | 0.67ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 5 / 5 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| preprocess_latency | 3 | 3 | 100.0% |
|
|
| tool_search_latency | 2 | 2 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| easy | 2 | 2 | 100.0% |
|
|
| medium | 3 | 3 | 100.0% |
|
|
|
|
### 4. 工具搜索 (Tool Search) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [72.2%, 100.0%] |
|
|
| Precision | 83.3% |
|
|
| Recall | 83.3% |
|
|
| F1 | 83.3% |
|
|
| Latency p50 | 0.01ms |
|
|
| Latency p95 | 0.02ms |
|
|
| Latency p99 | 0.02ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 10 / 10 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| exact_match | 5 | 5 | 100.0% |
|
|
| fuzzy_match | 2 | 2 | 100.0% |
|
|
| no_match | 2 | 2 | 100.0% |
|
|
| top_k | 1 | 1 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| easy | 7 | 7 | 100.0% |
|
|
| medium | 3 | 3 | 100.0% |
|
|
|
|
### 5. 事件模型 (Event Model) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [61.0%, 100.0%] |
|
|
| Precision | 0.0% |
|
|
| Recall | 0.0% |
|
|
| F1 | 0.0% |
|
|
| Latency p50 | 0.05ms |
|
|
| Latency p95 | 15.87ms |
|
|
| Latency p99 | 20.08ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 6 / 6 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| sq_lifecycle | 3 | 3 | 100.0% |
|
|
| eq_lifecycle | 3 | 3 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| easy | 6 | 6 | 100.0% |
|
|
|
|
### 6. 规格管理 (Spec Management) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [64.6%, 100.0%] |
|
|
| Precision | 0.0% |
|
|
| Recall | 0.0% |
|
|
| F1 | 0.0% |
|
|
| Latency p50 | 1.94ms |
|
|
| Latency p95 | 2.94ms |
|
|
| Latency p99 | 3.25ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 7 / 7 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| crud | 5 | 5 | 100.0% |
|
|
| edge | 2 | 2 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| easy | 6 | 6 | 100.0% |
|
|
| medium | 1 | 1 | 100.0% |
|
|
|
|
### 7. 验证循环 (Verification Loop) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [56.5%, 100.0%] |
|
|
| Precision | 0.0% |
|
|
| Recall | 0.0% |
|
|
| F1 | 0.0% |
|
|
| Latency p50 | 22.22ms |
|
|
| Latency p95 | 47.79ms |
|
|
| Latency p99 | 50.93ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 5 / 5 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| basic | 2 | 2 | 100.0% |
|
|
| retry | 1 | 1 | 100.0% |
|
|
| timeout | 1 | 1 | 100.0% |
|
|
| multi | 1 | 1 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| easy | 2 | 2 | 100.0% |
|
|
| medium | 3 | 3 | 100.0% |
|
|
|
|
### 8. 私董会路由 (Board Meeting Routing) [Mock]
|
|
|
|
| 指标 | 值 |
|
|
|---|---|
|
|
| Accuracy | 100.0% ± 0.0% |
|
|
| 95% CI | [82.4%, 100.0%] |
|
|
| Precision | 100.0% |
|
|
| Recall | 100.0% |
|
|
| F1 | 100.0% |
|
|
| Latency p50 | 0.01ms |
|
|
| Latency p95 | 0.39ms |
|
|
| Latency p99 | 1.19ms |
|
|
| Consistency | 100.0% |
|
|
| Total / Pass / Fail | 18 / 18 / 0 |
|
|
|
|
#### 按类别分布
|
|
|
|
| 类别 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| default_template | 3 | 3 | 100.0% |
|
|
| explicit_experts | 3 | 3 | 100.0% |
|
|
| topic_extraction | 3 | 3 | 100.0% |
|
|
| no_match | 3 | 3 | 100.0% |
|
|
| name_validation | 3 | 3 | 100.0% |
|
|
| stop_command | 3 | 3 | 100.0% |
|
|
|
|
#### 按难度分布
|
|
|
|
| 难度 | 用例数 | 通过 | 准确率 |
|
|
|---|---|---|---|
|
|
| easy | 11 | 11 | 100.0% |
|
|
| medium | 7 | 7 | 100.0% |
|
|
|
|
## 问题总结与改进建议
|
|
|
|
- 所有维度表现良好,无需特别改进。
|