fischer-agentkit/test-results/benchmark/benchmark_report.md

# AgentKit 能力基准测试报告

## 测试概要
- 时间: 2026-06-20T03:18:35.937935+00:00
- 版本: 0.1.0
- 模式: llm
- 运行次数: 1
- 总体准确率: 60.0% ± 0.0%

## 与行业 Benchmark 对比

| Benchmark | 测试对象 | AgentKit 对应 |
|---|---|---|
| SWE-bench | LLM 代码修复 | — (测 LLM 非框架) |
| ToolBench | 工具调用 | tool_search 维度 |
| AgentBench | Agent 系统 | 全部维度 |

## 维度结果

### 9. LLM 推理能力 (LLM Reasoning) [LLM]

| 指标 | 值 |
|---|---|
| Accuracy | 60.0% ± 0.0% |
| 95% CI | [23.1%, 88.2%] |
| Precision | 0.0% |
| Recall | 0.0% |
| F1 | 0.0% |
| Latency p50 | 35309.32ms |
| Latency p95 | 41704.39ms |
| Latency p99 | 42044.76ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 5 / 3 / 2 |

#### 按类别分布

| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| intent_understanding | 1 | 0 | 0.0% |
| tool_selection | 1 | 1 | 100.0% |
| multi_step | 1 | 1 | 100.0% |
| code_generation | 1 | 0 | 0.0% |
| error_recovery | 1 | 1 | 100.0% |

#### 按难度分布

| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 1 | 0 | 0.0% |
| medium | 2 | 1 | 50.0% |
| hard | 2 | 2 | 100.0% |

#### 失败用例分析

| 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 |
|---|---|---|---|---|---|
| llm-001 | intent_understanding | easy | react | timeout | timeout |
| llm-004 | code_generation | medium | react | timeout | timeout |

## 问题总结与改进建议

- **llm_reasoning**: 准确率 60.0% 低于 90%，建议检查失败用例并优化
- **llm_reasoning**: P95 延迟 41704.39ms 较高，建议优化性能