1.5 KiB

Raw Blame History

AgentKit 能力基准测试报告

测试概要

时间: 2026-06-20T11:05:39.446588+00:00
版本: 0.1.0
模式: llm
运行次数: 3
总体准确率: 93.3% ± 0.0%

与行业 Benchmark 对比

Benchmark	测试对象	AgentKit 对应
SWE-bench	LLM 代码修复	— (测 LLM 非框架)
ToolBench	工具调用	tool_search 维度
AgentBench	Agent 系统	全部维度

维度结果

9. LLM 推理能力 (LLM Reasoning) [LLM]

指标	值
Accuracy	93.3% ± 9.4%
95% CI	[37.5%, 96.4%]
Precision	0.0%
Recall	0.0%
F1	0.0%
Latency p50	40798.45ms
Latency p95	56307.93ms
Latency p99	59262.53ms
Consistency	100.0%
Total / Pass / Fail	5 / 4 / 1

按类别分布

类别	用例数	通过	准确率
intent_understanding	1	1	100.0%
tool_selection	1	0	0.0%
multi_step	1	1	100.0%
code_generation	1	1	100.0%
error_recovery	1	1	100.0%

按难度分布

难度	用例数	通过	准确率
easy	1	1	100.0%
medium	2	1	50.0%
hard	2	2	100.0%

失败用例分析

用例 ID	类别	难度	期望	实际	根因
llm-002	tool_selection	medium	react	timeout	timeout

问题总结与改进建议

llm_reasoning: 准确率 80.0% 低于 90%，建议检查失败用例并优化
llm_reasoning: P95 延迟 56307.93ms 较高，建议优化性能