1.6 KiB

Raw Blame History

AgentKit 能力基准测试报告

测试概要

时间: 2026-06-20T03:18:35.937935+00:00
版本: 0.1.0
模式: llm
运行次数: 1
总体准确率: 60.0% ± 0.0%

与行业 Benchmark 对比

Benchmark	测试对象	AgentKit 对应
SWE-bench	LLM 代码修复	— (测 LLM 非框架)
ToolBench	工具调用	tool_search 维度
AgentBench	Agent 系统	全部维度

维度结果

9. LLM 推理能力 (LLM Reasoning) [LLM]

指标	值
Accuracy	60.0% ± 0.0%
95% CI	[23.1%, 88.2%]
Precision	0.0%
Recall	0.0%
F1	0.0%
Latency p50	35309.32ms
Latency p95	41704.39ms
Latency p99	42044.76ms
Consistency	100.0%
Total / Pass / Fail	5 / 3 / 2

按类别分布

类别	用例数	通过	准确率
intent_understanding	1	0	0.0%
tool_selection	1	1	100.0%
multi_step	1	1	100.0%
code_generation	1	0	0.0%
error_recovery	1	1	100.0%

按难度分布

难度	用例数	通过	准确率
easy	1	0	0.0%
medium	2	1	50.0%
hard	2	2	100.0%

失败用例分析

用例 ID	类别	难度	期望	实际	根因
llm-001	intent_understanding	easy	react	timeout	timeout
llm-004	code_generation	medium	react	timeout	timeout

问题总结与改进建议

llm_reasoning: 准确率 60.0% 低于 90%，建议检查失败用例并优化
llm_reasoning: P95 延迟 41704.39ms 较高，建议优化性能