7.5 KiB

Raw Blame History

AgentKit 能力基准测试报告

测试概要

时间: 2026-06-17T04:52:53.863927+00:00
版本: 0.1.0
模式: all
运行次数: 1
总体准确率: 95.2% ± 0.0%

与行业 Benchmark 对比

Benchmark	测试对象	AgentKit 对应
SWE-bench	LLM 代码修复	— (测 LLM 非框架)
ToolBench	工具调用	tool_search 维度
AgentBench	Agent 系统	全部维度

维度结果

1. 预处理准确度 (Preprocessing Accuracy) [Mock]

指标	值
Accuracy	100.0% ± 0.0%
95% CI	[79.6%, 100.0%]
Precision	100.0%
Recall	100.0%
F1	100.0%
Latency p50	0.01ms
Latency p95	0.06ms
Latency p99	0.11ms
Consistency	100.0%
Total / Pass / Fail	15 / 15 / 0

按类别分布

类别	用例数	通过	准确率
greeting	4	4	100.0%
tool_query	5	5	100.0%
skill_prefix	3	3	100.0%
complex	3	3	100.0%

按难度分布

难度	用例数	通过	准确率
easy	5	5	100.0%
medium	7	7	100.0%
hard	3	3	100.0%

2. 过拟合检测 (Overfitting Detection) [Mock]

指标	值
Accuracy	100.0% ± 0.0%
95% CI	[56.5%, 100.0%]
Precision	100.0%
Recall	100.0%
F1	100.0%
Latency p50	0.03ms
Latency p95	0.06ms
Latency p99	0.06ms
Consistency	100.0%
Total / Pass / Fail	5 / 5 / 0

按类别分布

类别	用例数	通过	准确率
ip_check	1	1	100.0%
search	1	1	100.0%
greeting	1	1	100.0%
tool_use	1	1	100.0%
complex	1	1	100.0%

按难度分布

难度	用例数	通过	准确率
medium	3	3	100.0%
easy	1	1	100.0%
hard	1	1	100.0%

3. 效率测试 (Efficiency) [Mock]

指标	值
Accuracy	100.0% ± 0.0%
95% CI	[56.5%, 100.0%]
Precision	0.0%
Recall	0.0%
F1	0.0%
Latency p50	0.33ms
Latency p95	0.62ms
Latency p99	0.66ms
Consistency	100.0%
Total / Pass / Fail	5 / 5 / 0

按类别分布

类别	用例数	通过	准确率
preprocess_latency	3	3	100.0%
tool_search_latency	2	2	100.0%

按难度分布

难度	用例数	通过	准确率
easy	2	2	100.0%
medium	3	3	100.0%

4. 工具搜索 (Tool Search) [Mock]

指标	值
Accuracy	100.0% ± 0.0%
95% CI	[72.2%, 100.0%]
Precision	83.3%
Recall	83.3%
F1	83.3%
Latency p50	0.02ms
Latency p95	0.03ms
Latency p99	0.03ms
Consistency	100.0%
Total / Pass / Fail	10 / 10 / 0

按类别分布

类别	用例数	通过	准确率
exact_match	5	5	100.0%
fuzzy_match	2	2	100.0%
no_match	2	2	100.0%
top_k	1	1	100.0%

按难度分布

难度	用例数	通过	准确率
easy	7	7	100.0%
medium	3	3	100.0%

5. 事件模型 (Event Model) [Mock]

指标	值
Accuracy	100.0% ± 0.0%
95% CI	[61.0%, 100.0%]
Precision	0.0%
Recall	0.0%
F1	0.0%
Latency p50	0.06ms
Latency p95	16.00ms
Latency p99	20.24ms
Consistency	100.0%
Total / Pass / Fail	6 / 6 / 0

按类别分布

类别	用例数	通过	准确率
sq_lifecycle	3	3	100.0%
eq_lifecycle	3	3	100.0%

按难度分布

难度	用例数	通过	准确率
easy	6	6	100.0%

6. 规格管理 (Spec Management) [Mock]

指标	值
Accuracy	100.0% ± 0.0%
95% CI	[64.6%, 100.0%]
Precision	0.0%
Recall	0.0%
F1	0.0%
Latency p50	1.38ms
Latency p95	3.46ms
Latency p99	4.01ms
Consistency	100.0%
Total / Pass / Fail	7 / 7 / 0

按类别分布

类别	用例数	通过	准确率
crud	5	5	100.0%
edge	2	2	100.0%

按难度分布

难度	用例数	通过	准确率
easy	6	6	100.0%
medium	1	1	100.0%

7. 验证循环 (Verification Loop) [Mock]

指标	值
Accuracy	100.0% ± 0.0%
95% CI	[56.5%, 100.0%]
Precision	0.0%
Recall	0.0%
F1	0.0%
Latency p50	22.00ms
Latency p95	411.57ms
Latency p99	487.06ms
Consistency	100.0%
Total / Pass / Fail	5 / 5 / 0

按类别分布

类别	用例数	通过	准确率
basic	2	2	100.0%
retry	1	1	100.0%
timeout	1	1	100.0%
multi	1	1	100.0%

按难度分布

难度	用例数	通过	准确率
easy	2	2	100.0%
medium	3	3	100.0%

8. LLM 推理能力 (LLM Reasoning) [LLM]

指标	值
Accuracy	60.0% ± 0.0%
95% CI	[23.1%, 88.2%]
Precision	0.0%
Recall	0.0%
F1	0.0%
Latency p50	25149.49ms
Latency p95	30001.17ms
Latency p99	30001.23ms
Consistency	100.0%
Total / Pass / Fail	5 / 3 / 2

按类别分布

类别	用例数	通过	准确率
intent_understanding	1	1	100.0%
tool_selection	1	1	100.0%
multi_step	1	0	0.0%
code_generation	1	1	100.0%
error_recovery	1	0	0.0%

按难度分布

难度	用例数	通过	准确率
easy	1	1	100.0%
medium	2	2	100.0%
hard	2	0	0.0%

失败用例分析

用例 ID	类别	难度	期望	实际	根因
llm-003	multi_step	hard	react	timeout	timeout
llm-005	error_recovery	hard	react	timeout	timeout

9. GUI 集成测试 (GUI Integration) [GUI]

指标	值
Accuracy	80.0% ± 0.0%
95% CI	[37.5%, 96.4%]
Precision	80.0%
Recall	80.0%
F1	80.0%
Latency p50	0.00ms
Latency p95	0.00ms
Latency p99	0.00ms
Consistency	100.0%
Total / Pass / Fail	5 / 4 / 1

按类别分布

类别	用例数	通过	准确率
service_startup	1	1	100.0%
api_availability	2	2	100.0%
websocket	1	0	0.0%
frontend	1	1	100.0%

按难度分布

难度	用例数	通过	准确率
easy	2	2	100.0%
medium	2	2	100.0%
hard	1	0	0.0%

失败用例分析

用例 ID	类别	难度	期望	实际	根因
gui-004	websocket	hard	connected	failed	gui_failure

基线对比

维度	基线准确率	当前准确率	变化
preprocessing	100.0%	100.0%	—
overfitting	100.0%	100.0%	—
efficiency	100.0%	100.0%	—
tool_search	100.0%	100.0%	—
event_model	100.0%	100.0%	—
spec_management	100.0%	100.0%	—
verification	100.0%	100.0%	—
llm_reasoning	0.0%	60.0%	↑
gui_integration	0.0%	80.0%	↑

问题总结与改进建议

verification: P95 延迟 411.57ms 较高，建议优化性能
llm_reasoning: 准确率 60.0% 低于 90%，建议检查失败用例并优化
llm_reasoning: P95 延迟 30001.17ms 较高，建议优化性能
gui_integration: 准确率 80.0% 低于 90%，建议检查失败用例并优化