refactor: standardize benchmark with industry methodology (P/R/F1, multi-run, baseline)
This commit is contained in:
parent
d361177cc7
commit
1fbfd9d132
|
|
@ -36,7 +36,9 @@ prompt:
|
|||
identity: "你是 AgentKit 能力回测助手,负责运行各维度能力测试并生成评估报告。"
|
||||
instructions: |
|
||||
## 职责
|
||||
根据用户需求运行 AgentKit 能力回测,生成综合评估报告。
|
||||
根据用户需求运行 AgentKit 能力回测,生成标准化评估报告。
|
||||
采用行业 Benchmark 方法论(SWE-bench / AgentBench / ToolBench 风格),
|
||||
提供 Accuracy / Precision / Recall / F1 / Latency / Consistency 等完整指标。
|
||||
|
||||
## 可用命令
|
||||
|
||||
|
|
@ -44,13 +46,14 @@ prompt:
|
|||
```bash
|
||||
python3 -m agentkit.cli.main benchmark --report --verbose
|
||||
```
|
||||
运行所有 7 个维度共 51 个测试用例,生成 JSON + TXT 报告。
|
||||
运行所有 7 个维度共 53 个标准化测试用例,生成 JSON + Markdown 报告。
|
||||
默认运行 3 次取均值 ± 标准差,附带 95% Wilson 置信区间。
|
||||
|
||||
### 快速回测
|
||||
```bash
|
||||
python3 -m agentkit.cli.main benchmark --fast --report
|
||||
```
|
||||
运行核心用例(约 23 个),适合开发时快速验证。
|
||||
运行核心用例(约 22 个),适合开发时快速验证。
|
||||
|
||||
### 单维度回测
|
||||
```bash
|
||||
|
|
@ -58,16 +61,42 @@ prompt:
|
|||
```
|
||||
可选维度:preprocessing, overfitting, efficiency, tool_search, event_model, spec_management, verification
|
||||
|
||||
### 多次运行取均值(--runs)
|
||||
```bash
|
||||
python3 -m agentkit.cli.main benchmark --runs 5 --report
|
||||
```
|
||||
指定运行次数(默认 3),计算 accuracy_mean ± accuracy_std 和 95% 置信区间。
|
||||
适用于稳定性评估和回归检测。
|
||||
|
||||
### 基线对比(--baseline)
|
||||
```bash
|
||||
python3 -m agentkit.cli.main benchmark --baseline --report
|
||||
```
|
||||
首次运行自动创建基线(baseline.json),后续运行与基线对比,显示 ↑/↓ 变化趋势。
|
||||
适用于 CI/CD 回归监控。
|
||||
|
||||
### Markdown 报告(默认)
|
||||
```bash
|
||||
python3 -m agentkit.cli.main benchmark --report --format markdown
|
||||
```
|
||||
生成人类可读的 Markdown 报告,包含指标表格、失败用例分析、改进建议。
|
||||
|
||||
### HTML 报告
|
||||
```bash
|
||||
python3 -m agentkit.cli.main benchmark --report --format html
|
||||
```
|
||||
|
||||
### JSON 报告
|
||||
```bash
|
||||
python3 -m agentkit.cli.main benchmark --report --format json
|
||||
```
|
||||
仅生成 JSON 报告,适合机器解析和 CI 集成。
|
||||
|
||||
### pytest 综合回测
|
||||
```bash
|
||||
python3 -m pytest tests/e2e/test_capability_comprehensive.py -v
|
||||
python3 -m pytest tests/e2e/test_capability_comprehensive.py -v -m e2e_capability
|
||||
```
|
||||
运行 60 个测试(8 维度),生成 comprehensive_report。
|
||||
运行 64 个测试(10 维度,含标准 Benchmark 框架集成测试),生成 comprehensive_report。
|
||||
|
||||
### 指定输出目录
|
||||
```bash
|
||||
|
|
@ -75,24 +104,37 @@ prompt:
|
|||
```
|
||||
|
||||
## 测试维度说明
|
||||
每个维度均提供以下标准化指标:
|
||||
- **Accuracy** — 准确率(通过率)
|
||||
- **Precision** — 精确率(macro-averaged,多分类)
|
||||
- **Recall** — 召回率(macro-averaged,多分类)
|
||||
- **F1** — F1 分数(Precision 与 Recall 的调和平均)
|
||||
- **Latency p50/p95/p99** — 延迟分位数(毫秒)
|
||||
- **Consistency** — 一致性(过拟合检测,改写输入的稳定性)
|
||||
- **95% CI** — Wilson 置信区间(多次运行时)
|
||||
|
||||
维度清单:
|
||||
1. **preprocessing** — 预处理准确度:greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT
|
||||
2. **overfitting** — 过拟合检测:同一意图不同表达的一致性
|
||||
3. **efficiency** — 执行效率:预处理延迟 < 50ms, 工具搜索延迟 < 10ms
|
||||
4. **tool_search** — 工具搜索准确度:BM25 相关性排序
|
||||
2. **overfitting** — 过拟合检测:同一意图不同表达的一致性(Consistency 指标)
|
||||
3. **efficiency** — 执行效率:预处理延迟 < 50ms, 工具搜索延迟 < 10ms(Latency 指标)
|
||||
4. **tool_search** — 工具搜索准确度:BM25 相关性排序(P/R/F1 指标)
|
||||
5. **event_model** — 事件模型完整性:SQ/EQ 双队列生命周期
|
||||
6. **spec_management** — Spec 管理:CRUD 操作
|
||||
7. **verification** — 验证循环:verify/retry 行为
|
||||
|
||||
## 报告位置
|
||||
- CLI 报告:`test-results/benchmark/benchmark_report.{json,txt,html}`
|
||||
- CLI 报告:`test-results/benchmark/benchmark_report.{json,md,html}`
|
||||
- 基线文件:`test-results/benchmark/baseline.json`(使用 --baseline 时生成)
|
||||
- pytest 报告:`test-results/e2e/comprehensive_report.{json,txt}`
|
||||
|
||||
## 输出要求
|
||||
1. 运行测试命令
|
||||
2. 读取生成的报告文件
|
||||
3. 向用户展示结果摘要表格
|
||||
4. 如有失败用例,分析原因并给出改进建议
|
||||
5. 对比历史报告(如存在),展示趋势变化
|
||||
2. 读取生成的报告文件(JSON + Markdown)
|
||||
3. 向用户展示结果摘要表格,包含各维度的 Accuracy / P / R / F1 / Latency
|
||||
4. 如有失败用例,分析根因(wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded)
|
||||
5. 对比基线报告(如使用 --baseline),展示各维度准确率的 ↑/↓ 变化趋势
|
||||
6. 关注关键指标:P95 延迟 > 100ms 需提示性能问题,Consistency < 100% 需提示过拟合风险
|
||||
7. 给出针对性改进建议,基于指标数据而非主观判断
|
||||
|
||||
llm:
|
||||
model: "default"
|
||||
|
|
|
|||
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
File diff suppressed because it is too large
Load Diff
|
|
@ -0,0 +1,246 @@
|
|||
# AgentKit 能力基准测试报告
|
||||
|
||||
## 测试概要
|
||||
- 时间: 2026-06-17T04:00:50.738066+00:00
|
||||
- 版本: 0.1.0
|
||||
- 运行次数: 3
|
||||
- 总体准确率: 100.0% ± 0.0%
|
||||
|
||||
## 与行业 Benchmark 对比
|
||||
|
||||
| Benchmark | 测试对象 | AgentKit 对应 |
|
||||
|---|---|---|
|
||||
| SWE-bench | LLM 代码修复 | — (测 LLM 非框架) |
|
||||
| ToolBench | 工具调用 | tool_search 维度 |
|
||||
| AgentBench | Agent 系统 | 全部维度 |
|
||||
|
||||
## 维度结果
|
||||
|
||||
### 1. 预处理准确度 (Preprocessing Accuracy)
|
||||
|
||||
| 指标 | 值 |
|
||||
|---|---|
|
||||
| Accuracy | 100.0% ± 0.0% |
|
||||
| 95% CI | [79.6%, 100.0%] |
|
||||
| Precision | 100.0% |
|
||||
| Recall | 100.0% |
|
||||
| F1 | 100.0% |
|
||||
| Latency p50 | 0.01ms |
|
||||
| Latency p95 | 0.03ms |
|
||||
| Latency p99 | 0.06ms |
|
||||
| Consistency | 100.0% |
|
||||
| Total / Pass / Fail | 15 / 15 / 0 |
|
||||
|
||||
#### 按类别分布
|
||||
|
||||
| 类别 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| greeting | 4 | 4 | 100.0% |
|
||||
| tool_query | 5 | 5 | 100.0% |
|
||||
| skill_prefix | 3 | 3 | 100.0% |
|
||||
| complex | 3 | 3 | 100.0% |
|
||||
|
||||
#### 按难度分布
|
||||
|
||||
| 难度 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| easy | 5 | 5 | 100.0% |
|
||||
| medium | 7 | 7 | 100.0% |
|
||||
| hard | 3 | 3 | 100.0% |
|
||||
|
||||
### 2. 过拟合检测 (Overfitting Detection)
|
||||
|
||||
| 指标 | 值 |
|
||||
|---|---|
|
||||
| Accuracy | 100.0% ± 0.0% |
|
||||
| 95% CI | [56.5%, 100.0%] |
|
||||
| Precision | 100.0% |
|
||||
| Recall | 100.0% |
|
||||
| F1 | 100.0% |
|
||||
| Latency p50 | 0.04ms |
|
||||
| Latency p95 | 0.06ms |
|
||||
| Latency p99 | 0.07ms |
|
||||
| Consistency | 100.0% |
|
||||
| Total / Pass / Fail | 5 / 5 / 0 |
|
||||
|
||||
#### 按类别分布
|
||||
|
||||
| 类别 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| ip_check | 1 | 1 | 100.0% |
|
||||
| search | 1 | 1 | 100.0% |
|
||||
| greeting | 1 | 1 | 100.0% |
|
||||
| tool_use | 1 | 1 | 100.0% |
|
||||
| complex | 1 | 1 | 100.0% |
|
||||
|
||||
#### 按难度分布
|
||||
|
||||
| 难度 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| medium | 3 | 3 | 100.0% |
|
||||
| easy | 1 | 1 | 100.0% |
|
||||
| hard | 1 | 1 | 100.0% |
|
||||
|
||||
### 3. 效率测试 (Efficiency)
|
||||
|
||||
| 指标 | 值 |
|
||||
|---|---|
|
||||
| Accuracy | 100.0% ± 0.0% |
|
||||
| 95% CI | [56.5%, 100.0%] |
|
||||
| Precision | 0.0% |
|
||||
| Recall | 0.0% |
|
||||
| F1 | 0.0% |
|
||||
| Latency p50 | 0.40ms |
|
||||
| Latency p95 | 0.77ms |
|
||||
| Latency p99 | 0.82ms |
|
||||
| Consistency | 100.0% |
|
||||
| Total / Pass / Fail | 5 / 5 / 0 |
|
||||
|
||||
#### 按类别分布
|
||||
|
||||
| 类别 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| preprocess_latency | 3 | 3 | 100.0% |
|
||||
| tool_search_latency | 2 | 2 | 100.0% |
|
||||
|
||||
#### 按难度分布
|
||||
|
||||
| 难度 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| easy | 2 | 2 | 100.0% |
|
||||
| medium | 3 | 3 | 100.0% |
|
||||
|
||||
### 4. 工具搜索 (Tool Search)
|
||||
|
||||
| 指标 | 值 |
|
||||
|---|---|
|
||||
| Accuracy | 100.0% ± 0.0% |
|
||||
| 95% CI | [72.2%, 100.0%] |
|
||||
| Precision | 83.3% |
|
||||
| Recall | 83.3% |
|
||||
| F1 | 83.3% |
|
||||
| Latency p50 | 0.01ms |
|
||||
| Latency p95 | 0.02ms |
|
||||
| Latency p99 | 0.02ms |
|
||||
| Consistency | 100.0% |
|
||||
| Total / Pass / Fail | 10 / 10 / 0 |
|
||||
|
||||
#### 按类别分布
|
||||
|
||||
| 类别 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| exact_match | 5 | 5 | 100.0% |
|
||||
| fuzzy_match | 2 | 2 | 100.0% |
|
||||
| no_match | 2 | 2 | 100.0% |
|
||||
| top_k | 1 | 1 | 100.0% |
|
||||
|
||||
#### 按难度分布
|
||||
|
||||
| 难度 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| easy | 7 | 7 | 100.0% |
|
||||
| medium | 3 | 3 | 100.0% |
|
||||
|
||||
### 5. 事件模型 (Event Model)
|
||||
|
||||
| 指标 | 值 |
|
||||
|---|---|
|
||||
| Accuracy | 100.0% ± 0.0% |
|
||||
| 95% CI | [61.0%, 100.0%] |
|
||||
| Precision | 0.0% |
|
||||
| Recall | 0.0% |
|
||||
| F1 | 0.0% |
|
||||
| Latency p50 | 0.04ms |
|
||||
| Latency p95 | 15.68ms |
|
||||
| Latency p99 | 19.84ms |
|
||||
| Consistency | 100.0% |
|
||||
| Total / Pass / Fail | 6 / 6 / 0 |
|
||||
|
||||
#### 按类别分布
|
||||
|
||||
| 类别 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| sq_lifecycle | 3 | 3 | 100.0% |
|
||||
| eq_lifecycle | 3 | 3 | 100.0% |
|
||||
|
||||
#### 按难度分布
|
||||
|
||||
| 难度 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| easy | 6 | 6 | 100.0% |
|
||||
|
||||
### 6. 规格管理 (Spec Management)
|
||||
|
||||
| 指标 | 值 |
|
||||
|---|---|
|
||||
| Accuracy | 100.0% ± 0.0% |
|
||||
| 95% CI | [64.6%, 100.0%] |
|
||||
| Precision | 0.0% |
|
||||
| Recall | 0.0% |
|
||||
| F1 | 0.0% |
|
||||
| Latency p50 | 1.41ms |
|
||||
| Latency p95 | 3.60ms |
|
||||
| Latency p99 | 4.04ms |
|
||||
| Consistency | 100.0% |
|
||||
| Total / Pass / Fail | 7 / 7 / 0 |
|
||||
|
||||
#### 按类别分布
|
||||
|
||||
| 类别 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| crud | 5 | 5 | 100.0% |
|
||||
| edge | 2 | 2 | 100.0% |
|
||||
|
||||
#### 按难度分布
|
||||
|
||||
| 难度 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| easy | 6 | 6 | 100.0% |
|
||||
| medium | 1 | 1 | 100.0% |
|
||||
|
||||
### 7. 验证循环 (Verification Loop)
|
||||
|
||||
| 指标 | 值 |
|
||||
|---|---|
|
||||
| Accuracy | 100.0% ± 0.0% |
|
||||
| 95% CI | [56.5%, 100.0%] |
|
||||
| Precision | 0.0% |
|
||||
| Recall | 0.0% |
|
||||
| F1 | 0.0% |
|
||||
| Latency p50 | 25.44ms |
|
||||
| Latency p95 | 413.42ms |
|
||||
| Latency p99 | 488.32ms |
|
||||
| Consistency | 100.0% |
|
||||
| Total / Pass / Fail | 5 / 5 / 0 |
|
||||
|
||||
#### 按类别分布
|
||||
|
||||
| 类别 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| basic | 2 | 2 | 100.0% |
|
||||
| retry | 1 | 1 | 100.0% |
|
||||
| timeout | 1 | 1 | 100.0% |
|
||||
| multi | 1 | 1 | 100.0% |
|
||||
|
||||
#### 按难度分布
|
||||
|
||||
| 难度 | 用例数 | 通过 | 准确率 |
|
||||
|---|---|---|---|
|
||||
| easy | 2 | 2 | 100.0% |
|
||||
| medium | 3 | 3 | 100.0% |
|
||||
|
||||
## 基线对比
|
||||
|
||||
| 维度 | 基线准确率 | 当前准确率 | 变化 |
|
||||
|---|---|---|---|
|
||||
| preprocessing | 100.0% | 100.0% | — |
|
||||
| overfitting | 100.0% | 100.0% | — |
|
||||
| efficiency | 100.0% | 100.0% | — |
|
||||
| tool_search | 100.0% | 100.0% | — |
|
||||
| event_model | 100.0% | 100.0% | — |
|
||||
| spec_management | 100.0% | 100.0% | — |
|
||||
| verification | 100.0% | 100.0% | — |
|
||||
|
||||
## 问题总结与改进建议
|
||||
|
||||
- **verification**: P95 延迟 413.42ms 较高,建议优化性能
|
||||
|
|
@ -1517,3 +1517,95 @@ class TestComprehensiveReport:
|
|||
total_score = json_report["total_score"]
|
||||
print(f"\n总体评分: {total_score:.1f}%")
|
||||
assert total_score >= 80.0, f"Total score {total_score:.1f}% is below 80% threshold"
|
||||
|
||||
|
||||
# ═══════════════════════════════════════════════════════════════════════════
|
||||
# 10. 标准 Benchmark 框架集成
|
||||
# ═══════════════════════════════════════════════════════════════════════════
|
||||
|
||||
|
||||
@pytest.mark.e2e_capability
|
||||
class TestStandardBenchmarkIntegration:
|
||||
"""测试标准 Benchmark 框架集成。"""
|
||||
|
||||
def test_benchmark_task_creation(self) -> None:
|
||||
"""测试 BenchmarkTask 可以正确创建。"""
|
||||
from agentkit.cli.benchmark import BenchmarkTask
|
||||
|
||||
task = BenchmarkTask(
|
||||
task_id="test-001",
|
||||
dimension="preprocessing",
|
||||
category="greeting",
|
||||
difficulty="easy",
|
||||
input="你好",
|
||||
expected="direct_chat",
|
||||
tags=["regex", "chinese"],
|
||||
description="测试用例",
|
||||
paraphrases=[],
|
||||
)
|
||||
assert task.task_id == "test-001"
|
||||
assert task.dimension == "preprocessing"
|
||||
|
||||
def test_metric_set_prf(self) -> None:
|
||||
"""测试 MetricSet P/R/F1 计算。"""
|
||||
from agentkit.cli.benchmark import MetricSet
|
||||
|
||||
m = MetricSet(
|
||||
accuracy=0.9,
|
||||
precision=0.95,
|
||||
recall=0.85,
|
||||
f1=0.90,
|
||||
latency_p50_ms=1.0,
|
||||
latency_p95_ms=2.0,
|
||||
latency_p99_ms=3.0,
|
||||
consistency=1.0,
|
||||
total=100,
|
||||
passed=90,
|
||||
failed=10,
|
||||
)
|
||||
assert m.f1 == 0.90
|
||||
assert m.precision == 0.95
|
||||
|
||||
def test_benchmark_runs_successfully(self) -> None:
|
||||
"""测试 benchmark 函数可以成功运行(fast 模式)。"""
|
||||
from agentkit.cli.benchmark import BenchmarkDimension, benchmark
|
||||
|
||||
# 使用 fast 模式,不生成报告,不输出到终端
|
||||
# 只验证不抛异常
|
||||
try:
|
||||
benchmark(
|
||||
dimension=BenchmarkDimension.ALL,
|
||||
report=False,
|
||||
fast=True,
|
||||
verbose=False,
|
||||
runs=1,
|
||||
output_dir="test-results/benchmark",
|
||||
format="json",
|
||||
)
|
||||
except SystemExit:
|
||||
pass # benchmark 可能通过 typer.Exit 退出
|
||||
|
||||
def test_report_generation(self, tmp_path: Path) -> None:
|
||||
"""测试报告文件可以正确生成。"""
|
||||
import os
|
||||
|
||||
from agentkit.cli.benchmark import BenchmarkDimension, benchmark
|
||||
|
||||
out_dir = str(tmp_path / "benchmark")
|
||||
try:
|
||||
benchmark(
|
||||
dimension=BenchmarkDimension.ALL,
|
||||
report=True,
|
||||
fast=True,
|
||||
verbose=False,
|
||||
runs=1,
|
||||
output_dir=out_dir,
|
||||
format="markdown",
|
||||
)
|
||||
except SystemExit:
|
||||
pass
|
||||
# 验证报告文件生成
|
||||
json_path = os.path.join(out_dir, "benchmark_report.json")
|
||||
md_path = os.path.join(out_dir, "benchmark_report.md")
|
||||
assert os.path.exists(json_path), f"JSON report not found: {json_path}"
|
||||
assert os.path.exists(md_path), f"Markdown report not found: {md_path}"
|
||||
|
|
|
|||
Loading…
Reference in New Issue