refactor: standardize benchmark with industry methodology (P/R/F1, multi-run, baseline)

2026-06-17 12:01:34 +08:00 · 2026-06-17 12:01:34 +08:00 · 1fbfd9d132
parent d361177cc7
commit 1fbfd9d132
6 changed files with 5054 additions and 1126 deletions
--- a/configs/skills/benchmark_runner.yaml
+++ b/configs/skills/benchmark_runner.yaml
@ -36,7 +36,9 @@ prompt:
  identity: "你是 AgentKit 能力回测助手，负责运行各维度能力测试并生成评估报告。"
  instructions: |
    ## 职责
-    根据用户需求运行 AgentKit 能力回测，生成综合评估报告。
+    根据用户需求运行 AgentKit 能力回测，生成标准化评估报告。
+    采用行业 Benchmark 方法论（SWE-bench / AgentBench / ToolBench 风格），
+    提供 Accuracy / Precision / Recall / F1 / Latency / Consistency 等完整指标。

    ## 可用命令

@ -44,13 +46,14 @@ prompt:
    ```bash
    python3 -m agentkit.cli.main benchmark --report --verbose
    ```
-    运行所有 7 个维度共 51 个测试用例，生成 JSON + TXT 报告。
+    运行所有 7 个维度共 53 个标准化测试用例，生成 JSON + Markdown 报告。
+    默认运行 3 次取均值 ± 标准差，附带 95% Wilson 置信区间。

    ### 快速回测
    ```bash
    python3 -m agentkit.cli.main benchmark --fast --report
    ```
-    运行核心用例（约 23 个），适合开发时快速验证。
+    运行核心用例（约 22 个），适合开发时快速验证。

    ### 单维度回测
    ```bash
@ -58,16 +61,42 @@ prompt:
    ```
    可选维度：preprocessing, overfitting, efficiency, tool_search, event_model, spec_management, verification

+    ### 多次运行取均值（--runs）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --runs 5 --report
+    ```
+    指定运行次数（默认 3），计算 accuracy_mean ± accuracy_std 和 95% 置信区间。
+    适用于稳定性评估和回归检测。
+
+    ### 基线对比（--baseline）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --baseline --report
+    ```
+    首次运行自动创建基线（baseline.json），后续运行与基线对比，显示 ↑/↓ 变化趋势。
+    适用于 CI/CD 回归监控。
+
+    ### Markdown 报告（默认）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --report --format markdown
+    ```
+    生成人类可读的 Markdown 报告，包含指标表格、失败用例分析、改进建议。
+
    ### HTML 报告
    ```bash
    python3 -m agentkit.cli.main benchmark --report --format html
    ```

+    ### JSON 报告
+    ```bash
+    python3 -m agentkit.cli.main benchmark --report --format json
+    ```
+    仅生成 JSON 报告，适合机器解析和 CI 集成。
+
    ### pytest 综合回测
    ```bash
-    python3 -m pytest tests/e2e/test_capability_comprehensive.py -v
+    python3 -m pytest tests/e2e/test_capability_comprehensive.py -v -m e2e_capability
    ```
-    运行 60 个测试（8 维度），生成 comprehensive_report。
+    运行 64 个测试（10 维度，含标准 Benchmark 框架集成测试），生成 comprehensive_report。

    ### 指定输出目录
    ```bash
@ -75,24 +104,37 @@ prompt:
    ```

    ## 测试维度说明
+    每个维度均提供以下标准化指标：
+    - **Accuracy** — 准确率（通过率）
+    - **Precision** — 精确率（macro-averaged，多分类）
+    - **Recall** — 召回率（macro-averaged，多分类）
+    - **F1** — F1 分数（Precision 与 Recall 的调和平均）
+    - **Latency p50/p95/p99** — 延迟分位数（毫秒）
+    - **Consistency** — 一致性（过拟合检测，改写输入的稳定性）
+    - **95% CI** — Wilson 置信区间（多次运行时）
+
+    维度清单：
    1. **preprocessing** — 预处理准确度：greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT
-    2. **overfitting** — 过拟合检测：同一意图不同表达的一致性
-    3. **efficiency** — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms
-    4. **tool_search** — 工具搜索准确度：BM25 相关性排序
+    2. **overfitting** — 过拟合检测：同一意图不同表达的一致性（Consistency 指标）
+    3. **efficiency** — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms（Latency 指标）
+    4. **tool_search** — 工具搜索准确度：BM25 相关性排序（P/R/F1 指标）
    5. **event_model** — 事件模型完整性：SQ/EQ 双队列生命周期
    6. **spec_management** — Spec 管理：CRUD 操作
    7. **verification** — 验证循环：verify/retry 行为

    ## 报告位置
-    - CLI 报告：`test-results/benchmark/benchmark_report.{json,txt,html}`
+    - CLI 报告：`test-results/benchmark/benchmark_report.{json,md,html}`
+    - 基线文件：`test-results/benchmark/baseline.json`（使用 --baseline 时生成）
    - pytest 报告：`test-results/e2e/comprehensive_report.{json,txt}`

    ## 输出要求
    1. 运行测试命令
-    2. 读取生成的报告文件
-    3. 向用户展示结果摘要表格
-    4. 如有失败用例，分析原因并给出改进建议
-    5. 对比历史报告（如存在），展示趋势变化
+    2. 读取生成的报告文件（JSON + Markdown）
+    3. 向用户展示结果摘要表格，包含各维度的 Accuracy / P / R / F1 / Latency
+    4. 如有失败用例，分析根因（wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded）
+    5. 对比基线报告（如使用 --baseline），展示各维度准确率的 ↑/↓ 变化趋势
+    6. 关注关键指标：P95 延迟 > 100ms 需提示性能问题，Consistency < 100% 需提示过拟合风险
+    7. 给出针对性改进建议，基于指标数据而非主观判断

 llm:
  model: "default"
--- a/src/agentkit/cli/benchmark.py
+++ b/src/agentkit/cli/benchmark.py
--- a/test-results/benchmark/baseline.json
+++ b/test-results/benchmark/baseline.json
--- a/test-results/benchmark/benchmark_report.json
+++ b/test-results/benchmark/benchmark_report.json
--- a/test-results/benchmark/benchmark_report.md
+++ b/test-results/benchmark/benchmark_report.md
@ -0,0 +1,246 @@
+# AgentKit 能力基准测试报告
+
+## 测试概要
+- 时间: 2026-06-17T04:00:50.738066+00:00
+- 版本: 0.1.0
+- 运行次数: 3
+- 总体准确率: 100.0% ± 0.0%
+
+## 与行业 Benchmark 对比
+
+| Benchmark | 测试对象 | AgentKit 对应 |
+|---|---|---|
+| SWE-bench | LLM 代码修复 | — (测 LLM 非框架) |
+| ToolBench | 工具调用 | tool_search 维度 |
+| AgentBench | Agent 系统 | 全部维度 |
+
+## 维度结果
+
+### 1. 预处理准确度 (Preprocessing Accuracy)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [79.6%, 100.0%] |
+| Precision | 100.0% |
+| Recall | 100.0% |
+| F1 | 100.0% |
+| Latency p50 | 0.01ms |
+| Latency p95 | 0.03ms |
+| Latency p99 | 0.06ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 15 / 15 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| greeting | 4 | 4 | 100.0% |
+| tool_query | 5 | 5 | 100.0% |
+| skill_prefix | 3 | 3 | 100.0% |
+| complex | 3 | 3 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 5 | 5 | 100.0% |
+| medium | 7 | 7 | 100.0% |
+| hard | 3 | 3 | 100.0% |
+
+### 2. 过拟合检测 (Overfitting Detection)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [56.5%, 100.0%] |
+| Precision | 100.0% |
+| Recall | 100.0% |
+| F1 | 100.0% |
+| Latency p50 | 0.04ms |
+| Latency p95 | 0.06ms |
+| Latency p99 | 0.07ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 5 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| ip_check | 1 | 1 | 100.0% |
+| search | 1 | 1 | 100.0% |
+| greeting | 1 | 1 | 100.0% |
+| tool_use | 1 | 1 | 100.0% |
+| complex | 1 | 1 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| medium | 3 | 3 | 100.0% |
+| easy | 1 | 1 | 100.0% |
+| hard | 1 | 1 | 100.0% |
+
+### 3. 效率测试 (Efficiency)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [56.5%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 0.40ms |
+| Latency p95 | 0.77ms |
+| Latency p99 | 0.82ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 5 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| preprocess_latency | 3 | 3 | 100.0% |
+| tool_search_latency | 2 | 2 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 2 | 2 | 100.0% |
+| medium | 3 | 3 | 100.0% |
+
+### 4. 工具搜索 (Tool Search)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [72.2%, 100.0%] |
+| Precision | 83.3% |
+| Recall | 83.3% |
+| F1 | 83.3% |
+| Latency p50 | 0.01ms |
+| Latency p95 | 0.02ms |
+| Latency p99 | 0.02ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 10 / 10 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| exact_match | 5 | 5 | 100.0% |
+| fuzzy_match | 2 | 2 | 100.0% |
+| no_match | 2 | 2 | 100.0% |
+| top_k | 1 | 1 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 7 | 7 | 100.0% |
+| medium | 3 | 3 | 100.0% |
+
+### 5. 事件模型 (Event Model)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [61.0%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 0.04ms |
+| Latency p95 | 15.68ms |
+| Latency p99 | 19.84ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 6 / 6 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| sq_lifecycle | 3 | 3 | 100.0% |
+| eq_lifecycle | 3 | 3 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 6 | 6 | 100.0% |
+
+### 6. 规格管理 (Spec Management)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [64.6%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 1.41ms |
+| Latency p95 | 3.60ms |
+| Latency p99 | 4.04ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 7 / 7 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| crud | 5 | 5 | 100.0% |
+| edge | 2 | 2 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 6 | 6 | 100.0% |
+| medium | 1 | 1 | 100.0% |
+
+### 7. 验证循环 (Verification Loop)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [56.5%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 25.44ms |
+| Latency p95 | 413.42ms |
+| Latency p99 | 488.32ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 5 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| basic | 2 | 2 | 100.0% |
+| retry | 1 | 1 | 100.0% |
+| timeout | 1 | 1 | 100.0% |
+| multi | 1 | 1 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 2 | 2 | 100.0% |
+| medium | 3 | 3 | 100.0% |
+
+## 基线对比
+
+| 维度 | 基线准确率 | 当前准确率 | 变化 |
+|---|---|---|---|
+| preprocessing | 100.0% | 100.0% | — |
+| overfitting | 100.0% | 100.0% | — |
+| efficiency | 100.0% | 100.0% | — |
+| tool_search | 100.0% | 100.0% | — |
+| event_model | 100.0% | 100.0% | — |
+| spec_management | 100.0% | 100.0% | — |
+| verification | 100.0% | 100.0% | — |
+
+## 问题总结与改进建议
+
+- **verification**: P95 延迟 413.42ms 较高，建议优化性能
--- a/tests/e2e/test_capability_comprehensive.py
+++ b/tests/e2e/test_capability_comprehensive.py
@ -1517,3 +1517,95 @@ class TestComprehensiveReport:
        total_score = json_report["total_score"]
        print(f"\n总体评分: {total_score:.1f}%")
        assert total_score >= 80.0, f"Total score {total_score:.1f}% is below 80% threshold"
+
+
+# ═══════════════════════════════════════════════════════════════════════════
+# 10. 标准 Benchmark 框架集成
+# ═══════════════════════════════════════════════════════════════════════════
+
+
+@pytest.mark.e2e_capability
+class TestStandardBenchmarkIntegration:
+    """测试标准 Benchmark 框架集成。"""
+
+    def test_benchmark_task_creation(self) -> None:
+        """测试 BenchmarkTask 可以正确创建。"""
+        from agentkit.cli.benchmark import BenchmarkTask
+
+        task = BenchmarkTask(
+            task_id="test-001",
+            dimension="preprocessing",
+            category="greeting",
+            difficulty="easy",
+            input="你好",
+            expected="direct_chat",
+            tags=["regex", "chinese"],
+            description="测试用例",
+            paraphrases=[],
+        )
+        assert task.task_id == "test-001"
+        assert task.dimension == "preprocessing"
+
+    def test_metric_set_prf(self) -> None:
+        """测试 MetricSet P/R/F1 计算。"""
+        from agentkit.cli.benchmark import MetricSet
+
+        m = MetricSet(
+            accuracy=0.9,
+            precision=0.95,
+            recall=0.85,
+            f1=0.90,
+            latency_p50_ms=1.0,
+            latency_p95_ms=2.0,
+            latency_p99_ms=3.0,
+            consistency=1.0,
+            total=100,
+            passed=90,
+            failed=10,
+        )
+        assert m.f1 == 0.90
+        assert m.precision == 0.95
+
+    def test_benchmark_runs_successfully(self) -> None:
+        """测试 benchmark 函数可以成功运行（fast 模式）。"""
+        from agentkit.cli.benchmark import BenchmarkDimension, benchmark
+
+        # 使用 fast 模式，不生成报告，不输出到终端
+        # 只验证不抛异常
+        try:
+            benchmark(
+                dimension=BenchmarkDimension.ALL,
+                report=False,
+                fast=True,
+                verbose=False,
+                runs=1,
+                output_dir="test-results/benchmark",
+                format="json",
+            )
+        except SystemExit:
+            pass  # benchmark 可能通过 typer.Exit 退出
+
+    def test_report_generation(self, tmp_path: Path) -> None:
+        """测试报告文件可以正确生成。"""
+        import os
+
+        from agentkit.cli.benchmark import BenchmarkDimension, benchmark
+
+        out_dir = str(tmp_path / "benchmark")
+        try:
+            benchmark(
+                dimension=BenchmarkDimension.ALL,
+                report=True,
+                fast=True,
+                verbose=False,
+                runs=1,
+                output_dir=out_dir,
+                format="markdown",
+            )
+        except SystemExit:
+            pass
+        # 验证报告文件生成
+        json_path = os.path.join(out_dir, "benchmark_report.json")
+        md_path = os.path.join(out_dir, "benchmark_report.md")
+        assert os.path.exists(json_path), f"JSON report not found: {json_path}"
+        assert os.path.exists(md_path), f"Markdown report not found: {md_path}"