diff --git a/configs/skills/benchmark_runner.yaml b/configs/skills/benchmark_runner.yaml
index f3805df..159ccbf 100644
--- a/configs/skills/benchmark_runner.yaml
+++ b/configs/skills/benchmark_runner.yaml
@@ -36,7 +36,9 @@ prompt:
   identity: "你是 AgentKit 能力回测助手，负责运行各维度能力测试并生成评估报告。"
   instructions: |
     ## 职责
-    根据用户需求运行 AgentKit 能力回测，生成综合评估报告。
+    根据用户需求运行 AgentKit 能力回测，生成标准化评估报告。
+    采用行业 Benchmark 方法论（SWE-bench / AgentBench / ToolBench 风格），
+    提供 Accuracy / Precision / Recall / F1 / Latency / Consistency 等完整指标。
 
     ## 可用命令
 
@@ -44,13 +46,14 @@ prompt:
     ```bash
     python3 -m agentkit.cli.main benchmark --report --verbose
     ```
-    运行所有 7 个维度共 51 个测试用例，生成 JSON + TXT 报告。
+    运行所有 7 个维度共 53 个标准化测试用例，生成 JSON + Markdown 报告。
+    默认运行 3 次取均值 ± 标准差，附带 95% Wilson 置信区间。
 
     ### 快速回测
     ```bash
     python3 -m agentkit.cli.main benchmark --fast --report
     ```
-    运行核心用例（约 23 个），适合开发时快速验证。
+    运行核心用例（约 22 个），适合开发时快速验证。
 
     ### 单维度回测
     ```bash
@@ -58,16 +61,42 @@ prompt:
     ```
     可选维度：preprocessing, overfitting, efficiency, tool_search, event_model, spec_management, verification
 
+    ### 多次运行取均值（--runs）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --runs 5 --report
+    ```
+    指定运行次数（默认 3），计算 accuracy_mean ± accuracy_std 和 95% 置信区间。
+    适用于稳定性评估和回归检测。
+
+    ### 基线对比（--baseline）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --baseline --report
+    ```
+    首次运行自动创建基线（baseline.json），后续运行与基线对比，显示 ↑/↓ 变化趋势。
+    适用于 CI/CD 回归监控。
+
+    ### Markdown 报告（默认）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --report --format markdown
+    ```
+    生成人类可读的 Markdown 报告，包含指标表格、失败用例分析、改进建议。
+
     ### HTML 报告
     ```bash
     python3 -m agentkit.cli.main benchmark --report --format html
     ```
 
+    ### JSON 报告
+    ```bash
+    python3 -m agentkit.cli.main benchmark --report --format json
+    ```
+    仅生成 JSON 报告，适合机器解析和 CI 集成。
+
     ### pytest 综合回测
     ```bash
-    python3 -m pytest tests/e2e/test_capability_comprehensive.py -v
+    python3 -m pytest tests/e2e/test_capability_comprehensive.py -v -m e2e_capability
     ```
-    运行 60 个测试（8 维度），生成 comprehensive_report。
+    运行 64 个测试（10 维度，含标准 Benchmark 框架集成测试），生成 comprehensive_report。
 
     ### 指定输出目录
     ```bash
@@ -75,24 +104,37 @@ prompt:
     ```
 
     ## 测试维度说明
+    每个维度均提供以下标准化指标：
+    - **Accuracy** — 准确率（通过率）
+    - **Precision** — 精确率（macro-averaged，多分类）
+    - **Recall** — 召回率（macro-averaged，多分类）
+    - **F1** — F1 分数（Precision 与 Recall 的调和平均）
+    - **Latency p50/p95/p99** — 延迟分位数（毫秒）
+    - **Consistency** — 一致性（过拟合检测，改写输入的稳定性）
+    - **95% CI** — Wilson 置信区间（多次运行时）
+
+    维度清单：
     1. **preprocessing** — 预处理准确度：greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT
-    2. **overfitting** — 过拟合检测：同一意图不同表达的一致性
-    3. **efficiency** — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms
-    4. **tool_search** — 工具搜索准确度：BM25 相关性排序
+    2. **overfitting** — 过拟合检测：同一意图不同表达的一致性（Consistency 指标）
+    3. **efficiency** — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms（Latency 指标）
+    4. **tool_search** — 工具搜索准确度：BM25 相关性排序（P/R/F1 指标）
     5. **event_model** — 事件模型完整性：SQ/EQ 双队列生命周期
     6. **spec_management** — Spec 管理：CRUD 操作
     7. **verification** — 验证循环：verify/retry 行为
 
     ## 报告位置
-    - CLI 报告：`test-results/benchmark/benchmark_report.{json,txt,html}`
+    - CLI 报告：`test-results/benchmark/benchmark_report.{json,md,html}`
+    - 基线文件：`test-results/benchmark/baseline.json`（使用 --baseline 时生成）
     - pytest 报告：`test-results/e2e/comprehensive_report.{json,txt}`
 
     ## 输出要求
     1. 运行测试命令
-    2. 读取生成的报告文件
-    3. 向用户展示结果摘要表格
-    4. 如有失败用例，分析原因并给出改进建议
-    5. 对比历史报告（如存在），展示趋势变化
+    2. 读取生成的报告文件（JSON + Markdown）
+    3. 向用户展示结果摘要表格，包含各维度的 Accuracy / P / R / F1 / Latency
+    4. 如有失败用例，分析根因（wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded）
+    5. 对比基线报告（如使用 --baseline），展示各维度准确率的 ↑/↓ 变化趋势
+    6. 关注关键指标：P95 延迟 > 100ms 需提示性能问题，Consistency < 100% 需提示过拟合风险
+    7. 给出针对性改进建议，基于指标数据而非主观判断
 
 llm:
   model: "default"
diff --git a/src/agentkit/cli/benchmark.py b/src/agentkit/cli/benchmark.py
index 45e7dd7..b52e257 100644
--- a/src/agentkit/cli/benchmark.py
+++ b/src/agentkit/cli/benchmark.py
@@ -1,4 +1,12 @@
-"""Benchmark CLI command — run capability backtests and generate reports.
+"""Benchmark CLI command — standardized capability benchmarking.
+
+Implements industry-standard benchmark methodology (SWE-bench / AgentBench / ToolBench):
+- Standardized TaskSet with dimension/category/difficulty metadata
+- Full metrics: Accuracy / Precision / Recall / F1 / Latency p50,p95,p99 / Consistency
+- Multiple runs with mean ± std and 95% Wilson confidence interval
+- Failure root-cause classification (wrong_mode / wrong_tool / timeout / exception / ...)
+- Markdown + JSON + HTML report generation
+- Baseline comparison (↑/↓)
 
 Tests core AgentKit components directly (no pytest subprocess, no real LLM):
 - preprocessing: RequestPreprocessor routing accuracy
@@ -11,24 +19,30 @@ Tests core AgentKit components directly (no pytest subprocess, no real LLM):
 
 Usage:
     agentkit benchmark                          # run all dimensions
-    agentkit benchmark --dimension preprocessing
-    agentkit benchmark --report                 # JSON + TXT report
-    agentkit benchmark --report --format html   # + HTML report
-    agentkit benchmark --output-dir ./my-results
+    agentkit benchmark -d preprocessing         # single dimension
+    agentkit benchmark --report                 # generate reports
     agentkit benchmark --fast                   # core cases only
     agentkit benchmark --verbose                # detailed output
+    agentkit benchmark --format html            # HTML format
+    agentkit benchmark -o ./results             # output directory
+    agentkit benchmark --runs 3                 # multiple runs (default 3)
+    agentkit benchmark --baseline               # compare with baseline
+    agentkit benchmark --format markdown        # Markdown report (default)
 """
 
 from __future__ import annotations
 
 import asyncio
 import json
+import math
+import re
 import time
+from collections.abc import Awaitable, Callable
 from dataclasses import asdict, dataclass, field
 from datetime import datetime, timezone
 from enum import Enum
 from pathlib import Path
-from typing import Any
+from typing import TYPE_CHECKING
 
 import typer
 from rich.console import Console
@@ -42,6 +56,10 @@ from rich.progress import (
 )
 from rich.table import Table
 
+if TYPE_CHECKING:
+    from agentkit.chat.request_preprocessor import RequestPreprocessor
+    from agentkit.tools.search import ToolSearchIndex
+
 console = Console()
 
 _DEFAULT_OUTPUT_DIR = "test-results/benchmark"
@@ -61,20 +79,88 @@ class BenchmarkDimension(str, Enum):
 
 
 # ---------------------------------------------------------------------------
-# Result data structures
+# Data structures
 # ---------------------------------------------------------------------------
 
 
 @dataclass
-class TestCaseResult:
-    """Single test case result."""
+class BenchmarkTask:
+    """Standardized benchmark task definition.
 
-    case_id: str
+    Attributes:
+        task_id: Unique identifier (e.g. "prep-001").
+        dimension: Test dimension (preprocessing/overfitting/...).
+        category: Sub-category (greeting/tool_query/skill_prefix/...).
+        difficulty: easy / medium / hard.
+        input: Test input string.
+        expected: Expected output (execution mode, tool name, "passed", or threshold).
+        tags: Tag list for filtering (e.g. "regex", "bm25", "fallback").
+        description: Human-readable description.
+        paraphrases: Paraphrase list for overfitting detection.
+    """
+
+    task_id: str
+    dimension: str
+    category: str
+    difficulty: str
+    input: str
+    expected: str
+    tags: list[str]
+    description: str
+    paraphrases: list[str] = field(default_factory=list)
+
+
+@dataclass
+class ExecutionResult:
+    """Raw execution result from a single task invocation."""
+
+    actual: str
+    passed: bool
+    duration_ms: float
+    detail: str = ""
+    consistency: float = 1.0
+
+
+@dataclass
+class CaseResult:
+    """A single test case result with metadata."""
+
+    task_id: str
+    dimension: str
+    category: str
+    difficulty: str
     passed: bool
     expected: str
     actual: str
     duration_ms: float
+    root_cause: str = "none"
     detail: str = ""
+    consistency: float = 1.0
+
+
+@dataclass
+class MetricSet:
+    """Aggregated metrics for a group of cases.
+
+    Includes Accuracy / Precision / Recall / F1, latency percentiles,
+    consistency (overfitting), and multi-run statistics with 95% CI.
+    """
+
+    accuracy: float
+    precision: float
+    recall: float
+    f1: float
+    latency_p50_ms: float
+    latency_p95_ms: float
+    latency_p99_ms: float
+    consistency: float
+    total: int
+    passed: int
+    failed: int
+    accuracy_mean: float = 0.0
+    accuracy_std: float = 0.0
+    ci_lower: float = 0.0
+    ci_upper: float = 0.0
 
 
 @dataclass
@@ -82,40 +168,605 @@ class DimensionResult:
     """Aggregated result for one dimension."""
 
     dimension: str
-    total: int = 0
-    passed: int = 0
-    failed: int = 0
-    details: list[TestCaseResult] = field(default_factory=list)
+    metrics: MetricSet
+    cases: list[CaseResult]
+    by_category: dict[str, MetricSet]
+    by_difficulty: dict[str, MetricSet]
 
-    @property
-    def score(self) -> float:
-        return self.passed / self.total if self.total > 0 else 0.0
 
-    def add(self, case: TestCaseResult) -> None:
-        self.total += 1
-        if case.passed:
-            self.passed += 1
-        else:
-            self.failed += 1
-        self.details.append(case)
+@dataclass
+class BenchmarkContext:
+    """Shared context for benchmark execution."""
 
-    def to_dict(self) -> dict[str, Any]:
-        return {
-            "score": round(self.score, 4),
-            "total": self.total,
-            "passed": self.passed,
-            "failed": self.failed,
-            "details": [asdict(d) for d in self.details],
-        }
+    preprocessor: object  # RequestPreprocessor
+    search_index: object  # ToolSearchIndex
+    tmp_dir: Path
 
 
 # ---------------------------------------------------------------------------
-# Helpers — mock objects
+# Standardized TaskSet
 # ---------------------------------------------------------------------------
 
 
-def _make_mock_skill_registry():
-    """Build a SkillRegistry with a couple of mock skills for preprocessing tests."""
+TASK_SET: list[BenchmarkTask] = [
+    # === Preprocessing (15 tasks) ===
+    BenchmarkTask(
+        "prep-001",
+        "preprocessing",
+        "greeting",
+        "easy",
+        "你好",
+        "direct_chat",
+        ["regex", "chinese"],
+        "中文问候应路由到 DIRECT_CHAT",
+    ),
+    BenchmarkTask(
+        "prep-002",
+        "preprocessing",
+        "greeting",
+        "easy",
+        "hello",
+        "direct_chat",
+        ["regex", "english"],
+        "英文问候应路由到 DIRECT_CHAT",
+    ),
+    BenchmarkTask(
+        "prep-003",
+        "preprocessing",
+        "greeting",
+        "easy",
+        "谢谢",
+        "direct_chat",
+        ["regex", "chitchat"],
+        "感谢语应路由到 DIRECT_CHAT",
+    ),
+    BenchmarkTask(
+        "prep-004",
+        "preprocessing",
+        "greeting",
+        "easy",
+        "你是谁",
+        "direct_chat",
+        ["regex", "identity"],
+        "身份询问应路由到 DIRECT_CHAT",
+    ),
+    BenchmarkTask(
+        "prep-005",
+        "preprocessing",
+        "tool_query",
+        "medium",
+        "搜索golang教程",
+        "react",
+        ["search", "default"],
+        "搜索类请求应路由到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-006",
+        "preprocessing",
+        "tool_query",
+        "medium",
+        "执行ls命令",
+        "react",
+        ["shell", "default"],
+        "Shell 执行类请求应路由到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-007",
+        "preprocessing",
+        "tool_query",
+        "medium",
+        "翻译hello为中文",
+        "react",
+        ["translate", "default"],
+        "翻译类请求应路由到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-008",
+        "preprocessing",
+        "tool_query",
+        "medium",
+        "什么是机器学习",
+        "react",
+        ["knowledge", "default"],
+        "知识查询类请求应路由到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-009",
+        "preprocessing",
+        "tool_query",
+        "medium",
+        "帮我分析数据",
+        "react",
+        ["analysis", "default"],
+        "分析类请求应路由到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-010",
+        "preprocessing",
+        "skill_prefix",
+        "medium",
+        "@skill:react_agent 查看ip",
+        "skill_react",
+        ["skill", "react"],
+        "有效 skill 前缀应路由到 SKILL_REACT",
+    ),
+    BenchmarkTask(
+        "prep-011",
+        "preprocessing",
+        "skill_prefix",
+        "medium",
+        "@skill:chat_only 你好",
+        "direct_chat",
+        ["skill", "direct"],
+        "direct 模式 skill 前缀应路由到 DIRECT_CHAT",
+    ),
+    BenchmarkTask(
+        "prep-012",
+        "preprocessing",
+        "skill_prefix",
+        "hard",
+        "@skill:nonexistent 做点什么",
+        "react",
+        ["skill", "fallback"],
+        "无效 skill 前缀应回退到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-013",
+        "preprocessing",
+        "complex",
+        "hard",
+        "帮我分析这个数据并生成报告",
+        "react",
+        ["multi_step"],
+        "多步骤复杂任务应路由到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-014",
+        "preprocessing",
+        "complex",
+        "easy",
+        "随便聊聊",
+        "react",
+        ["chitchat", "default"],
+        "非匹配闲聊应回退到 REACT",
+    ),
+    BenchmarkTask(
+        "prep-015",
+        "preprocessing",
+        "complex",
+        "hard",
+        "请帮我完成以下任务：1. 查询天气 2. 生成报告",
+        "react",
+        ["multi_step"],
+        "多步骤任务应路由到 REACT",
+    ),
+    # === Overfitting (5 groups) ===
+    BenchmarkTask(
+        "over-001",
+        "overfitting",
+        "ip_check",
+        "medium",
+        "查下ip",
+        "react",
+        ["colloquial"],
+        "IP 查询改写一致性",
+        paraphrases=["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"],
+    ),
+    BenchmarkTask(
+        "over-002",
+        "overfitting",
+        "search",
+        "medium",
+        "搜索golang教程",
+        "react",
+        ["search"],
+        "搜索改写一致性",
+        paraphrases=["搜索golang教程", "搜一下golang教程", "找下golang学习资料"],
+    ),
+    BenchmarkTask(
+        "over-003",
+        "overfitting",
+        "greeting",
+        "easy",
+        "你好",
+        "direct_chat",
+        ["greeting"],
+        "问候改写一致性",
+        paraphrases=["你好", "hello", "hi", "嗨", "哈喽"],
+    ),
+    BenchmarkTask(
+        "over-004",
+        "overfitting",
+        "tool_use",
+        "medium",
+        "执行ls命令",
+        "react",
+        ["shell"],
+        "工具使用改写一致性",
+        paraphrases=["执行ls命令", "运行ls", "跑一下ls"],
+    ),
+    BenchmarkTask(
+        "over-005",
+        "overfitting",
+        "complex",
+        "hard",
+        "帮我分析数据",
+        "react",
+        ["analysis"],
+        "复杂任务改写一致性",
+        paraphrases=["帮我分析数据", "分析一下数据", "看看这些数据"],
+    ),
+    # === Efficiency (5 tasks) ===
+    BenchmarkTask(
+        "eff-001",
+        "efficiency",
+        "preprocess_latency",
+        "easy",
+        "你好",
+        "<=50ms",
+        ["greeting", "preprocess"],
+        "问候预处理延迟 < 50ms",
+    ),
+    BenchmarkTask(
+        "eff-002",
+        "efficiency",
+        "preprocess_latency",
+        "medium",
+        "查下ip",
+        "<=50ms",
+        ["react", "preprocess"],
+        "REACT 预处理延迟 < 50ms",
+    ),
+    BenchmarkTask(
+        "eff-003",
+        "efficiency",
+        "preprocess_latency",
+        "medium",
+        "@skill:react_agent test",
+        "<=50ms",
+        ["skill", "preprocess"],
+        "Skill 前缀预处理延迟 < 50ms",
+    ),
+    BenchmarkTask(
+        "eff-004",
+        "efficiency",
+        "tool_search_latency",
+        "medium",
+        "read file",
+        "<=10ms",
+        ["tool_search", "bm25"],
+        "工具搜索延迟 < 10ms",
+    ),
+    BenchmarkTask(
+        "eff-005",
+        "efficiency",
+        "tool_search_latency",
+        "easy",
+        "",
+        "<=5ms",
+        ["tool_search", "empty"],
+        "空查询工具搜索延迟 < 5ms",
+    ),
+    # === Tool Search (10 tasks) ===
+    BenchmarkTask(
+        "ts-001",
+        "tool_search",
+        "exact_match",
+        "easy",
+        "read file",
+        "read_file",
+        ["bm25", "exact"],
+        "精确匹配 read_file",
+    ),
+    BenchmarkTask(
+        "ts-002",
+        "tool_search",
+        "exact_match",
+        "easy",
+        "write file content",
+        "write_file",
+        ["bm25", "exact"],
+        "精确匹配 write_file",
+    ),
+    BenchmarkTask(
+        "ts-003",
+        "tool_search",
+        "exact_match",
+        "easy",
+        "search web information",
+        "web_search",
+        ["bm25", "exact"],
+        "精确匹配 web_search",
+    ),
+    BenchmarkTask(
+        "ts-004",
+        "tool_search",
+        "exact_match",
+        "easy",
+        "execute shell command",
+        "shell_exec",
+        ["bm25", "exact"],
+        "精确匹配 shell_exec",
+    ),
+    BenchmarkTask(
+        "ts-005",
+        "tool_search",
+        "exact_match",
+        "easy",
+        "send http request url",
+        "http_request",
+        ["bm25", "exact"],
+        "精确匹配 http_request",
+    ),
+    BenchmarkTask(
+        "ts-006",
+        "tool_search",
+        "fuzzy_match",
+        "medium",
+        "io file",
+        "read_file",
+        ["bm25", "fuzzy", "tag"],
+        "标签模糊匹配 io file",
+    ),
+    BenchmarkTask(
+        "ts-007",
+        "tool_search",
+        "fuzzy_match",
+        "medium",
+        "search query engine",
+        "web_search",
+        ["bm25", "fuzzy", "multi"],
+        "多关键词模糊匹配",
+    ),
+    BenchmarkTask(
+        "ts-008",
+        "tool_search",
+        "no_match",
+        "easy",
+        "",
+        "__none__",
+        ["bm25", "empty"],
+        "空查询应返回空结果",
+    ),
+    BenchmarkTask(
+        "ts-009",
+        "tool_search",
+        "no_match",
+        "easy",
+        "zzzznonexistent",
+        "__none__",
+        ["bm25", "no_match"],
+        "无匹配查询应返回空结果",
+    ),
+    BenchmarkTask(
+        "ts-010",
+        "tool_search",
+        "top_k",
+        "medium",
+        "file",
+        "read_file",
+        ["bm25", "top_k"],
+        "top_k=1 限制返回数",
+    ),
+    # === Event Model (6 tasks) ===
+    BenchmarkTask(
+        "ev-001",
+        "event_model",
+        "sq_lifecycle",
+        "easy",
+        "submit+drain",
+        "passed",
+        ["sq", "submit"],
+        "SQ 提交并消费",
+    ),
+    BenchmarkTask(
+        "ev-002",
+        "event_model",
+        "sq_lifecycle",
+        "easy",
+        "cancel",
+        "passed",
+        ["sq", "cancel"],
+        "SQ 取消任务",
+    ),
+    BenchmarkTask(
+        "ev-003",
+        "event_model",
+        "sq_lifecycle",
+        "easy",
+        "close",
+        "passed",
+        ["sq", "close"],
+        "SQ 关闭后拒绝提交",
+    ),
+    BenchmarkTask(
+        "ev-004",
+        "event_model",
+        "eq_lifecycle",
+        "easy",
+        "emit+replay",
+        "passed",
+        ["eq", "replay"],
+        "EQ 发射并回放",
+    ),
+    BenchmarkTask(
+        "ev-005",
+        "event_model",
+        "eq_lifecycle",
+        "easy",
+        "close",
+        "passed",
+        ["eq", "close"],
+        "EQ 关闭哨兵退出",
+    ),
+    BenchmarkTask(
+        "ev-006",
+        "event_model",
+        "eq_lifecycle",
+        "easy",
+        "subscriber_count",
+        "passed",
+        ["eq", "count"],
+        "EQ 初始订阅者计数",
+    ),
+    # === Spec Management (7 tasks) ===
+    BenchmarkTask(
+        "sm-001",
+        "spec_management",
+        "crud",
+        "easy",
+        "create",
+        "passed",
+        ["create"],
+        "Spec 创建",
+    ),
+    BenchmarkTask(
+        "sm-002",
+        "spec_management",
+        "crud",
+        "easy",
+        "get",
+        "passed",
+        ["read"],
+        "Spec 读取",
+    ),
+    BenchmarkTask(
+        "sm-003",
+        "spec_management",
+        "crud",
+        "easy",
+        "update",
+        "passed",
+        ["update"],
+        "Spec 更新",
+    ),
+    BenchmarkTask(
+        "sm-004",
+        "spec_management",
+        "crud",
+        "easy",
+        "delete",
+        "passed",
+        ["delete"],
+        "Spec 删除",
+    ),
+    BenchmarkTask(
+        "sm-005",
+        "spec_management",
+        "crud",
+        "easy",
+        "list",
+        "passed",
+        ["list"],
+        "Spec 列表",
+    ),
+    BenchmarkTask(
+        "sm-006",
+        "spec_management",
+        "edge",
+        "medium",
+        "confirm",
+        "passed",
+        ["confirm"],
+        "Spec 确认",
+    ),
+    BenchmarkTask(
+        "sm-007",
+        "spec_management",
+        "edge",
+        "easy",
+        "missing",
+        "passed",
+        ["missing"],
+        "Spec 不存在返回 None",
+    ),
+    # === Verification (5 tasks) ===
+    BenchmarkTask(
+        "vf-001",
+        "verification",
+        "basic",
+        "easy",
+        "pass",
+        "passed",
+        ["pass"],
+        "验证通过命令",
+    ),
+    BenchmarkTask(
+        "vf-002",
+        "verification",
+        "basic",
+        "easy",
+        "fail",
+        "passed",
+        ["fail"],
+        "验证失败命令",
+    ),
+    BenchmarkTask(
+        "vf-003",
+        "verification",
+        "retry",
+        "medium",
+        "fix_callback",
+        "passed",
+        ["retry", "callback"],
+        "重试与修复回调",
+    ),
+    BenchmarkTask(
+        "vf-004",
+        "verification",
+        "timeout",
+        "medium",
+        "timeout",
+        "passed",
+        ["timeout"],
+        "超时检测",
+    ),
+    BenchmarkTask(
+        "vf-005",
+        "verification",
+        "multi",
+        "medium",
+        "multi_command",
+        "passed",
+        ["multi"],
+        "多命令验证",
+    ),
+]
+
+
+_FAST_CORE_IDS: set[str] = {
+    "prep-001",
+    "prep-005",
+    "prep-010",
+    "prep-012",
+    "over-001",
+    "over-003",
+    "eff-001",
+    "eff-004",
+    "ts-001",
+    "ts-003",
+    "ts-008",
+    "ts-010",
+    "ev-001",
+    "ev-004",
+    "ev-005",
+    "sm-001",
+    "sm-002",
+    "sm-006",
+    "sm-004",
+    "vf-001",
+    "vf-002",
+    "vf-003",
+}
+
+
+# ---------------------------------------------------------------------------
+# Mock helpers
+# ---------------------------------------------------------------------------
+
+
+def _make_mock_skill_registry() -> object:
+    """Build a SkillRegistry with mock skills for preprocessing tests."""
     from agentkit.skills.base import Skill, SkillConfig
     from agentkit.skills.registry import SkillRegistry
 
@@ -142,7 +793,7 @@ def _make_mock_skill_registry():
     return registry
 
 
-def _make_mock_tools():
+def _make_mock_tools() -> list[object]:
     """Build a list of mock Tool instances for tool_search tests."""
     from agentkit.tools.base import Tool
 
@@ -151,7 +802,7 @@ def _make_mock_tools():
             self,
             name: str,
             description: str,
-            input_schema: dict[str, Any] | None = None,
+            input_schema: dict[str, object] | None = None,
             tags: list[str] | None = None,
         ):
             super().__init__(
@@ -161,7 +812,7 @@ def _make_mock_tools():
                 tags=tags or [],
             )
 
-        async def execute(self, **kwargs) -> dict:
+        async def execute(self, **kwargs: object) -> dict[str, object]:
             return {"status": "ok"}
 
     return [
@@ -224,144 +875,8 @@ def _make_mock_tools():
     ]
 
 
-# ---------------------------------------------------------------------------
-# Dimension test runners
-# ---------------------------------------------------------------------------
-
-
-async def _run_preprocessing(fast: bool, verbose: bool) -> DimensionResult:
-    """Test RequestPreprocessor routing accuracy."""
-    from agentkit.chat.request_preprocessor import RequestPreprocessor
-
-    registry = _make_mock_skill_registry()
-    preprocessor = RequestPreprocessor(skill_registry=registry)
-
-    cases: list[dict[str, str]] = [
-        {"id": "greeting_cn", "input": "你好", "expected": "direct_chat"},
-        {"id": "greeting_en", "input": "hello", "expected": "direct_chat"},
-        {"id": "chitchat_thanks", "input": "谢谢", "expected": "direct_chat"},
-        {"id": "identity_who", "input": "你是谁", "expected": "direct_chat"},
-        {"id": "colloquial_ip_1", "input": "查下ip", "expected": "react"},
-        {"id": "colloquial_ip_2", "input": "查看当前ip", "expected": "react"},
-        {"id": "tool_search", "input": "搜索golang教程", "expected": "react"},
-        {"id": "tool_shell", "input": "执行ls命令", "expected": "react"},
-        {"id": "translation", "input": "翻译hello为中文", "expected": "react"},
-        {"id": "knowledge", "input": "什么是机器学习", "expected": "react"},
-        {"id": "skill_prefix_react", "input": "@skill:react_agent 查看ip", "expected": "skill_react"},
-        {"id": "skill_prefix_direct", "input": "@skill:chat_only 你好", "expected": "skill_react"},
-        {"id": "skill_not_found", "input": "@skill:nonexistent 做点什么", "expected": "react"},
-        {"id": "complex_analysis", "input": "帮我分析一下这个数据并生成报告", "expected": "react"},
-        {"id": "empty_fallback", "input": "随便聊聊", "expected": "react"},
-    ]
-
-    if fast:
-        # Core cases only: greetings, tool queries, skill prefix
-        fast_ids = {
-            "greeting_cn",
-            "colloquial_ip_1",
-            "tool_search",
-            "skill_prefix_react",
-            "skill_not_found",
-        }
-        cases = [c for c in cases if c["id"] in fast_ids]
-
-    result = DimensionResult(dimension="preprocessing")
-
-    for case in cases:
-        start = time.perf_counter()
-        routing = await preprocessor.preprocess(content=case["input"])
-        elapsed_ms = (time.perf_counter() - start) * 1000
-
-        actual = routing.execution_mode.value
-        passed = actual == case["expected"]
-
-        result.add(
-            TestCaseResult(
-                case_id=case["id"],
-                passed=passed,
-                expected=case["expected"],
-                actual=actual,
-                duration_ms=round(elapsed_ms, 2),
-                detail=f"input={case['input']!r} method={routing.match_method}",
-            )
-        )
-
-        if verbose and not passed:
-            console.print(
-                f"  [red]✗[/red] {case['id']}: expected={case['expected']} "
-                f"actual={actual} ({routing.match_method})"
-            )
-        elif verbose:
-            console.print(f"  [green]✓[/green] {case['id']}: {actual} ({elapsed_ms:.1f}ms)")
-
-    return result
-
-
-async def _run_overfitting(fast: bool, verbose: bool) -> DimensionResult:
-    """Test routing consistency across paraphrases (overfitting detection).
-
-    Same intent expressed differently should route to the same execution mode.
-    """
-    from agentkit.chat.request_preprocessor import RequestPreprocessor
-
-    registry = _make_mock_skill_registry()
-    preprocessor = RequestPreprocessor(skill_registry=registry)
-
-    paraphrase_groups: list[dict[str, Any]] = [
-        {
-            "id": "ip_check_variants",
-            "paraphrases": ["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"],
-            "expected": "react",
-        },
-        {
-            "id": "search_variants",
-            "paraphrases": ["搜索golang教程", "搜一下golang教程", "找下golang学习资料"],
-            "expected": "react",
-        },
-        {
-            "id": "greeting_variants",
-            "paraphrases": ["你好", "hello", "hi", "嗨", "哈喽"],
-            "expected": "direct_chat",
-        },
-    ]
-
-    if fast:
-        paraphrase_groups = paraphrase_groups[:2]
-
-    result = DimensionResult(dimension="overfitting")
-
-    for group in paraphrase_groups:
-        modes: list[str] = []
-        for text in group["paraphrases"]:
-            routing = await preprocessor.preprocess(content=text)
-            modes.append(routing.execution_mode.value)
-
-        # All paraphrases should produce the same mode
-        unique_modes = set(modes)
-        consistent = len(unique_modes) == 1
-        expected_mode = group["expected"]
-        correct = consistent and modes[0] == expected_mode if modes else False
-
-        result.add(
-            TestCaseResult(
-                case_id=group["id"],
-                passed=correct,
-                expected=expected_mode,
-                actual=",".join(modes),
-                duration_ms=0.0,
-                detail=f"paraphrases={len(group['paraphrases'])} consistent={consistent}",
-            )
-        )
-
-        if verbose:
-            status = "[green]✓[/green]" if correct else "[red]✗[/red]"
-            console.print(f"  {status} {group['id']}: modes={modes}")
-
-    return result
-
-
-async def _run_efficiency(fast: bool, verbose: bool) -> DimensionResult:
-    """Test component execution efficiency (timing bounds)."""
+def _make_context(tmp_dir: Path) -> BenchmarkContext:
+    """Create a benchmark context with mock components."""
     from agentkit.chat.request_preprocessor import RequestPreprocessor
     from agentkit.tools.search import ToolSearchIndex
 
@@ -370,744 +885,1051 @@ async def _run_efficiency(fast: bool, verbose: bool) -> DimensionResult:
     tools = _make_mock_tools()
     search_index = ToolSearchIndex(tools)
 
-    # Thresholds in milliseconds (generous — these are pure-Python ops)
-    thresholds: list[dict[str, Any]] = [
-        {
-            "id": "preprocess_greeting",
-            "func": lambda: preprocessor.preprocess(content="你好"),
-            "max_ms": 50.0,
-            "iterations": 100,
-        },
-        {
-            "id": "preprocess_react",
-            "func": lambda: preprocessor.preprocess(content="查下ip"),
-            "max_ms": 50.0,
-            "iterations": 100,
-        },
-        {
-            "id": "preprocess_skill_prefix",
-            "func": lambda: preprocessor.preprocess(content="@skill:react_agent test"),
-            "max_ms": 50.0,
-            "iterations": 100,
-        },
-        {
-            "id": "tool_search_query",
-            "func": None,  # handled specially (sync)
-            "max_ms": 10.0,
-            "iterations": 200,
-        },
-        {
-            "id": "tool_search_empty",
-            "func": None,
-            "max_ms": 5.0,
-            "iterations": 200,
-        },
-    ]
-
-    if fast:
-        thresholds = [t for t in thresholds if t["id"] in {
-            "preprocess_greeting", "tool_search_query"
-        }]
-
-    result = DimensionResult(dimension="efficiency")
-
-    for spec in thresholds:
-        start = time.perf_counter()
-        if spec["func"] is not None:
-            for _ in range(spec["iterations"]):
-                await spec["func"]()
-        else:
-            query = "read file" if "query" in spec["id"] else ""
-            for _ in range(spec["iterations"]):
-                search_index.search(query, top_k=5)
-        total_ms = (time.perf_counter() - start) * 1000
-        avg_ms = total_ms / spec["iterations"]
-
-        passed = avg_ms <= spec["max_ms"]
-        result.add(
-            TestCaseResult(
-                case_id=spec["id"],
-                passed=passed,
-                expected=f"<= {spec['max_ms']}ms/call",
-                actual=f"{avg_ms:.3f}ms/call",
-                duration_ms=round(total_ms, 2),
-                detail=f"iterations={spec['iterations']}",
-            )
-        )
-
-        if verbose:
-            status = "[green]✓[/green]" if passed else "[red]✗[/red]"
-            console.print(
-                f"  {status} {spec['id']}: {avg_ms:.3f}ms/call "
-                f"(threshold {spec['max_ms']}ms)"
-            )
-
-    return result
+    return BenchmarkContext(
+        preprocessor=preprocessor,
+        search_index=search_index,
+        tmp_dir=tmp_dir,
+    )
 
 
-async def _run_tool_search(fast: bool, verbose: bool) -> DimensionResult:
-    """Test ToolSearchIndex BM25 relevance ranking."""
-    from agentkit.tools.search import ToolSearchIndex
+# ---------------------------------------------------------------------------
+# Utility functions
+# ---------------------------------------------------------------------------
 
-    tools = _make_mock_tools()
-    index = ToolSearchIndex(tools)
 
-    cases: list[dict[str, Any]] = [
-        {"id": "read_file_query", "query": "read file", "expected_top": "read_file"},
-        {"id": "write_file_query", "query": "write file content", "expected_top": "write_file"},
-        {"id": "web_search_query", "query": "search web information", "expected_top": "web_search"},
-        {"id": "shell_exec_query", "query": "execute shell command", "expected_top": "shell_exec"},
-        {"id": "http_request_query", "query": "send http request url", "expected_top": "http_request"},
-        {"id": "file_tag_query", "query": "io file", "expected_top": "read_file"},
-        {"id": "empty_query", "query": "", "expected_top": "__none__"},
-        {"id": "no_match_query", "query": "zzzznonexistent", "expected_top": "__none__"},
-        {"id": "top_k_limit", "query": "file", "expected_top": "read_file", "top_k": 1},
-        {"id": "multi_token_query", "query": "search query engine", "expected_top": "web_search"},
-    ]
+def _wilson_interval(successes: int, total: int, z: float = 1.96) -> tuple[float, float]:
+    """Compute 95% Wilson confidence interval for a proportion."""
+    if total == 0:
+        return (0.0, 0.0)
+    p = successes / total
+    denom = 1.0 + z * z / total
+    center = (p + z * z / (2 * total)) / denom
+    spread = z * math.sqrt(p * (1 - p) / total + z * z / (4 * total * total)) / denom
+    return (max(0.0, center - spread), min(1.0, center + spread))
 
-    if fast:
-        fast_ids = {"read_file_query", "web_search_query", "empty_query", "top_k_limit"}
-        cases = [c for c in cases if c["id"] in fast_ids]
 
-    result = DimensionResult(dimension="tool_search")
+def _percentile(sorted_values: list[float], p: float) -> float:
+    """Compute percentile from a sorted list."""
+    if not sorted_values:
+        return 0.0
+    if len(sorted_values) == 1:
+        return sorted_values[0]
+    k = (len(sorted_values) - 1) * p / 100.0
+    f = math.floor(k)
+    c = math.ceil(k)
+    if f == c:
+        return sorted_values[int(k)]
+    d0 = sorted_values[int(f)] * (c - k)
+    d1 = sorted_values[int(c)] * (k - f)
+    return d0 + d1
 
+
+def _std(values: list[float]) -> float:
+    """Compute population standard deviation."""
+    if len(values) < 2:
+        return 0.0
+    mean = sum(values) / len(values)
+    variance = sum((v - mean) ** 2 for v in values) / len(values)
+    return math.sqrt(variance)
+
+
+def _parse_threshold(expected: str) -> float:
+    """Parse threshold from string like '<=50ms' -> 50.0."""
+    match = re.match(r"<=\s*([\d.]+)\s*ms", expected)
+    if match:
+        return float(match.group(1))
+    return float("inf")
+
+
+# ---------------------------------------------------------------------------
+# Metrics computation
+# ---------------------------------------------------------------------------
+
+
+def _compute_metrics(
+    cases: list[CaseResult],
+    accuracies: list[float] | None = None,
+) -> MetricSet:
+    """Compute full metric set from a list of cases."""
+    total = len(cases)
+    passed = sum(1 for c in cases if c.passed)
+    failed = total - passed
+    accuracy = passed / total if total > 0 else 0.0
+
+    # Multi-class macro-averaged Precision / Recall / F1
+    expected_classes: set[str] = {c.expected for c in cases}
+    precisions: list[float] = []
+    recalls: list[float] = []
+    f1s: list[float] = []
+    for cls in expected_classes:
+        tp = sum(1 for c in cases if c.expected == cls and c.actual == cls)
+        fp = sum(1 for c in cases if c.expected != cls and c.actual == cls)
+        fn = sum(1 for c in cases if c.expected == cls and c.actual != cls)
+        p = tp / (tp + fp) if (tp + fp) > 0 else 0.0
+        r = tp / (tp + fn) if (tp + fn) > 0 else 0.0
+        f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0
+        precisions.append(p)
+        recalls.append(r)
+        f1s.append(f1)
+
+    precision = sum(precisions) / len(precisions) if precisions else 0.0
+    recall = sum(recalls) / len(recalls) if recalls else 0.0
+    f1 = sum(f1s) / len(f1s) if f1s else 0.0
+
+    # Latency percentiles
+    latencies = sorted(c.duration_ms for c in cases)
+    p50 = _percentile(latencies, 50)
+    p95 = _percentile(latencies, 95)
+    p99 = _percentile(latencies, 99)
+
+    # Consistency (overfitting detection)
+    consistency = sum(c.consistency for c in cases) / total if total > 0 else 0.0
+
+    # Multi-run statistics
+    if accuracies and len(accuracies) > 0:
+        accuracy_mean = sum(accuracies) / len(accuracies)
+        accuracy_std = _std(accuracies)
+    else:
+        accuracy_mean = accuracy
+        accuracy_std = 0.0
+
+    # Wilson 95% CI
+    ci_lower, ci_upper = _wilson_interval(passed, total)
+
+    return MetricSet(
+        accuracy=round(accuracy, 4),
+        precision=round(precision, 4),
+        recall=round(recall, 4),
+        f1=round(f1, 4),
+        latency_p50_ms=round(p50, 4),
+        latency_p95_ms=round(p95, 4),
+        latency_p99_ms=round(p99, 4),
+        consistency=round(consistency, 4),
+        total=total,
+        passed=passed,
+        failed=failed,
+        accuracy_mean=round(accuracy_mean, 4),
+        accuracy_std=round(accuracy_std, 4),
+        ci_lower=round(ci_lower, 4),
+        ci_upper=round(ci_upper, 4),
+    )
+
+
+def _aggregate_by(cases: list[CaseResult], key: str) -> dict[str, MetricSet]:
+    """Aggregate cases by a field name (category or difficulty)."""
+    groups: dict[str, list[CaseResult]] = {}
     for case in cases:
-        start = time.perf_counter()
-        top_k = case.get("top_k", 5)
-        found = index.search(case["query"], top_k=top_k)
-        elapsed_ms = (time.perf_counter() - start) * 1000
+        k = getattr(case, key)
+        groups.setdefault(k, []).append(case)
+    return {k: _compute_metrics(v) for k, v in groups.items()}
 
-        if case["expected_top"] == "__none__":
-            passed = len(found) == 0
-            actual = "[]" if passed else found[0].name
-        else:
-            actual = found[0].name if found else "__empty__"
-            passed = actual == case["expected_top"]
 
-        result.add(
-            TestCaseResult(
-                case_id=case["id"],
-                passed=passed,
-                expected=case["expected_top"],
-                actual=actual,
-                duration_ms=round(elapsed_ms, 2),
-                detail=f"query={case['query']!r} top_k={top_k} results={len(found)}",
-            )
+def _classify_root_cause(task: BenchmarkTask, result: ExecutionResult) -> str:
+    """Classify the root cause of a failure."""
+    if result.passed:
+        return "none"
+    detail_lower = result.detail.lower()
+    actual_lower = result.actual.lower()
+    if "__exception__" in result.actual or "exception" in detail_lower:
+        return "exception"
+    if "timeout" in detail_lower or "timed out" in actual_lower:
+        return "timeout"
+    if task.dimension == "preprocessing":
+        return "wrong_mode"
+    if task.dimension == "tool_search":
+        return "wrong_tool"
+    if task.dimension == "overfitting":
+        return "inconsistent"
+    if task.dimension == "efficiency":
+        return "latency_exceeded"
+    return "assertion"
+
+
+# ---------------------------------------------------------------------------
+# Task executors
+# ---------------------------------------------------------------------------
+
+
+async def _exec_preprocessing(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute preprocessing benchmark task."""
+    preprocessor: RequestPreprocessor = ctx.preprocessor  # type: ignore[assignment]
+    start = time.perf_counter()
+    routing = await preprocessor.preprocess(content=task.input)
+    elapsed = (time.perf_counter() - start) * 1000
+    actual = routing.execution_mode.value
+    passed = actual == task.expected
+    return ExecutionResult(
+        actual=actual,
+        passed=passed,
+        duration_ms=round(elapsed, 4),
+        detail=f"input={task.input!r} method={routing.match_method}",
+    )
+
+
+async def _exec_overfitting(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute overfitting benchmark task (paraphrase consistency)."""
+    preprocessor: RequestPreprocessor = ctx.preprocessor  # type: ignore[assignment]
+    start = time.perf_counter()
+    modes: list[str] = []
+    for text in task.paraphrases:
+        routing = await preprocessor.preprocess(content=text)
+        modes.append(routing.execution_mode.value)
+    elapsed = (time.perf_counter() - start) * 1000
+
+    unique_modes = set(modes)
+    consistent = len(unique_modes) == 1
+    actual = modes[0] if consistent else "inconsistent"
+    passed = consistent and actual == task.expected
+
+    return ExecutionResult(
+        actual=actual,
+        passed=passed,
+        duration_ms=round(elapsed, 4),
+        detail=f"paraphrases={len(task.paraphrases)} modes={modes}",
+        consistency=1.0 if consistent else 0.0,
+    )
+
+
+async def _exec_efficiency(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute efficiency benchmark task (latency threshold)."""
+    threshold = _parse_threshold(task.expected)
+    iterations = 100
+
+    preprocessor: RequestPreprocessor = ctx.preprocessor  # type: ignore[assignment]
+    search_index: ToolSearchIndex = ctx.search_index  # type: ignore[assignment]
+
+    start = time.perf_counter()
+    if task.category == "preprocess_latency":
+        for _ in range(iterations):
+            await preprocessor.preprocess(content=task.input)
+    elif task.category == "tool_search_latency":
+        for _ in range(iterations):
+            search_index.search(task.input, top_k=5)
+    else:
+        return ExecutionResult(
+            actual="unknown_category",
+            passed=False,
+            duration_ms=0.0,
+            detail=f"Unknown efficiency category: {task.category}",
         )
+    total_ms = (time.perf_counter() - start) * 1000
+    avg_ms = total_ms / iterations
 
-        if verbose:
-            status = "[green]✓[/green]" if passed else "[red]✗[/red]"
-            console.print(f"  {status} {case['id']}: top={actual} ({elapsed_ms:.2f}ms)")
-
-    return result
+    passed = avg_ms <= threshold
+    return ExecutionResult(
+        actual=f"{avg_ms:.3f}ms",
+        passed=passed,
+        duration_ms=round(total_ms, 2),
+        detail=f"iterations={iterations} avg={avg_ms:.3f}ms threshold={threshold}ms",
+    )
 
 
-async def _run_event_model(fast: bool, verbose: bool) -> DimensionResult:
-    """Test SubmissionQueue / EventQueue lifecycle."""
+async def _exec_tool_search(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute tool search benchmark task."""
+    search_index: ToolSearchIndex = ctx.search_index  # type: ignore[assignment]
+    top_k = 1 if "top_k" in task.tags else 5
+
+    start = time.perf_counter()
+    found = search_index.search(task.input, top_k=top_k)
+    elapsed = (time.perf_counter() - start) * 1000
+
+    if task.expected == "__none__":
+        passed = len(found) == 0
+        actual = "[]" if passed else (found[0].name if found else "[]")
+    else:
+        actual = found[0].name if found else "__empty__"
+        passed = actual == task.expected
+
+    return ExecutionResult(
+        actual=actual,
+        passed=passed,
+        duration_ms=round(elapsed, 4),
+        detail=f"query={task.input!r} top_k={top_k} results={len(found)}",
+    )
+
+
+async def _exec_event_model(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute event model benchmark task."""
     from agentkit.core.event_queue import EventQueue, SubmissionQueue
     from agentkit.core.protocol import Event
 
-    result = DimensionResult(dimension="event_model")
-
-    # --- SubmissionQueue tests ---
-    sq = SubmissionQueue()
-
-    # Test 1: submit and drain
     start = time.perf_counter()
-    task_id = await sq.submit("hello", "session-1")
-    drained: list[str] = []
-    async for submission in sq.drain():
-        drained.append(submission.content)
-        break  # only drain one to avoid blocking
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = task_id != "" and drained == ["hello"]
-    result.add(
-        TestCaseResult(
-            case_id="sq_submit_drain",
+
+    if task.task_id == "ev-001":  # SQ submit + drain
+        sq = SubmissionQueue()
+        task_id = await sq.submit("hello", "session-1")
+        drained: list[str] = []
+        async for sub in sq.drain():
+            drained.append(sub.content)
+            break
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = task_id != "" and drained == ["hello"]
+        return ExecutionResult(
+            actual=f"drained={drained}",
             passed=passed,
-            expected="task_id + drained=['hello']",
-            actual=f"task_id={task_id[:8]}... drained={drained}",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
+            detail=f"task_id={task_id[:8]}...",
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} sq_submit_drain")
 
-    # Test 2: cancel
-    start = time.perf_counter()
-    cancel_id = await sq.submit("to-cancel", "session-2")
-    cancelled = await sq.cancel(cancel_id)
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = cancelled and sq._submissions[cancel_id].cancelled
-    result.add(
-        TestCaseResult(
-            case_id="sq_cancel",
-            passed=passed,
-            expected="cancelled=True",
+    if task.task_id == "ev-002":  # SQ cancel
+        sq = SubmissionQueue()
+        cancel_id = await sq.submit("to-cancel", "session-2")
+        cancelled = await sq.cancel(cancel_id)
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(cancelled and sq._submissions[cancel_id].cancelled)
+        return ExecutionResult(
             actual=f"cancelled={cancelled}",
-            duration_ms=round(elapsed_ms, 2),
-        )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} sq_cancel")
-
-    # Test 3: close blocks new submissions
-    start = time.perf_counter()
-    sq2 = SubmissionQueue()
-    sq2.close()
-    raised = False
-    try:
-        await sq2.submit("after-close", "session-3")
-    except RuntimeError:
-        raised = True
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = raised and sq2.is_closed
-    result.add(
-        TestCaseResult(
-            case_id="sq_close_blocks",
             passed=passed,
-            expected="RuntimeError on submit after close",
-            actual=f"raised={raised} closed={sq2.is_closed}",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} sq_close_blocks")
 
-    # --- EventQueue tests ---
-    eq = EventQueue(buffer_size=10)
-
-    # Test 4: emit and subscribe with replay
-    start = time.perf_counter()
-    test_event = Event(
-        event_type="test_event",
-        task_id="task-1",
-        session_id="session-1",
-        data={"msg": "hello"},
-        timestamp=datetime.now(timezone.utc).isoformat(),
-    )
-    await eq.emit(test_event)
-
-    received: list[Event] = []
-    # Subscribe and collect one event (replay)
-    async for event in eq.subscribe():
-        received.append(event)
-        break
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = len(received) == 1 and received[0].event_type == "test_event"
-    result.add(
-        TestCaseResult(
-            case_id="eq_emit_subscribe_replay",
+    if task.task_id == "ev-003":  # SQ close blocks
+        sq = SubmissionQueue()
+        sq.close()
+        raised = False
+        try:
+            await sq.submit("after-close", "session-3")
+        except RuntimeError:
+            raised = True
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = raised and sq.is_closed
+        return ExecutionResult(
+            actual=f"raised={raised} closed={sq.is_closed}",
             passed=passed,
-            expected="1 event replayed",
-            actual=f"{len(received)} events",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} eq_emit_subscribe_replay")
 
-    # Test 5: close sends sentinel
-    start = time.perf_counter()
-    eq2 = EventQueue()
-
-    async def _consume_all() -> list[Event]:
-        events: list[Event] = []
-        async for ev in eq2.subscribe():
-            events.append(ev)
-        return events
-
-    # Start consumer, emit, then close
-    consumer_task = asyncio.create_task(_consume_all())
-    await asyncio.sleep(0.01)  # let subscriber register
-    await eq2.emit(test_event)
-    await asyncio.sleep(0.01)
-    eq2.close()
-    events = await asyncio.wait_for(consumer_task, timeout=2.0)
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = len(events) >= 1 and eq2.is_closed
-    result.add(
-        TestCaseResult(
-            case_id="eq_close_sentinel",
+    if task.task_id == "ev-004":  # EQ emit + replay
+        eq = EventQueue(buffer_size=10)
+        test_event = Event(
+            event_type="test_event",
+            task_id="task-1",
+            session_id="session-1",
+            data={"msg": "hello"},
+            timestamp=datetime.now(timezone.utc).isoformat(),
+        )
+        await eq.emit(test_event)
+        received: list[Event] = []
+        async for event in eq.subscribe():
+            received.append(event)
+            break
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = len(received) == 1 and received[0].event_type == "test_event"
+        return ExecutionResult(
+            actual=f"received={len(received)}",
             passed=passed,
-            expected="subscriber exits on close",
-            actual=f"{len(events)} events, closed={eq2.is_closed}",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} eq_close_sentinel")
 
-    # Test 6: subscriber count
-    start = time.perf_counter()
-    eq3 = EventQueue()
-    initial_count = eq3.subscriber_count
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = initial_count == 0
-    result.add(
-        TestCaseResult(
-            case_id="eq_subscriber_count",
+    if task.task_id == "ev-005":  # EQ close sentinel
+        eq = EventQueue()
+
+        async def _consume_all() -> list[Event]:
+            events: list[Event] = []
+            async for ev in eq.subscribe():
+                events.append(ev)
+            return events
+
+        consumer_task = asyncio.create_task(_consume_all())
+        await asyncio.sleep(0.01)
+        test_event = Event(
+            event_type="test_event",
+            task_id="task-1",
+            session_id="session-1",
+            data={"msg": "hello"},
+            timestamp=datetime.now(timezone.utc).isoformat(),
+        )
+        await eq.emit(test_event)
+        await asyncio.sleep(0.01)
+        eq.close()
+        events = await asyncio.wait_for(consumer_task, timeout=2.0)
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = len(events) >= 1 and eq.is_closed
+        return ExecutionResult(
+            actual=f"events={len(events)} closed={eq.is_closed}",
             passed=passed,
-            expected="0 subscribers initially",
-            actual=f"{initial_count} subscribers",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
+
+    if task.task_id == "ev-006":  # EQ subscriber count
+        eq = EventQueue()
+        count = eq.subscriber_count
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = count == 0
+        return ExecutionResult(
+            actual=f"subscribers={count}",
+            passed=passed,
+            duration_ms=round(elapsed, 4),
+        )
+
+    return ExecutionResult(
+        actual="unknown_task",
+        passed=False,
+        duration_ms=0.0,
+        detail=f"Unknown event_model task: {task.task_id}",
     )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} eq_subscriber_count")
-
-    if fast:
-        # Keep only core cases in fast mode
-        core_ids = {"sq_submit_drain", "eq_emit_subscribe_replay", "eq_close_sentinel"}
-        result.details = [d for d in result.details if d.case_id in core_ids]
-        result.total = len(result.details)
-        result.passed = sum(1 for d in result.details if d.passed)
-        result.failed = result.total - result.passed
-
-    return result
 
 
-async def _run_spec_management(fast: bool, verbose: bool, tmp_dir: Path) -> DimensionResult:
-    """Test SpecManager CRUD operations."""
+async def _exec_spec_management(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute spec management benchmark task (each task is self-contained)."""
     from agentkit.core.spec_manager import Spec, SpecManager, SpecStep
 
-    specs_dir = str(tmp_dir / "specs")
+    specs_dir = str(ctx.tmp_dir / "specs" / task.task_id)
     manager = SpecManager(specs_dir=specs_dir)
 
-    result = DimensionResult(dimension="spec_management")
-
-    # Test 1: create
     start = time.perf_counter()
-    spec = Spec(
-        spec_id="spec-001",
-        goal="Test goal",
-        steps=[
-            SpecStep(step_id="s1", name="step1", description="first step"),
-            SpecStep(step_id="s2", name="step2", description="second step", dependencies=["s1"]),
-        ],
-    )
-    path = manager.create(spec)
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = path.exists()
-    result.add(
-        TestCaseResult(
-            case_id="spec_create",
-            passed=passed,
-            expected="file exists on disk",
-            actual=f"exists={path.exists()}",
-            duration_ms=round(elapsed_ms, 2),
+
+    if task.task_id == "sm-001":  # create
+        spec = Spec(
+            spec_id="test-spec",
+            goal="Test goal",
+            steps=[SpecStep(step_id="s1", name="step1", description="first step")],
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_create")
-
-    # Test 2: get
-    start = time.perf_counter()
-    loaded = manager.get("spec-001")
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = loaded is not None and loaded.spec_id == "spec-001" and len(loaded.steps) == 2
-    result.add(
-        TestCaseResult(
-            case_id="spec_get",
+        path = manager.create(spec)
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = path.exists()
+        return ExecutionResult(
+            actual=f"exists={passed}",
             passed=passed,
-            expected="spec with 2 steps",
+            duration_ms=round(elapsed, 4),
+            detail=f"path={path}",
+        )
+
+    if task.task_id == "sm-002":  # get
+        spec = Spec(
+            spec_id="test-spec",
+            goal="Test goal",
+            steps=[
+                SpecStep(step_id="s1", name="step1", description="first step"),
+                SpecStep(step_id="s2", name="step2", description="second step"),
+            ],
+        )
+        manager.create(spec)
+        loaded = manager.get("test-spec")
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = loaded is not None and loaded.spec_id == "test-spec" and len(loaded.steps) == 2
+        return ExecutionResult(
             actual=f"steps={len(loaded.steps) if loaded else 0}",
-            duration_ms=round(elapsed_ms, 2),
-        )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_get")
-
-    # Test 3: update
-    start = time.perf_counter()
-    updated = manager.update("spec-001", goal="Updated goal")
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = updated is not None and updated.goal == "Updated goal"
-    result.add(
-        TestCaseResult(
-            case_id="spec_update",
             passed=passed,
-            expected="goal='Updated goal'",
+            duration_ms=round(elapsed, 4),
+        )
+
+    if task.task_id == "sm-003":  # update
+        spec = Spec(spec_id="test-spec", goal="Original goal")
+        manager.create(spec)
+        updated = manager.update("test-spec", goal="Updated goal")
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = updated is not None and updated.goal == "Updated goal"
+        return ExecutionResult(
             actual=f"goal={updated.goal if updated else None}",
-            duration_ms=round(elapsed_ms, 2),
-        )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_update")
-
-    # Test 4: confirm
-    start = time.perf_counter()
-    confirmed = manager.confirm("spec-001")
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = (
-        confirmed is not None
-        and confirmed.status == "confirmed"
-        and confirmed.confirmed_at is not None
-        and all(s.status == "confirmed" for s in confirmed.steps)
-    )
-    result.add(
-        TestCaseResult(
-            case_id="spec_confirm",
             passed=passed,
-            expected="status=confirmed, all steps confirmed",
+            duration_ms=round(elapsed, 4),
+        )
+
+    if task.task_id == "sm-004":  # delete
+        spec = Spec(spec_id="test-spec", goal="To be deleted")
+        manager.create(spec)
+        deleted = manager.delete("test-spec")
+        remaining = manager.list_specs()
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(deleted and len(remaining) == 0)
+        return ExecutionResult(
+            actual=f"deleted={deleted} remaining={len(remaining)}",
+            passed=passed,
+            duration_ms=round(elapsed, 4),
+        )
+
+    if task.task_id == "sm-005":  # list
+        manager.create(Spec(spec_id="spec-a", goal="Goal A"))
+        manager.create(Spec(spec_id="spec-b", goal="Goal B"))
+        specs = manager.list_specs()
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = len(specs) == 2
+        return ExecutionResult(
+            actual=f"count={len(specs)}",
+            passed=passed,
+            duration_ms=round(elapsed, 4),
+        )
+
+    if task.task_id == "sm-006":  # confirm
+        spec = Spec(
+            spec_id="test-spec",
+            goal="Test goal",
+            steps=[SpecStep(step_id="s1", name="step1", description="first step")],
+        )
+        manager.create(spec)
+        confirmed = manager.confirm("test-spec")
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(
+            confirmed is not None
+            and confirmed.status == "confirmed"
+            and confirmed.confirmed_at is not None
+            and all(s.status == "confirmed" for s in confirmed.steps)
+        )
+        return ExecutionResult(
             actual=f"status={confirmed.status if confirmed else None}",
-            duration_ms=round(elapsed_ms, 2),
-        )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_confirm")
-
-    # Test 5: list
-    start = time.perf_counter()
-    # Create a second spec for listing
-    spec2 = Spec(spec_id="spec-002", goal="Second goal")
-    manager.create(spec2)
-    specs = manager.list_specs()
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = len(specs) == 2
-    result.add(
-        TestCaseResult(
-            case_id="spec_list",
             passed=passed,
-            expected="2 specs",
-            actual=f"{len(specs)} specs",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_list")
 
-    # Test 6: delete
-    start = time.perf_counter()
-    deleted = manager.delete("spec-002")
-    remaining = manager.list_specs()
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = deleted and len(remaining) == 1
-    result.add(
-        TestCaseResult(
-            case_id="spec_delete",
+    if task.task_id == "sm-007":  # get missing
+        missing = manager.get("nonexistent")
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = missing is None
+        return ExecutionResult(
+            actual=f"result={missing}",
             passed=passed,
-            expected="deleted, 1 remaining",
-            actual=f"deleted={deleted}, remaining={len(remaining)}",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
+
+    return ExecutionResult(
+        actual="unknown_task",
+        passed=False,
+        duration_ms=0.0,
+        detail=f"Unknown spec_management task: {task.task_id}",
     )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_delete")
-
-    # Test 7: get nonexistent
-    start = time.perf_counter()
-    missing = manager.get("nonexistent")
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = missing is None
-    result.add(
-        TestCaseResult(
-            case_id="spec_get_missing",
-            passed=passed,
-            expected="None",
-            actual=f"{missing}",
-            duration_ms=round(elapsed_ms, 2),
-        )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_get_missing")
-
-    if fast:
-        core_ids = {"spec_create", "spec_get", "spec_confirm", "spec_delete"}
-        result.details = [d for d in result.details if d.case_id in core_ids]
-        result.total = len(result.details)
-        result.passed = sum(1 for d in result.details if d.passed)
-        result.failed = result.total - result.passed
-
-    return result
 
 
-async def _run_verification(fast: bool, verbose: bool, tmp_dir: Path) -> DimensionResult:
-    """Test VerificationLoop execute/retry behavior."""
+async def _exec_verification(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute verification benchmark task."""
     from agentkit.core.verification_loop import VerificationLoop
 
-    result = DimensionResult(dimension="verification")
-
-    # Test 1: passing command
+    working_dir = str(ctx.tmp_dir)
     start = time.perf_counter()
-    loop_pass = VerificationLoop(
-        commands=["true"],
-        max_retries=0,
-        working_dir=str(tmp_dir),
-        timeout=5.0,
-    )
-    res = await loop_pass.verify()
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = res.passed and res.attempts == 1
-    result.add(
-        TestCaseResult(
-            case_id="verify_pass",
-            passed=passed,
-            expected="passed=True, attempts=1",
-            actual=f"passed={res.passed}, attempts={res.attempts}",
-            duration_ms=round(elapsed_ms, 2),
+
+    if task.task_id == "vf-001":  # pass
+        loop = VerificationLoop(
+            commands=["true"], max_retries=0, working_dir=working_dir, timeout=5.0
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_pass")
-
-    # Test 2: failing command
-    start = time.perf_counter()
-    loop_fail = VerificationLoop(
-        commands=["false"],
-        max_retries=0,
-        working_dir=str(tmp_dir),
-        timeout=5.0,
-    )
-    res = await loop_fail.verify()
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = not res.passed and len(res.errors) > 0
-    result.add(
-        TestCaseResult(
-            case_id="verify_fail",
+        res = await loop.verify()
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(res.passed and res.attempts == 1)
+        return ExecutionResult(
+            actual=f"passed={res.passed} attempts={res.attempts}",
             passed=passed,
-            expected="passed=False, has errors",
-            actual=f"passed={res.passed}, errors={len(res.errors)}",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_fail")
 
-    # Test 3: retry with fix callback
-    start = time.perf_counter()
-    call_count = 0
-
-    async def _fix_callback(errors: list[str], output: str) -> None:
-        nonlocal call_count
-        call_count += 1
-
-    # Use a command that always fails to test retry logic
-    loop_retry = VerificationLoop(
-        commands=["false"],
-        max_retries=2,
-        working_dir=str(tmp_dir),
-        timeout=5.0,
-    )
-    res = await loop_retry.verify_and_retry(fix_callback=_fix_callback)
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = not res.passed and res.attempts == 3 and call_count == 2
-    result.add(
-        TestCaseResult(
-            case_id="verify_retry",
-            passed=passed,
-            expected="attempts=3, fix_callback called 2x",
-            actual=f"attempts={res.attempts}, callbacks={call_count}",
-            duration_ms=round(elapsed_ms, 2),
+    if task.task_id == "vf-002":  # fail
+        loop = VerificationLoop(
+            commands=["false"], max_retries=0, working_dir=working_dir, timeout=5.0
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_retry")
-
-    # Test 4: timeout
-    start = time.perf_counter()
-    loop_timeout = VerificationLoop(
-        commands=["sleep 10"],
-        max_retries=0,
-        working_dir=str(tmp_dir),
-        timeout=0.5,
-    )
-    res = await loop_timeout.verify()
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = not res.passed and any("timed out" in e.lower() for e in res.errors)
-    result.add(
-        TestCaseResult(
-            case_id="verify_timeout",
+        res = await loop.verify()
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(not res.passed and len(res.errors) > 0)
+        return ExecutionResult(
+            actual=f"passed={res.passed} errors={len(res.errors)}",
             passed=passed,
-            expected="timeout error",
-            actual=f"passed={res.passed}, errors={len(res.errors)}",
-            duration_ms=round(elapsed_ms, 2),
+            duration_ms=round(elapsed, 4),
         )
-    )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_timeout")
 
-    # Test 5: multiple commands (one passes, one fails)
-    start = time.perf_counter()
-    loop_multi = VerificationLoop(
-        commands=["true", "false"],
-        max_retries=0,
-        working_dir=str(tmp_dir),
-        timeout=5.0,
-    )
-    res = await loop_multi.verify()
-    elapsed_ms = (time.perf_counter() - start) * 1000
-    passed = not res.passed and "false" in res.test_output
-    result.add(
-        TestCaseResult(
-            case_id="verify_multi_command",
+    if task.task_id == "vf-003":  # retry with fix_callback
+        call_count = 0
+
+        async def _fix_callback(errors: list[str], output: str) -> None:
+            nonlocal call_count
+            call_count += 1
+
+        loop = VerificationLoop(
+            commands=["false"], max_retries=2, working_dir=working_dir, timeout=5.0
+        )
+        res = await loop.verify_and_retry(fix_callback=_fix_callback)
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(not res.passed and res.attempts == 3 and call_count == 2)
+        return ExecutionResult(
+            actual=f"attempts={res.attempts} callbacks={call_count}",
             passed=passed,
-            expected="overall fail, output has both commands",
+            duration_ms=round(elapsed, 4),
+        )
+
+    if task.task_id == "vf-004":  # timeout
+        loop = VerificationLoop(
+            commands=["sleep 10"], max_retries=0, working_dir=working_dir, timeout=0.5
+        )
+        res = await loop.verify()
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(not res.passed and any("timed out" in e.lower() for e in res.errors))
+        return ExecutionResult(
+            actual=f"passed={res.passed} errors={len(res.errors)}",
+            passed=passed,
+            duration_ms=round(elapsed, 4),
+            detail=f"errors={res.errors[:1]}",
+        )
+
+    if task.task_id == "vf-005":  # multi command
+        loop = VerificationLoop(
+            commands=["true", "false"], max_retries=0, working_dir=working_dir, timeout=5.0
+        )
+        res = await loop.verify()
+        elapsed = (time.perf_counter() - start) * 1000
+        passed = bool(not res.passed and "false" in res.test_output)
+        return ExecutionResult(
             actual=f"passed={res.passed}",
-            duration_ms=round(elapsed_ms, 2),
+            passed=passed,
+            duration_ms=round(elapsed, 4),
         )
+
+    return ExecutionResult(
+        actual="unknown_task",
+        passed=False,
+        duration_ms=0.0,
+        detail=f"Unknown verification task: {task.task_id}",
     )
-    if verbose:
-        console.print(f"  {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_multi_command")
 
+
+_EXECUTORS: dict[
+    str,
+    Callable[[BenchmarkTask, BenchmarkContext], Awaitable[ExecutionResult]],
+] = {
+    "preprocessing": _exec_preprocessing,
+    "overfitting": _exec_overfitting,
+    "efficiency": _exec_efficiency,
+    "tool_search": _exec_tool_search,
+    "event_model": _exec_event_model,
+    "spec_management": _exec_spec_management,
+    "verification": _exec_verification,
+}
+
+
+async def _execute_task(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute a single benchmark task via the dimension dispatcher."""
+    executor = _EXECUTORS.get(task.dimension)
+    if executor is None:
+        return ExecutionResult(
+            actual="unknown_dimension",
+            passed=False,
+            duration_ms=0.0,
+            detail=f"Unknown dimension: {task.dimension}",
+        )
+    return await executor(task, ctx)
+
+
+async def _execute_task_safely(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult:
+    """Execute a task with exception handling."""
+    try:
+        return await _execute_task(task, ctx)
+    except Exception as e:
+        return ExecutionResult(
+            actual="__exception__",
+            passed=False,
+            duration_ms=0.0,
+            detail=f"Exception: {type(e).__name__}: {e}",
+            consistency=0.0,
+        )
+
+
+# ---------------------------------------------------------------------------
+# Dimension runner
+# ---------------------------------------------------------------------------
+
+
+async def _run_dimension(
+    dimension: str,
+    runs: int,
+    fast: bool,
+    verbose: bool,
+    ctx: BenchmarkContext,
+) -> DimensionResult:
+    """Run all tasks for a dimension, optionally multiple times."""
+    tasks = [t for t in TASK_SET if t.dimension == dimension]
     if fast:
-        core_ids = {"verify_pass", "verify_fail", "verify_retry"}
-        result.details = [d for d in result.details if d.case_id in core_ids]
-        result.total = len(result.details)
-        result.passed = sum(1 for d in result.details if d.passed)
-        result.failed = result.total - result.passed
+        tasks = [t for t in tasks if t.task_id in _FAST_CORE_IDS]
 
-    return result
+    all_runs_cases: list[list[CaseResult]] = []
+    accuracies: list[float] = []
+
+    for run_idx in range(runs):
+        run_ctx = BenchmarkContext(
+            preprocessor=ctx.preprocessor,
+            search_index=ctx.search_index,
+            tmp_dir=ctx.tmp_dir / f"run-{run_idx}",
+        )
+        run_ctx.tmp_dir.mkdir(parents=True, exist_ok=True)
+
+        cases: list[CaseResult] = []
+        for task in tasks:
+            result = await _execute_task_safely(task, run_ctx)
+            root_cause = _classify_root_cause(task, result)
+            case = CaseResult(
+                task_id=task.task_id,
+                dimension=task.dimension,
+                category=task.category,
+                difficulty=task.difficulty,
+                passed=result.passed,
+                expected=task.expected,
+                actual=result.actual,
+                duration_ms=result.duration_ms,
+                root_cause=root_cause,
+                detail=result.detail,
+                consistency=result.consistency,
+            )
+            cases.append(case)
+
+            if verbose:
+                status = "[green]✓[/green]" if case.passed else "[red]✗[/red]"
+                console.print(
+                    f"  {status} {task.task_id}: {result.actual} ({result.duration_ms:.2f}ms)"
+                )
+
+        all_runs_cases.append(cases)
+        passed_count = sum(1 for c in cases if c.passed)
+        accuracies.append(passed_count / len(cases) if cases else 0.0)
+
+    final_cases = all_runs_cases[-1] if all_runs_cases else []
+    metrics = _compute_metrics(final_cases, accuracies if runs > 1 else None)
+    by_category = _aggregate_by(final_cases, "category")
+    by_difficulty = _aggregate_by(final_cases, "difficulty")
+
+    return DimensionResult(
+        dimension=dimension,
+        metrics=metrics,
+        cases=final_cases,
+        by_category=by_category,
+        by_difficulty=by_difficulty,
+    )
 
 
 # ---------------------------------------------------------------------------
-# Report generation
+# Report generators
 # ---------------------------------------------------------------------------
 
 
+def _dimension_to_dict(dim_result: DimensionResult) -> dict[str, object]:
+    """Convert a DimensionResult to a serializable dict."""
+    return {
+        "metrics": asdict(dim_result.metrics),
+        "by_category": {k: asdict(v) for k, v in dim_result.by_category.items()},
+        "by_difficulty": {k: asdict(v) for k, v in dim_result.by_difficulty.items()},
+        "cases": [asdict(c) for c in dim_result.cases],
+    }
+
+
 def _generate_json_report(
-    report_data: dict[str, Any],
+    report_data: dict[str, object],
     output_path: Path,
 ) -> None:
+    """Generate JSON report."""
     output_path.parent.mkdir(parents=True, exist_ok=True)
     output_path.write_text(
-        json.dumps(report_data, indent=2, ensure_ascii=False),
+        json.dumps(report_data, indent=2, ensure_ascii=False, default=str),
         encoding="utf-8",
     )
 
 
-def _generate_txt_report(
-    report_data: dict[str, Any],
+def _md_table(headers: list[str], rows: list[list[str]]) -> str:
+    """Generate a Markdown table."""
+    lines = ["| " + " | ".join(headers) + " |"]
+    lines.append("|" + "|".join("---" for _ in headers) + "|")
+    for row in rows:
+        lines.append("| " + " | ".join(row) + " |")
+    return "\n".join(lines)
+
+
+def _generate_markdown_report(
+    report_data: dict[str, object],
     output_path: Path,
 ) -> None:
+    """Generate human-readable Markdown report."""
     output_path.parent.mkdir(parents=True, exist_ok=True)
 
+    timestamp = str(report_data.get("timestamp", ""))
+    version = str(report_data.get("version", ""))
+    runs = int(report_data.get("runs", 1))
+    overall = float(report_data.get("overall_accuracy", 0.0))
+    overall_mean = float(report_data.get("overall_accuracy_mean", overall))
+    overall_std = float(report_data.get("overall_accuracy_std", 0.0))
+
     lines: list[str] = []
-    lines.append("=" * 70)
-    lines.append("AgentKit Benchmark Report")
-    lines.append("=" * 70)
-    lines.append(f"Timestamp:      {report_data['timestamp']}")
-    lines.append(f"Version:        {report_data['version']}")
-    lines.append(f"Overall Score:  {report_data['overall_score']:.1%}")
-    lines.append(f"Summary:        {report_data['summary']}")
+    lines.append("# AgentKit 能力基准测试报告")
+    lines.append("")
+    lines.append("## 测试概要")
+    lines.append(f"- 时间: {timestamp}")
+    lines.append(f"- 版本: {version}")
+    lines.append(f"- 运行次数: {runs}")
+    lines.append(f"- 总体准确率: {overall_mean:.1%} ± {overall_std:.1%}")
     lines.append("")
 
-    lines.append("-" * 70)
-    lines.append(f"{'Dimension':<20} {'Total':>6} {'Pass':>6} {'Fail':>6} {'Score':>8}")
-    lines.append("-" * 70)
-
-    total_all = 0
-    pass_all = 0
-    fail_all = 0
-
-    for dim_name, dim_data in report_data["dimensions"].items():
-        total = dim_data["total"]
-        passed = dim_data["passed"]
-        failed = dim_data["failed"]
-        score = dim_data["score"]
-        lines.append(
-            f"{dim_name:<20} {total:>6} {passed:>6} {failed:>6} {score:>7.1%}"
-        )
-        total_all += total
-        pass_all += passed
-        fail_all += failed
-
-    lines.append("-" * 70)
-    overall = pass_all / total_all if total_all > 0 else 0.0
+    # Industry benchmark comparison
+    lines.append("## 与行业 Benchmark 对比")
+    lines.append("")
     lines.append(
-        f"{'OVERALL':<20} {total_all:>6} {pass_all:>6} {fail_all:>6} {overall:>7.1%}"
+        _md_table(
+            ["Benchmark", "测试对象", "AgentKit 对应"],
+            [
+                ["SWE-bench", "LLM 代码修复", "— (测 LLM 非框架)"],
+                ["ToolBench", "工具调用", "tool_search 维度"],
+                ["AgentBench", "Agent 系统", "全部维度"],
+            ],
+        )
     )
-    lines.append("=" * 70)
     lines.append("")
 
-    # Detailed failures
-    has_failures = False
-    for dim_name, dim_data in report_data["dimensions"].items():
-        failures = [d for d in dim_data["details"] if not d["passed"]]
-        if failures:
-            if not has_failures:
-                lines.append("Failed Cases:")
-                lines.append("-" * 70)
-                has_failures = True
-            for f in failures:
-                lines.append(f"  [{dim_name}] {f['case_id']}")
-                lines.append(f"    expected: {f['expected']}")
-                lines.append(f"    actual:   {f['actual']}")
-                if f.get("detail"):
-                    lines.append(f"    detail:   {f['detail']}")
+    # Dimension results
+    dimensions = report_data.get("dimensions", {})
+    if not isinstance(dimensions, dict):
+        dimensions = {}
+
+    dim_titles = {
+        "preprocessing": "1. 预处理准确度 (Preprocessing Accuracy)",
+        "overfitting": "2. 过拟合检测 (Overfitting Detection)",
+        "efficiency": "3. 效率测试 (Efficiency)",
+        "tool_search": "4. 工具搜索 (Tool Search)",
+        "event_model": "5. 事件模型 (Event Model)",
+        "spec_management": "6. 规格管理 (Spec Management)",
+        "verification": "7. 验证循环 (Verification Loop)",
+    }
+
+    lines.append("## 维度结果")
+    lines.append("")
+
+    for dim_name, title in dim_titles.items():
+        dim_data = dimensions.get(dim_name)
+        if not isinstance(dim_data, dict):
+            continue
+        metrics = dim_data.get("metrics", {})
+        if not isinstance(metrics, dict):
+            metrics = {}
+
+        lines.append(f"### {title}")
+        lines.append("")
+
+        acc = float(metrics.get("accuracy", 0.0))
+        acc_mean = float(metrics.get("accuracy_mean", acc))
+        acc_std = float(metrics.get("accuracy_std", 0.0))
+        precision = float(metrics.get("precision", 0.0))
+        recall = float(metrics.get("recall", 0.0))
+        f1 = float(metrics.get("f1", 0.0))
+        p50 = float(metrics.get("latency_p50_ms", 0.0))
+        p95 = float(metrics.get("latency_p95_ms", 0.0))
+        p99 = float(metrics.get("latency_p99_ms", 0.0))
+        consistency = float(metrics.get("consistency", 0.0))
+        total = int(metrics.get("total", 0))
+        passed = int(metrics.get("passed", 0))
+        failed = int(metrics.get("failed", 0))
+        ci_lower = float(metrics.get("ci_lower", 0.0))
+        ci_upper = float(metrics.get("ci_upper", 0.0))
+
+        lines.append(
+            _md_table(
+                ["指标", "值"],
+                [
+                    ["Accuracy", f"{acc_mean:.1%} ± {acc_std:.1%}"],
+                    ["95% CI", f"[{ci_lower:.1%}, {ci_upper:.1%}]"],
+                    ["Precision", f"{precision:.1%}"],
+                    ["Recall", f"{recall:.1%}"],
+                    ["F1", f"{f1:.1%}"],
+                    ["Latency p50", f"{p50:.2f}ms"],
+                    ["Latency p95", f"{p95:.2f}ms"],
+                    ["Latency p99", f"{p99:.2f}ms"],
+                    ["Consistency", f"{consistency:.1%}"],
+                    ["Total / Pass / Fail", f"{total} / {passed} / {failed}"],
+                ],
+            )
+        )
+        lines.append("")
+
+        # By category
+        by_category = dim_data.get("by_category", {})
+        if isinstance(by_category, dict) and by_category:
+            lines.append("#### 按类别分布")
+            lines.append("")
+            cat_rows: list[list[str]] = []
+            for cat_name, cat_metrics in by_category.items():
+                if not isinstance(cat_metrics, dict):
+                    continue
+                cat_total = int(cat_metrics.get("total", 0))
+                cat_passed = int(cat_metrics.get("passed", 0))
+                cat_acc = float(cat_metrics.get("accuracy", 0.0))
+                cat_rows.append(
+                    [
+                        str(cat_name),
+                        str(cat_total),
+                        str(cat_passed),
+                        f"{cat_acc:.1%}",
+                    ]
+                )
+            lines.append(_md_table(["类别", "用例数", "通过", "准确率"], cat_rows))
+            lines.append("")
+
+        # By difficulty
+        by_difficulty = dim_data.get("by_difficulty", {})
+        if isinstance(by_difficulty, dict) and by_difficulty:
+            lines.append("#### 按难度分布")
+            lines.append("")
+            diff_rows: list[list[str]] = []
+            for diff_name, diff_metrics in by_difficulty.items():
+                if not isinstance(diff_metrics, dict):
+                    continue
+                diff_total = int(diff_metrics.get("total", 0))
+                diff_passed = int(diff_metrics.get("passed", 0))
+                diff_acc = float(diff_metrics.get("accuracy", 0.0))
+                diff_rows.append(
+                    [
+                        str(diff_name),
+                        str(diff_total),
+                        str(diff_passed),
+                        f"{diff_acc:.1%}",
+                    ]
+                )
+            lines.append(_md_table(["难度", "用例数", "通过", "准确率"], diff_rows))
+            lines.append("")
+
+        # Failure analysis
+        cases = dim_data.get("cases", [])
+        if isinstance(cases, list):
+            failures = [c for c in cases if isinstance(c, dict) and not c.get("passed", True)]
+            if failures:
+                lines.append("#### 失败用例分析")
+                lines.append("")
+                fail_rows: list[list[str]] = []
+                for f in failures:
+                    fail_rows.append(
+                        [
+                            str(f.get("task_id", "")),
+                            str(f.get("category", "")),
+                            str(f.get("difficulty", "")),
+                            str(f.get("expected", "")),
+                            str(f.get("actual", "")),
+                            str(f.get("root_cause", "")),
+                        ]
+                    )
+                lines.append(
+                    _md_table(
+                        ["用例 ID", "类别", "难度", "期望", "实际", "根因"],
+                        fail_rows,
+                    )
+                )
                 lines.append("")
 
-    if not has_failures:
-        lines.append("All tests passed — no failures to report.")
+    # Baseline comparison
+    baseline_comparison = report_data.get("baseline_comparison")
+    if isinstance(baseline_comparison, dict):
+        lines.append("## 基线对比")
         lines.append("")
+        status = baseline_comparison.get("status", "")
+        if status == "first_run":
+            lines.append("> 首次运行，已自动创建基线。")
+            lines.append("")
+        else:
+            dim_comparisons = baseline_comparison.get("dimensions", {})
+            if isinstance(dim_comparisons, dict) and dim_comparisons:
+                bl_rows: list[list[str]] = []
+                for dim_name, cmp_data in dim_comparisons.items():
+                    if not isinstance(cmp_data, dict):
+                        continue
+                    bl_acc = float(cmp_data.get("baseline_accuracy", 0.0))
+                    cur_acc = float(cmp_data.get("current_accuracy", 0.0))
+                    direction = str(cmp_data.get("direction", "—"))
+                    bl_rows.append(
+                        [
+                            str(dim_name),
+                            f"{bl_acc:.1%}",
+                            f"{cur_acc:.1%}",
+                            direction,
+                        ]
+                    )
+                lines.append(
+                    _md_table(
+                        ["维度", "基线准确率", "当前准确率", "变化"],
+                        bl_rows,
+                    )
+                )
+                lines.append("")
+
+    # Improvement suggestions
+    lines.append("## 问题总结与改进建议")
+    lines.append("")
+    suggestions = _generate_suggestions(dimensions)
+    for s in suggestions:
+        lines.append(s)
+    lines.append("")
 
     output_path.write_text("\n".join(lines), encoding="utf-8")
 
 
+def _generate_suggestions(dimensions: dict[str, object]) -> list[str]:
+    """Generate improvement suggestions based on results."""
+    suggestions: list[str] = []
+    if not isinstance(dimensions, dict):
+        return ["- 所有维度表现良好。"]
+
+    for dim_name, dim_data in dimensions.items():
+        if not isinstance(dim_data, dict):
+            continue
+        metrics = dim_data.get("metrics", {})
+        if not isinstance(metrics, dict):
+            continue
+        acc = float(metrics.get("accuracy", 1.0))
+        p95 = float(metrics.get("latency_p95_ms", 0.0))
+        consistency = float(metrics.get("consistency", 1.0))
+
+        if acc < 0.9:
+            suggestions.append(
+                f"- **{dim_name}**: 准确率 {acc:.1%} 低于 90%，建议检查失败用例并优化"
+            )
+        if p95 > 100:
+            suggestions.append(f"- **{dim_name}**: P95 延迟 {p95:.2f}ms 较高，建议优化性能")
+        if dim_name == "overfitting" and consistency < 1.0:
+            suggestions.append(
+                f"- **overfitting**: 一致性 {consistency:.1%} 低于 100%，存在过拟合风险"
+            )
+
+    if not suggestions:
+        suggestions.append("- 所有维度表现良好，无需特别改进。")
+    return suggestions
+
+
 def _generate_html_report(
-    report_data: dict[str, Any],
+    report_data: dict[str, object],
     output_path: Path,
 ) -> None:
+    """Generate HTML report."""
     output_path.parent.mkdir(parents=True, exist_ok=True)
 
+    dimensions = report_data.get("dimensions", {})
+    if not isinstance(dimensions, dict):
+        dimensions = {}
+
     rows_html: list[str] = []
     total_all = 0
     pass_all = 0
     fail_all = 0
 
-    for dim_name, dim_data in report_data["dimensions"].items():
-        total = dim_data["total"]
-        passed = dim_data["passed"]
-        failed = dim_data["failed"]
-        score = dim_data["score"]
+    for dim_name, dim_data in dimensions.items():
+        if not isinstance(dim_data, dict):
+            continue
+        metrics = dim_data.get("metrics", {})
+        if not isinstance(metrics, dict):
+            metrics = {}
+        total = int(metrics.get("total", 0))
+        passed = int(metrics.get("passed", 0))
+        failed = int(metrics.get("failed", 0))
+        acc = float(metrics.get("accuracy", 0.0))
         total_all += total
         pass_all += passed
         fail_all += failed
 
-        score_class = "score-good" if score >= 0.9 else "score-warn" if score >= 0.7 else "score-bad"
+        acc_class = "good" if acc >= 0.9 else "warn" if acc >= 0.7 else "bad"
         rows_html.append(
             f"<tr>"
             f"<td>{dim_name}</td>"
             f"<td class='num'>{total}</td>"
             f"<td class='num pass'>{passed}</td>"
             f"<td class='num fail'>{failed}</td>"
-            f"<td class='num {score_class}'>{score:.1%}</td>"
+            f"<td class='num {acc_class}'>{acc:.1%}</td>"
+            f"<td class='num'>{float(metrics.get('precision', 0)):.1%}</td>"
+            f"<td class='num'>{float(metrics.get('recall', 0)):.1%}</td>"
+            f"<td class='num'>{float(metrics.get('f1', 0)):.1%}</td>"
+            f"<td class='num'>{float(metrics.get('latency_p50_ms', 0)):.2f}ms</td>"
             f"</tr>"
         )
 
     overall = pass_all / total_all if total_all > 0 else 0.0
-    overall_class = (
-        "score-good" if overall >= 0.9 else "score-warn" if overall >= 0.7 else "score-bad"
-    )
-    rows_html.append(
-        f"<tr class='overall-row'>"
-        f"<td><strong>OVERALL</strong></td>"
-        f"<td class='num'><strong>{total_all}</strong></td>"
-        f"<td class='num pass'><strong>{pass_all}</strong></td>"
-        f"<td class='num fail'><strong>{fail_all}</strong></td>"
-        f"<td class='num {overall_class}'><strong>{overall:.1%}</strong></td>"
-        f"</tr>"
-    )
+    overall_class = "good" if overall >= 0.9 else "warn" if overall >= 0.7 else "bad"
 
-    # Failure details
-    failure_html: list[str] = []
-    for dim_name, dim_data in report_data["dimensions"].items():
-        failures = [d for d in dim_data["details"] if not d["passed"]]
-        for f in failures:
-            failure_html.append(
-                f"<div class='failure'>"
-                f"<span class='dim'>[{dim_name}]</span> "
-                f"<span class='case'>{f['case_id']}</span>"
-                f"<div class='detail'>expected: {f['expected']}</div>"
-                f"<div class='detail'>actual: {f['actual']}</div>"
-                f"</div>"
-            )
-
-    failures_section = (
-        "<h2>Failed Cases</h2>" + "".join(failure_html)
-        if failure_html
-        else "<p class='all-pass'>All tests passed.</p>"
-    )
+    timestamp = str(report_data.get("timestamp", ""))
+    version = str(report_data.get("version", ""))
+    runs = int(report_data.get("runs", 1))
 
     html = f"""<!DOCTYPE html>
 <html lang="en">
@@ -1124,33 +1946,26 @@ def _generate_html_report(
   td.num {{ text-align: right; font-family: monospace; }}
   td.pass {{ color: #2e7d32; }}
   td.fail {{ color: #c62828; }}
-  .score-good {{ color: #2e7d32; font-weight: bold; }}
-  .score-warn {{ color: #e65100; font-weight: bold; }}
-  .score-bad {{ color: #c62828; font-weight: bold; }}
-  .overall-row {{ background-color: #f5f5f5; }}
-  .failure {{ margin: 0.5em 0; padding: 0.5em; background: #fff3e0; border-left: 3px solid #ff9800; }}
-  .failure .dim {{ color: #e65100; font-weight: bold; }}
-  .failure .case {{ font-family: monospace; }}
-  .failure .detail {{ font-size: 0.85em; color: #555; margin-left: 1em; }}
-  .all-pass {{ color: #2e7d32; font-weight: bold; }}
+  .good {{ color: #2e7d32; font-weight: bold; }}
+  .warn {{ color: #e65100; font-weight: bold; }}
+  .bad {{ color: #c62828; font-weight: bold; }}
 </style>
 </head>
 <body>
 <h1>AgentKit Benchmark Report</h1>
 <div class="meta">
-  <p>Timestamp: {report_data['timestamp']}</p>
-  <p>Version: {report_data['version']}</p>
-  <p>Overall Score: <strong>{overall:.1%}</strong></p>
-  <p>Summary: {report_data['summary']}</p>
+  <p>Timestamp: {timestamp}</p>
+  <p>Version: {version}</p>
+  <p>Runs: {runs}</p>
+  <p>Overall Accuracy: <strong class="{overall_class}">{overall:.1%}</strong></p>
 </div>
 <h2>Dimension Results</h2>
 <table>
-<thead><tr><th>Dimension</th><th>Total</th><th>Pass</th><th>Fail</th><th>Score</th></tr></thead>
+<thead><tr><th>Dimension</th><th>Total</th><th>Pass</th><th>Fail</th><th>Acc</th><th>P</th><th>R</th><th>F1</th><th>p50</th></tr></thead>
 <tbody>
 {"".join(rows_html)}
 </tbody>
 </table>
-{failures_section}
 </body>
 </html>"""
 
@@ -1158,42 +1973,112 @@ def _generate_html_report(
 
 
 # ---------------------------------------------------------------------------
-# Main command
+# Baseline management
 # ---------------------------------------------------------------------------
 
 
-def _get_version() -> str:
+def _load_baseline(output_dir: Path) -> dict[str, object] | None:
+    """Load baseline JSON if it exists."""
+    baseline_path = output_dir / "baseline.json"
+    if not baseline_path.exists():
+        return None
     try:
-        from importlib.metadata import version as get_version
-
-        return get_version("fischer-agentkit")
+        data = json.loads(baseline_path.read_text(encoding="utf-8"))
+        if isinstance(data, dict):
+            return data
     except Exception:
-        return "0.1.0 (dev)"
+        pass
+    return None
+
+
+def _save_baseline(report_data: dict[str, object], output_dir: Path) -> None:
+    """Save current report as baseline."""
+    baseline_path = output_dir / "baseline.json"
+    baseline_path.write_text(
+        json.dumps(report_data, indent=2, ensure_ascii=False, default=str),
+        encoding="utf-8",
+    )
+
+
+def _compare_with_baseline(
+    current: dict[str, object],
+    baseline: dict[str, object],
+) -> dict[str, object]:
+    """Compare current results with baseline."""
+    comparison: dict[str, object] = {"status": "compared", "dimensions": {}}
+    current_dims = current.get("dimensions", {})
+    baseline_dims = baseline.get("dimensions", {})
+    if not isinstance(current_dims, dict) or not isinstance(baseline_dims, dict):
+        return comparison
+
+    dim_comparison: dict[str, object] = {}
+    for dim_name, dim_data in current_dims.items():
+        if not isinstance(dim_data, dict):
+            continue
+        baseline_dim = baseline_dims.get(dim_name, {})
+        if not isinstance(baseline_dim, dict):
+            baseline_dim = {}
+
+        current_metrics = dim_data.get("metrics", {})
+        baseline_metrics = baseline_dim.get("metrics", {})
+        if not isinstance(current_metrics, dict):
+            current_metrics = {}
+        if not isinstance(baseline_metrics, dict):
+            baseline_metrics = {}
+
+        current_acc = float(current_metrics.get("accuracy", 0.0))
+        baseline_acc = float(baseline_metrics.get("accuracy", 0.0))
+        change = current_acc - baseline_acc
+
+        dim_comparison[dim_name] = {
+            "baseline_accuracy": round(baseline_acc, 4),
+            "current_accuracy": round(current_acc, 4),
+            "change": round(change, 4),
+            "direction": "↑" if change > 0.001 else "↓" if change < -0.001 else "—",
+        }
+
+    comparison["dimensions"] = dim_comparison
+    return comparison
+
+
+# ---------------------------------------------------------------------------
+# Terminal display
+# ---------------------------------------------------------------------------
 
 
 def _build_summary_table(results: dict[str, DimensionResult]) -> Table:
+    """Build Rich summary table with full metrics."""
     table = Table(title="AgentKit Benchmark Results", show_lines=True)
     table.add_column("Dimension", style="cyan", no_wrap=True)
-    table.add_column("Total", justify="right", style="white")
+    table.add_column("Total", justify="right")
     table.add_column("Pass", justify="right", style="green")
     table.add_column("Fail", justify="right", style="red")
-    table.add_column("Score", justify="right", style="magenta")
+    table.add_column("Acc", justify="right", style="magenta")
+    table.add_column("P", justify="right")
+    table.add_column("R", justify="right")
+    table.add_column("F1", justify="right")
+    table.add_column("p50", justify="right")
 
     total_all = 0
     pass_all = 0
     fail_all = 0
 
     for dim_name, dim_result in results.items():
+        m = dim_result.metrics
         table.add_row(
             dim_name,
-            str(dim_result.total),
-            str(dim_result.passed),
-            str(dim_result.failed),
-            f"{dim_result.score:.1%}",
+            str(m.total),
+            str(m.passed),
+            str(m.failed),
+            f"{m.accuracy_mean:.1%}±{m.accuracy_std:.1%}",
+            f"{m.precision:.1%}" if m.precision > 0 else "—",
+            f"{m.recall:.1%}" if m.recall > 0 else "—",
+            f"{m.f1:.1%}" if m.f1 > 0 else "—",
+            f"{m.latency_p50_ms:.2f}ms",
         )
-        total_all += dim_result.total
-        pass_all += dim_result.passed
-        fail_all += dim_result.failed
+        total_all += m.total
+        pass_all += m.passed
+        fail_all += m.failed
 
     overall = pass_all / total_all if total_all > 0 else 0.0
     table.add_row(
@@ -1202,11 +2087,30 @@ def _build_summary_table(results: dict[str, DimensionResult]) -> Table:
         f"[bold green]{pass_all}[/bold green]",
         f"[bold red]{fail_all}[/bold red]",
         f"[bold magenta]{overall:.1%}[/bold magenta]",
+        "—",
+        "—",
+        "—",
+        "—",
     )
 
     return table
 
 
+# ---------------------------------------------------------------------------
+# Main command
+# ---------------------------------------------------------------------------
+
+
+def _get_version() -> str:
+    """Get package version."""
+    try:
+        from importlib.metadata import version as get_version
+
+        return get_version("fischer-agentkit")
+    except Exception:
+        return "0.1.0 (dev)"
+
+
 def benchmark(
     dimension: BenchmarkDimension = typer.Option(
         BenchmarkDimension.ALL,
@@ -1214,12 +2118,12 @@ def benchmark(
         "-d",
         help="Benchmark dimension to run (default: all)",
     ),
-    report: bool = typer.Option(False, "--report", help="Generate JSON + TXT report files"),
+    report: bool = typer.Option(False, "--report", help="Generate report files"),
     format: str = typer.Option(
-        "json",
+        "markdown",
         "--format",
         "-f",
-        help="Report format: json, txt, or html (use with --report)",
+        help="Report format: markdown (default), json, or html",
     ),
     output_dir: str = typer.Option(
         _DEFAULT_OUTPUT_DIR,
@@ -1229,24 +2133,35 @@ def benchmark(
     ),
     fast: bool = typer.Option(False, "--fast", help="Run only core test cases"),
     verbose: bool = typer.Option(False, "--verbose", "-v", help="Show detailed output"),
+    runs: int = typer.Option(3, "--runs", help="Number of runs for averaging (default: 3)"),
+    baseline: bool = typer.Option(False, "--baseline", help="Compare with baseline results"),
 ):
-    """Run AgentKit capability benchmarks and generate reports.
+    """Run AgentKit capability benchmarks with standardized metrics.
 
     Tests core components directly (no LLM, no pytest subprocess):
     preprocessing, overfitting, efficiency, tool_search, event_model,
     spec_management, verification.
+
+    Produces Accuracy / Precision / Recall / F1 / Latency / Consistency
+    metrics with multi-run averaging and 95% confidence intervals.
     """
     import tempfile
 
-    # Normalize dimension to enum (Typer may pass string)
+    # Normalize dimension (Typer may pass string)
     if isinstance(dimension, str):
         dimension = BenchmarkDimension(dimension)
 
+    # Normalize format
+    fmt = format.lower()
+    if fmt == "txt":
+        fmt = "markdown"
+
     console.print()
     console.print(
         Panel.fit(
             "[bold cyan]AgentKit Benchmark[/bold cyan]\n"
             f"Dimension: [yellow]{dimension.value}[/yellow]  "
+            f"Runs: [yellow]{runs}[/yellow]  "
             f"Fast: [yellow]{fast}[/yellow]  "
             f"Verbose: [yellow]{verbose}[/yellow]",
             border_style="cyan",
@@ -1268,21 +2183,11 @@ def benchmark(
     else:
         dims_to_run = [dimension]
 
-    # Map dimension enum to runner functions
-    runner_map: dict[BenchmarkDimension, Any] = {
-        BenchmarkDimension.PREPROCESSING: _run_preprocessing,
-        BenchmarkDimension.OVERFITTING: _run_overfitting,
-        BenchmarkDimension.EFFICIENCY: _run_efficiency,
-        BenchmarkDimension.TOOL_SEARCH: _run_tool_search,
-        BenchmarkDimension.EVENT_MODEL: _run_event_model,
-        BenchmarkDimension.SPEC_MANAGEMENT: _run_spec_management,
-        BenchmarkDimension.VERIFICATION: _run_verification,
-    }
-
     results: dict[str, DimensionResult] = {}
 
     with tempfile.TemporaryDirectory(prefix="agentkit-benchmark-") as tmp:
         tmp_path = Path(tmp)
+        ctx = _make_context(tmp_path)
 
         with Progress(
             SpinnerColumn(),
@@ -1292,17 +2197,8 @@ def benchmark(
             console=console,
         ) as progress:
             for dim in dims_to_run:
-                task = progress.add_task(
-                    f"Running {dim.value}...", total=None
-                )
-                runner = runner_map[dim]
-
-                # spec_management and verification need tmp_path
-                if dim in (BenchmarkDimension.SPEC_MANAGEMENT, BenchmarkDimension.VERIFICATION):
-                    dim_result = asyncio.run(runner(fast, verbose, tmp_path))
-                else:
-                    dim_result = asyncio.run(runner(fast, verbose))
-
+                task = progress.add_task(f"Running {dim.value}...", total=None)
+                dim_result = asyncio.run(_run_dimension(dim.value, runs, fast, verbose, ctx))
                 results[dim.value] = dim_result
                 progress.update(task, completed=True, total=1)
 
@@ -1313,9 +2209,9 @@ def benchmark(
     console.print()
 
     # Compute overall
-    total_all = sum(r.total for r in results.values())
-    pass_all = sum(r.passed for r in results.values())
-    fail_all = sum(r.failed for r in results.values())
+    total_all = sum(r.metrics.total for r in results.values())
+    pass_all = sum(r.metrics.passed for r in results.values())
+    fail_all = sum(r.metrics.failed for r in results.values())
     overall_score = pass_all / total_all if total_all > 0 else 0.0
 
     if fail_all == 0:
@@ -1338,26 +2234,59 @@ def benchmark(
         timestamp = datetime.now(timezone.utc).isoformat()
         version = _get_version()
 
-        report_data: dict[str, Any] = {
+        # Compute overall multi-run stats
+        all_accuracies: list[float] = []
+        for dim_result in results.values():
+            m = dim_result.metrics
+            if m.accuracy_std > 0:
+                all_accuracies.append(m.accuracy_mean)
+
+        overall_mean = overall_score
+        overall_std = 0.0
+        if runs > 1 and all_accuracies:
+            overall_mean = (
+                sum(all_accuracies) / len(all_accuracies) if all_accuracies else overall_score
+            )
+            overall_std = _std(all_accuracies) if len(all_accuracies) > 1 else 0.0
+
+        report_data: dict[str, object] = {
             "timestamp": timestamp,
             "version": version,
-            "dimensions": {name: r.to_dict() for name, r in results.items()},
-            "overall_score": round(overall_score, 4),
+            "runs": runs,
+            "fast": fast,
+            "overall_accuracy": round(overall_score, 4),
+            "overall_accuracy_mean": round(overall_mean, 4),
+            "overall_accuracy_std": round(overall_std, 4),
             "summary": summary,
+            "dimensions": {name: _dimension_to_dict(r) for name, r in results.items()},
         }
 
+        # Baseline comparison
+        if baseline:
+            baseline_data = _load_baseline(out_path)
+            if baseline_data is None:
+                _save_baseline(report_data, out_path)
+                report_data["baseline_comparison"] = {
+                    "status": "first_run",
+                    "message": "Baseline created from current run",
+                }
+                console.print("[green]Baseline created:[/green] baseline.json")
+            else:
+                comparison = _compare_with_baseline(report_data, baseline_data)
+                report_data["baseline_comparison"] = comparison
+                console.print("[green]Baseline comparison:[/green] completed")
+
         # Always generate JSON
         json_path = out_path / "benchmark_report.json"
         _generate_json_report(report_data, json_path)
         console.print(f"[green]JSON report:[/green] {json_path}")
 
-        # Always generate TXT
-        txt_path = out_path / "benchmark_report.txt"
-        _generate_txt_report(report_data, txt_path)
-        console.print(f"[green]TXT report:[/green] {txt_path}")
-
-        # Generate HTML if requested
-        if format.lower() == "html":
+        # Generate format-specific report
+        if fmt == "markdown":
+            md_path = out_path / "benchmark_report.md"
+            _generate_markdown_report(report_data, md_path)
+            console.print(f"[green]Markdown report:[/green] {md_path}")
+        elif fmt == "html":
             html_path = out_path / "benchmark_report.html"
             _generate_html_report(report_data, html_path)
             console.print(f"[green]HTML report:[/green] {html_path}")
diff --git a/test-results/benchmark/baseline.json b/test-results/benchmark/baseline.json
new file mode 100644
index 0000000..e026a91
--- /dev/null
+++ b/test-results/benchmark/baseline.json
@@ -0,0 +1,1522 @@
+{
+  "timestamp": "2026-06-17T03:54:43.123142+00:00",
+  "version": "0.1.0",
+  "runs": 1,
+  "fast": false,
+  "overall_accuracy": 1.0,
+  "overall_accuracy_mean": 1.0,
+  "overall_accuracy_std": 0.0,
+  "summary": "All 53 tests passed across 7 dimensions.",
+  "dimensions": {
+    "preprocessing": {
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "f1": 1.0,
+        "latency_p50_ms": 0.016,
+        "latency_p95_ms": 0.4208,
+        "latency_p99_ms": 1.1294,
+        "consistency": 1.0,
+        "total": 15,
+        "passed": 15,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.7961,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "greeting": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0196,
+          "latency_p95_ms": 0.0241,
+          "latency_p99_ms": 0.0243,
+          "consistency": 1.0,
+          "total": 4,
+          "passed": 4,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5101,
+          "ci_upper": 1.0
+        },
+        "tool_query": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0153,
+          "latency_p95_ms": 0.0162,
+          "latency_p99_ms": 0.0164,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "skill_prefix": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0412,
+          "latency_p95_ms": 1.1801,
+          "latency_p99_ms": 1.2813,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "complex": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0147,
+          "latency_p95_ms": 0.0148,
+          "latency_p99_ms": 0.0148,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.017,
+          "latency_p95_ms": 0.0239,
+          "latency_p99_ms": 0.0243,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0156,
+          "latency_p95_ms": 0.0367,
+          "latency_p99_ms": 0.0403,
+          "consistency": 1.0,
+          "total": 7,
+          "passed": 7,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6457,
+          "ci_upper": 1.0
+        },
+        "hard": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0147,
+          "latency_p95_ms": 1.1774,
+          "latency_p99_ms": 1.2808,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
+        {
+          "task_id": "prep-001",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "direct_chat",
+          "actual": "direct_chat",
+          "duration_ms": 0.0221,
+          "root_cause": "none",
+          "detail": "input='你好' method=regex_direct",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-002",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "direct_chat",
+          "actual": "direct_chat",
+          "duration_ms": 0.0244,
+          "root_cause": "none",
+          "detail": "input='hello' method=regex_direct",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-003",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "direct_chat",
+          "actual": "direct_chat",
+          "duration_ms": 0.017,
+          "root_cause": "none",
+          "detail": "input='谢谢' method=regex_direct",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-004",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "direct_chat",
+          "actual": "direct_chat",
+          "duration_ms": 0.016,
+          "root_cause": "none",
+          "detail": "input='你是谁' method=regex_direct",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-005",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0164,
+          "root_cause": "none",
+          "detail": "input='搜索golang教程' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-006",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0156,
+          "root_cause": "none",
+          "detail": "input='执行ls命令' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-007",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0153,
+          "root_cause": "none",
+          "detail": "input='翻译hello为中文' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-008",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.014,
+          "root_cause": "none",
+          "detail": "input='什么是机器学习' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-009",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0148,
+          "root_cause": "none",
+          "detail": "input='帮我分析数据' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-010",
+          "dimension": "preprocessing",
+          "category": "skill_prefix",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "skill_react",
+          "actual": "skill_react",
+          "duration_ms": 0.0412,
+          "root_cause": "none",
+          "detail": "input='@skill:react_agent 查看ip' method=skill_prefix",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-011",
+          "dimension": "preprocessing",
+          "category": "skill_prefix",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "direct_chat",
+          "actual": "direct_chat",
+          "duration_ms": 0.0262,
+          "root_cause": "none",
+          "detail": "input='@skill:chat_only 你好' method=skill_prefix",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-012",
+          "dimension": "preprocessing",
+          "category": "skill_prefix",
+          "difficulty": "hard",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 1.3066,
+          "root_cause": "none",
+          "detail": "input='@skill:nonexistent 做点什么' method=skill_not_found_fallback",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-013",
+          "dimension": "preprocessing",
+          "category": "complex",
+          "difficulty": "hard",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0147,
+          "root_cause": "none",
+          "detail": "input='帮我分析这个数据并生成报告' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-014",
+          "dimension": "preprocessing",
+          "category": "complex",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0148,
+          "root_cause": "none",
+          "detail": "input='随便聊聊' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-015",
+          "dimension": "preprocessing",
+          "category": "complex",
+          "difficulty": "hard",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0132,
+          "root_cause": "none",
+          "detail": "input='请帮我完成以下任务：1. 查询天气 2. 生成报告' method=default_react",
+          "consistency": 1.0
+        }
+      ]
+    },
+    "overfitting": {
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "f1": 1.0,
+        "latency_p50_ms": 0.0295,
+        "latency_p95_ms": 0.0396,
+        "latency_p99_ms": 0.0401,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 5,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.5655,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "ip_check": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0402,
+          "latency_p95_ms": 0.0402,
+          "latency_p99_ms": 0.0402,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "search": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0282,
+          "latency_p95_ms": 0.0282,
+          "latency_p99_ms": 0.0282,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "greeting": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0373,
+          "latency_p95_ms": 0.0373,
+          "latency_p99_ms": 0.0373,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "tool_use": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0295,
+          "latency_p95_ms": 0.0295,
+          "latency_p99_ms": 0.0295,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "complex": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0249,
+          "latency_p95_ms": 0.0249,
+          "latency_p99_ms": 0.0249,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0295,
+          "latency_p95_ms": 0.0391,
+          "latency_p99_ms": 0.04,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0373,
+          "latency_p95_ms": 0.0373,
+          "latency_p99_ms": 0.0373,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "hard": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0249,
+          "latency_p95_ms": 0.0249,
+          "latency_p99_ms": 0.0249,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
+        {
+          "task_id": "over-001",
+          "dimension": "overfitting",
+          "category": "ip_check",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0402,
+          "root_cause": "none",
+          "detail": "paraphrases=5 modes=['react', 'react', 'react', 'react', 'react']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "over-002",
+          "dimension": "overfitting",
+          "category": "search",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0282,
+          "root_cause": "none",
+          "detail": "paraphrases=3 modes=['react', 'react', 'react']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "over-003",
+          "dimension": "overfitting",
+          "category": "greeting",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "direct_chat",
+          "actual": "direct_chat",
+          "duration_ms": 0.0373,
+          "root_cause": "none",
+          "detail": "paraphrases=5 modes=['direct_chat', 'direct_chat', 'direct_chat', 'direct_chat', 'direct_chat']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "over-004",
+          "dimension": "overfitting",
+          "category": "tool_use",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0295,
+          "root_cause": "none",
+          "detail": "paraphrases=3 modes=['react', 'react', 'react']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "over-005",
+          "dimension": "overfitting",
+          "category": "complex",
+          "difficulty": "hard",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0249,
+          "root_cause": "none",
+          "detail": "paraphrases=3 modes=['react', 'react', 'react']",
+          "consistency": 1.0
+        }
+      ]
+    },
+    "efficiency": {
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 0.33,
+        "latency_p95_ms": 0.602,
+        "latency_p99_ms": 0.6404,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 5,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.5655,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "preprocess_latency": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.33,
+          "latency_p95_ms": 0.402,
+          "latency_p99_ms": 0.4084,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "tool_search_latency": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.345,
+          "latency_p95_ms": 0.6195,
+          "latency_p99_ms": 0.6439,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.16,
+          "latency_p95_ms": 0.268,
+          "latency_p99_ms": 0.2776,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.41,
+          "latency_p95_ms": 0.626,
+          "latency_p99_ms": 0.6452,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
+        {
+          "task_id": "eff-001",
+          "dimension": "efficiency",
+          "category": "preprocess_latency",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "<=50ms",
+          "actual": "0.003ms",
+          "duration_ms": 0.28,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.003ms threshold=50.0ms",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "eff-002",
+          "dimension": "efficiency",
+          "category": "preprocess_latency",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "<=50ms",
+          "actual": "0.003ms",
+          "duration_ms": 0.33,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.003ms threshold=50.0ms",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "eff-003",
+          "dimension": "efficiency",
+          "category": "preprocess_latency",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "<=50ms",
+          "actual": "0.004ms",
+          "duration_ms": 0.41,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.004ms threshold=50.0ms",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "eff-004",
+          "dimension": "efficiency",
+          "category": "tool_search_latency",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "<=10ms",
+          "actual": "0.006ms",
+          "duration_ms": 0.65,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.006ms threshold=10.0ms",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "eff-005",
+          "dimension": "efficiency",
+          "category": "tool_search_latency",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "<=5ms",
+          "actual": "0.000ms",
+          "duration_ms": 0.04,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.000ms threshold=5.0ms",
+          "consistency": 1.0
+        }
+      ]
+    },
+    "tool_search": {
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.8333,
+        "recall": 0.8333,
+        "f1": 0.8333,
+        "latency_p50_ms": 0.0229,
+        "latency_p95_ms": 0.0415,
+        "latency_p99_ms": 0.0518,
+        "consistency": 1.0,
+        "total": 10,
+        "passed": 10,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.7225,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "exact_match": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0234,
+          "latency_p95_ms": 0.0487,
+          "latency_p99_ms": 0.0533,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "fuzzy_match": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0224,
+          "latency_p95_ms": 0.0228,
+          "latency_p99_ms": 0.0228,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "no_match": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0089,
+          "latency_p95_ms": 0.0141,
+          "latency_p99_ms": 0.0146,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "top_k": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0184,
+          "latency_p95_ms": 0.0184,
+          "latency_p99_ms": 0.0184,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.8333,
+          "recall": 0.8333,
+          "f1": 0.8333,
+          "latency_p50_ms": 0.0231,
+          "latency_p95_ms": 0.0458,
+          "latency_p99_ms": 0.0527,
+          "consistency": 1.0,
+          "total": 7,
+          "passed": 7,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6457,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0219,
+          "latency_p95_ms": 0.0227,
+          "latency_p99_ms": 0.0228,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
+        {
+          "task_id": "ts-001",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "read_file",
+          "actual": "read_file",
+          "duration_ms": 0.023,
+          "root_cause": "none",
+          "detail": "query='read file' top_k=5 results=2",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-002",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "write_file",
+          "actual": "write_file",
+          "duration_ms": 0.0544,
+          "root_cause": "none",
+          "detail": "query='write file content' top_k=5 results=2",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-003",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "web_search",
+          "actual": "web_search",
+          "duration_ms": 0.0258,
+          "root_cause": "none",
+          "detail": "query='search web information' top_k=5 results=2",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-004",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "shell_exec",
+          "actual": "shell_exec",
+          "duration_ms": 0.0234,
+          "root_cause": "none",
+          "detail": "query='execute shell command' top_k=5 results=1",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-005",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "http_request",
+          "actual": "http_request",
+          "duration_ms": 0.0231,
+          "root_cause": "none",
+          "detail": "query='send http request url' top_k=5 results=1",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-006",
+          "dimension": "tool_search",
+          "category": "fuzzy_match",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "read_file",
+          "actual": "read_file",
+          "duration_ms": 0.0228,
+          "root_cause": "none",
+          "detail": "query='io file' top_k=5 results=2",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-007",
+          "dimension": "tool_search",
+          "category": "fuzzy_match",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "web_search",
+          "actual": "web_search",
+          "duration_ms": 0.0219,
+          "root_cause": "none",
+          "detail": "query='search query engine' top_k=5 results=1",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-008",
+          "dimension": "tool_search",
+          "category": "no_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "__none__",
+          "actual": "[]",
+          "duration_ms": 0.003,
+          "root_cause": "none",
+          "detail": "query='' top_k=5 results=0",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-009",
+          "dimension": "tool_search",
+          "category": "no_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "__none__",
+          "actual": "[]",
+          "duration_ms": 0.0147,
+          "root_cause": "none",
+          "detail": "query='zzzznonexistent' top_k=5 results=0",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-010",
+          "dimension": "tool_search",
+          "category": "top_k",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "read_file",
+          "actual": "read_file",
+          "duration_ms": 0.0184,
+          "root_cause": "none",
+          "detail": "query='file' top_k=1 results=1",
+          "consistency": 1.0
+        }
+      ]
+    },
+    "event_model": {
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 0.0894,
+        "latency_p95_ms": 16.7933,
+        "latency_p99_ms": 20.5773,
+        "consistency": 1.0,
+        "total": 6,
+        "passed": 6,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.6097,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "sq_lifecycle": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0671,
+          "latency_p95_ms": 0.1071,
+          "latency_p99_ms": 0.1107,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "eq_lifecycle": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 2.6035,
+          "latency_p95_ms": 19.6313,
+          "latency_p99_ms": 21.1449,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0894,
+          "latency_p95_ms": 16.7933,
+          "latency_p99_ms": 20.5773,
+          "consistency": 1.0,
+          "total": 6,
+          "passed": 6,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6097,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
+        {
+          "task_id": "ev-001",
+          "dimension": "event_model",
+          "category": "sq_lifecycle",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "drained=['hello']",
+          "duration_ms": 0.1116,
+          "root_cause": "none",
+          "detail": "task_id=5c4be886...",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ev-002",
+          "dimension": "event_model",
+          "category": "sq_lifecycle",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "cancelled=True",
+          "duration_ms": 0.0671,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ev-003",
+          "dimension": "event_model",
+          "category": "sq_lifecycle",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "raised=True closed=True",
+          "duration_ms": 0.0143,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ev-004",
+          "dimension": "event_model",
+          "category": "eq_lifecycle",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "received=1",
+          "duration_ms": 2.6035,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ev-005",
+          "dimension": "event_model",
+          "category": "eq_lifecycle",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "events=1 closed=True",
+          "duration_ms": 21.5233,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ev-006",
+          "dimension": "event_model",
+          "category": "eq_lifecycle",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "subscribers=0",
+          "duration_ms": 0.008,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        }
+      ]
+    },
+    "spec_management": {
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 1.4329,
+        "latency_p95_ms": 2.75,
+        "latency_p99_ms": 3.1046,
+        "consistency": 1.0,
+        "total": 7,
+        "passed": 7,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.6457,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "crud": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 1.4329,
+          "latency_p95_ms": 2.8609,
+          "latency_p99_ms": 3.1268,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "edge": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.8834,
+          "latency_p95_ms": 1.6324,
+          "latency_p99_ms": 1.699,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 1.3287,
+          "latency_p95_ms": 2.7777,
+          "latency_p99_ms": 3.1102,
+          "consistency": 1.0,
+          "total": 6,
+          "passed": 6,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6097,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 1.7156,
+          "latency_p95_ms": 1.7156,
+          "latency_p99_ms": 1.7156,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
+        {
+          "task_id": "sm-001",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "exists=True",
+          "duration_ms": 1.4329,
+          "root_cause": "none",
+          "detail": "path=/var/folders/6b/ljk5bdq50yxcsth24frf05200000gn/T/agentkit-benchmark-dzm9kg48/run-0/specs/sm-001/test-spec.yaml",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-002",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "steps=2",
+          "duration_ms": 1.2244,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-003",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "goal=Updated goal",
+          "duration_ms": 1.5311,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-004",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "deleted=True remaining=0",
+          "duration_ms": 1.1484,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-005",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "count=2",
+          "duration_ms": 3.1933,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-006",
+          "dimension": "spec_management",
+          "category": "edge",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "passed",
+          "actual": "status=confirmed",
+          "duration_ms": 1.7156,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-007",
+          "dimension": "spec_management",
+          "category": "edge",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "result=None",
+          "duration_ms": 0.0512,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        }
+      ]
+    },
+    "verification": {
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 24.8909,
+        "latency_p95_ms": 411.9118,
+        "latency_p99_ms": 487.0974,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 5,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.5655,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "basic": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 11.7309,
+          "latency_p95_ms": 11.9356,
+          "latency_p99_ms": 11.9538,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "retry": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 35.984,
+          "latency_p95_ms": 35.984,
+          "latency_p99_ms": 35.984,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "timeout": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 505.8938,
+          "latency_p95_ms": 505.8938,
+          "latency_p99_ms": 505.8938,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "multi": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 24.8909,
+          "latency_p95_ms": 24.8909,
+          "latency_p99_ms": 24.8909,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 11.7309,
+          "latency_p95_ms": 11.9356,
+          "latency_p99_ms": 11.9538,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 35.984,
+          "latency_p95_ms": 458.9028,
+          "latency_p99_ms": 496.4956,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
+        {
+          "task_id": "vf-001",
+          "dimension": "verification",
+          "category": "basic",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "passed=True attempts=1",
+          "duration_ms": 11.5036,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "vf-002",
+          "dimension": "verification",
+          "category": "basic",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "passed=False errors=1",
+          "duration_ms": 11.9583,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "vf-003",
+          "dimension": "verification",
+          "category": "retry",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "passed",
+          "actual": "attempts=3 callbacks=2",
+          "duration_ms": 35.984,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "vf-004",
+          "dimension": "verification",
+          "category": "timeout",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "passed",
+          "actual": "passed=False errors=1",
+          "duration_ms": 505.8938,
+          "root_cause": "none",
+          "detail": "errors=['Command timed out after 0.5s: sleep 10']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "vf-005",
+          "dimension": "verification",
+          "category": "multi",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "passed",
+          "actual": "passed=False",
+          "duration_ms": 24.8909,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        }
+      ]
+    }
+  }
+}
\ No newline at end of file
diff --git a/test-results/benchmark/benchmark_report.json b/test-results/benchmark/benchmark_report.json
index c63b01b..a38ea17 100644
--- a/test-results/benchmark/benchmark_report.json
+++ b/test-results/benchmark/benchmark_report.json
@@ -1,472 +1,1569 @@
 {
-  "timestamp": "2026-06-17T03:26:25.072956+00:00",
+  "timestamp": "2026-06-17T04:00:50.738066+00:00",
   "version": "0.1.0",
+  "runs": 3,
+  "fast": false,
+  "overall_accuracy": 1.0,
+  "overall_accuracy_mean": 1.0,
+  "overall_accuracy_std": 0.0,
+  "summary": "All 53 tests passed across 7 dimensions.",
   "dimensions": {
     "preprocessing": {
-      "score": 0.9333,
-      "total": 15,
-      "passed": 14,
-      "failed": 1,
-      "details": [
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "f1": 1.0,
+        "latency_p50_ms": 0.006,
+        "latency_p95_ms": 0.0295,
+        "latency_p99_ms": 0.0569,
+        "consistency": 1.0,
+        "total": 15,
+        "passed": 15,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.7961,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "greeting": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0069,
+          "latency_p95_ms": 0.0111,
+          "latency_p99_ms": 0.0117,
+          "consistency": 1.0,
+          "total": 4,
+          "passed": 4,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5101,
+          "ci_upper": 1.0
+        },
+        "tool_query": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0051,
+          "latency_p95_ms": 0.0052,
+          "latency_p99_ms": 0.0052,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "skill_prefix": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0149,
+          "latency_p95_ms": 0.0588,
+          "latency_p99_ms": 0.0627,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "complex": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0056,
+          "latency_p95_ms": 0.0074,
+          "latency_p99_ms": 0.0076,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0066,
+          "latency_p95_ms": 0.0109,
+          "latency_p99_ms": 0.0116,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0051,
+          "latency_p95_ms": 0.0132,
+          "latency_p99_ms": 0.0146,
+          "consistency": 1.0,
+          "total": 7,
+          "passed": 7,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6457,
+          "ci_upper": 1.0
+        },
+        "hard": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0076,
+          "latency_p95_ms": 0.0581,
+          "latency_p99_ms": 0.0626,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
         {
-          "case_id": "greeting_cn",
+          "task_id": "prep-001",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.03,
-          "detail": "input='你好' method=regex_direct"
+          "duration_ms": 0.0118,
+          "root_cause": "none",
+          "detail": "input='你好' method=regex_direct",
+          "consistency": 1.0
         },
         {
-          "case_id": "greeting_en",
+          "task_id": "prep-002",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.02,
-          "detail": "input='hello' method=regex_direct"
+          "duration_ms": 0.0071,
+          "root_cause": "none",
+          "detail": "input='hello' method=regex_direct",
+          "consistency": 1.0
         },
         {
-          "case_id": "chitchat_thanks",
+          "task_id": "prep-003",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.01,
-          "detail": "input='谢谢' method=regex_direct"
+          "duration_ms": 0.0066,
+          "root_cause": "none",
+          "detail": "input='谢谢' method=regex_direct",
+          "consistency": 1.0
         },
         {
-          "case_id": "identity_who",
+          "task_id": "prep-004",
+          "dimension": "preprocessing",
+          "category": "greeting",
+          "difficulty": "easy",
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.02,
-          "detail": "input='你是谁' method=regex_direct"
+          "duration_ms": 0.006,
+          "root_cause": "none",
+          "detail": "input='你是谁' method=regex_direct",
+          "consistency": 1.0
         },
         {
-          "case_id": "colloquial_ip_1",
+          "task_id": "prep-005",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.02,
-          "detail": "input='查下ip' method=default_react"
+          "duration_ms": 0.0052,
+          "root_cause": "none",
+          "detail": "input='搜索golang教程' method=default_react",
+          "consistency": 1.0
         },
         {
-          "case_id": "colloquial_ip_2",
+          "task_id": "prep-006",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.01,
-          "detail": "input='查看当前ip' method=default_react"
+          "duration_ms": 0.0046,
+          "root_cause": "none",
+          "detail": "input='执行ls命令' method=default_react",
+          "consistency": 1.0
         },
         {
-          "case_id": "tool_search",
+          "task_id": "prep-007",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.01,
-          "detail": "input='搜索golang教程' method=default_react"
+          "duration_ms": 0.0051,
+          "root_cause": "none",
+          "detail": "input='翻译hello为中文' method=default_react",
+          "consistency": 1.0
         },
         {
-          "case_id": "tool_shell",
+          "task_id": "prep-008",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.01,
-          "detail": "input='执行ls命令' method=default_react"
+          "duration_ms": 0.0051,
+          "root_cause": "none",
+          "detail": "input='什么是机器学习' method=default_react",
+          "consistency": 1.0
         },
         {
-          "case_id": "translation",
+          "task_id": "prep-009",
+          "dimension": "preprocessing",
+          "category": "tool_query",
+          "difficulty": "medium",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.01,
-          "detail": "input='翻译hello为中文' method=default_react"
+          "duration_ms": 0.0047,
+          "root_cause": "none",
+          "detail": "input='帮我分析数据' method=default_react",
+          "consistency": 1.0
         },
         {
-          "case_id": "knowledge",
-          "passed": true,
-          "expected": "react",
-          "actual": "react",
-          "duration_ms": 0.01,
-          "detail": "input='什么是机器学习' method=default_react"
-        },
-        {
-          "case_id": "skill_prefix_react",
+          "task_id": "prep-010",
+          "dimension": "preprocessing",
+          "category": "skill_prefix",
+          "difficulty": "medium",
           "passed": true,
           "expected": "skill_react",
           "actual": "skill_react",
-          "duration_ms": 0.03,
-          "detail": "input='@skill:react_agent 查看ip' method=skill_prefix"
+          "duration_ms": 0.0149,
+          "root_cause": "none",
+          "detail": "input='@skill:react_agent 查看ip' method=skill_prefix",
+          "consistency": 1.0
         },
         {
-          "case_id": "skill_prefix_direct",
-          "passed": false,
-          "expected": "skill_react",
+          "task_id": "prep-011",
+          "dimension": "preprocessing",
+          "category": "skill_prefix",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.02,
-          "detail": "input='@skill:chat_only 你好' method=skill_prefix"
+          "duration_ms": 0.0092,
+          "root_cause": "none",
+          "detail": "input='@skill:chat_only 你好' method=skill_prefix",
+          "consistency": 1.0
         },
         {
-          "case_id": "skill_not_found",
+          "task_id": "prep-012",
+          "dimension": "preprocessing",
+          "category": "skill_prefix",
+          "difficulty": "hard",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.13,
-          "detail": "input='@skill:nonexistent 做点什么' method=skill_not_found_fallback"
+          "duration_ms": 0.0637,
+          "root_cause": "none",
+          "detail": "input='@skill:nonexistent 做点什么' method=skill_not_found_fallback",
+          "consistency": 1.0
         },
         {
-          "case_id": "complex_analysis",
+          "task_id": "prep-013",
+          "dimension": "preprocessing",
+          "category": "complex",
+          "difficulty": "hard",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.01,
-          "detail": "input='帮我分析一下这个数据并生成报告' method=default_react"
+          "duration_ms": 0.0076,
+          "root_cause": "none",
+          "detail": "input='帮我分析这个数据并生成报告' method=default_react",
+          "consistency": 1.0
         },
         {
-          "case_id": "empty_fallback",
+          "task_id": "prep-014",
+          "dimension": "preprocessing",
+          "category": "complex",
+          "difficulty": "easy",
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.01,
-          "detail": "input='随便聊聊' method=default_react"
+          "duration_ms": 0.0056,
+          "root_cause": "none",
+          "detail": "input='随便聊聊' method=default_react",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "prep-015",
+          "dimension": "preprocessing",
+          "category": "complex",
+          "difficulty": "hard",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0047,
+          "root_cause": "none",
+          "detail": "input='请帮我完成以下任务：1. 查询天气 2. 生成报告' method=default_react",
+          "consistency": 1.0
         }
       ]
     },
     "overfitting": {
-      "score": 1.0,
-      "total": 3,
-      "passed": 3,
-      "failed": 0,
-      "details": [
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 1.0,
+        "recall": 1.0,
+        "f1": 1.0,
+        "latency_p50_ms": 0.0426,
+        "latency_p95_ms": 0.0644,
+        "latency_p99_ms": 0.0675,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 5,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.5655,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "ip_check": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0426,
+          "latency_p95_ms": 0.0426,
+          "latency_p99_ms": 0.0426,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "search": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0309,
+          "latency_p95_ms": 0.0309,
+          "latency_p99_ms": 0.0309,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "greeting": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.049,
+          "latency_p95_ms": 0.049,
+          "latency_p99_ms": 0.049,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "tool_use": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0252,
+          "latency_p95_ms": 0.0252,
+          "latency_p99_ms": 0.0252,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "complex": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0683,
+          "latency_p95_ms": 0.0683,
+          "latency_p99_ms": 0.0683,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0309,
+          "latency_p95_ms": 0.0414,
+          "latency_p99_ms": 0.0424,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.049,
+          "latency_p95_ms": 0.049,
+          "latency_p99_ms": 0.049,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "hard": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0683,
+          "latency_p95_ms": 0.0683,
+          "latency_p99_ms": 0.0683,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
         {
-          "case_id": "ip_check_variants",
+          "task_id": "over-001",
+          "dimension": "overfitting",
+          "category": "ip_check",
+          "difficulty": "medium",
           "passed": true,
           "expected": "react",
-          "actual": "react,react,react,react,react",
-          "duration_ms": 0.0,
-          "detail": "paraphrases=5 consistent=True"
+          "actual": "react",
+          "duration_ms": 0.0426,
+          "root_cause": "none",
+          "detail": "paraphrases=5 modes=['react', 'react', 'react', 'react', 'react']",
+          "consistency": 1.0
         },
         {
-          "case_id": "search_variants",
+          "task_id": "over-002",
+          "dimension": "overfitting",
+          "category": "search",
+          "difficulty": "medium",
           "passed": true,
           "expected": "react",
-          "actual": "react,react,react",
-          "duration_ms": 0.0,
-          "detail": "paraphrases=3 consistent=True"
+          "actual": "react",
+          "duration_ms": 0.0309,
+          "root_cause": "none",
+          "detail": "paraphrases=3 modes=['react', 'react', 'react']",
+          "consistency": 1.0
         },
         {
-          "case_id": "greeting_variants",
+          "task_id": "over-003",
+          "dimension": "overfitting",
+          "category": "greeting",
+          "difficulty": "easy",
           "passed": true,
           "expected": "direct_chat",
-          "actual": "direct_chat,direct_chat,direct_chat,direct_chat,direct_chat",
-          "duration_ms": 0.0,
-          "detail": "paraphrases=5 consistent=True"
+          "actual": "direct_chat",
+          "duration_ms": 0.049,
+          "root_cause": "none",
+          "detail": "paraphrases=5 modes=['direct_chat', 'direct_chat', 'direct_chat', 'direct_chat', 'direct_chat']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "over-004",
+          "dimension": "overfitting",
+          "category": "tool_use",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0252,
+          "root_cause": "none",
+          "detail": "paraphrases=3 modes=['react', 'react', 'react']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "over-005",
+          "dimension": "overfitting",
+          "category": "complex",
+          "difficulty": "hard",
+          "passed": true,
+          "expected": "react",
+          "actual": "react",
+          "duration_ms": 0.0683,
+          "root_cause": "none",
+          "detail": "paraphrases=3 modes=['react', 'react', 'react']",
+          "consistency": 1.0
         }
       ]
     },
     "efficiency": {
-      "score": 1.0,
-      "total": 5,
-      "passed": 5,
-      "failed": 0,
-      "details": [
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 0.4,
+        "latency_p95_ms": 0.768,
+        "latency_p99_ms": 0.8176,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 5,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.5655,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "preprocess_latency": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.4,
+          "latency_p95_ms": 0.508,
+          "latency_p99_ms": 0.5176,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "tool_search_latency": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.44,
+          "latency_p95_ms": 0.791,
+          "latency_p99_ms": 0.8222,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.2,
+          "latency_p95_ms": 0.335,
+          "latency_p99_ms": 0.347,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.52,
+          "latency_p95_ms": 0.799,
+          "latency_p99_ms": 0.8238,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
         {
-          "case_id": "preprocess_greeting",
+          "task_id": "eff-001",
+          "dimension": "efficiency",
+          "category": "preprocess_latency",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "<= 50.0ms/call",
-          "actual": "0.004ms/call",
-          "duration_ms": 0.44,
-          "detail": "iterations=100"
+          "expected": "<=50ms",
+          "actual": "0.004ms",
+          "duration_ms": 0.35,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.004ms threshold=50.0ms",
+          "consistency": 1.0
         },
         {
-          "case_id": "preprocess_react",
+          "task_id": "eff-002",
+          "dimension": "efficiency",
+          "category": "preprocess_latency",
+          "difficulty": "medium",
           "passed": true,
-          "expected": "<= 50.0ms/call",
-          "actual": "0.004ms/call",
-          "duration_ms": 0.38,
-          "detail": "iterations=100"
+          "expected": "<=50ms",
+          "actual": "0.004ms",
+          "duration_ms": 0.4,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.004ms threshold=50.0ms",
+          "consistency": 1.0
         },
         {
-          "case_id": "preprocess_skill_prefix",
+          "task_id": "eff-003",
+          "dimension": "efficiency",
+          "category": "preprocess_latency",
+          "difficulty": "medium",
           "passed": true,
-          "expected": "<= 50.0ms/call",
-          "actual": "0.005ms/call",
-          "duration_ms": 0.51,
-          "detail": "iterations=100"
+          "expected": "<=50ms",
+          "actual": "0.005ms",
+          "duration_ms": 0.52,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.005ms threshold=50.0ms",
+          "consistency": 1.0
         },
         {
-          "case_id": "tool_search_query",
+          "task_id": "eff-004",
+          "dimension": "efficiency",
+          "category": "tool_search_latency",
+          "difficulty": "medium",
           "passed": true,
-          "expected": "<= 10.0ms/call",
-          "actual": "0.008ms/call",
-          "duration_ms": 1.69,
-          "detail": "iterations=200"
+          "expected": "<=10ms",
+          "actual": "0.008ms",
+          "duration_ms": 0.83,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.008ms threshold=10.0ms",
+          "consistency": 1.0
         },
         {
-          "case_id": "tool_search_empty",
+          "task_id": "eff-005",
+          "dimension": "efficiency",
+          "category": "tool_search_latency",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "<= 5.0ms/call",
-          "actual": "0.000ms/call",
-          "duration_ms": 0.08,
-          "detail": "iterations=200"
+          "expected": "<=5ms",
+          "actual": "0.000ms",
+          "duration_ms": 0.05,
+          "root_cause": "none",
+          "detail": "iterations=100 avg=0.000ms threshold=5.0ms",
+          "consistency": 1.0
         }
       ]
     },
     "tool_search": {
-      "score": 1.0,
-      "total": 10,
-      "passed": 10,
-      "failed": 0,
-      "details": [
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.8333,
+        "recall": 0.8333,
+        "f1": 0.8333,
+        "latency_p50_ms": 0.0112,
+        "latency_p95_ms": 0.0153,
+        "latency_p99_ms": 0.0163,
+        "consistency": 1.0,
+        "total": 10,
+        "passed": 10,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.7225,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "exact_match": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0124,
+          "latency_p95_ms": 0.016,
+          "latency_p99_ms": 0.0165,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "fuzzy_match": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0108,
+          "latency_p95_ms": 0.0111,
+          "latency_p99_ms": 0.0111,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "no_match": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0044,
+          "latency_p95_ms": 0.0071,
+          "latency_p99_ms": 0.0073,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "top_k": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0091,
+          "latency_p95_ms": 0.0091,
+          "latency_p99_ms": 0.0091,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.8333,
+          "recall": 0.8333,
+          "f1": 0.8333,
+          "latency_p50_ms": 0.0124,
+          "latency_p95_ms": 0.0158,
+          "latency_p99_ms": 0.0164,
+          "consistency": 1.0,
+          "total": 7,
+          "passed": 7,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6457,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0105,
+          "latency_p95_ms": 0.011,
+          "latency_p99_ms": 0.0111,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
         {
-          "case_id": "read_file_query",
+          "task_id": "ts-001",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
           "passed": true,
           "expected": "read_file",
           "actual": "read_file",
-          "duration_ms": 0.02,
-          "detail": "query='read file' top_k=5 results=2"
+          "duration_ms": 0.0166,
+          "root_cause": "none",
+          "detail": "query='read file' top_k=5 results=2",
+          "consistency": 1.0
         },
         {
-          "case_id": "write_file_query",
+          "task_id": "ts-002",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
           "passed": true,
           "expected": "write_file",
           "actual": "write_file",
-          "duration_ms": 0.02,
-          "detail": "query='write file content' top_k=5 results=2"
+          "duration_ms": 0.0138,
+          "root_cause": "none",
+          "detail": "query='write file content' top_k=5 results=2",
+          "consistency": 1.0
         },
         {
-          "case_id": "web_search_query",
+          "task_id": "ts-003",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
           "passed": true,
           "expected": "web_search",
           "actual": "web_search",
-          "duration_ms": 0.02,
-          "detail": "query='search web information' top_k=5 results=2"
+          "duration_ms": 0.0124,
+          "root_cause": "none",
+          "detail": "query='search web information' top_k=5 results=2",
+          "consistency": 1.0
         },
         {
-          "case_id": "shell_exec_query",
+          "task_id": "ts-004",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
           "passed": true,
           "expected": "shell_exec",
           "actual": "shell_exec",
-          "duration_ms": 0.02,
-          "detail": "query='execute shell command' top_k=5 results=1"
+          "duration_ms": 0.0113,
+          "root_cause": "none",
+          "detail": "query='execute shell command' top_k=5 results=1",
+          "consistency": 1.0
         },
         {
-          "case_id": "http_request_query",
+          "task_id": "ts-005",
+          "dimension": "tool_search",
+          "category": "exact_match",
+          "difficulty": "easy",
           "passed": true,
           "expected": "http_request",
           "actual": "http_request",
-          "duration_ms": 0.03,
-          "detail": "query='send http request url' top_k=5 results=1"
+          "duration_ms": 0.0124,
+          "root_cause": "none",
+          "detail": "query='send http request url' top_k=5 results=1",
+          "consistency": 1.0
         },
         {
-          "case_id": "file_tag_query",
+          "task_id": "ts-006",
+          "dimension": "tool_search",
+          "category": "fuzzy_match",
+          "difficulty": "medium",
           "passed": true,
           "expected": "read_file",
           "actual": "read_file",
-          "duration_ms": 0.02,
-          "detail": "query='io file' top_k=5 results=2"
+          "duration_ms": 0.0105,
+          "root_cause": "none",
+          "detail": "query='io file' top_k=5 results=2",
+          "consistency": 1.0
         },
         {
-          "case_id": "empty_query",
-          "passed": true,
-          "expected": "__none__",
-          "actual": "[]",
-          "duration_ms": 0.0,
-          "detail": "query='' top_k=5 results=0"
-        },
-        {
-          "case_id": "no_match_query",
-          "passed": true,
-          "expected": "__none__",
-          "actual": "[]",
-          "duration_ms": 0.01,
-          "detail": "query='zzzznonexistent' top_k=5 results=0"
-        },
-        {
-          "case_id": "top_k_limit",
-          "passed": true,
-          "expected": "read_file",
-          "actual": "read_file",
-          "duration_ms": 0.02,
-          "detail": "query='file' top_k=1 results=1"
-        },
-        {
-          "case_id": "multi_token_query",
+          "task_id": "ts-007",
+          "dimension": "tool_search",
+          "category": "fuzzy_match",
+          "difficulty": "medium",
           "passed": true,
           "expected": "web_search",
           "actual": "web_search",
-          "duration_ms": 0.03,
-          "detail": "query='search query engine' top_k=5 results=1"
+          "duration_ms": 0.0111,
+          "root_cause": "none",
+          "detail": "query='search query engine' top_k=5 results=1",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-008",
+          "dimension": "tool_search",
+          "category": "no_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "__none__",
+          "actual": "[]",
+          "duration_ms": 0.0015,
+          "root_cause": "none",
+          "detail": "query='' top_k=5 results=0",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-009",
+          "dimension": "tool_search",
+          "category": "no_match",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "__none__",
+          "actual": "[]",
+          "duration_ms": 0.0074,
+          "root_cause": "none",
+          "detail": "query='zzzznonexistent' top_k=5 results=0",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "ts-010",
+          "dimension": "tool_search",
+          "category": "top_k",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "read_file",
+          "actual": "read_file",
+          "duration_ms": 0.0091,
+          "root_cause": "none",
+          "detail": "query='file' top_k=1 results=1",
+          "consistency": 1.0
         }
       ]
     },
     "event_model": {
-      "score": 1.0,
-      "total": 6,
-      "passed": 6,
-      "failed": 0,
-      "details": [
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 0.0409,
+        "latency_p95_ms": 15.6839,
+        "latency_p99_ms": 19.8446,
+        "consistency": 1.0,
+        "total": 6,
+        "passed": 6,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.6097,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "sq_lifecycle": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.038,
+          "latency_p95_ms": 0.0773,
+          "latency_p99_ms": 0.0808,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        },
+        "eq_lifecycle": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0438,
+          "latency_p95_ms": 18.8006,
+          "latency_p99_ms": 20.4679,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0409,
+          "latency_p95_ms": 15.6839,
+          "latency_p99_ms": 19.8446,
+          "consistency": 1.0,
+          "total": 6,
+          "passed": 6,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6097,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
         {
-          "case_id": "sq_submit_drain",
+          "task_id": "ev-001",
+          "dimension": "event_model",
+          "category": "sq_lifecycle",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "task_id + drained=['hello']",
-          "actual": "task_id=571839fb... drained=['hello']",
-          "duration_ms": 0.1,
-          "detail": ""
+          "expected": "passed",
+          "actual": "drained=['hello']",
+          "duration_ms": 0.0817,
+          "root_cause": "none",
+          "detail": "task_id=b0a1c409...",
+          "consistency": 1.0
         },
         {
-          "case_id": "sq_cancel",
+          "task_id": "ev-002",
+          "dimension": "event_model",
+          "category": "sq_lifecycle",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "cancelled=True",
+          "expected": "passed",
           "actual": "cancelled=True",
-          "duration_ms": 0.04,
-          "detail": ""
+          "duration_ms": 0.038,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "sq_close_blocks",
+          "task_id": "ev-003",
+          "dimension": "event_model",
+          "category": "sq_lifecycle",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "RuntimeError on submit after close",
+          "expected": "passed",
           "actual": "raised=True closed=True",
-          "duration_ms": 0.02,
-          "detail": ""
+          "duration_ms": 0.0091,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "eq_emit_subscribe_replay",
+          "task_id": "ev-004",
+          "dimension": "event_model",
+          "category": "eq_lifecycle",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "1 event replayed",
-          "actual": "1 events",
-          "duration_ms": 0.07,
-          "detail": ""
+          "expected": "passed",
+          "actual": "received=1",
+          "duration_ms": 0.0438,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "eq_close_sentinel",
+          "task_id": "ev-005",
+          "dimension": "event_model",
+          "category": "eq_lifecycle",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "subscriber exits on close",
-          "actual": "1 events, closed=True",
-          "duration_ms": 21.59,
-          "detail": ""
+          "expected": "passed",
+          "actual": "events=1 closed=True",
+          "duration_ms": 20.8847,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "eq_subscriber_count",
+          "task_id": "ev-006",
+          "dimension": "event_model",
+          "category": "eq_lifecycle",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "0 subscribers initially",
-          "actual": "0 subscribers",
-          "duration_ms": 0.01,
-          "detail": ""
+          "expected": "passed",
+          "actual": "subscribers=0",
+          "duration_ms": 0.0045,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         }
       ]
     },
     "spec_management": {
-      "score": 1.0,
-      "total": 7,
-      "passed": 7,
-      "failed": 0,
-      "details": [
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 1.414,
+        "latency_p95_ms": 3.5951,
+        "latency_p99_ms": 4.0383,
+        "consistency": 1.0,
+        "total": 7,
+        "passed": 7,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.6457,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "crud": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 1.414,
+          "latency_p95_ms": 3.6332,
+          "latency_p99_ms": 4.0459,
+          "consistency": 1.0,
+          "total": 5,
+          "passed": 5,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.5655,
+          "ci_upper": 1.0
+        },
+        "edge": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 1.1783,
+          "latency_p95_ms": 2.1899,
+          "latency_p99_ms": 2.2798,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 1.3787,
+          "latency_p95_ms": 3.5042,
+          "latency_p99_ms": 4.0201,
+          "consistency": 1.0,
+          "total": 6,
+          "passed": 6,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.6097,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 2.3023,
+          "latency_p95_ms": 2.3023,
+          "latency_p99_ms": 2.3023,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
         {
-          "case_id": "spec_create",
+          "task_id": "sm-001",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "file exists on disk",
+          "expected": "passed",
           "actual": "exists=True",
-          "duration_ms": 2.24,
-          "detail": ""
+          "duration_ms": 1.414,
+          "root_cause": "none",
+          "detail": "path=/var/folders/6b/ljk5bdq50yxcsth24frf05200000gn/T/agentkit-benchmark-pz2hpb1l/run-2/specs/sm-001/test-spec.yaml",
+          "consistency": 1.0
         },
         {
-          "case_id": "spec_get",
+          "task_id": "sm-002",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "spec with 2 steps",
+          "expected": "passed",
           "actual": "steps=2",
-          "duration_ms": 0.0,
-          "detail": ""
+          "duration_ms": 1.3435,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "spec_update",
+          "task_id": "sm-003",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "goal='Updated goal'",
+          "expected": "passed",
           "actual": "goal=Updated goal",
-          "duration_ms": 1.75,
-          "detail": ""
+          "duration_ms": 1.5695,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "spec_confirm",
+          "task_id": "sm-004",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "status=confirmed, all steps confirmed",
+          "expected": "passed",
+          "actual": "deleted=True remaining=0",
+          "duration_ms": 1.1556,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-005",
+          "dimension": "spec_management",
+          "category": "crud",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "passed",
+          "actual": "count=2",
+          "duration_ms": 4.1491,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "sm-006",
+          "dimension": "spec_management",
+          "category": "edge",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "passed",
           "actual": "status=confirmed",
-          "duration_ms": 1.86,
-          "detail": ""
+          "duration_ms": 2.3023,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "spec_list",
+          "task_id": "sm-007",
+          "dimension": "spec_management",
+          "category": "edge",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "2 specs",
-          "actual": "2 specs",
-          "duration_ms": 4.92,
-          "detail": ""
-        },
-        {
-          "case_id": "spec_delete",
-          "passed": true,
-          "expected": "deleted, 1 remaining",
-          "actual": "deleted=True, remaining=1",
-          "duration_ms": 1.94,
-          "detail": ""
-        },
-        {
-          "case_id": "spec_get_missing",
-          "passed": true,
-          "expected": "None",
-          "actual": "None",
-          "duration_ms": 0.06,
-          "detail": ""
+          "expected": "passed",
+          "actual": "result=None",
+          "duration_ms": 0.0544,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         }
       ]
     },
     "verification": {
-      "score": 1.0,
-      "total": 5,
-      "passed": 5,
-      "failed": 0,
-      "details": [
+      "metrics": {
+        "accuracy": 1.0,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 25.4393,
+        "latency_p95_ms": 413.4245,
+        "latency_p99_ms": 488.3185,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 5,
+        "failed": 0,
+        "accuracy_mean": 1.0,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.5655,
+        "ci_upper": 1.0
+      },
+      "by_category": {
+        "basic": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 12.9474,
+          "latency_p95_ms": 13.0775,
+          "latency_p99_ms": 13.0891,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "retry": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 38.9547,
+          "latency_p95_ms": 38.9547,
+          "latency_p99_ms": 38.9547,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "timeout": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 507.042,
+          "latency_p95_ms": 507.042,
+          "latency_p99_ms": 507.042,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "multi": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 25.4393,
+          "latency_p95_ms": 25.4393,
+          "latency_p99_ms": 25.4393,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 12.9474,
+          "latency_p95_ms": 13.0775,
+          "latency_p99_ms": 13.0891,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 38.9547,
+          "latency_p95_ms": 460.2333,
+          "latency_p99_ms": 497.6803,
+          "consistency": 1.0,
+          "total": 3,
+          "passed": 3,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.4385,
+          "ci_upper": 1.0
+        }
+      },
+      "cases": [
         {
-          "case_id": "verify_pass",
+          "task_id": "vf-001",
+          "dimension": "verification",
+          "category": "basic",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "passed=True, attempts=1",
-          "actual": "passed=True, attempts=1",
-          "duration_ms": 11.82,
-          "detail": ""
+          "expected": "passed",
+          "actual": "passed=True attempts=1",
+          "duration_ms": 13.092,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "verify_fail",
+          "task_id": "vf-002",
+          "dimension": "verification",
+          "category": "basic",
+          "difficulty": "easy",
           "passed": true,
-          "expected": "passed=False, has errors",
-          "actual": "passed=False, errors=1",
-          "duration_ms": 9.8,
-          "detail": ""
+          "expected": "passed",
+          "actual": "passed=False errors=1",
+          "duration_ms": 12.8029,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "verify_retry",
+          "task_id": "vf-003",
+          "dimension": "verification",
+          "category": "retry",
+          "difficulty": "medium",
           "passed": true,
-          "expected": "attempts=3, fix_callback called 2x",
-          "actual": "attempts=3, callbacks=2",
-          "duration_ms": 33.87,
-          "detail": ""
+          "expected": "passed",
+          "actual": "attempts=3 callbacks=2",
+          "duration_ms": 38.9547,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         },
         {
-          "case_id": "verify_timeout",
+          "task_id": "vf-004",
+          "dimension": "verification",
+          "category": "timeout",
+          "difficulty": "medium",
           "passed": true,
-          "expected": "timeout error",
-          "actual": "passed=False, errors=1",
-          "duration_ms": 506.8,
-          "detail": ""
+          "expected": "passed",
+          "actual": "passed=False errors=1",
+          "duration_ms": 507.042,
+          "root_cause": "none",
+          "detail": "errors=['Command timed out after 0.5s: sleep 10']",
+          "consistency": 1.0
         },
         {
-          "case_id": "verify_multi_command",
+          "task_id": "vf-005",
+          "dimension": "verification",
+          "category": "multi",
+          "difficulty": "medium",
           "passed": true,
-          "expected": "overall fail, output has both commands",
+          "expected": "passed",
           "actual": "passed=False",
-          "duration_ms": 23.12,
-          "detail": ""
+          "duration_ms": 25.4393,
+          "root_cause": "none",
+          "detail": "",
+          "consistency": 1.0
         }
       ]
     }
   },
-  "overall_score": 0.9804,
-  "summary": "50/51 tests passed (1 failed) across 7 dimensions."
+  "baseline_comparison": {
+    "status": "compared",
+    "dimensions": {
+      "preprocessing": {
+        "baseline_accuracy": 1.0,
+        "current_accuracy": 1.0,
+        "change": 0.0,
+        "direction": "—"
+      },
+      "overfitting": {
+        "baseline_accuracy": 1.0,
+        "current_accuracy": 1.0,
+        "change": 0.0,
+        "direction": "—"
+      },
+      "efficiency": {
+        "baseline_accuracy": 1.0,
+        "current_accuracy": 1.0,
+        "change": 0.0,
+        "direction": "—"
+      },
+      "tool_search": {
+        "baseline_accuracy": 1.0,
+        "current_accuracy": 1.0,
+        "change": 0.0,
+        "direction": "—"
+      },
+      "event_model": {
+        "baseline_accuracy": 1.0,
+        "current_accuracy": 1.0,
+        "change": 0.0,
+        "direction": "—"
+      },
+      "spec_management": {
+        "baseline_accuracy": 1.0,
+        "current_accuracy": 1.0,
+        "change": 0.0,
+        "direction": "—"
+      },
+      "verification": {
+        "baseline_accuracy": 1.0,
+        "current_accuracy": 1.0,
+        "change": 0.0,
+        "direction": "—"
+      }
+    }
+  }
 }
\ No newline at end of file
diff --git a/test-results/benchmark/benchmark_report.md b/test-results/benchmark/benchmark_report.md
new file mode 100644
index 0000000..87c6399
--- /dev/null
+++ b/test-results/benchmark/benchmark_report.md
@@ -0,0 +1,246 @@
+# AgentKit 能力基准测试报告
+
+## 测试概要
+- 时间: 2026-06-17T04:00:50.738066+00:00
+- 版本: 0.1.0
+- 运行次数: 3
+- 总体准确率: 100.0% ± 0.0%
+
+## 与行业 Benchmark 对比
+
+| Benchmark | 测试对象 | AgentKit 对应 |
+|---|---|---|
+| SWE-bench | LLM 代码修复 | — (测 LLM 非框架) |
+| ToolBench | 工具调用 | tool_search 维度 |
+| AgentBench | Agent 系统 | 全部维度 |
+
+## 维度结果
+
+### 1. 预处理准确度 (Preprocessing Accuracy)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [79.6%, 100.0%] |
+| Precision | 100.0% |
+| Recall | 100.0% |
+| F1 | 100.0% |
+| Latency p50 | 0.01ms |
+| Latency p95 | 0.03ms |
+| Latency p99 | 0.06ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 15 / 15 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| greeting | 4 | 4 | 100.0% |
+| tool_query | 5 | 5 | 100.0% |
+| skill_prefix | 3 | 3 | 100.0% |
+| complex | 3 | 3 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 5 | 5 | 100.0% |
+| medium | 7 | 7 | 100.0% |
+| hard | 3 | 3 | 100.0% |
+
+### 2. 过拟合检测 (Overfitting Detection)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [56.5%, 100.0%] |
+| Precision | 100.0% |
+| Recall | 100.0% |
+| F1 | 100.0% |
+| Latency p50 | 0.04ms |
+| Latency p95 | 0.06ms |
+| Latency p99 | 0.07ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 5 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| ip_check | 1 | 1 | 100.0% |
+| search | 1 | 1 | 100.0% |
+| greeting | 1 | 1 | 100.0% |
+| tool_use | 1 | 1 | 100.0% |
+| complex | 1 | 1 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| medium | 3 | 3 | 100.0% |
+| easy | 1 | 1 | 100.0% |
+| hard | 1 | 1 | 100.0% |
+
+### 3. 效率测试 (Efficiency)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [56.5%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 0.40ms |
+| Latency p95 | 0.77ms |
+| Latency p99 | 0.82ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 5 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| preprocess_latency | 3 | 3 | 100.0% |
+| tool_search_latency | 2 | 2 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 2 | 2 | 100.0% |
+| medium | 3 | 3 | 100.0% |
+
+### 4. 工具搜索 (Tool Search)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [72.2%, 100.0%] |
+| Precision | 83.3% |
+| Recall | 83.3% |
+| F1 | 83.3% |
+| Latency p50 | 0.01ms |
+| Latency p95 | 0.02ms |
+| Latency p99 | 0.02ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 10 / 10 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| exact_match | 5 | 5 | 100.0% |
+| fuzzy_match | 2 | 2 | 100.0% |
+| no_match | 2 | 2 | 100.0% |
+| top_k | 1 | 1 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 7 | 7 | 100.0% |
+| medium | 3 | 3 | 100.0% |
+
+### 5. 事件模型 (Event Model)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [61.0%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 0.04ms |
+| Latency p95 | 15.68ms |
+| Latency p99 | 19.84ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 6 / 6 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| sq_lifecycle | 3 | 3 | 100.0% |
+| eq_lifecycle | 3 | 3 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 6 | 6 | 100.0% |
+
+### 6. 规格管理 (Spec Management)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [64.6%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 1.41ms |
+| Latency p95 | 3.60ms |
+| Latency p99 | 4.04ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 7 / 7 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| crud | 5 | 5 | 100.0% |
+| edge | 2 | 2 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 6 | 6 | 100.0% |
+| medium | 1 | 1 | 100.0% |
+
+### 7. 验证循环 (Verification Loop)
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 100.0% ± 0.0% |
+| 95% CI | [56.5%, 100.0%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 25.44ms |
+| Latency p95 | 413.42ms |
+| Latency p99 | 488.32ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 5 / 0 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| basic | 2 | 2 | 100.0% |
+| retry | 1 | 1 | 100.0% |
+| timeout | 1 | 1 | 100.0% |
+| multi | 1 | 1 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 2 | 2 | 100.0% |
+| medium | 3 | 3 | 100.0% |
+
+## 基线对比
+
+| 维度 | 基线准确率 | 当前准确率 | 变化 |
+|---|---|---|---|
+| preprocessing | 100.0% | 100.0% | — |
+| overfitting | 100.0% | 100.0% | — |
+| efficiency | 100.0% | 100.0% | — |
+| tool_search | 100.0% | 100.0% | — |
+| event_model | 100.0% | 100.0% | — |
+| spec_management | 100.0% | 100.0% | — |
+| verification | 100.0% | 100.0% | — |
+
+## 问题总结与改进建议
+
+- **verification**: P95 延迟 413.42ms 较高，建议优化性能
diff --git a/tests/e2e/test_capability_comprehensive.py b/tests/e2e/test_capability_comprehensive.py
index 672fb58..ff1c0b7 100644
--- a/tests/e2e/test_capability_comprehensive.py
+++ b/tests/e2e/test_capability_comprehensive.py
@@ -1517,3 +1517,95 @@ class TestComprehensiveReport:
         total_score = json_report["total_score"]
         print(f"\n总体评分: {total_score:.1f}%")
         assert total_score >= 80.0, f"Total score {total_score:.1f}% is below 80% threshold"
+
+
+# ═══════════════════════════════════════════════════════════════════════════
+# 10. 标准 Benchmark 框架集成
+# ═══════════════════════════════════════════════════════════════════════════
+
+
+@pytest.mark.e2e_capability
+class TestStandardBenchmarkIntegration:
+    """测试标准 Benchmark 框架集成。"""
+
+    def test_benchmark_task_creation(self) -> None:
+        """测试 BenchmarkTask 可以正确创建。"""
+        from agentkit.cli.benchmark import BenchmarkTask
+
+        task = BenchmarkTask(
+            task_id="test-001",
+            dimension="preprocessing",
+            category="greeting",
+            difficulty="easy",
+            input="你好",
+            expected="direct_chat",
+            tags=["regex", "chinese"],
+            description="测试用例",
+            paraphrases=[],
+        )
+        assert task.task_id == "test-001"
+        assert task.dimension == "preprocessing"
+
+    def test_metric_set_prf(self) -> None:
+        """测试 MetricSet P/R/F1 计算。"""
+        from agentkit.cli.benchmark import MetricSet
+
+        m = MetricSet(
+            accuracy=0.9,
+            precision=0.95,
+            recall=0.85,
+            f1=0.90,
+            latency_p50_ms=1.0,
+            latency_p95_ms=2.0,
+            latency_p99_ms=3.0,
+            consistency=1.0,
+            total=100,
+            passed=90,
+            failed=10,
+        )
+        assert m.f1 == 0.90
+        assert m.precision == 0.95
+
+    def test_benchmark_runs_successfully(self) -> None:
+        """测试 benchmark 函数可以成功运行（fast 模式）。"""
+        from agentkit.cli.benchmark import BenchmarkDimension, benchmark
+
+        # 使用 fast 模式，不生成报告，不输出到终端
+        # 只验证不抛异常
+        try:
+            benchmark(
+                dimension=BenchmarkDimension.ALL,
+                report=False,
+                fast=True,
+                verbose=False,
+                runs=1,
+                output_dir="test-results/benchmark",
+                format="json",
+            )
+        except SystemExit:
+            pass  # benchmark 可能通过 typer.Exit 退出
+
+    def test_report_generation(self, tmp_path: Path) -> None:
+        """测试报告文件可以正确生成。"""
+        import os
+
+        from agentkit.cli.benchmark import BenchmarkDimension, benchmark
+
+        out_dir = str(tmp_path / "benchmark")
+        try:
+            benchmark(
+                dimension=BenchmarkDimension.ALL,
+                report=True,
+                fast=True,
+                verbose=False,
+                runs=1,
+                output_dir=out_dir,
+                format="markdown",
+            )
+        except SystemExit:
+            pass
+        # 验证报告文件生成
+        json_path = os.path.join(out_dir, "benchmark_report.json")
+        md_path = os.path.join(out_dir, "benchmark_report.md")
+        assert os.path.exists(json_path), f"JSON report not found: {json_path}"
+        assert os.path.exists(md_path), f"Markdown report not found: {md_path}"