diff --git a/configs/skills/benchmark_runner.yaml b/configs/skills/benchmark_runner.yaml index f3805df..159ccbf 100644 --- a/configs/skills/benchmark_runner.yaml +++ b/configs/skills/benchmark_runner.yaml @@ -36,7 +36,9 @@ prompt: identity: "你是 AgentKit 能力回测助手,负责运行各维度能力测试并生成评估报告。" instructions: | ## 职责 - 根据用户需求运行 AgentKit 能力回测,生成综合评估报告。 + 根据用户需求运行 AgentKit 能力回测,生成标准化评估报告。 + 采用行业 Benchmark 方法论(SWE-bench / AgentBench / ToolBench 风格), + 提供 Accuracy / Precision / Recall / F1 / Latency / Consistency 等完整指标。 ## 可用命令 @@ -44,13 +46,14 @@ prompt: ```bash python3 -m agentkit.cli.main benchmark --report --verbose ``` - 运行所有 7 个维度共 51 个测试用例,生成 JSON + TXT 报告。 + 运行所有 7 个维度共 53 个标准化测试用例,生成 JSON + Markdown 报告。 + 默认运行 3 次取均值 ± 标准差,附带 95% Wilson 置信区间。 ### 快速回测 ```bash python3 -m agentkit.cli.main benchmark --fast --report ``` - 运行核心用例(约 23 个),适合开发时快速验证。 + 运行核心用例(约 22 个),适合开发时快速验证。 ### 单维度回测 ```bash @@ -58,16 +61,42 @@ prompt: ``` 可选维度:preprocessing, overfitting, efficiency, tool_search, event_model, spec_management, verification + ### 多次运行取均值(--runs) + ```bash + python3 -m agentkit.cli.main benchmark --runs 5 --report + ``` + 指定运行次数(默认 3),计算 accuracy_mean ± accuracy_std 和 95% 置信区间。 + 适用于稳定性评估和回归检测。 + + ### 基线对比(--baseline) + ```bash + python3 -m agentkit.cli.main benchmark --baseline --report + ``` + 首次运行自动创建基线(baseline.json),后续运行与基线对比,显示 ↑/↓ 变化趋势。 + 适用于 CI/CD 回归监控。 + + ### Markdown 报告(默认) + ```bash + python3 -m agentkit.cli.main benchmark --report --format markdown + ``` + 生成人类可读的 Markdown 报告,包含指标表格、失败用例分析、改进建议。 + ### HTML 报告 ```bash python3 -m agentkit.cli.main benchmark --report --format html ``` + ### JSON 报告 + ```bash + python3 -m agentkit.cli.main benchmark --report --format json + ``` + 仅生成 JSON 报告,适合机器解析和 CI 集成。 + ### pytest 综合回测 ```bash - python3 -m pytest tests/e2e/test_capability_comprehensive.py -v + python3 -m pytest tests/e2e/test_capability_comprehensive.py -v -m e2e_capability ``` - 运行 60 个测试(8 维度),生成 comprehensive_report。 + 运行 64 个测试(10 维度,含标准 Benchmark 框架集成测试),生成 comprehensive_report。 ### 指定输出目录 ```bash @@ -75,24 +104,37 @@ prompt: ``` ## 测试维度说明 + 每个维度均提供以下标准化指标: + - **Accuracy** — 准确率(通过率) + - **Precision** — 精确率(macro-averaged,多分类) + - **Recall** — 召回率(macro-averaged,多分类) + - **F1** — F1 分数(Precision 与 Recall 的调和平均) + - **Latency p50/p95/p99** — 延迟分位数(毫秒) + - **Consistency** — 一致性(过拟合检测,改写输入的稳定性) + - **95% CI** — Wilson 置信区间(多次运行时) + + 维度清单: 1. **preprocessing** — 预处理准确度:greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT - 2. **overfitting** — 过拟合检测:同一意图不同表达的一致性 - 3. **efficiency** — 执行效率:预处理延迟 < 50ms, 工具搜索延迟 < 10ms - 4. **tool_search** — 工具搜索准确度:BM25 相关性排序 + 2. **overfitting** — 过拟合检测:同一意图不同表达的一致性(Consistency 指标) + 3. **efficiency** — 执行效率:预处理延迟 < 50ms, 工具搜索延迟 < 10ms(Latency 指标) + 4. **tool_search** — 工具搜索准确度:BM25 相关性排序(P/R/F1 指标) 5. **event_model** — 事件模型完整性:SQ/EQ 双队列生命周期 6. **spec_management** — Spec 管理:CRUD 操作 7. **verification** — 验证循环:verify/retry 行为 ## 报告位置 - - CLI 报告:`test-results/benchmark/benchmark_report.{json,txt,html}` + - CLI 报告:`test-results/benchmark/benchmark_report.{json,md,html}` + - 基线文件:`test-results/benchmark/baseline.json`(使用 --baseline 时生成) - pytest 报告:`test-results/e2e/comprehensive_report.{json,txt}` ## 输出要求 1. 运行测试命令 - 2. 读取生成的报告文件 - 3. 向用户展示结果摘要表格 - 4. 如有失败用例,分析原因并给出改进建议 - 5. 对比历史报告(如存在),展示趋势变化 + 2. 读取生成的报告文件(JSON + Markdown) + 3. 向用户展示结果摘要表格,包含各维度的 Accuracy / P / R / F1 / Latency + 4. 如有失败用例,分析根因(wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded) + 5. 对比基线报告(如使用 --baseline),展示各维度准确率的 ↑/↓ 变化趋势 + 6. 关注关键指标:P95 延迟 > 100ms 需提示性能问题,Consistency < 100% 需提示过拟合风险 + 7. 给出针对性改进建议,基于指标数据而非主观判断 llm: model: "default" diff --git a/src/agentkit/cli/benchmark.py b/src/agentkit/cli/benchmark.py index 45e7dd7..b52e257 100644 --- a/src/agentkit/cli/benchmark.py +++ b/src/agentkit/cli/benchmark.py @@ -1,4 +1,12 @@ -"""Benchmark CLI command — run capability backtests and generate reports. +"""Benchmark CLI command — standardized capability benchmarking. + +Implements industry-standard benchmark methodology (SWE-bench / AgentBench / ToolBench): +- Standardized TaskSet with dimension/category/difficulty metadata +- Full metrics: Accuracy / Precision / Recall / F1 / Latency p50,p95,p99 / Consistency +- Multiple runs with mean ± std and 95% Wilson confidence interval +- Failure root-cause classification (wrong_mode / wrong_tool / timeout / exception / ...) +- Markdown + JSON + HTML report generation +- Baseline comparison (↑/↓) Tests core AgentKit components directly (no pytest subprocess, no real LLM): - preprocessing: RequestPreprocessor routing accuracy @@ -11,24 +19,30 @@ Tests core AgentKit components directly (no pytest subprocess, no real LLM): Usage: agentkit benchmark # run all dimensions - agentkit benchmark --dimension preprocessing - agentkit benchmark --report # JSON + TXT report - agentkit benchmark --report --format html # + HTML report - agentkit benchmark --output-dir ./my-results + agentkit benchmark -d preprocessing # single dimension + agentkit benchmark --report # generate reports agentkit benchmark --fast # core cases only agentkit benchmark --verbose # detailed output + agentkit benchmark --format html # HTML format + agentkit benchmark -o ./results # output directory + agentkit benchmark --runs 3 # multiple runs (default 3) + agentkit benchmark --baseline # compare with baseline + agentkit benchmark --format markdown # Markdown report (default) """ from __future__ import annotations import asyncio import json +import math +import re import time +from collections.abc import Awaitable, Callable from dataclasses import asdict, dataclass, field from datetime import datetime, timezone from enum import Enum from pathlib import Path -from typing import Any +from typing import TYPE_CHECKING import typer from rich.console import Console @@ -42,6 +56,10 @@ from rich.progress import ( ) from rich.table import Table +if TYPE_CHECKING: + from agentkit.chat.request_preprocessor import RequestPreprocessor + from agentkit.tools.search import ToolSearchIndex + console = Console() _DEFAULT_OUTPUT_DIR = "test-results/benchmark" @@ -61,20 +79,88 @@ class BenchmarkDimension(str, Enum): # --------------------------------------------------------------------------- -# Result data structures +# Data structures # --------------------------------------------------------------------------- @dataclass -class TestCaseResult: - """Single test case result.""" +class BenchmarkTask: + """Standardized benchmark task definition. - case_id: str + Attributes: + task_id: Unique identifier (e.g. "prep-001"). + dimension: Test dimension (preprocessing/overfitting/...). + category: Sub-category (greeting/tool_query/skill_prefix/...). + difficulty: easy / medium / hard. + input: Test input string. + expected: Expected output (execution mode, tool name, "passed", or threshold). + tags: Tag list for filtering (e.g. "regex", "bm25", "fallback"). + description: Human-readable description. + paraphrases: Paraphrase list for overfitting detection. + """ + + task_id: str + dimension: str + category: str + difficulty: str + input: str + expected: str + tags: list[str] + description: str + paraphrases: list[str] = field(default_factory=list) + + +@dataclass +class ExecutionResult: + """Raw execution result from a single task invocation.""" + + actual: str + passed: bool + duration_ms: float + detail: str = "" + consistency: float = 1.0 + + +@dataclass +class CaseResult: + """A single test case result with metadata.""" + + task_id: str + dimension: str + category: str + difficulty: str passed: bool expected: str actual: str duration_ms: float + root_cause: str = "none" detail: str = "" + consistency: float = 1.0 + + +@dataclass +class MetricSet: + """Aggregated metrics for a group of cases. + + Includes Accuracy / Precision / Recall / F1, latency percentiles, + consistency (overfitting), and multi-run statistics with 95% CI. + """ + + accuracy: float + precision: float + recall: float + f1: float + latency_p50_ms: float + latency_p95_ms: float + latency_p99_ms: float + consistency: float + total: int + passed: int + failed: int + accuracy_mean: float = 0.0 + accuracy_std: float = 0.0 + ci_lower: float = 0.0 + ci_upper: float = 0.0 @dataclass @@ -82,40 +168,605 @@ class DimensionResult: """Aggregated result for one dimension.""" dimension: str - total: int = 0 - passed: int = 0 - failed: int = 0 - details: list[TestCaseResult] = field(default_factory=list) + metrics: MetricSet + cases: list[CaseResult] + by_category: dict[str, MetricSet] + by_difficulty: dict[str, MetricSet] - @property - def score(self) -> float: - return self.passed / self.total if self.total > 0 else 0.0 - def add(self, case: TestCaseResult) -> None: - self.total += 1 - if case.passed: - self.passed += 1 - else: - self.failed += 1 - self.details.append(case) +@dataclass +class BenchmarkContext: + """Shared context for benchmark execution.""" - def to_dict(self) -> dict[str, Any]: - return { - "score": round(self.score, 4), - "total": self.total, - "passed": self.passed, - "failed": self.failed, - "details": [asdict(d) for d in self.details], - } + preprocessor: object # RequestPreprocessor + search_index: object # ToolSearchIndex + tmp_dir: Path # --------------------------------------------------------------------------- -# Helpers — mock objects +# Standardized TaskSet # --------------------------------------------------------------------------- -def _make_mock_skill_registry(): - """Build a SkillRegistry with a couple of mock skills for preprocessing tests.""" +TASK_SET: list[BenchmarkTask] = [ + # === Preprocessing (15 tasks) === + BenchmarkTask( + "prep-001", + "preprocessing", + "greeting", + "easy", + "你好", + "direct_chat", + ["regex", "chinese"], + "中文问候应路由到 DIRECT_CHAT", + ), + BenchmarkTask( + "prep-002", + "preprocessing", + "greeting", + "easy", + "hello", + "direct_chat", + ["regex", "english"], + "英文问候应路由到 DIRECT_CHAT", + ), + BenchmarkTask( + "prep-003", + "preprocessing", + "greeting", + "easy", + "谢谢", + "direct_chat", + ["regex", "chitchat"], + "感谢语应路由到 DIRECT_CHAT", + ), + BenchmarkTask( + "prep-004", + "preprocessing", + "greeting", + "easy", + "你是谁", + "direct_chat", + ["regex", "identity"], + "身份询问应路由到 DIRECT_CHAT", + ), + BenchmarkTask( + "prep-005", + "preprocessing", + "tool_query", + "medium", + "搜索golang教程", + "react", + ["search", "default"], + "搜索类请求应路由到 REACT", + ), + BenchmarkTask( + "prep-006", + "preprocessing", + "tool_query", + "medium", + "执行ls命令", + "react", + ["shell", "default"], + "Shell 执行类请求应路由到 REACT", + ), + BenchmarkTask( + "prep-007", + "preprocessing", + "tool_query", + "medium", + "翻译hello为中文", + "react", + ["translate", "default"], + "翻译类请求应路由到 REACT", + ), + BenchmarkTask( + "prep-008", + "preprocessing", + "tool_query", + "medium", + "什么是机器学习", + "react", + ["knowledge", "default"], + "知识查询类请求应路由到 REACT", + ), + BenchmarkTask( + "prep-009", + "preprocessing", + "tool_query", + "medium", + "帮我分析数据", + "react", + ["analysis", "default"], + "分析类请求应路由到 REACT", + ), + BenchmarkTask( + "prep-010", + "preprocessing", + "skill_prefix", + "medium", + "@skill:react_agent 查看ip", + "skill_react", + ["skill", "react"], + "有效 skill 前缀应路由到 SKILL_REACT", + ), + BenchmarkTask( + "prep-011", + "preprocessing", + "skill_prefix", + "medium", + "@skill:chat_only 你好", + "direct_chat", + ["skill", "direct"], + "direct 模式 skill 前缀应路由到 DIRECT_CHAT", + ), + BenchmarkTask( + "prep-012", + "preprocessing", + "skill_prefix", + "hard", + "@skill:nonexistent 做点什么", + "react", + ["skill", "fallback"], + "无效 skill 前缀应回退到 REACT", + ), + BenchmarkTask( + "prep-013", + "preprocessing", + "complex", + "hard", + "帮我分析这个数据并生成报告", + "react", + ["multi_step"], + "多步骤复杂任务应路由到 REACT", + ), + BenchmarkTask( + "prep-014", + "preprocessing", + "complex", + "easy", + "随便聊聊", + "react", + ["chitchat", "default"], + "非匹配闲聊应回退到 REACT", + ), + BenchmarkTask( + "prep-015", + "preprocessing", + "complex", + "hard", + "请帮我完成以下任务:1. 查询天气 2. 生成报告", + "react", + ["multi_step"], + "多步骤任务应路由到 REACT", + ), + # === Overfitting (5 groups) === + BenchmarkTask( + "over-001", + "overfitting", + "ip_check", + "medium", + "查下ip", + "react", + ["colloquial"], + "IP 查询改写一致性", + paraphrases=["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"], + ), + BenchmarkTask( + "over-002", + "overfitting", + "search", + "medium", + "搜索golang教程", + "react", + ["search"], + "搜索改写一致性", + paraphrases=["搜索golang教程", "搜一下golang教程", "找下golang学习资料"], + ), + BenchmarkTask( + "over-003", + "overfitting", + "greeting", + "easy", + "你好", + "direct_chat", + ["greeting"], + "问候改写一致性", + paraphrases=["你好", "hello", "hi", "嗨", "哈喽"], + ), + BenchmarkTask( + "over-004", + "overfitting", + "tool_use", + "medium", + "执行ls命令", + "react", + ["shell"], + "工具使用改写一致性", + paraphrases=["执行ls命令", "运行ls", "跑一下ls"], + ), + BenchmarkTask( + "over-005", + "overfitting", + "complex", + "hard", + "帮我分析数据", + "react", + ["analysis"], + "复杂任务改写一致性", + paraphrases=["帮我分析数据", "分析一下数据", "看看这些数据"], + ), + # === Efficiency (5 tasks) === + BenchmarkTask( + "eff-001", + "efficiency", + "preprocess_latency", + "easy", + "你好", + "<=50ms", + ["greeting", "preprocess"], + "问候预处理延迟 < 50ms", + ), + BenchmarkTask( + "eff-002", + "efficiency", + "preprocess_latency", + "medium", + "查下ip", + "<=50ms", + ["react", "preprocess"], + "REACT 预处理延迟 < 50ms", + ), + BenchmarkTask( + "eff-003", + "efficiency", + "preprocess_latency", + "medium", + "@skill:react_agent test", + "<=50ms", + ["skill", "preprocess"], + "Skill 前缀预处理延迟 < 50ms", + ), + BenchmarkTask( + "eff-004", + "efficiency", + "tool_search_latency", + "medium", + "read file", + "<=10ms", + ["tool_search", "bm25"], + "工具搜索延迟 < 10ms", + ), + BenchmarkTask( + "eff-005", + "efficiency", + "tool_search_latency", + "easy", + "", + "<=5ms", + ["tool_search", "empty"], + "空查询工具搜索延迟 < 5ms", + ), + # === Tool Search (10 tasks) === + BenchmarkTask( + "ts-001", + "tool_search", + "exact_match", + "easy", + "read file", + "read_file", + ["bm25", "exact"], + "精确匹配 read_file", + ), + BenchmarkTask( + "ts-002", + "tool_search", + "exact_match", + "easy", + "write file content", + "write_file", + ["bm25", "exact"], + "精确匹配 write_file", + ), + BenchmarkTask( + "ts-003", + "tool_search", + "exact_match", + "easy", + "search web information", + "web_search", + ["bm25", "exact"], + "精确匹配 web_search", + ), + BenchmarkTask( + "ts-004", + "tool_search", + "exact_match", + "easy", + "execute shell command", + "shell_exec", + ["bm25", "exact"], + "精确匹配 shell_exec", + ), + BenchmarkTask( + "ts-005", + "tool_search", + "exact_match", + "easy", + "send http request url", + "http_request", + ["bm25", "exact"], + "精确匹配 http_request", + ), + BenchmarkTask( + "ts-006", + "tool_search", + "fuzzy_match", + "medium", + "io file", + "read_file", + ["bm25", "fuzzy", "tag"], + "标签模糊匹配 io file", + ), + BenchmarkTask( + "ts-007", + "tool_search", + "fuzzy_match", + "medium", + "search query engine", + "web_search", + ["bm25", "fuzzy", "multi"], + "多关键词模糊匹配", + ), + BenchmarkTask( + "ts-008", + "tool_search", + "no_match", + "easy", + "", + "__none__", + ["bm25", "empty"], + "空查询应返回空结果", + ), + BenchmarkTask( + "ts-009", + "tool_search", + "no_match", + "easy", + "zzzznonexistent", + "__none__", + ["bm25", "no_match"], + "无匹配查询应返回空结果", + ), + BenchmarkTask( + "ts-010", + "tool_search", + "top_k", + "medium", + "file", + "read_file", + ["bm25", "top_k"], + "top_k=1 限制返回数", + ), + # === Event Model (6 tasks) === + BenchmarkTask( + "ev-001", + "event_model", + "sq_lifecycle", + "easy", + "submit+drain", + "passed", + ["sq", "submit"], + "SQ 提交并消费", + ), + BenchmarkTask( + "ev-002", + "event_model", + "sq_lifecycle", + "easy", + "cancel", + "passed", + ["sq", "cancel"], + "SQ 取消任务", + ), + BenchmarkTask( + "ev-003", + "event_model", + "sq_lifecycle", + "easy", + "close", + "passed", + ["sq", "close"], + "SQ 关闭后拒绝提交", + ), + BenchmarkTask( + "ev-004", + "event_model", + "eq_lifecycle", + "easy", + "emit+replay", + "passed", + ["eq", "replay"], + "EQ 发射并回放", + ), + BenchmarkTask( + "ev-005", + "event_model", + "eq_lifecycle", + "easy", + "close", + "passed", + ["eq", "close"], + "EQ 关闭哨兵退出", + ), + BenchmarkTask( + "ev-006", + "event_model", + "eq_lifecycle", + "easy", + "subscriber_count", + "passed", + ["eq", "count"], + "EQ 初始订阅者计数", + ), + # === Spec Management (7 tasks) === + BenchmarkTask( + "sm-001", + "spec_management", + "crud", + "easy", + "create", + "passed", + ["create"], + "Spec 创建", + ), + BenchmarkTask( + "sm-002", + "spec_management", + "crud", + "easy", + "get", + "passed", + ["read"], + "Spec 读取", + ), + BenchmarkTask( + "sm-003", + "spec_management", + "crud", + "easy", + "update", + "passed", + ["update"], + "Spec 更新", + ), + BenchmarkTask( + "sm-004", + "spec_management", + "crud", + "easy", + "delete", + "passed", + ["delete"], + "Spec 删除", + ), + BenchmarkTask( + "sm-005", + "spec_management", + "crud", + "easy", + "list", + "passed", + ["list"], + "Spec 列表", + ), + BenchmarkTask( + "sm-006", + "spec_management", + "edge", + "medium", + "confirm", + "passed", + ["confirm"], + "Spec 确认", + ), + BenchmarkTask( + "sm-007", + "spec_management", + "edge", + "easy", + "missing", + "passed", + ["missing"], + "Spec 不存在返回 None", + ), + # === Verification (5 tasks) === + BenchmarkTask( + "vf-001", + "verification", + "basic", + "easy", + "pass", + "passed", + ["pass"], + "验证通过命令", + ), + BenchmarkTask( + "vf-002", + "verification", + "basic", + "easy", + "fail", + "passed", + ["fail"], + "验证失败命令", + ), + BenchmarkTask( + "vf-003", + "verification", + "retry", + "medium", + "fix_callback", + "passed", + ["retry", "callback"], + "重试与修复回调", + ), + BenchmarkTask( + "vf-004", + "verification", + "timeout", + "medium", + "timeout", + "passed", + ["timeout"], + "超时检测", + ), + BenchmarkTask( + "vf-005", + "verification", + "multi", + "medium", + "multi_command", + "passed", + ["multi"], + "多命令验证", + ), +] + + +_FAST_CORE_IDS: set[str] = { + "prep-001", + "prep-005", + "prep-010", + "prep-012", + "over-001", + "over-003", + "eff-001", + "eff-004", + "ts-001", + "ts-003", + "ts-008", + "ts-010", + "ev-001", + "ev-004", + "ev-005", + "sm-001", + "sm-002", + "sm-006", + "sm-004", + "vf-001", + "vf-002", + "vf-003", +} + + +# --------------------------------------------------------------------------- +# Mock helpers +# --------------------------------------------------------------------------- + + +def _make_mock_skill_registry() -> object: + """Build a SkillRegistry with mock skills for preprocessing tests.""" from agentkit.skills.base import Skill, SkillConfig from agentkit.skills.registry import SkillRegistry @@ -142,7 +793,7 @@ def _make_mock_skill_registry(): return registry -def _make_mock_tools(): +def _make_mock_tools() -> list[object]: """Build a list of mock Tool instances for tool_search tests.""" from agentkit.tools.base import Tool @@ -151,7 +802,7 @@ def _make_mock_tools(): self, name: str, description: str, - input_schema: dict[str, Any] | None = None, + input_schema: dict[str, object] | None = None, tags: list[str] | None = None, ): super().__init__( @@ -161,7 +812,7 @@ def _make_mock_tools(): tags=tags or [], ) - async def execute(self, **kwargs) -> dict: + async def execute(self, **kwargs: object) -> dict[str, object]: return {"status": "ok"} return [ @@ -224,144 +875,8 @@ def _make_mock_tools(): ] -# --------------------------------------------------------------------------- -# Dimension test runners -# --------------------------------------------------------------------------- - - -async def _run_preprocessing(fast: bool, verbose: bool) -> DimensionResult: - """Test RequestPreprocessor routing accuracy.""" - from agentkit.chat.request_preprocessor import RequestPreprocessor - - registry = _make_mock_skill_registry() - preprocessor = RequestPreprocessor(skill_registry=registry) - - cases: list[dict[str, str]] = [ - {"id": "greeting_cn", "input": "你好", "expected": "direct_chat"}, - {"id": "greeting_en", "input": "hello", "expected": "direct_chat"}, - {"id": "chitchat_thanks", "input": "谢谢", "expected": "direct_chat"}, - {"id": "identity_who", "input": "你是谁", "expected": "direct_chat"}, - {"id": "colloquial_ip_1", "input": "查下ip", "expected": "react"}, - {"id": "colloquial_ip_2", "input": "查看当前ip", "expected": "react"}, - {"id": "tool_search", "input": "搜索golang教程", "expected": "react"}, - {"id": "tool_shell", "input": "执行ls命令", "expected": "react"}, - {"id": "translation", "input": "翻译hello为中文", "expected": "react"}, - {"id": "knowledge", "input": "什么是机器学习", "expected": "react"}, - {"id": "skill_prefix_react", "input": "@skill:react_agent 查看ip", "expected": "skill_react"}, - {"id": "skill_prefix_direct", "input": "@skill:chat_only 你好", "expected": "skill_react"}, - {"id": "skill_not_found", "input": "@skill:nonexistent 做点什么", "expected": "react"}, - {"id": "complex_analysis", "input": "帮我分析一下这个数据并生成报告", "expected": "react"}, - {"id": "empty_fallback", "input": "随便聊聊", "expected": "react"}, - ] - - if fast: - # Core cases only: greetings, tool queries, skill prefix - fast_ids = { - "greeting_cn", - "colloquial_ip_1", - "tool_search", - "skill_prefix_react", - "skill_not_found", - } - cases = [c for c in cases if c["id"] in fast_ids] - - result = DimensionResult(dimension="preprocessing") - - for case in cases: - start = time.perf_counter() - routing = await preprocessor.preprocess(content=case["input"]) - elapsed_ms = (time.perf_counter() - start) * 1000 - - actual = routing.execution_mode.value - passed = actual == case["expected"] - - result.add( - TestCaseResult( - case_id=case["id"], - passed=passed, - expected=case["expected"], - actual=actual, - duration_ms=round(elapsed_ms, 2), - detail=f"input={case['input']!r} method={routing.match_method}", - ) - ) - - if verbose and not passed: - console.print( - f" [red]✗[/red] {case['id']}: expected={case['expected']} " - f"actual={actual} ({routing.match_method})" - ) - elif verbose: - console.print(f" [green]✓[/green] {case['id']}: {actual} ({elapsed_ms:.1f}ms)") - - return result - - -async def _run_overfitting(fast: bool, verbose: bool) -> DimensionResult: - """Test routing consistency across paraphrases (overfitting detection). - - Same intent expressed differently should route to the same execution mode. - """ - from agentkit.chat.request_preprocessor import RequestPreprocessor - - registry = _make_mock_skill_registry() - preprocessor = RequestPreprocessor(skill_registry=registry) - - paraphrase_groups: list[dict[str, Any]] = [ - { - "id": "ip_check_variants", - "paraphrases": ["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"], - "expected": "react", - }, - { - "id": "search_variants", - "paraphrases": ["搜索golang教程", "搜一下golang教程", "找下golang学习资料"], - "expected": "react", - }, - { - "id": "greeting_variants", - "paraphrases": ["你好", "hello", "hi", "嗨", "哈喽"], - "expected": "direct_chat", - }, - ] - - if fast: - paraphrase_groups = paraphrase_groups[:2] - - result = DimensionResult(dimension="overfitting") - - for group in paraphrase_groups: - modes: list[str] = [] - for text in group["paraphrases"]: - routing = await preprocessor.preprocess(content=text) - modes.append(routing.execution_mode.value) - - # All paraphrases should produce the same mode - unique_modes = set(modes) - consistent = len(unique_modes) == 1 - expected_mode = group["expected"] - correct = consistent and modes[0] == expected_mode if modes else False - - result.add( - TestCaseResult( - case_id=group["id"], - passed=correct, - expected=expected_mode, - actual=",".join(modes), - duration_ms=0.0, - detail=f"paraphrases={len(group['paraphrases'])} consistent={consistent}", - ) - ) - - if verbose: - status = "[green]✓[/green]" if correct else "[red]✗[/red]" - console.print(f" {status} {group['id']}: modes={modes}") - - return result - - -async def _run_efficiency(fast: bool, verbose: bool) -> DimensionResult: - """Test component execution efficiency (timing bounds).""" +def _make_context(tmp_dir: Path) -> BenchmarkContext: + """Create a benchmark context with mock components.""" from agentkit.chat.request_preprocessor import RequestPreprocessor from agentkit.tools.search import ToolSearchIndex @@ -370,744 +885,1051 @@ async def _run_efficiency(fast: bool, verbose: bool) -> DimensionResult: tools = _make_mock_tools() search_index = ToolSearchIndex(tools) - # Thresholds in milliseconds (generous — these are pure-Python ops) - thresholds: list[dict[str, Any]] = [ - { - "id": "preprocess_greeting", - "func": lambda: preprocessor.preprocess(content="你好"), - "max_ms": 50.0, - "iterations": 100, - }, - { - "id": "preprocess_react", - "func": lambda: preprocessor.preprocess(content="查下ip"), - "max_ms": 50.0, - "iterations": 100, - }, - { - "id": "preprocess_skill_prefix", - "func": lambda: preprocessor.preprocess(content="@skill:react_agent test"), - "max_ms": 50.0, - "iterations": 100, - }, - { - "id": "tool_search_query", - "func": None, # handled specially (sync) - "max_ms": 10.0, - "iterations": 200, - }, - { - "id": "tool_search_empty", - "func": None, - "max_ms": 5.0, - "iterations": 200, - }, - ] - - if fast: - thresholds = [t for t in thresholds if t["id"] in { - "preprocess_greeting", "tool_search_query" - }] - - result = DimensionResult(dimension="efficiency") - - for spec in thresholds: - start = time.perf_counter() - if spec["func"] is not None: - for _ in range(spec["iterations"]): - await spec["func"]() - else: - query = "read file" if "query" in spec["id"] else "" - for _ in range(spec["iterations"]): - search_index.search(query, top_k=5) - total_ms = (time.perf_counter() - start) * 1000 - avg_ms = total_ms / spec["iterations"] - - passed = avg_ms <= spec["max_ms"] - result.add( - TestCaseResult( - case_id=spec["id"], - passed=passed, - expected=f"<= {spec['max_ms']}ms/call", - actual=f"{avg_ms:.3f}ms/call", - duration_ms=round(total_ms, 2), - detail=f"iterations={spec['iterations']}", - ) - ) - - if verbose: - status = "[green]✓[/green]" if passed else "[red]✗[/red]" - console.print( - f" {status} {spec['id']}: {avg_ms:.3f}ms/call " - f"(threshold {spec['max_ms']}ms)" - ) - - return result + return BenchmarkContext( + preprocessor=preprocessor, + search_index=search_index, + tmp_dir=tmp_dir, + ) -async def _run_tool_search(fast: bool, verbose: bool) -> DimensionResult: - """Test ToolSearchIndex BM25 relevance ranking.""" - from agentkit.tools.search import ToolSearchIndex +# --------------------------------------------------------------------------- +# Utility functions +# --------------------------------------------------------------------------- - tools = _make_mock_tools() - index = ToolSearchIndex(tools) - cases: list[dict[str, Any]] = [ - {"id": "read_file_query", "query": "read file", "expected_top": "read_file"}, - {"id": "write_file_query", "query": "write file content", "expected_top": "write_file"}, - {"id": "web_search_query", "query": "search web information", "expected_top": "web_search"}, - {"id": "shell_exec_query", "query": "execute shell command", "expected_top": "shell_exec"}, - {"id": "http_request_query", "query": "send http request url", "expected_top": "http_request"}, - {"id": "file_tag_query", "query": "io file", "expected_top": "read_file"}, - {"id": "empty_query", "query": "", "expected_top": "__none__"}, - {"id": "no_match_query", "query": "zzzznonexistent", "expected_top": "__none__"}, - {"id": "top_k_limit", "query": "file", "expected_top": "read_file", "top_k": 1}, - {"id": "multi_token_query", "query": "search query engine", "expected_top": "web_search"}, - ] +def _wilson_interval(successes: int, total: int, z: float = 1.96) -> tuple[float, float]: + """Compute 95% Wilson confidence interval for a proportion.""" + if total == 0: + return (0.0, 0.0) + p = successes / total + denom = 1.0 + z * z / total + center = (p + z * z / (2 * total)) / denom + spread = z * math.sqrt(p * (1 - p) / total + z * z / (4 * total * total)) / denom + return (max(0.0, center - spread), min(1.0, center + spread)) - if fast: - fast_ids = {"read_file_query", "web_search_query", "empty_query", "top_k_limit"} - cases = [c for c in cases if c["id"] in fast_ids] - result = DimensionResult(dimension="tool_search") +def _percentile(sorted_values: list[float], p: float) -> float: + """Compute percentile from a sorted list.""" + if not sorted_values: + return 0.0 + if len(sorted_values) == 1: + return sorted_values[0] + k = (len(sorted_values) - 1) * p / 100.0 + f = math.floor(k) + c = math.ceil(k) + if f == c: + return sorted_values[int(k)] + d0 = sorted_values[int(f)] * (c - k) + d1 = sorted_values[int(c)] * (k - f) + return d0 + d1 + +def _std(values: list[float]) -> float: + """Compute population standard deviation.""" + if len(values) < 2: + return 0.0 + mean = sum(values) / len(values) + variance = sum((v - mean) ** 2 for v in values) / len(values) + return math.sqrt(variance) + + +def _parse_threshold(expected: str) -> float: + """Parse threshold from string like '<=50ms' -> 50.0.""" + match = re.match(r"<=\s*([\d.]+)\s*ms", expected) + if match: + return float(match.group(1)) + return float("inf") + + +# --------------------------------------------------------------------------- +# Metrics computation +# --------------------------------------------------------------------------- + + +def _compute_metrics( + cases: list[CaseResult], + accuracies: list[float] | None = None, +) -> MetricSet: + """Compute full metric set from a list of cases.""" + total = len(cases) + passed = sum(1 for c in cases if c.passed) + failed = total - passed + accuracy = passed / total if total > 0 else 0.0 + + # Multi-class macro-averaged Precision / Recall / F1 + expected_classes: set[str] = {c.expected for c in cases} + precisions: list[float] = [] + recalls: list[float] = [] + f1s: list[float] = [] + for cls in expected_classes: + tp = sum(1 for c in cases if c.expected == cls and c.actual == cls) + fp = sum(1 for c in cases if c.expected != cls and c.actual == cls) + fn = sum(1 for c in cases if c.expected == cls and c.actual != cls) + p = tp / (tp + fp) if (tp + fp) > 0 else 0.0 + r = tp / (tp + fn) if (tp + fn) > 0 else 0.0 + f1 = 2 * p * r / (p + r) if (p + r) > 0 else 0.0 + precisions.append(p) + recalls.append(r) + f1s.append(f1) + + precision = sum(precisions) / len(precisions) if precisions else 0.0 + recall = sum(recalls) / len(recalls) if recalls else 0.0 + f1 = sum(f1s) / len(f1s) if f1s else 0.0 + + # Latency percentiles + latencies = sorted(c.duration_ms for c in cases) + p50 = _percentile(latencies, 50) + p95 = _percentile(latencies, 95) + p99 = _percentile(latencies, 99) + + # Consistency (overfitting detection) + consistency = sum(c.consistency for c in cases) / total if total > 0 else 0.0 + + # Multi-run statistics + if accuracies and len(accuracies) > 0: + accuracy_mean = sum(accuracies) / len(accuracies) + accuracy_std = _std(accuracies) + else: + accuracy_mean = accuracy + accuracy_std = 0.0 + + # Wilson 95% CI + ci_lower, ci_upper = _wilson_interval(passed, total) + + return MetricSet( + accuracy=round(accuracy, 4), + precision=round(precision, 4), + recall=round(recall, 4), + f1=round(f1, 4), + latency_p50_ms=round(p50, 4), + latency_p95_ms=round(p95, 4), + latency_p99_ms=round(p99, 4), + consistency=round(consistency, 4), + total=total, + passed=passed, + failed=failed, + accuracy_mean=round(accuracy_mean, 4), + accuracy_std=round(accuracy_std, 4), + ci_lower=round(ci_lower, 4), + ci_upper=round(ci_upper, 4), + ) + + +def _aggregate_by(cases: list[CaseResult], key: str) -> dict[str, MetricSet]: + """Aggregate cases by a field name (category or difficulty).""" + groups: dict[str, list[CaseResult]] = {} for case in cases: - start = time.perf_counter() - top_k = case.get("top_k", 5) - found = index.search(case["query"], top_k=top_k) - elapsed_ms = (time.perf_counter() - start) * 1000 + k = getattr(case, key) + groups.setdefault(k, []).append(case) + return {k: _compute_metrics(v) for k, v in groups.items()} - if case["expected_top"] == "__none__": - passed = len(found) == 0 - actual = "[]" if passed else found[0].name - else: - actual = found[0].name if found else "__empty__" - passed = actual == case["expected_top"] - result.add( - TestCaseResult( - case_id=case["id"], - passed=passed, - expected=case["expected_top"], - actual=actual, - duration_ms=round(elapsed_ms, 2), - detail=f"query={case['query']!r} top_k={top_k} results={len(found)}", - ) +def _classify_root_cause(task: BenchmarkTask, result: ExecutionResult) -> str: + """Classify the root cause of a failure.""" + if result.passed: + return "none" + detail_lower = result.detail.lower() + actual_lower = result.actual.lower() + if "__exception__" in result.actual or "exception" in detail_lower: + return "exception" + if "timeout" in detail_lower or "timed out" in actual_lower: + return "timeout" + if task.dimension == "preprocessing": + return "wrong_mode" + if task.dimension == "tool_search": + return "wrong_tool" + if task.dimension == "overfitting": + return "inconsistent" + if task.dimension == "efficiency": + return "latency_exceeded" + return "assertion" + + +# --------------------------------------------------------------------------- +# Task executors +# --------------------------------------------------------------------------- + + +async def _exec_preprocessing(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute preprocessing benchmark task.""" + preprocessor: RequestPreprocessor = ctx.preprocessor # type: ignore[assignment] + start = time.perf_counter() + routing = await preprocessor.preprocess(content=task.input) + elapsed = (time.perf_counter() - start) * 1000 + actual = routing.execution_mode.value + passed = actual == task.expected + return ExecutionResult( + actual=actual, + passed=passed, + duration_ms=round(elapsed, 4), + detail=f"input={task.input!r} method={routing.match_method}", + ) + + +async def _exec_overfitting(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute overfitting benchmark task (paraphrase consistency).""" + preprocessor: RequestPreprocessor = ctx.preprocessor # type: ignore[assignment] + start = time.perf_counter() + modes: list[str] = [] + for text in task.paraphrases: + routing = await preprocessor.preprocess(content=text) + modes.append(routing.execution_mode.value) + elapsed = (time.perf_counter() - start) * 1000 + + unique_modes = set(modes) + consistent = len(unique_modes) == 1 + actual = modes[0] if consistent else "inconsistent" + passed = consistent and actual == task.expected + + return ExecutionResult( + actual=actual, + passed=passed, + duration_ms=round(elapsed, 4), + detail=f"paraphrases={len(task.paraphrases)} modes={modes}", + consistency=1.0 if consistent else 0.0, + ) + + +async def _exec_efficiency(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute efficiency benchmark task (latency threshold).""" + threshold = _parse_threshold(task.expected) + iterations = 100 + + preprocessor: RequestPreprocessor = ctx.preprocessor # type: ignore[assignment] + search_index: ToolSearchIndex = ctx.search_index # type: ignore[assignment] + + start = time.perf_counter() + if task.category == "preprocess_latency": + for _ in range(iterations): + await preprocessor.preprocess(content=task.input) + elif task.category == "tool_search_latency": + for _ in range(iterations): + search_index.search(task.input, top_k=5) + else: + return ExecutionResult( + actual="unknown_category", + passed=False, + duration_ms=0.0, + detail=f"Unknown efficiency category: {task.category}", ) + total_ms = (time.perf_counter() - start) * 1000 + avg_ms = total_ms / iterations - if verbose: - status = "[green]✓[/green]" if passed else "[red]✗[/red]" - console.print(f" {status} {case['id']}: top={actual} ({elapsed_ms:.2f}ms)") - - return result + passed = avg_ms <= threshold + return ExecutionResult( + actual=f"{avg_ms:.3f}ms", + passed=passed, + duration_ms=round(total_ms, 2), + detail=f"iterations={iterations} avg={avg_ms:.3f}ms threshold={threshold}ms", + ) -async def _run_event_model(fast: bool, verbose: bool) -> DimensionResult: - """Test SubmissionQueue / EventQueue lifecycle.""" +async def _exec_tool_search(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute tool search benchmark task.""" + search_index: ToolSearchIndex = ctx.search_index # type: ignore[assignment] + top_k = 1 if "top_k" in task.tags else 5 + + start = time.perf_counter() + found = search_index.search(task.input, top_k=top_k) + elapsed = (time.perf_counter() - start) * 1000 + + if task.expected == "__none__": + passed = len(found) == 0 + actual = "[]" if passed else (found[0].name if found else "[]") + else: + actual = found[0].name if found else "__empty__" + passed = actual == task.expected + + return ExecutionResult( + actual=actual, + passed=passed, + duration_ms=round(elapsed, 4), + detail=f"query={task.input!r} top_k={top_k} results={len(found)}", + ) + + +async def _exec_event_model(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute event model benchmark task.""" from agentkit.core.event_queue import EventQueue, SubmissionQueue from agentkit.core.protocol import Event - result = DimensionResult(dimension="event_model") - - # --- SubmissionQueue tests --- - sq = SubmissionQueue() - - # Test 1: submit and drain start = time.perf_counter() - task_id = await sq.submit("hello", "session-1") - drained: list[str] = [] - async for submission in sq.drain(): - drained.append(submission.content) - break # only drain one to avoid blocking - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = task_id != "" and drained == ["hello"] - result.add( - TestCaseResult( - case_id="sq_submit_drain", + + if task.task_id == "ev-001": # SQ submit + drain + sq = SubmissionQueue() + task_id = await sq.submit("hello", "session-1") + drained: list[str] = [] + async for sub in sq.drain(): + drained.append(sub.content) + break + elapsed = (time.perf_counter() - start) * 1000 + passed = task_id != "" and drained == ["hello"] + return ExecutionResult( + actual=f"drained={drained}", passed=passed, - expected="task_id + drained=['hello']", - actual=f"task_id={task_id[:8]}... drained={drained}", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), + detail=f"task_id={task_id[:8]}...", ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} sq_submit_drain") - # Test 2: cancel - start = time.perf_counter() - cancel_id = await sq.submit("to-cancel", "session-2") - cancelled = await sq.cancel(cancel_id) - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = cancelled and sq._submissions[cancel_id].cancelled - result.add( - TestCaseResult( - case_id="sq_cancel", - passed=passed, - expected="cancelled=True", + if task.task_id == "ev-002": # SQ cancel + sq = SubmissionQueue() + cancel_id = await sq.submit("to-cancel", "session-2") + cancelled = await sq.cancel(cancel_id) + elapsed = (time.perf_counter() - start) * 1000 + passed = bool(cancelled and sq._submissions[cancel_id].cancelled) + return ExecutionResult( actual=f"cancelled={cancelled}", - duration_ms=round(elapsed_ms, 2), - ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} sq_cancel") - - # Test 3: close blocks new submissions - start = time.perf_counter() - sq2 = SubmissionQueue() - sq2.close() - raised = False - try: - await sq2.submit("after-close", "session-3") - except RuntimeError: - raised = True - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = raised and sq2.is_closed - result.add( - TestCaseResult( - case_id="sq_close_blocks", passed=passed, - expected="RuntimeError on submit after close", - actual=f"raised={raised} closed={sq2.is_closed}", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} sq_close_blocks") - # --- EventQueue tests --- - eq = EventQueue(buffer_size=10) - - # Test 4: emit and subscribe with replay - start = time.perf_counter() - test_event = Event( - event_type="test_event", - task_id="task-1", - session_id="session-1", - data={"msg": "hello"}, - timestamp=datetime.now(timezone.utc).isoformat(), - ) - await eq.emit(test_event) - - received: list[Event] = [] - # Subscribe and collect one event (replay) - async for event in eq.subscribe(): - received.append(event) - break - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = len(received) == 1 and received[0].event_type == "test_event" - result.add( - TestCaseResult( - case_id="eq_emit_subscribe_replay", + if task.task_id == "ev-003": # SQ close blocks + sq = SubmissionQueue() + sq.close() + raised = False + try: + await sq.submit("after-close", "session-3") + except RuntimeError: + raised = True + elapsed = (time.perf_counter() - start) * 1000 + passed = raised and sq.is_closed + return ExecutionResult( + actual=f"raised={raised} closed={sq.is_closed}", passed=passed, - expected="1 event replayed", - actual=f"{len(received)} events", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} eq_emit_subscribe_replay") - # Test 5: close sends sentinel - start = time.perf_counter() - eq2 = EventQueue() - - async def _consume_all() -> list[Event]: - events: list[Event] = [] - async for ev in eq2.subscribe(): - events.append(ev) - return events - - # Start consumer, emit, then close - consumer_task = asyncio.create_task(_consume_all()) - await asyncio.sleep(0.01) # let subscriber register - await eq2.emit(test_event) - await asyncio.sleep(0.01) - eq2.close() - events = await asyncio.wait_for(consumer_task, timeout=2.0) - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = len(events) >= 1 and eq2.is_closed - result.add( - TestCaseResult( - case_id="eq_close_sentinel", + if task.task_id == "ev-004": # EQ emit + replay + eq = EventQueue(buffer_size=10) + test_event = Event( + event_type="test_event", + task_id="task-1", + session_id="session-1", + data={"msg": "hello"}, + timestamp=datetime.now(timezone.utc).isoformat(), + ) + await eq.emit(test_event) + received: list[Event] = [] + async for event in eq.subscribe(): + received.append(event) + break + elapsed = (time.perf_counter() - start) * 1000 + passed = len(received) == 1 and received[0].event_type == "test_event" + return ExecutionResult( + actual=f"received={len(received)}", passed=passed, - expected="subscriber exits on close", - actual=f"{len(events)} events, closed={eq2.is_closed}", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} eq_close_sentinel") - # Test 6: subscriber count - start = time.perf_counter() - eq3 = EventQueue() - initial_count = eq3.subscriber_count - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = initial_count == 0 - result.add( - TestCaseResult( - case_id="eq_subscriber_count", + if task.task_id == "ev-005": # EQ close sentinel + eq = EventQueue() + + async def _consume_all() -> list[Event]: + events: list[Event] = [] + async for ev in eq.subscribe(): + events.append(ev) + return events + + consumer_task = asyncio.create_task(_consume_all()) + await asyncio.sleep(0.01) + test_event = Event( + event_type="test_event", + task_id="task-1", + session_id="session-1", + data={"msg": "hello"}, + timestamp=datetime.now(timezone.utc).isoformat(), + ) + await eq.emit(test_event) + await asyncio.sleep(0.01) + eq.close() + events = await asyncio.wait_for(consumer_task, timeout=2.0) + elapsed = (time.perf_counter() - start) * 1000 + passed = len(events) >= 1 and eq.is_closed + return ExecutionResult( + actual=f"events={len(events)} closed={eq.is_closed}", passed=passed, - expected="0 subscribers initially", - actual=f"{initial_count} subscribers", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) + + if task.task_id == "ev-006": # EQ subscriber count + eq = EventQueue() + count = eq.subscriber_count + elapsed = (time.perf_counter() - start) * 1000 + passed = count == 0 + return ExecutionResult( + actual=f"subscribers={count}", + passed=passed, + duration_ms=round(elapsed, 4), + ) + + return ExecutionResult( + actual="unknown_task", + passed=False, + duration_ms=0.0, + detail=f"Unknown event_model task: {task.task_id}", ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} eq_subscriber_count") - - if fast: - # Keep only core cases in fast mode - core_ids = {"sq_submit_drain", "eq_emit_subscribe_replay", "eq_close_sentinel"} - result.details = [d for d in result.details if d.case_id in core_ids] - result.total = len(result.details) - result.passed = sum(1 for d in result.details if d.passed) - result.failed = result.total - result.passed - - return result -async def _run_spec_management(fast: bool, verbose: bool, tmp_dir: Path) -> DimensionResult: - """Test SpecManager CRUD operations.""" +async def _exec_spec_management(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute spec management benchmark task (each task is self-contained).""" from agentkit.core.spec_manager import Spec, SpecManager, SpecStep - specs_dir = str(tmp_dir / "specs") + specs_dir = str(ctx.tmp_dir / "specs" / task.task_id) manager = SpecManager(specs_dir=specs_dir) - result = DimensionResult(dimension="spec_management") - - # Test 1: create start = time.perf_counter() - spec = Spec( - spec_id="spec-001", - goal="Test goal", - steps=[ - SpecStep(step_id="s1", name="step1", description="first step"), - SpecStep(step_id="s2", name="step2", description="second step", dependencies=["s1"]), - ], - ) - path = manager.create(spec) - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = path.exists() - result.add( - TestCaseResult( - case_id="spec_create", - passed=passed, - expected="file exists on disk", - actual=f"exists={path.exists()}", - duration_ms=round(elapsed_ms, 2), + + if task.task_id == "sm-001": # create + spec = Spec( + spec_id="test-spec", + goal="Test goal", + steps=[SpecStep(step_id="s1", name="step1", description="first step")], ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_create") - - # Test 2: get - start = time.perf_counter() - loaded = manager.get("spec-001") - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = loaded is not None and loaded.spec_id == "spec-001" and len(loaded.steps) == 2 - result.add( - TestCaseResult( - case_id="spec_get", + path = manager.create(spec) + elapsed = (time.perf_counter() - start) * 1000 + passed = path.exists() + return ExecutionResult( + actual=f"exists={passed}", passed=passed, - expected="spec with 2 steps", + duration_ms=round(elapsed, 4), + detail=f"path={path}", + ) + + if task.task_id == "sm-002": # get + spec = Spec( + spec_id="test-spec", + goal="Test goal", + steps=[ + SpecStep(step_id="s1", name="step1", description="first step"), + SpecStep(step_id="s2", name="step2", description="second step"), + ], + ) + manager.create(spec) + loaded = manager.get("test-spec") + elapsed = (time.perf_counter() - start) * 1000 + passed = loaded is not None and loaded.spec_id == "test-spec" and len(loaded.steps) == 2 + return ExecutionResult( actual=f"steps={len(loaded.steps) if loaded else 0}", - duration_ms=round(elapsed_ms, 2), - ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_get") - - # Test 3: update - start = time.perf_counter() - updated = manager.update("spec-001", goal="Updated goal") - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = updated is not None and updated.goal == "Updated goal" - result.add( - TestCaseResult( - case_id="spec_update", passed=passed, - expected="goal='Updated goal'", + duration_ms=round(elapsed, 4), + ) + + if task.task_id == "sm-003": # update + spec = Spec(spec_id="test-spec", goal="Original goal") + manager.create(spec) + updated = manager.update("test-spec", goal="Updated goal") + elapsed = (time.perf_counter() - start) * 1000 + passed = updated is not None and updated.goal == "Updated goal" + return ExecutionResult( actual=f"goal={updated.goal if updated else None}", - duration_ms=round(elapsed_ms, 2), - ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_update") - - # Test 4: confirm - start = time.perf_counter() - confirmed = manager.confirm("spec-001") - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = ( - confirmed is not None - and confirmed.status == "confirmed" - and confirmed.confirmed_at is not None - and all(s.status == "confirmed" for s in confirmed.steps) - ) - result.add( - TestCaseResult( - case_id="spec_confirm", passed=passed, - expected="status=confirmed, all steps confirmed", + duration_ms=round(elapsed, 4), + ) + + if task.task_id == "sm-004": # delete + spec = Spec(spec_id="test-spec", goal="To be deleted") + manager.create(spec) + deleted = manager.delete("test-spec") + remaining = manager.list_specs() + elapsed = (time.perf_counter() - start) * 1000 + passed = bool(deleted and len(remaining) == 0) + return ExecutionResult( + actual=f"deleted={deleted} remaining={len(remaining)}", + passed=passed, + duration_ms=round(elapsed, 4), + ) + + if task.task_id == "sm-005": # list + manager.create(Spec(spec_id="spec-a", goal="Goal A")) + manager.create(Spec(spec_id="spec-b", goal="Goal B")) + specs = manager.list_specs() + elapsed = (time.perf_counter() - start) * 1000 + passed = len(specs) == 2 + return ExecutionResult( + actual=f"count={len(specs)}", + passed=passed, + duration_ms=round(elapsed, 4), + ) + + if task.task_id == "sm-006": # confirm + spec = Spec( + spec_id="test-spec", + goal="Test goal", + steps=[SpecStep(step_id="s1", name="step1", description="first step")], + ) + manager.create(spec) + confirmed = manager.confirm("test-spec") + elapsed = (time.perf_counter() - start) * 1000 + passed = bool( + confirmed is not None + and confirmed.status == "confirmed" + and confirmed.confirmed_at is not None + and all(s.status == "confirmed" for s in confirmed.steps) + ) + return ExecutionResult( actual=f"status={confirmed.status if confirmed else None}", - duration_ms=round(elapsed_ms, 2), - ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_confirm") - - # Test 5: list - start = time.perf_counter() - # Create a second spec for listing - spec2 = Spec(spec_id="spec-002", goal="Second goal") - manager.create(spec2) - specs = manager.list_specs() - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = len(specs) == 2 - result.add( - TestCaseResult( - case_id="spec_list", passed=passed, - expected="2 specs", - actual=f"{len(specs)} specs", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_list") - # Test 6: delete - start = time.perf_counter() - deleted = manager.delete("spec-002") - remaining = manager.list_specs() - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = deleted and len(remaining) == 1 - result.add( - TestCaseResult( - case_id="spec_delete", + if task.task_id == "sm-007": # get missing + missing = manager.get("nonexistent") + elapsed = (time.perf_counter() - start) * 1000 + passed = missing is None + return ExecutionResult( + actual=f"result={missing}", passed=passed, - expected="deleted, 1 remaining", - actual=f"deleted={deleted}, remaining={len(remaining)}", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) + + return ExecutionResult( + actual="unknown_task", + passed=False, + duration_ms=0.0, + detail=f"Unknown spec_management task: {task.task_id}", ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_delete") - - # Test 7: get nonexistent - start = time.perf_counter() - missing = manager.get("nonexistent") - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = missing is None - result.add( - TestCaseResult( - case_id="spec_get_missing", - passed=passed, - expected="None", - actual=f"{missing}", - duration_ms=round(elapsed_ms, 2), - ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} spec_get_missing") - - if fast: - core_ids = {"spec_create", "spec_get", "spec_confirm", "spec_delete"} - result.details = [d for d in result.details if d.case_id in core_ids] - result.total = len(result.details) - result.passed = sum(1 for d in result.details if d.passed) - result.failed = result.total - result.passed - - return result -async def _run_verification(fast: bool, verbose: bool, tmp_dir: Path) -> DimensionResult: - """Test VerificationLoop execute/retry behavior.""" +async def _exec_verification(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute verification benchmark task.""" from agentkit.core.verification_loop import VerificationLoop - result = DimensionResult(dimension="verification") - - # Test 1: passing command + working_dir = str(ctx.tmp_dir) start = time.perf_counter() - loop_pass = VerificationLoop( - commands=["true"], - max_retries=0, - working_dir=str(tmp_dir), - timeout=5.0, - ) - res = await loop_pass.verify() - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = res.passed and res.attempts == 1 - result.add( - TestCaseResult( - case_id="verify_pass", - passed=passed, - expected="passed=True, attempts=1", - actual=f"passed={res.passed}, attempts={res.attempts}", - duration_ms=round(elapsed_ms, 2), + + if task.task_id == "vf-001": # pass + loop = VerificationLoop( + commands=["true"], max_retries=0, working_dir=working_dir, timeout=5.0 ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_pass") - - # Test 2: failing command - start = time.perf_counter() - loop_fail = VerificationLoop( - commands=["false"], - max_retries=0, - working_dir=str(tmp_dir), - timeout=5.0, - ) - res = await loop_fail.verify() - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = not res.passed and len(res.errors) > 0 - result.add( - TestCaseResult( - case_id="verify_fail", + res = await loop.verify() + elapsed = (time.perf_counter() - start) * 1000 + passed = bool(res.passed and res.attempts == 1) + return ExecutionResult( + actual=f"passed={res.passed} attempts={res.attempts}", passed=passed, - expected="passed=False, has errors", - actual=f"passed={res.passed}, errors={len(res.errors)}", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_fail") - # Test 3: retry with fix callback - start = time.perf_counter() - call_count = 0 - - async def _fix_callback(errors: list[str], output: str) -> None: - nonlocal call_count - call_count += 1 - - # Use a command that always fails to test retry logic - loop_retry = VerificationLoop( - commands=["false"], - max_retries=2, - working_dir=str(tmp_dir), - timeout=5.0, - ) - res = await loop_retry.verify_and_retry(fix_callback=_fix_callback) - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = not res.passed and res.attempts == 3 and call_count == 2 - result.add( - TestCaseResult( - case_id="verify_retry", - passed=passed, - expected="attempts=3, fix_callback called 2x", - actual=f"attempts={res.attempts}, callbacks={call_count}", - duration_ms=round(elapsed_ms, 2), + if task.task_id == "vf-002": # fail + loop = VerificationLoop( + commands=["false"], max_retries=0, working_dir=working_dir, timeout=5.0 ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_retry") - - # Test 4: timeout - start = time.perf_counter() - loop_timeout = VerificationLoop( - commands=["sleep 10"], - max_retries=0, - working_dir=str(tmp_dir), - timeout=0.5, - ) - res = await loop_timeout.verify() - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = not res.passed and any("timed out" in e.lower() for e in res.errors) - result.add( - TestCaseResult( - case_id="verify_timeout", + res = await loop.verify() + elapsed = (time.perf_counter() - start) * 1000 + passed = bool(not res.passed and len(res.errors) > 0) + return ExecutionResult( + actual=f"passed={res.passed} errors={len(res.errors)}", passed=passed, - expected="timeout error", - actual=f"passed={res.passed}, errors={len(res.errors)}", - duration_ms=round(elapsed_ms, 2), + duration_ms=round(elapsed, 4), ) - ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_timeout") - # Test 5: multiple commands (one passes, one fails) - start = time.perf_counter() - loop_multi = VerificationLoop( - commands=["true", "false"], - max_retries=0, - working_dir=str(tmp_dir), - timeout=5.0, - ) - res = await loop_multi.verify() - elapsed_ms = (time.perf_counter() - start) * 1000 - passed = not res.passed and "false" in res.test_output - result.add( - TestCaseResult( - case_id="verify_multi_command", + if task.task_id == "vf-003": # retry with fix_callback + call_count = 0 + + async def _fix_callback(errors: list[str], output: str) -> None: + nonlocal call_count + call_count += 1 + + loop = VerificationLoop( + commands=["false"], max_retries=2, working_dir=working_dir, timeout=5.0 + ) + res = await loop.verify_and_retry(fix_callback=_fix_callback) + elapsed = (time.perf_counter() - start) * 1000 + passed = bool(not res.passed and res.attempts == 3 and call_count == 2) + return ExecutionResult( + actual=f"attempts={res.attempts} callbacks={call_count}", passed=passed, - expected="overall fail, output has both commands", + duration_ms=round(elapsed, 4), + ) + + if task.task_id == "vf-004": # timeout + loop = VerificationLoop( + commands=["sleep 10"], max_retries=0, working_dir=working_dir, timeout=0.5 + ) + res = await loop.verify() + elapsed = (time.perf_counter() - start) * 1000 + passed = bool(not res.passed and any("timed out" in e.lower() for e in res.errors)) + return ExecutionResult( + actual=f"passed={res.passed} errors={len(res.errors)}", + passed=passed, + duration_ms=round(elapsed, 4), + detail=f"errors={res.errors[:1]}", + ) + + if task.task_id == "vf-005": # multi command + loop = VerificationLoop( + commands=["true", "false"], max_retries=0, working_dir=working_dir, timeout=5.0 + ) + res = await loop.verify() + elapsed = (time.perf_counter() - start) * 1000 + passed = bool(not res.passed and "false" in res.test_output) + return ExecutionResult( actual=f"passed={res.passed}", - duration_ms=round(elapsed_ms, 2), + passed=passed, + duration_ms=round(elapsed, 4), ) + + return ExecutionResult( + actual="unknown_task", + passed=False, + duration_ms=0.0, + detail=f"Unknown verification task: {task.task_id}", ) - if verbose: - console.print(f" {'[green]✓[/green]' if passed else '[red]✗[/red]'} verify_multi_command") + +_EXECUTORS: dict[ + str, + Callable[[BenchmarkTask, BenchmarkContext], Awaitable[ExecutionResult]], +] = { + "preprocessing": _exec_preprocessing, + "overfitting": _exec_overfitting, + "efficiency": _exec_efficiency, + "tool_search": _exec_tool_search, + "event_model": _exec_event_model, + "spec_management": _exec_spec_management, + "verification": _exec_verification, +} + + +async def _execute_task(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute a single benchmark task via the dimension dispatcher.""" + executor = _EXECUTORS.get(task.dimension) + if executor is None: + return ExecutionResult( + actual="unknown_dimension", + passed=False, + duration_ms=0.0, + detail=f"Unknown dimension: {task.dimension}", + ) + return await executor(task, ctx) + + +async def _execute_task_safely(task: BenchmarkTask, ctx: BenchmarkContext) -> ExecutionResult: + """Execute a task with exception handling.""" + try: + return await _execute_task(task, ctx) + except Exception as e: + return ExecutionResult( + actual="__exception__", + passed=False, + duration_ms=0.0, + detail=f"Exception: {type(e).__name__}: {e}", + consistency=0.0, + ) + + +# --------------------------------------------------------------------------- +# Dimension runner +# --------------------------------------------------------------------------- + + +async def _run_dimension( + dimension: str, + runs: int, + fast: bool, + verbose: bool, + ctx: BenchmarkContext, +) -> DimensionResult: + """Run all tasks for a dimension, optionally multiple times.""" + tasks = [t for t in TASK_SET if t.dimension == dimension] if fast: - core_ids = {"verify_pass", "verify_fail", "verify_retry"} - result.details = [d for d in result.details if d.case_id in core_ids] - result.total = len(result.details) - result.passed = sum(1 for d in result.details if d.passed) - result.failed = result.total - result.passed + tasks = [t for t in tasks if t.task_id in _FAST_CORE_IDS] - return result + all_runs_cases: list[list[CaseResult]] = [] + accuracies: list[float] = [] + + for run_idx in range(runs): + run_ctx = BenchmarkContext( + preprocessor=ctx.preprocessor, + search_index=ctx.search_index, + tmp_dir=ctx.tmp_dir / f"run-{run_idx}", + ) + run_ctx.tmp_dir.mkdir(parents=True, exist_ok=True) + + cases: list[CaseResult] = [] + for task in tasks: + result = await _execute_task_safely(task, run_ctx) + root_cause = _classify_root_cause(task, result) + case = CaseResult( + task_id=task.task_id, + dimension=task.dimension, + category=task.category, + difficulty=task.difficulty, + passed=result.passed, + expected=task.expected, + actual=result.actual, + duration_ms=result.duration_ms, + root_cause=root_cause, + detail=result.detail, + consistency=result.consistency, + ) + cases.append(case) + + if verbose: + status = "[green]✓[/green]" if case.passed else "[red]✗[/red]" + console.print( + f" {status} {task.task_id}: {result.actual} ({result.duration_ms:.2f}ms)" + ) + + all_runs_cases.append(cases) + passed_count = sum(1 for c in cases if c.passed) + accuracies.append(passed_count / len(cases) if cases else 0.0) + + final_cases = all_runs_cases[-1] if all_runs_cases else [] + metrics = _compute_metrics(final_cases, accuracies if runs > 1 else None) + by_category = _aggregate_by(final_cases, "category") + by_difficulty = _aggregate_by(final_cases, "difficulty") + + return DimensionResult( + dimension=dimension, + metrics=metrics, + cases=final_cases, + by_category=by_category, + by_difficulty=by_difficulty, + ) # --------------------------------------------------------------------------- -# Report generation +# Report generators # --------------------------------------------------------------------------- +def _dimension_to_dict(dim_result: DimensionResult) -> dict[str, object]: + """Convert a DimensionResult to a serializable dict.""" + return { + "metrics": asdict(dim_result.metrics), + "by_category": {k: asdict(v) for k, v in dim_result.by_category.items()}, + "by_difficulty": {k: asdict(v) for k, v in dim_result.by_difficulty.items()}, + "cases": [asdict(c) for c in dim_result.cases], + } + + def _generate_json_report( - report_data: dict[str, Any], + report_data: dict[str, object], output_path: Path, ) -> None: + """Generate JSON report.""" output_path.parent.mkdir(parents=True, exist_ok=True) output_path.write_text( - json.dumps(report_data, indent=2, ensure_ascii=False), + json.dumps(report_data, indent=2, ensure_ascii=False, default=str), encoding="utf-8", ) -def _generate_txt_report( - report_data: dict[str, Any], +def _md_table(headers: list[str], rows: list[list[str]]) -> str: + """Generate a Markdown table.""" + lines = ["| " + " | ".join(headers) + " |"] + lines.append("|" + "|".join("---" for _ in headers) + "|") + for row in rows: + lines.append("| " + " | ".join(row) + " |") + return "\n".join(lines) + + +def _generate_markdown_report( + report_data: dict[str, object], output_path: Path, ) -> None: + """Generate human-readable Markdown report.""" output_path.parent.mkdir(parents=True, exist_ok=True) + timestamp = str(report_data.get("timestamp", "")) + version = str(report_data.get("version", "")) + runs = int(report_data.get("runs", 1)) + overall = float(report_data.get("overall_accuracy", 0.0)) + overall_mean = float(report_data.get("overall_accuracy_mean", overall)) + overall_std = float(report_data.get("overall_accuracy_std", 0.0)) + lines: list[str] = [] - lines.append("=" * 70) - lines.append("AgentKit Benchmark Report") - lines.append("=" * 70) - lines.append(f"Timestamp: {report_data['timestamp']}") - lines.append(f"Version: {report_data['version']}") - lines.append(f"Overall Score: {report_data['overall_score']:.1%}") - lines.append(f"Summary: {report_data['summary']}") + lines.append("# AgentKit 能力基准测试报告") + lines.append("") + lines.append("## 测试概要") + lines.append(f"- 时间: {timestamp}") + lines.append(f"- 版本: {version}") + lines.append(f"- 运行次数: {runs}") + lines.append(f"- 总体准确率: {overall_mean:.1%} ± {overall_std:.1%}") lines.append("") - lines.append("-" * 70) - lines.append(f"{'Dimension':<20} {'Total':>6} {'Pass':>6} {'Fail':>6} {'Score':>8}") - lines.append("-" * 70) - - total_all = 0 - pass_all = 0 - fail_all = 0 - - for dim_name, dim_data in report_data["dimensions"].items(): - total = dim_data["total"] - passed = dim_data["passed"] - failed = dim_data["failed"] - score = dim_data["score"] - lines.append( - f"{dim_name:<20} {total:>6} {passed:>6} {failed:>6} {score:>7.1%}" - ) - total_all += total - pass_all += passed - fail_all += failed - - lines.append("-" * 70) - overall = pass_all / total_all if total_all > 0 else 0.0 + # Industry benchmark comparison + lines.append("## 与行业 Benchmark 对比") + lines.append("") lines.append( - f"{'OVERALL':<20} {total_all:>6} {pass_all:>6} {fail_all:>6} {overall:>7.1%}" + _md_table( + ["Benchmark", "测试对象", "AgentKit 对应"], + [ + ["SWE-bench", "LLM 代码修复", "— (测 LLM 非框架)"], + ["ToolBench", "工具调用", "tool_search 维度"], + ["AgentBench", "Agent 系统", "全部维度"], + ], + ) ) - lines.append("=" * 70) lines.append("") - # Detailed failures - has_failures = False - for dim_name, dim_data in report_data["dimensions"].items(): - failures = [d for d in dim_data["details"] if not d["passed"]] - if failures: - if not has_failures: - lines.append("Failed Cases:") - lines.append("-" * 70) - has_failures = True - for f in failures: - lines.append(f" [{dim_name}] {f['case_id']}") - lines.append(f" expected: {f['expected']}") - lines.append(f" actual: {f['actual']}") - if f.get("detail"): - lines.append(f" detail: {f['detail']}") + # Dimension results + dimensions = report_data.get("dimensions", {}) + if not isinstance(dimensions, dict): + dimensions = {} + + dim_titles = { + "preprocessing": "1. 预处理准确度 (Preprocessing Accuracy)", + "overfitting": "2. 过拟合检测 (Overfitting Detection)", + "efficiency": "3. 效率测试 (Efficiency)", + "tool_search": "4. 工具搜索 (Tool Search)", + "event_model": "5. 事件模型 (Event Model)", + "spec_management": "6. 规格管理 (Spec Management)", + "verification": "7. 验证循环 (Verification Loop)", + } + + lines.append("## 维度结果") + lines.append("") + + for dim_name, title in dim_titles.items(): + dim_data = dimensions.get(dim_name) + if not isinstance(dim_data, dict): + continue + metrics = dim_data.get("metrics", {}) + if not isinstance(metrics, dict): + metrics = {} + + lines.append(f"### {title}") + lines.append("") + + acc = float(metrics.get("accuracy", 0.0)) + acc_mean = float(metrics.get("accuracy_mean", acc)) + acc_std = float(metrics.get("accuracy_std", 0.0)) + precision = float(metrics.get("precision", 0.0)) + recall = float(metrics.get("recall", 0.0)) + f1 = float(metrics.get("f1", 0.0)) + p50 = float(metrics.get("latency_p50_ms", 0.0)) + p95 = float(metrics.get("latency_p95_ms", 0.0)) + p99 = float(metrics.get("latency_p99_ms", 0.0)) + consistency = float(metrics.get("consistency", 0.0)) + total = int(metrics.get("total", 0)) + passed = int(metrics.get("passed", 0)) + failed = int(metrics.get("failed", 0)) + ci_lower = float(metrics.get("ci_lower", 0.0)) + ci_upper = float(metrics.get("ci_upper", 0.0)) + + lines.append( + _md_table( + ["指标", "值"], + [ + ["Accuracy", f"{acc_mean:.1%} ± {acc_std:.1%}"], + ["95% CI", f"[{ci_lower:.1%}, {ci_upper:.1%}]"], + ["Precision", f"{precision:.1%}"], + ["Recall", f"{recall:.1%}"], + ["F1", f"{f1:.1%}"], + ["Latency p50", f"{p50:.2f}ms"], + ["Latency p95", f"{p95:.2f}ms"], + ["Latency p99", f"{p99:.2f}ms"], + ["Consistency", f"{consistency:.1%}"], + ["Total / Pass / Fail", f"{total} / {passed} / {failed}"], + ], + ) + ) + lines.append("") + + # By category + by_category = dim_data.get("by_category", {}) + if isinstance(by_category, dict) and by_category: + lines.append("#### 按类别分布") + lines.append("") + cat_rows: list[list[str]] = [] + for cat_name, cat_metrics in by_category.items(): + if not isinstance(cat_metrics, dict): + continue + cat_total = int(cat_metrics.get("total", 0)) + cat_passed = int(cat_metrics.get("passed", 0)) + cat_acc = float(cat_metrics.get("accuracy", 0.0)) + cat_rows.append( + [ + str(cat_name), + str(cat_total), + str(cat_passed), + f"{cat_acc:.1%}", + ] + ) + lines.append(_md_table(["类别", "用例数", "通过", "准确率"], cat_rows)) + lines.append("") + + # By difficulty + by_difficulty = dim_data.get("by_difficulty", {}) + if isinstance(by_difficulty, dict) and by_difficulty: + lines.append("#### 按难度分布") + lines.append("") + diff_rows: list[list[str]] = [] + for diff_name, diff_metrics in by_difficulty.items(): + if not isinstance(diff_metrics, dict): + continue + diff_total = int(diff_metrics.get("total", 0)) + diff_passed = int(diff_metrics.get("passed", 0)) + diff_acc = float(diff_metrics.get("accuracy", 0.0)) + diff_rows.append( + [ + str(diff_name), + str(diff_total), + str(diff_passed), + f"{diff_acc:.1%}", + ] + ) + lines.append(_md_table(["难度", "用例数", "通过", "准确率"], diff_rows)) + lines.append("") + + # Failure analysis + cases = dim_data.get("cases", []) + if isinstance(cases, list): + failures = [c for c in cases if isinstance(c, dict) and not c.get("passed", True)] + if failures: + lines.append("#### 失败用例分析") + lines.append("") + fail_rows: list[list[str]] = [] + for f in failures: + fail_rows.append( + [ + str(f.get("task_id", "")), + str(f.get("category", "")), + str(f.get("difficulty", "")), + str(f.get("expected", "")), + str(f.get("actual", "")), + str(f.get("root_cause", "")), + ] + ) + lines.append( + _md_table( + ["用例 ID", "类别", "难度", "期望", "实际", "根因"], + fail_rows, + ) + ) lines.append("") - if not has_failures: - lines.append("All tests passed — no failures to report.") + # Baseline comparison + baseline_comparison = report_data.get("baseline_comparison") + if isinstance(baseline_comparison, dict): + lines.append("## 基线对比") lines.append("") + status = baseline_comparison.get("status", "") + if status == "first_run": + lines.append("> 首次运行,已自动创建基线。") + lines.append("") + else: + dim_comparisons = baseline_comparison.get("dimensions", {}) + if isinstance(dim_comparisons, dict) and dim_comparisons: + bl_rows: list[list[str]] = [] + for dim_name, cmp_data in dim_comparisons.items(): + if not isinstance(cmp_data, dict): + continue + bl_acc = float(cmp_data.get("baseline_accuracy", 0.0)) + cur_acc = float(cmp_data.get("current_accuracy", 0.0)) + direction = str(cmp_data.get("direction", "—")) + bl_rows.append( + [ + str(dim_name), + f"{bl_acc:.1%}", + f"{cur_acc:.1%}", + direction, + ] + ) + lines.append( + _md_table( + ["维度", "基线准确率", "当前准确率", "变化"], + bl_rows, + ) + ) + lines.append("") + + # Improvement suggestions + lines.append("## 问题总结与改进建议") + lines.append("") + suggestions = _generate_suggestions(dimensions) + for s in suggestions: + lines.append(s) + lines.append("") output_path.write_text("\n".join(lines), encoding="utf-8") +def _generate_suggestions(dimensions: dict[str, object]) -> list[str]: + """Generate improvement suggestions based on results.""" + suggestions: list[str] = [] + if not isinstance(dimensions, dict): + return ["- 所有维度表现良好。"] + + for dim_name, dim_data in dimensions.items(): + if not isinstance(dim_data, dict): + continue + metrics = dim_data.get("metrics", {}) + if not isinstance(metrics, dict): + continue + acc = float(metrics.get("accuracy", 1.0)) + p95 = float(metrics.get("latency_p95_ms", 0.0)) + consistency = float(metrics.get("consistency", 1.0)) + + if acc < 0.9: + suggestions.append( + f"- **{dim_name}**: 准确率 {acc:.1%} 低于 90%,建议检查失败用例并优化" + ) + if p95 > 100: + suggestions.append(f"- **{dim_name}**: P95 延迟 {p95:.2f}ms 较高,建议优化性能") + if dim_name == "overfitting" and consistency < 1.0: + suggestions.append( + f"- **overfitting**: 一致性 {consistency:.1%} 低于 100%,存在过拟合风险" + ) + + if not suggestions: + suggestions.append("- 所有维度表现良好,无需特别改进。") + return suggestions + + def _generate_html_report( - report_data: dict[str, Any], + report_data: dict[str, object], output_path: Path, ) -> None: + """Generate HTML report.""" output_path.parent.mkdir(parents=True, exist_ok=True) + dimensions = report_data.get("dimensions", {}) + if not isinstance(dimensions, dict): + dimensions = {} + rows_html: list[str] = [] total_all = 0 pass_all = 0 fail_all = 0 - for dim_name, dim_data in report_data["dimensions"].items(): - total = dim_data["total"] - passed = dim_data["passed"] - failed = dim_data["failed"] - score = dim_data["score"] + for dim_name, dim_data in dimensions.items(): + if not isinstance(dim_data, dict): + continue + metrics = dim_data.get("metrics", {}) + if not isinstance(metrics, dict): + metrics = {} + total = int(metrics.get("total", 0)) + passed = int(metrics.get("passed", 0)) + failed = int(metrics.get("failed", 0)) + acc = float(metrics.get("accuracy", 0.0)) total_all += total pass_all += passed fail_all += failed - score_class = "score-good" if score >= 0.9 else "score-warn" if score >= 0.7 else "score-bad" + acc_class = "good" if acc >= 0.9 else "warn" if acc >= 0.7 else "bad" rows_html.append( f"
All tests passed.
" - ) + timestamp = str(report_data.get("timestamp", "")) + version = str(report_data.get("version", "")) + runs = int(report_data.get("runs", 1)) html = f""" @@ -1124,33 +1946,26 @@ def _generate_html_report( td.num {{ text-align: right; font-family: monospace; }} td.pass {{ color: #2e7d32; }} td.fail {{ color: #c62828; }} - .score-good {{ color: #2e7d32; font-weight: bold; }} - .score-warn {{ color: #e65100; font-weight: bold; }} - .score-bad {{ color: #c62828; font-weight: bold; }} - .overall-row {{ background-color: #f5f5f5; }} - .failure {{ margin: 0.5em 0; padding: 0.5em; background: #fff3e0; border-left: 3px solid #ff9800; }} - .failure .dim {{ color: #e65100; font-weight: bold; }} - .failure .case {{ font-family: monospace; }} - .failure .detail {{ font-size: 0.85em; color: #555; margin-left: 1em; }} - .all-pass {{ color: #2e7d32; font-weight: bold; }} + .good {{ color: #2e7d32; font-weight: bold; }} + .warn {{ color: #e65100; font-weight: bold; }} + .bad {{ color: #c62828; font-weight: bold; }}| Dimension | Total | Pass | Fail | Score | ||||
|---|---|---|---|---|---|---|---|---|
| Dimension | Total | Pass | Fail | Acc | P | R | F1 | p50 |