diff --git a/configs/skills/benchmark_runner.yaml b/configs/skills/benchmark_runner.yaml
index 159ccbf..cff4e92 100644
--- a/configs/skills/benchmark_runner.yaml
+++ b/configs/skills/benchmark_runner.yaml
@@ -40,55 +40,92 @@ prompt:
     采用行业 Benchmark 方法论（SWE-bench / AgentBench / ToolBench 风格），
     提供 Accuracy / Precision / Recall / F1 / Latency / Consistency 等完整指标。
 
-    ## 可用命令
+    ## 测试模式（--mode）
+    支持三种测试模式，可组合使用：
+
+    ### Mock 模式（默认，快速、无 LLM 依赖）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --mode mock --report --verbose
+    ```
+    全部使用 Mock 数据，7 个维度 53 个用例，适合 CI/CD 快速回归。
+
+    ### LLM 模式（使用真实 LLM）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --mode llm --report --verbose
+    ```
+    从 agentkit.yaml 加载真实 LLM 配置，测试 LLM 推理能力：
+    - 意图理解：LLM 是否正确识别用户意图
+    - 工具选择：LLM 是否选择正确工具
+    - 多步推理：LLM 是否能分解复杂任务
+    - 代码生成：LLM 是否能生成可执行代码
+    - 错误恢复：LLM 是否能给出修复建议
+    需要 agentkit.yaml 中配置了有效的 LLM API key。
+
+    ### GUI 模式（启动真实服务器测试端到端）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --mode gui --report --verbose
+    ```
+    自动启动 agentkit gui 服务器，测试：
+    - 服务启动：agentkit gui --port XXXX 能否成功启动
+    - API 可用性：/api/v1/health, /api/v1/skills, /api/v1/chat
+    - WebSocket 连接：ws://localhost:XXXX/api/v1/ws
+    - 前端资源：HTML/JS/CSS 是否可访问
+    测试完成后自动关闭服务器。
+
+    ### 全部模式（Mock + LLM + GUI）
+    ```bash
+    python3 -m agentkit.cli.main benchmark --mode all --report --verbose
+    ```
+    运行所有 9 个维度共 63 个测试用例，最全面的评估。
 
     ### 完整回测（推荐）
     ```bash
-    python3 -m agentkit.cli.main benchmark --report --verbose
+    python3 -m agentkit.cli.main benchmark --mode all --report --verbose
     ```
-    运行所有 7 个维度共 53 个标准化测试用例，生成 JSON + Markdown 报告。
+    运行所有 9 个维度（7 Mock + 1 LLM + 1 GUI）共 63 个测试用例。
     默认运行 3 次取均值 ± 标准差，附带 95% Wilson 置信区间。
 
     ### 快速回测
     ```bash
-    python3 -m agentkit.cli.main benchmark --fast --report
+    python3 -m agentkit.cli.main benchmark --mode mock --fast --report
     ```
-    运行核心用例（约 22 个），适合开发时快速验证。
+    运行 Mock 模式核心用例（约 22 个），适合开发时快速验证。
 
     ### 单维度回测
     ```bash
     python3 -m agentkit.cli.main benchmark --dimension <dim> --verbose
     ```
-    可选维度：preprocessing, overfitting, efficiency, tool_search, event_model, spec_management, verification
+    可选维度：preprocessing, overfitting, efficiency, tool_search, event_model,
+    spec_management, verification, llm_reasoning, gui_integration
 
     ### 多次运行取均值（--runs）
     ```bash
-    python3 -m agentkit.cli.main benchmark --runs 5 --report
+    python3 -m agentkit.cli.main benchmark --mode all --runs 3 --report
     ```
     指定运行次数（默认 3），计算 accuracy_mean ± accuracy_std 和 95% 置信区间。
     适用于稳定性评估和回归检测。
 
     ### 基线对比（--baseline）
     ```bash
-    python3 -m agentkit.cli.main benchmark --baseline --report
+    python3 -m agentkit.cli.main benchmark --mode all --baseline --report
     ```
     首次运行自动创建基线（baseline.json），后续运行与基线对比，显示 ↑/↓ 变化趋势。
     适用于 CI/CD 回归监控。
 
     ### Markdown 报告（默认）
     ```bash
-    python3 -m agentkit.cli.main benchmark --report --format markdown
+    python3 -m agentkit.cli.main benchmark --mode all --report --format markdown
     ```
     生成人类可读的 Markdown 报告，包含指标表格、失败用例分析、改进建议。
 
     ### HTML 报告
     ```bash
-    python3 -m agentkit.cli.main benchmark --report --format html
+    python3 -m agentkit.cli.main benchmark --mode all --report --format html
     ```
 
     ### JSON 报告
     ```bash
-    python3 -m agentkit.cli.main benchmark --report --format json
+    python3 -m agentkit.cli.main benchmark --mode all --report --format json
     ```
     仅生成 JSON 报告，适合机器解析和 CI 集成。
 
@@ -96,11 +133,11 @@ prompt:
     ```bash
     python3 -m pytest tests/e2e/test_capability_comprehensive.py -v -m e2e_capability
     ```
-    运行 64 个测试（10 维度，含标准 Benchmark 框架集成测试），生成 comprehensive_report。
+    运行 64 个测试（含标准 Benchmark 框架集成测试），生成 comprehensive_report。
 
     ### 指定输出目录
     ```bash
-    python3 -m agentkit.cli.main benchmark --report -o ./my-results
+    python3 -m agentkit.cli.main benchmark --mode all --report -o ./my-results
     ```
 
     ## 测试维度说明
@@ -113,14 +150,24 @@ prompt:
     - **Consistency** — 一致性（过拟合检测，改写输入的稳定性）
     - **95% CI** — Wilson 置信区间（多次运行时）
 
-    维度清单：
-    1. **preprocessing** — 预处理准确度：greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT
-    2. **overfitting** — 过拟合检测：同一意图不同表达的一致性（Consistency 指标）
-    3. **efficiency** — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms（Latency 指标）
-    4. **tool_search** — 工具搜索准确度：BM25 相关性排序（P/R/F1 指标）
-    5. **event_model** — 事件模型完整性：SQ/EQ 双队列生命周期
-    6. **spec_management** — Spec 管理：CRUD 操作
-    7. **verification** — 验证循环：verify/retry 行为
+    维度清单（9 个维度，按模式分组）：
+
+    **Mock 模式（7 维度，53 用例）**：
+    1. **preprocessing** [Mock] — 预处理准确度：greeting→DIRECT_CHAT, tool→REACT, @skill→SKILL_REACT
+    2. **overfitting** [Mock] — 过拟合检测：同一意图不同表达的一致性
+    3. **efficiency** [Mock] — 执行效率：预处理延迟 < 50ms, 工具搜索延迟 < 10ms
+    4. **tool_search** [Mock] — 工具搜索准确度：BM25 相关性排序
+    5. **event_model** [Mock] — 事件模型完整性：SQ/EQ 双队列生命周期
+    6. **spec_management** [Mock] — Spec 管理：CRUD 操作
+    7. **verification** [Mock] — 验证循环：verify/retry 行为
+
+    **LLM 模式（1 维度，5 用例）**：
+    8. **llm_reasoning** [LLM] — LLM 推理能力：意图理解/工具选择/多步推理/代码生成/错误恢复
+       使用真实 LLM 调用，记录 Token 使用量和响应延迟。
+
+    **GUI 模式（1 维度，5 用例）**：
+    9. **gui_integration** [GUI] — GUI 集成测试：服务启动/API 可用性/WebSocket/前端资源
+       自动启动 agentkit gui 服务器，测试完成后自动清理。
 
     ## 报告位置
     - CLI 报告：`test-results/benchmark/benchmark_report.{json,md,html}`
@@ -131,10 +178,15 @@ prompt:
     1. 运行测试命令
     2. 读取生成的报告文件（JSON + Markdown）
     3. 向用户展示结果摘要表格，包含各维度的 Accuracy / P / R / F1 / Latency
-    4. 如有失败用例，分析根因（wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded）
-    5. 对比基线报告（如使用 --baseline），展示各维度准确率的 ↑/↓ 变化趋势
-    6. 关注关键指标：P95 延迟 > 100ms 需提示性能问题，Consistency < 100% 需提示过拟合风险
-    7. 给出针对性改进建议，基于指标数据而非主观判断
+    4. 标注每个维度使用的模式（[Mock] / [LLM] / [GUI]）
+    5. 如有失败用例，分析根因（wrong_mode / wrong_tool / timeout / exception / inconsistent / latency_exceeded / gui_failure）
+    6. 对比基线报告（如使用 --baseline），展示各维度准确率的 ↑/↓ 变化趋势
+    7. 关注关键指标：
+       - P95 延迟 > 100ms 需提示性能问题
+       - Consistency < 100% 需提示过拟合风险
+       - LLM 维度 timeout 需提示模型响应慢或超时阈值需调整
+       - GUI 维度失败需提示服务器配置或端口问题
+    8. 给出针对性改进建议，基于指标数据而非主观判断
 
 llm:
   model: "default"
diff --git a/src/agentkit/cli/benchmark.py b/src/agentkit/cli/benchmark.py
index b52e257..f56a2ca 100644
--- a/src/agentkit/cli/benchmark.py
+++ b/src/agentkit/cli/benchmark.py
@@ -8,26 +8,36 @@ Implements industry-standard benchmark methodology (SWE-bench / AgentBench / Too
 - Markdown + JSON + HTML report generation
 - Baseline comparison (↑/↓)
 
-Tests core AgentKit components directly (no pytest subprocess, no real LLM):
-- preprocessing: RequestPreprocessor routing accuracy
-- overfitting: routing consistency across paraphrases
-- efficiency: component execution timing
-- tool_search: ToolSearchIndex BM25 relevance
-- event_model: SubmissionQueue / EventQueue lifecycle
-- spec_management: SpecManager CRUD operations
-- verification: VerificationLoop execute/retry behavior
+Three execution modes via --mode:
+- mock: 全部使用 Mock（默认，快速、无 LLM 依赖）
+- llm: 使用真实 LLM（需要 agentkit.yaml 配置）
+- gui: 启动真实 GUI 服务器测试端到端
+- all: 运行所有模式（Mock + LLM + GUI）
+
+Tests core AgentKit components:
+- preprocessing: RequestPreprocessor routing accuracy [Mock]
+- overfitting: routing consistency across paraphrases [Mock]
+- efficiency: component execution timing [Mock]
+- tool_search: ToolSearchIndex BM25 relevance [Mock]
+- event_model: SubmissionQueue / EventQueue lifecycle [Mock]
+- spec_management: SpecManager CRUD operations [Mock]
+- verification: VerificationLoop execute/retry behavior [Mock]
+- llm_reasoning: Real LLM intent/tool/multi-step/code/error [LLM]
+- gui_integration: agentkit gui end-to-end (API/WS/frontend) [GUI]
 
 Usage:
-    agentkit benchmark                          # run all dimensions
+    agentkit benchmark                          # run all mock dimensions
+    agentkit benchmark --mode mock              # explicit mock mode (default)
+    agentkit benchmark --mode llm --report      # LLM mode with report
+    agentkit benchmark --mode gui --report      # GUI mode with report
+    agentkit benchmark --mode all --report      # all modes
     agentkit benchmark -d preprocessing         # single dimension
-    agentkit benchmark --report                 # generate reports
     agentkit benchmark --fast                   # core cases only
     agentkit benchmark --verbose                # detailed output
     agentkit benchmark --format html            # HTML format
     agentkit benchmark -o ./results             # output directory
     agentkit benchmark --runs 3                 # multiple runs (default 3)
     agentkit benchmark --baseline               # compare with baseline
-    agentkit benchmark --format markdown        # Markdown report (default)
 """
 
 from __future__ import annotations
@@ -75,9 +85,38 @@ class BenchmarkDimension(str, Enum):
     EVENT_MODEL = "event_model"
     SPEC_MANAGEMENT = "spec_management"
     VERIFICATION = "verification"
+    LLM_REASONING = "llm_reasoning"
+    GUI_INTEGRATION = "gui_integration"
     ALL = "all"
 
 
+class BenchmarkMode(str, Enum):
+    """Benchmark execution mode.
+
+    MOCK: 全部使用 Mock（快速、无 LLM 依赖）
+    LLM: 使用真实 LLM（需要 agentkit.yaml）
+    GUI: 启动真实 GUI 服务器测试
+    ALL: 运行所有模式（Mock + LLM + GUI）
+    """
+
+    MOCK = "mock"
+    LLM = "llm"
+    GUI = "gui"
+    ALL = "all"
+
+
+# Mock dimensions (no LLM dependency)
+_MOCK_DIMENSIONS: list[BenchmarkDimension] = [
+    BenchmarkDimension.PREPROCESSING,
+    BenchmarkDimension.OVERFITTING,
+    BenchmarkDimension.EFFICIENCY,
+    BenchmarkDimension.TOOL_SEARCH,
+    BenchmarkDimension.EVENT_MODEL,
+    BenchmarkDimension.SPEC_MANAGEMENT,
+    BenchmarkDimension.VERIFICATION,
+]
+
+
 # ---------------------------------------------------------------------------
 # Data structures
 # ---------------------------------------------------------------------------
@@ -108,6 +147,7 @@ class BenchmarkTask:
     tags: list[str]
     description: str
     paraphrases: list[str] = field(default_factory=list)
+    expected_keywords: list[str] = field(default_factory=list)
 
 
 @dataclass
@@ -188,576 +228,195 @@ class BenchmarkContext:
 # ---------------------------------------------------------------------------
 
 
+# fmt: off
 TASK_SET: list[BenchmarkTask] = [
     # === Preprocessing (15 tasks) ===
-    BenchmarkTask(
-        "prep-001",
-        "preprocessing",
-        "greeting",
-        "easy",
-        "你好",
-        "direct_chat",
-        ["regex", "chinese"],
-        "中文问候应路由到 DIRECT_CHAT",
-    ),
-    BenchmarkTask(
-        "prep-002",
-        "preprocessing",
-        "greeting",
-        "easy",
-        "hello",
-        "direct_chat",
-        ["regex", "english"],
-        "英文问候应路由到 DIRECT_CHAT",
-    ),
-    BenchmarkTask(
-        "prep-003",
-        "preprocessing",
-        "greeting",
-        "easy",
-        "谢谢",
-        "direct_chat",
-        ["regex", "chitchat"],
-        "感谢语应路由到 DIRECT_CHAT",
-    ),
-    BenchmarkTask(
-        "prep-004",
-        "preprocessing",
-        "greeting",
-        "easy",
-        "你是谁",
-        "direct_chat",
-        ["regex", "identity"],
-        "身份询问应路由到 DIRECT_CHAT",
-    ),
-    BenchmarkTask(
-        "prep-005",
-        "preprocessing",
-        "tool_query",
-        "medium",
-        "搜索golang教程",
-        "react",
-        ["search", "default"],
-        "搜索类请求应路由到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-006",
-        "preprocessing",
-        "tool_query",
-        "medium",
-        "执行ls命令",
-        "react",
-        ["shell", "default"],
-        "Shell 执行类请求应路由到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-007",
-        "preprocessing",
-        "tool_query",
-        "medium",
-        "翻译hello为中文",
-        "react",
-        ["translate", "default"],
-        "翻译类请求应路由到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-008",
-        "preprocessing",
-        "tool_query",
-        "medium",
-        "什么是机器学习",
-        "react",
-        ["knowledge", "default"],
-        "知识查询类请求应路由到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-009",
-        "preprocessing",
-        "tool_query",
-        "medium",
-        "帮我分析数据",
-        "react",
-        ["analysis", "default"],
-        "分析类请求应路由到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-010",
-        "preprocessing",
-        "skill_prefix",
-        "medium",
-        "@skill:react_agent 查看ip",
-        "skill_react",
-        ["skill", "react"],
-        "有效 skill 前缀应路由到 SKILL_REACT",
-    ),
-    BenchmarkTask(
-        "prep-011",
-        "preprocessing",
-        "skill_prefix",
-        "medium",
-        "@skill:chat_only 你好",
-        "direct_chat",
-        ["skill", "direct"],
-        "direct 模式 skill 前缀应路由到 DIRECT_CHAT",
-    ),
-    BenchmarkTask(
-        "prep-012",
-        "preprocessing",
-        "skill_prefix",
-        "hard",
-        "@skill:nonexistent 做点什么",
-        "react",
-        ["skill", "fallback"],
-        "无效 skill 前缀应回退到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-013",
-        "preprocessing",
-        "complex",
-        "hard",
-        "帮我分析这个数据并生成报告",
-        "react",
-        ["multi_step"],
-        "多步骤复杂任务应路由到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-014",
-        "preprocessing",
-        "complex",
-        "easy",
-        "随便聊聊",
-        "react",
-        ["chitchat", "default"],
-        "非匹配闲聊应回退到 REACT",
-    ),
-    BenchmarkTask(
-        "prep-015",
-        "preprocessing",
-        "complex",
-        "hard",
+    BenchmarkTask("prep-001", "preprocessing", "greeting", "easy", "你好",
+        "direct_chat", ["regex", "chinese"], "中文问候应路由到 DIRECT_CHAT"),
+    BenchmarkTask("prep-002", "preprocessing", "greeting", "easy", "hello",
+        "direct_chat", ["regex", "english"], "英文问候应路由到 DIRECT_CHAT"),
+    BenchmarkTask("prep-003", "preprocessing", "greeting", "easy", "谢谢",
+        "direct_chat", ["regex", "chitchat"], "感谢语应路由到 DIRECT_CHAT"),
+    BenchmarkTask("prep-004", "preprocessing", "greeting", "easy", "你是谁",
+        "direct_chat", ["regex", "identity"], "身份询问应路由到 DIRECT_CHAT"),
+    BenchmarkTask("prep-005", "preprocessing", "tool_query", "medium", "搜索golang教程",
+        "react", ["search", "default"], "搜索类请求应路由到 REACT"),
+    BenchmarkTask("prep-006", "preprocessing", "tool_query", "medium", "执行ls命令",
+        "react", ["shell", "default"], "Shell 执行类请求应路由到 REACT"),
+    BenchmarkTask("prep-007", "preprocessing", "tool_query", "medium", "翻译hello为中文",
+        "react", ["translate", "default"], "翻译类请求应路由到 REACT"),
+    BenchmarkTask("prep-008", "preprocessing", "tool_query", "medium", "什么是机器学习",
+        "react", ["knowledge", "default"], "知识查询类请求应路由到 REACT"),
+    BenchmarkTask("prep-009", "preprocessing", "tool_query", "medium", "帮我分析数据",
+        "react", ["analysis", "default"], "分析类请求应路由到 REACT"),
+    BenchmarkTask("prep-010", "preprocessing", "skill_prefix", "medium", "@skill:react_agent 查看ip",
+        "skill_react", ["skill", "react"], "有效 skill 前缀应路由到 SKILL_REACT"),
+    BenchmarkTask("prep-011", "preprocessing", "skill_prefix", "medium", "@skill:chat_only 你好",
+        "direct_chat", ["skill", "direct"], "direct 模式 skill 前缀应路由到 DIRECT_CHAT"),
+    BenchmarkTask("prep-012", "preprocessing", "skill_prefix", "hard", "@skill:nonexistent 做点什么",
+        "react", ["skill", "fallback"], "无效 skill 前缀应回退到 REACT"),
+    BenchmarkTask("prep-013", "preprocessing", "complex", "hard", "帮我分析这个数据并生成报告",
+        "react", ["multi_step"], "多步骤复杂任务应路由到 REACT"),
+    BenchmarkTask("prep-014", "preprocessing", "complex", "easy", "随便聊聊",
+        "react", ["chitchat", "default"], "非匹配闲聊应回退到 REACT"),
+    BenchmarkTask("prep-015", "preprocessing", "complex", "hard",
         "请帮我完成以下任务：1. 查询天气 2. 生成报告",
-        "react",
-        ["multi_step"],
-        "多步骤任务应路由到 REACT",
-    ),
+        "react", ["multi_step"], "多步骤任务应路由到 REACT"),
     # === Overfitting (5 groups) ===
-    BenchmarkTask(
-        "over-001",
-        "overfitting",
-        "ip_check",
-        "medium",
-        "查下ip",
-        "react",
-        ["colloquial"],
-        "IP 查询改写一致性",
-        paraphrases=["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"],
-    ),
-    BenchmarkTask(
-        "over-002",
-        "overfitting",
-        "search",
-        "medium",
-        "搜索golang教程",
-        "react",
-        ["search"],
-        "搜索改写一致性",
-        paraphrases=["搜索golang教程", "搜一下golang教程", "找下golang学习资料"],
-    ),
-    BenchmarkTask(
-        "over-003",
-        "overfitting",
-        "greeting",
-        "easy",
-        "你好",
-        "direct_chat",
-        ["greeting"],
-        "问候改写一致性",
-        paraphrases=["你好", "hello", "hi", "嗨", "哈喽"],
-    ),
-    BenchmarkTask(
-        "over-004",
-        "overfitting",
-        "tool_use",
-        "medium",
-        "执行ls命令",
-        "react",
-        ["shell"],
-        "工具使用改写一致性",
-        paraphrases=["执行ls命令", "运行ls", "跑一下ls"],
-    ),
-    BenchmarkTask(
-        "over-005",
-        "overfitting",
-        "complex",
-        "hard",
-        "帮我分析数据",
-        "react",
-        ["analysis"],
-        "复杂任务改写一致性",
-        paraphrases=["帮我分析数据", "分析一下数据", "看看这些数据"],
-    ),
+    BenchmarkTask("over-001", "overfitting", "ip_check", "medium", "查下ip",
+        "react", ["colloquial"], "IP 查询改写一致性",
+        paraphrases=["查下ip", "查看当前ip", "获取ip地址", "看下ip", "帮我查一下ip"]),
+    BenchmarkTask("over-002", "overfitting", "search", "medium", "搜索golang教程",
+        "react", ["search"], "搜索改写一致性",
+        paraphrases=["搜索golang教程", "搜一下golang教程", "找下golang学习资料"]),
+    BenchmarkTask("over-003", "overfitting", "greeting", "easy", "你好",
+        "direct_chat", ["greeting"], "问候改写一致性",
+        paraphrases=["你好", "hello", "hi", "嗨", "哈喽"]),
+    BenchmarkTask("over-004", "overfitting", "tool_use", "medium", "执行ls命令",
+        "react", ["shell"], "工具使用改写一致性",
+        paraphrases=["执行ls命令", "运行ls", "跑一下ls"]),
+    BenchmarkTask("over-005", "overfitting", "complex", "hard", "帮我分析数据",
+        "react", ["analysis"], "复杂任务改写一致性",
+        paraphrases=["帮我分析数据", "分析一下数据", "看看这些数据"]),
     # === Efficiency (5 tasks) ===
-    BenchmarkTask(
-        "eff-001",
-        "efficiency",
-        "preprocess_latency",
-        "easy",
-        "你好",
-        "<=50ms",
-        ["greeting", "preprocess"],
-        "问候预处理延迟 < 50ms",
-    ),
-    BenchmarkTask(
-        "eff-002",
-        "efficiency",
-        "preprocess_latency",
-        "medium",
-        "查下ip",
-        "<=50ms",
-        ["react", "preprocess"],
-        "REACT 预处理延迟 < 50ms",
-    ),
-    BenchmarkTask(
-        "eff-003",
-        "efficiency",
-        "preprocess_latency",
-        "medium",
-        "@skill:react_agent test",
-        "<=50ms",
-        ["skill", "preprocess"],
-        "Skill 前缀预处理延迟 < 50ms",
-    ),
-    BenchmarkTask(
-        "eff-004",
-        "efficiency",
-        "tool_search_latency",
-        "medium",
-        "read file",
-        "<=10ms",
-        ["tool_search", "bm25"],
-        "工具搜索延迟 < 10ms",
-    ),
-    BenchmarkTask(
-        "eff-005",
-        "efficiency",
-        "tool_search_latency",
-        "easy",
-        "",
-        "<=5ms",
-        ["tool_search", "empty"],
-        "空查询工具搜索延迟 < 5ms",
-    ),
+    BenchmarkTask("eff-001", "efficiency", "preprocess_latency", "easy", "你好",
+        "<=50ms", ["greeting", "preprocess"], "问候预处理延迟 < 50ms"),
+    BenchmarkTask("eff-002", "efficiency", "preprocess_latency", "medium", "查下ip",
+        "<=50ms", ["react", "preprocess"], "REACT 预处理延迟 < 50ms"),
+    BenchmarkTask("eff-003", "efficiency", "preprocess_latency", "medium", "@skill:react_agent test",
+        "<=50ms", ["skill", "preprocess"], "Skill 前缀预处理延迟 < 50ms"),
+    BenchmarkTask("eff-004", "efficiency", "tool_search_latency", "medium", "read file",
+        "<=10ms", ["tool_search", "bm25"], "工具搜索延迟 < 10ms"),
+    BenchmarkTask("eff-005", "efficiency", "tool_search_latency", "easy", "",
+        "<=5ms", ["tool_search", "empty"], "空查询工具搜索延迟 < 5ms"),
     # === Tool Search (10 tasks) ===
-    BenchmarkTask(
-        "ts-001",
-        "tool_search",
-        "exact_match",
-        "easy",
-        "read file",
-        "read_file",
-        ["bm25", "exact"],
-        "精确匹配 read_file",
-    ),
-    BenchmarkTask(
-        "ts-002",
-        "tool_search",
-        "exact_match",
-        "easy",
-        "write file content",
-        "write_file",
-        ["bm25", "exact"],
-        "精确匹配 write_file",
-    ),
-    BenchmarkTask(
-        "ts-003",
-        "tool_search",
-        "exact_match",
-        "easy",
-        "search web information",
-        "web_search",
-        ["bm25", "exact"],
-        "精确匹配 web_search",
-    ),
-    BenchmarkTask(
-        "ts-004",
-        "tool_search",
-        "exact_match",
-        "easy",
-        "execute shell command",
-        "shell_exec",
-        ["bm25", "exact"],
-        "精确匹配 shell_exec",
-    ),
-    BenchmarkTask(
-        "ts-005",
-        "tool_search",
-        "exact_match",
-        "easy",
-        "send http request url",
-        "http_request",
-        ["bm25", "exact"],
-        "精确匹配 http_request",
-    ),
-    BenchmarkTask(
-        "ts-006",
-        "tool_search",
-        "fuzzy_match",
-        "medium",
-        "io file",
-        "read_file",
-        ["bm25", "fuzzy", "tag"],
-        "标签模糊匹配 io file",
-    ),
-    BenchmarkTask(
-        "ts-007",
-        "tool_search",
-        "fuzzy_match",
-        "medium",
-        "search query engine",
-        "web_search",
-        ["bm25", "fuzzy", "multi"],
-        "多关键词模糊匹配",
-    ),
-    BenchmarkTask(
-        "ts-008",
-        "tool_search",
-        "no_match",
-        "easy",
-        "",
-        "__none__",
-        ["bm25", "empty"],
-        "空查询应返回空结果",
-    ),
-    BenchmarkTask(
-        "ts-009",
-        "tool_search",
-        "no_match",
-        "easy",
-        "zzzznonexistent",
-        "__none__",
-        ["bm25", "no_match"],
-        "无匹配查询应返回空结果",
-    ),
-    BenchmarkTask(
-        "ts-010",
-        "tool_search",
-        "top_k",
-        "medium",
-        "file",
-        "read_file",
-        ["bm25", "top_k"],
-        "top_k=1 限制返回数",
-    ),
+    BenchmarkTask("ts-001", "tool_search", "exact_match", "easy", "read file",
+        "read_file", ["bm25", "exact"], "精确匹配 read_file"),
+    BenchmarkTask("ts-002", "tool_search", "exact_match", "easy", "write file content",
+        "write_file", ["bm25", "exact"], "精确匹配 write_file"),
+    BenchmarkTask("ts-003", "tool_search", "exact_match", "easy", "search web information",
+        "web_search", ["bm25", "exact"], "精确匹配 web_search"),
+    BenchmarkTask("ts-004", "tool_search", "exact_match", "easy", "execute shell command",
+        "shell_exec", ["bm25", "exact"], "精确匹配 shell_exec"),
+    BenchmarkTask("ts-005", "tool_search", "exact_match", "easy", "send http request url",
+        "http_request", ["bm25", "exact"], "精确匹配 http_request"),
+    BenchmarkTask("ts-006", "tool_search", "fuzzy_match", "medium", "io file",
+        "read_file", ["bm25", "fuzzy", "tag"], "标签模糊匹配 io file"),
+    BenchmarkTask("ts-007", "tool_search", "fuzzy_match", "medium", "search query engine",
+        "web_search", ["bm25", "fuzzy", "multi"], "多关键词模糊匹配"),
+    BenchmarkTask("ts-008", "tool_search", "no_match", "easy", "",
+        "__none__", ["bm25", "empty"], "空查询应返回空结果"),
+    BenchmarkTask("ts-009", "tool_search", "no_match", "easy", "zzzznonexistent",
+        "__none__", ["bm25", "no_match"], "无匹配查询应返回空结果"),
+    BenchmarkTask("ts-010", "tool_search", "top_k", "medium", "file",
+        "read_file", ["bm25", "top_k"], "top_k=1 限制返回数"),
     # === Event Model (6 tasks) ===
-    BenchmarkTask(
-        "ev-001",
-        "event_model",
-        "sq_lifecycle",
-        "easy",
-        "submit+drain",
-        "passed",
-        ["sq", "submit"],
-        "SQ 提交并消费",
-    ),
-    BenchmarkTask(
-        "ev-002",
-        "event_model",
-        "sq_lifecycle",
-        "easy",
-        "cancel",
-        "passed",
-        ["sq", "cancel"],
-        "SQ 取消任务",
-    ),
-    BenchmarkTask(
-        "ev-003",
-        "event_model",
-        "sq_lifecycle",
-        "easy",
-        "close",
-        "passed",
-        ["sq", "close"],
-        "SQ 关闭后拒绝提交",
-    ),
-    BenchmarkTask(
-        "ev-004",
-        "event_model",
-        "eq_lifecycle",
-        "easy",
-        "emit+replay",
-        "passed",
-        ["eq", "replay"],
-        "EQ 发射并回放",
-    ),
-    BenchmarkTask(
-        "ev-005",
-        "event_model",
-        "eq_lifecycle",
-        "easy",
-        "close",
-        "passed",
-        ["eq", "close"],
-        "EQ 关闭哨兵退出",
-    ),
-    BenchmarkTask(
-        "ev-006",
-        "event_model",
-        "eq_lifecycle",
-        "easy",
-        "subscriber_count",
-        "passed",
-        ["eq", "count"],
-        "EQ 初始订阅者计数",
-    ),
+    BenchmarkTask("ev-001", "event_model", "sq_lifecycle", "easy", "submit+drain",
+        "passed", ["sq", "submit"], "SQ 提交并消费"),
+    BenchmarkTask("ev-002", "event_model", "sq_lifecycle", "easy", "cancel",
+        "passed", ["sq", "cancel"], "SQ 取消任务"),
+    BenchmarkTask("ev-003", "event_model", "sq_lifecycle", "easy", "close",
+        "passed", ["sq", "close"], "SQ 关闭后拒绝提交"),
+    BenchmarkTask("ev-004", "event_model", "eq_lifecycle", "easy", "emit+replay",
+        "passed", ["eq", "replay"], "EQ 发射并回放"),
+    BenchmarkTask("ev-005", "event_model", "eq_lifecycle", "easy", "close",
+        "passed", ["eq", "close"], "EQ 关闭哨兵退出"),
+    BenchmarkTask("ev-006", "event_model", "eq_lifecycle", "easy", "subscriber_count",
+        "passed", ["eq", "count"], "EQ 初始订阅者计数"),
     # === Spec Management (7 tasks) ===
-    BenchmarkTask(
-        "sm-001",
-        "spec_management",
-        "crud",
-        "easy",
-        "create",
-        "passed",
-        ["create"],
-        "Spec 创建",
-    ),
-    BenchmarkTask(
-        "sm-002",
-        "spec_management",
-        "crud",
-        "easy",
-        "get",
-        "passed",
-        ["read"],
-        "Spec 读取",
-    ),
-    BenchmarkTask(
-        "sm-003",
-        "spec_management",
-        "crud",
-        "easy",
-        "update",
-        "passed",
-        ["update"],
-        "Spec 更新",
-    ),
-    BenchmarkTask(
-        "sm-004",
-        "spec_management",
-        "crud",
-        "easy",
-        "delete",
-        "passed",
-        ["delete"],
-        "Spec 删除",
-    ),
-    BenchmarkTask(
-        "sm-005",
-        "spec_management",
-        "crud",
-        "easy",
-        "list",
-        "passed",
-        ["list"],
-        "Spec 列表",
-    ),
-    BenchmarkTask(
-        "sm-006",
-        "spec_management",
-        "edge",
-        "medium",
-        "confirm",
-        "passed",
-        ["confirm"],
-        "Spec 确认",
-    ),
-    BenchmarkTask(
-        "sm-007",
-        "spec_management",
-        "edge",
-        "easy",
-        "missing",
-        "passed",
-        ["missing"],
-        "Spec 不存在返回 None",
-    ),
+    BenchmarkTask("sm-001", "spec_management", "crud", "easy", "create",
+        "passed", ["create"], "Spec 创建"),
+    BenchmarkTask("sm-002", "spec_management", "crud", "easy", "get",
+        "passed", ["read"], "Spec 读取"),
+    BenchmarkTask("sm-003", "spec_management", "crud", "easy", "update",
+        "passed", ["update"], "Spec 更新"),
+    BenchmarkTask("sm-004", "spec_management", "crud", "easy", "delete",
+        "passed", ["delete"], "Spec 删除"),
+    BenchmarkTask("sm-005", "spec_management", "crud", "easy", "list",
+        "passed", ["list"], "Spec 列表"),
+    BenchmarkTask("sm-006", "spec_management", "edge", "medium", "confirm",
+        "passed", ["confirm"], "Spec 确认"),
+    BenchmarkTask("sm-007", "spec_management", "edge", "easy", "missing",
+        "passed", ["missing"], "Spec 不存在返回 None"),
     # === Verification (5 tasks) ===
-    BenchmarkTask(
-        "vf-001",
-        "verification",
-        "basic",
-        "easy",
-        "pass",
-        "passed",
-        ["pass"],
-        "验证通过命令",
-    ),
-    BenchmarkTask(
-        "vf-002",
-        "verification",
-        "basic",
-        "easy",
-        "fail",
-        "passed",
-        ["fail"],
-        "验证失败命令",
-    ),
-    BenchmarkTask(
-        "vf-003",
-        "verification",
-        "retry",
-        "medium",
-        "fix_callback",
-        "passed",
-        ["retry", "callback"],
-        "重试与修复回调",
-    ),
-    BenchmarkTask(
-        "vf-004",
-        "verification",
-        "timeout",
-        "medium",
-        "timeout",
-        "passed",
-        ["timeout"],
-        "超时检测",
-    ),
-    BenchmarkTask(
-        "vf-005",
-        "verification",
-        "multi",
-        "medium",
-        "multi_command",
-        "passed",
-        ["multi"],
-        "多命令验证",
-    ),
+    BenchmarkTask("vf-001", "verification", "basic", "easy", "pass",
+        "passed", ["pass"], "验证通过命令"),
+    BenchmarkTask("vf-002", "verification", "basic", "easy", "fail",
+        "passed", ["fail"], "验证失败命令"),
+    BenchmarkTask("vf-003", "verification", "retry", "medium", "fix_callback",
+        "passed", ["retry", "callback"], "重试与修复回调"),
+    BenchmarkTask("vf-004", "verification", "timeout", "medium", "timeout",
+        "passed", ["timeout"], "超时检测"),
+    BenchmarkTask("vf-005", "verification", "multi", "medium", "multi_command",
+        "passed", ["multi"], "多命令验证"),
 ]
+# fmt: on
 
 
+# fmt: off
 _FAST_CORE_IDS: set[str] = {
-    "prep-001",
-    "prep-005",
-    "prep-010",
-    "prep-012",
-    "over-001",
-    "over-003",
-    "eff-001",
-    "eff-004",
-    "ts-001",
-    "ts-003",
-    "ts-008",
-    "ts-010",
-    "ev-001",
-    "ev-004",
-    "ev-005",
-    "sm-001",
-    "sm-002",
-    "sm-006",
-    "sm-004",
-    "vf-001",
-    "vf-002",
-    "vf-003",
+    "prep-001", "prep-005", "prep-010", "prep-012", "over-001", "over-003",
+    "eff-001", "eff-004", "ts-001", "ts-003", "ts-008", "ts-010",
+    "ev-001", "ev-004", "ev-005", "sm-001", "sm-002", "sm-006", "sm-004",
+    "vf-001", "vf-002", "vf-003", "llm-001", "llm-003", "gui-001", "gui-002", "gui-004",
 }
+# fmt: on
+
+
+# ---------------------------------------------------------------------------
+# LLM Reasoning tasks (require real LLM via agentkit.yaml)
+# ---------------------------------------------------------------------------
+
+
+# fmt: off
+LLM_REASONING_TASKS: list[BenchmarkTask] = [
+    BenchmarkTask("llm-001", "llm_reasoning", "intent_understanding", "easy",
+        "帮我查看当前服务器的IP地址", "react", ["intent", "tool_use"],
+        "LLM 应识别需要使用工具查看 IP",
+        expected_keywords=["ip", "地址", "ifconfig", "hostname", "网络"]),
+    BenchmarkTask("llm-002", "llm_reasoning", "tool_selection", "medium",
+        "搜索最新的 AI Agent 论文", "react", ["tool_selection", "web_search"],
+        "LLM 应选择 web_search 工具",
+        expected_keywords=["search", "搜索", "web", "论文", "paper", "agent"]),
+    BenchmarkTask("llm-003", "llm_reasoning", "multi_step", "hard",
+        "分析这段代码的性能问题并给出优化建议：def fib(n): return fib(n-1)+fib(n-2) if n>1 else n",
+        "react", ["multi_step", "code_analysis"], "LLM 应分析代码并给出优化建议",
+        expected_keywords=["fib", "递归", "优化", "缓存", "memo", "迭代", "动态规划", "性能"]),
+    BenchmarkTask("llm-004", "llm_reasoning", "code_generation", "medium",
+        "写一个 Python 函数来计算斐波那契数列", "react", ["code_gen"],
+        "LLM 应生成可执行的 Python 代码",
+        expected_keywords=["def", "fib", "return", "python"]),
+    BenchmarkTask("llm-005", "llm_reasoning", "error_recovery", "hard",
+        "这个报错怎么解决：ModuleNotFoundError: No module named 'agentkit'",
+        "react", ["error_recovery"], "LLM 应给出 pip install 建议",
+        expected_keywords=["pip", "install", "agentkit", "安装", "模块"]),
+]
+# fmt: on
+
+
+# ---------------------------------------------------------------------------
+# GUI Integration tasks (require starting real agentkit gui server)
+# ---------------------------------------------------------------------------
+
+
+# fmt: off
+GUI_INTEGRATION_TASKS: list[BenchmarkTask] = [
+    BenchmarkTask("gui-001", "gui_integration", "service_startup", "easy",
+        "agentkit gui --port {port}", "started", ["startup", "subprocess"],
+        "GUI 服务应成功启动并响应健康检查"),
+    BenchmarkTask("gui-002", "gui_integration", "api_availability", "medium",
+        "GET /api/v1/health, GET /api/v1/skills", "200", ["api", "http"],
+        "核心 API 端点应返回 200"),
+    BenchmarkTask("gui-003", "gui_integration", "api_availability", "medium",
+        "POST /api/v1/chat", "reachable", ["api", "chat"],
+        "Chat API 端点应可达（不要求成功，要求响应）"),
+    BenchmarkTask("gui-004", "gui_integration", "websocket", "hard",
+        "ws://localhost:{port}/api/v1/ws/{session}", "connected",
+        ["websocket", "realtime"], "WebSocket 端点应能建立连接并交换 ping/pong"),
+    BenchmarkTask("gui-005", "gui_integration", "frontend", "easy",
+        "GET /", "html", ["frontend", "static"], "前端首页应返回 HTML 内容"),
+]
+# fmt: on
 
 
 # ---------------------------------------------------------------------------
@@ -892,6 +551,468 @@ def _make_context(tmp_dir: Path) -> BenchmarkContext:
     )
 
 
+# ---------------------------------------------------------------------------
+# Real component builder (loads from agentkit.yaml for LLM mode)
+# ---------------------------------------------------------------------------
+
+
+def _find_config_path() -> str | None:
+    """Find agentkit.yaml config file (cwd or ~/.agentkit/)."""
+    import os as _os
+
+    candidates = [
+        _os.environ.get("AGENTKIT_CONFIG", ""),
+        str(Path.cwd() / "agentkit.yaml"),
+        str(Path.home() / ".agentkit" / "agentkit.yaml"),
+    ]
+    for path in candidates:
+        if path and Path(path).is_file():
+            return path
+    return None
+
+
+def _build_real_components() -> tuple[object, object, object] | None:
+    """Build real components from agentkit.yaml for LLM mode.
+
+    Returns (preprocessor, skill_registry, llm_gateway) or None if config
+    is missing or no LLM provider is available.
+    """
+    import os as _os
+
+    from agentkit.chat.request_preprocessor import RequestPreprocessor
+    from agentkit.server.app import _build_llm_gateway, _build_skill_registry
+    from agentkit.server.config import load_config_with_dotenv
+
+    config_path = _find_config_path()
+    if not config_path:
+        console.print("[yellow]No agentkit.yaml found — skipping LLM mode.[/yellow]")
+        return None
+
+    server_config = load_config_with_dotenv(config_path)
+
+    # Fallback: inject DASHSCOPE_API_KEY from env if providers lack keys
+    if not server_config.has_llm_provider():
+        dashscope_key = _os.environ.get("DASHSCOPE_API_KEY", "")
+        if dashscope_key:
+            for _name, pconf in server_config.llm_config.providers.items():
+                if not pconf.api_key:
+                    pconf.api_key = dashscope_key
+                    if not pconf.base_url:
+                        if dashscope_key.startswith("sk-sp-"):
+                            pconf.base_url = "https://coding.dashscope.aliyuncs.com/v1"
+                        else:
+                            pconf.base_url = "https://dashscope.aliyuncs.com/compatible-mode/v1"
+                    break
+
+    if not server_config.has_llm_provider():
+        console.print("[yellow]No LLM provider with valid API key — skipping LLM mode.[/yellow]")
+        return None
+
+    skill_registry = _build_skill_registry(server_config)
+    preprocessor = RequestPreprocessor(skill_registry=skill_registry)
+    llm_gateway = _build_llm_gateway(server_config)
+    return preprocessor, skill_registry, llm_gateway
+
+
+# ---------------------------------------------------------------------------
+# LLM Reasoning dimension executor
+# ---------------------------------------------------------------------------
+
+
+async def _execute_llm_reasoning_task(
+    task: BenchmarkTask,
+    preprocessor: object,
+    llm_gateway: object,
+) -> ExecutionResult:
+    """Execute a single LLM reasoning task.
+
+    Steps:
+    1. Call RequestPreprocessor.preprocess() to get execution mode.
+    2. If REACT mode, call LLMGateway.chat() with 30s timeout.
+    3. Check LLM response for expected keywords.
+    4. Record latency and token usage.
+    """
+    start = time.perf_counter()
+
+    # Step 1: preprocess to get execution mode
+    routing = await preprocessor.preprocess(content=task.input)  # type: ignore[attr-defined]
+    actual_mode = routing.execution_mode.value
+
+    # Step 2: if REACT, call LLM and check keywords
+    if actual_mode == "react":
+        try:
+            response = await asyncio.wait_for(
+                llm_gateway.chat(  # type: ignore[attr-defined]
+                    messages=[{"role": "user", "content": task.input}],
+                    model="default",
+                    agent_name="benchmark",
+                    max_tokens=512,
+                ),
+                timeout=30.0,
+            )
+            content = (response.content or "").lower()
+            tokens = response.usage.total_tokens if response.usage else 0
+
+            # Step 3: check expected keywords
+            if task.expected_keywords:
+                passed = any(kw.lower() in content for kw in task.expected_keywords)
+            else:
+                passed = bool(content.strip())
+
+            elapsed = (time.perf_counter() - start) * 1000
+            return ExecutionResult(
+                actual=f"mode=react tokens={tokens} len={len(content)}",
+                passed=passed,
+                duration_ms=round(elapsed, 4),
+                detail=f"mode={actual_mode} keywords={task.expected_keywords}",
+            )
+        except TimeoutError:
+            elapsed = (time.perf_counter() - start) * 1000
+            return ExecutionResult(
+                actual="timeout",
+                passed=False,
+                duration_ms=round(elapsed, 4),
+                detail="LLM call timed out after 30s",
+            )
+        except Exception as e:
+            elapsed = (time.perf_counter() - start) * 1000
+            return ExecutionResult(
+                actual=f"error:{type(e).__name__}",
+                passed=False,
+                duration_ms=round(elapsed, 4),
+                detail=f"LLM error: {e}",
+            )
+    else:
+        # Non-REACT mode: check if matches expected
+        passed = actual_mode == task.expected
+        elapsed = (time.perf_counter() - start) * 1000
+        return ExecutionResult(
+            actual=f"mode={actual_mode}",
+            passed=passed,
+            duration_ms=round(elapsed, 4),
+            detail=f"Expected {task.expected}, got {actual_mode}",
+        )
+
+
+async def _run_llm_reasoning(
+    runs: int,
+    fast: bool,
+    verbose: bool,
+    preprocessor: object,
+    llm_gateway: object,
+) -> DimensionResult:
+    """Run LLM reasoning benchmark dimension with real LLM calls."""
+    tasks = list(LLM_REASONING_TASKS)
+    if fast:
+        tasks = [t for t in tasks if t.task_id in _FAST_CORE_IDS]
+
+    all_runs_cases: list[list[CaseResult]] = []
+    accuracies: list[float] = []
+
+    for _run_idx in range(runs):
+        cases: list[CaseResult] = []
+        for task in tasks:
+            try:
+                result = await _execute_llm_reasoning_task(task, preprocessor, llm_gateway)
+            except Exception as e:
+                result = ExecutionResult(
+                    actual=f"__exception__:{type(e).__name__}",
+                    passed=False,
+                    duration_ms=0.0,
+                    detail=str(e),
+                )
+            root_cause = "none" if result.passed else _classify_llm_root_cause(result)
+            case = CaseResult(
+                task_id=task.task_id,
+                dimension=task.dimension,
+                category=task.category,
+                difficulty=task.difficulty,
+                passed=result.passed,
+                expected=task.expected,
+                actual=result.actual,
+                duration_ms=result.duration_ms,
+                root_cause=root_cause,
+                detail=result.detail,
+                consistency=result.consistency,
+            )
+            cases.append(case)
+            if verbose:
+                status = "[green]✓[/green]" if case.passed else "[red]✗[/red]"
+                console.print(
+                    f"  {status} {task.task_id}: {result.actual} ({result.duration_ms:.2f}ms)"
+                )
+        all_runs_cases.append(cases)
+        passed_count = sum(1 for c in cases if c.passed)
+        accuracies.append(passed_count / len(cases) if cases else 0.0)
+
+    final_cases = all_runs_cases[-1] if all_runs_cases else []
+    metrics = _compute_metrics(final_cases, accuracies if runs > 1 else None)
+    return DimensionResult(
+        dimension="llm_reasoning",
+        metrics=metrics,
+        cases=final_cases,
+        by_category=_aggregate_by(final_cases, "category"),
+        by_difficulty=_aggregate_by(final_cases, "difficulty"),
+    )
+
+
+def _classify_llm_root_cause(result: ExecutionResult) -> str:
+    """Classify root cause for LLM reasoning failures."""
+    if "timeout" in result.actual:
+        return "timeout"
+    if "error" in result.actual or "__exception__" in result.actual:
+        return "exception"
+    if "mode=" in result.actual and "react" not in result.actual:
+        return "wrong_mode"
+    return "keyword_miss"
+
+
+# ---------------------------------------------------------------------------
+# GUI Integration dimension executor
+# ---------------------------------------------------------------------------
+
+
+def _find_free_port() -> int:
+    """Find a free TCP port for the GUI server."""
+    import socket
+
+    with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
+        s.bind(("", 0))
+        return int(s.getsockname()[1])
+
+
+async def _wait_for_server(base_url: str, timeout_s: float = 30.0) -> bool:
+    """Poll health endpoint until server is ready or timeout."""
+    import httpx
+
+    deadline = time.perf_counter() + timeout_s
+    while time.perf_counter() < deadline:
+        try:
+            async with httpx.AsyncClient(timeout=2.0) as client:
+                resp = await client.get(f"{base_url}/api/v1/health")
+                if resp.status_code == 200:
+                    return True
+        except Exception:
+            await asyncio.sleep(0.5)
+    return False
+
+
+async def _run_gui_integration(
+    runs: int,
+    fast: bool,
+    verbose: bool,
+) -> DimensionResult:
+    """Run GUI integration benchmark by starting a real agentkit gui server."""
+    import os as _os
+    import subprocess
+    import sys
+
+    import httpx
+
+    tasks = list(GUI_INTEGRATION_TASKS)
+    if fast:
+        tasks = [t for t in tasks if t.task_id in _FAST_CORE_IDS]
+
+    def _case(
+        tid: str, cat: str, diff: str, actual: str, expected: str, passed: bool, detail: str
+    ) -> CaseResult:
+        return CaseResult(
+            tid,
+            "gui_integration",
+            cat,
+            diff,
+            passed,
+            expected,
+            actual,
+            0.0,
+            "none" if passed else "gui_failure",
+            detail,
+        )
+
+    def _log(tid: str, passed: bool, label: str) -> None:
+        if verbose:
+            status = "[green]✓[/green]" if passed else "[red]✗[/red]"
+            console.print(f"  {status} {tid}: {label}")
+
+    all_runs_cases: list[list[CaseResult]] = []
+    accuracies: list[float] = []
+
+    for _ in range(runs):
+        cases: list[CaseResult] = []
+        port = _find_free_port()
+        base_url = f"http://localhost:{port}"
+        proc = subprocess.Popen(
+            [
+                sys.executable,
+                "-m",
+                "agentkit",
+                "gui",
+                "--port",
+                str(port),
+                "--no-open",
+                "--host",
+                "127.0.0.1",
+            ],
+            stdout=subprocess.DEVNULL,
+            stderr=subprocess.DEVNULL,
+            env={**_os.environ, "AGENTKIT_GUI_MODE": "1"},
+        )
+        try:
+            # gui-001: service startup
+            startup_pass = await _wait_for_server(base_url, timeout_s=30.0)
+            cases.append(
+                _case(
+                    "gui-001",
+                    "service_startup",
+                    "easy",
+                    "started" if startup_pass else "failed",
+                    "started",
+                    startup_pass,
+                    f"port={port} pid={proc.pid}",
+                )
+            )
+            _log("gui-001", startup_pass, f"port={port}")
+
+            if not startup_pass:
+                for task in tasks[1:]:
+                    cases.append(
+                        _case(
+                            task.task_id,
+                            task.category,
+                            task.difficulty,
+                            "skipped",
+                            task.expected,
+                            False,
+                            "server not started",
+                        )
+                    )
+                all_runs_cases.append(cases)
+                accuracies.append(0.0)
+                continue
+
+            # gui-002: API availability (health + skills)
+            api_pass = False
+            api_detail = "N/A"
+            try:
+                async with httpx.AsyncClient(timeout=5.0) as client:
+                    h_resp = await client.get(f"{base_url}/api/v1/health")
+                    s_resp = await client.get(f"{base_url}/api/v1/skills")
+                    api_pass = h_resp.status_code == 200 and s_resp.status_code == 200
+                    api_detail = f"health={h_resp.status_code} skills={s_resp.status_code}"
+            except Exception as e:
+                api_detail = f"error: {e}"
+            cases.append(
+                _case(
+                    "gui-002",
+                    "api_availability",
+                    "medium",
+                    "200" if api_pass else "error",
+                    "200",
+                    api_pass,
+                    api_detail,
+                )
+            )
+            _log("gui-002", api_pass, "health+skills")
+
+            # gui-003: chat API reachability
+            chat_pass = False
+            chat_detail = "N/A"
+            try:
+                async with httpx.AsyncClient(timeout=5.0) as client:
+                    c_resp = await client.post(
+                        f"{base_url}/api/v1/chat",
+                        json={"message": "ping", "session_id": "bench-test"},
+                    )
+                    chat_pass = c_resp.status_code < 500
+                    chat_detail = f"status={c_resp.status_code}"
+            except Exception as e:
+                chat_detail = f"error: {e}"
+            cases.append(
+                _case(
+                    "gui-003",
+                    "api_availability",
+                    "medium",
+                    "reachable" if chat_pass else "unreachable",
+                    "reachable",
+                    chat_pass,
+                    chat_detail,
+                )
+            )
+            _log("gui-003", chat_pass, "chat API")
+
+            # gui-004: WebSocket connection
+            ws_pass = False
+            ws_detail = "N/A"
+            try:
+                import websockets
+
+                ws_url = f"ws://localhost:{port}/api/v1/ws/bench-session"
+                async with websockets.connect(ws_url, open_timeout=5.0) as ws:
+                    await ws.send('{"type": "ping"}')
+                    msg = await asyncio.wait_for(ws.recv(), timeout=5.0)
+                    ws_pass = "pong" in str(msg).lower() or "error" in str(msg).lower()
+                    ws_detail = f"msg={str(msg)[:50]}"
+            except Exception as e:
+                ws_detail = f"error: {e}"
+            cases.append(
+                _case(
+                    "gui-004",
+                    "websocket",
+                    "hard",
+                    "connected" if ws_pass else "failed",
+                    "connected",
+                    ws_pass,
+                    ws_detail,
+                )
+            )
+            _log("gui-004", ws_pass, "websocket")
+
+            # gui-005: frontend resources
+            fe_pass = False
+            fe_detail = "N/A"
+            try:
+                async with httpx.AsyncClient(timeout=5.0) as client:
+                    r_resp = await client.get(f"{base_url}/")
+                    fe_pass = r_resp.status_code == 200 and "<html" in r_resp.text.lower()
+                    fe_detail = f"status={r_resp.status_code} len={len(r_resp.text)}"
+            except Exception as e:
+                fe_detail = f"error: {e}"
+            cases.append(
+                _case(
+                    "gui-005",
+                    "frontend",
+                    "easy",
+                    "html" if fe_pass else "missing",
+                    "html",
+                    fe_pass,
+                    fe_detail,
+                )
+            )
+            _log("gui-005", fe_pass, "frontend")
+
+        finally:
+            proc.terminate()
+            try:
+                proc.wait(timeout=5.0)
+            except subprocess.TimeoutExpired:
+                proc.kill()
+                proc.wait(timeout=2.0)
+
+        all_runs_cases.append(cases)
+        passed_count = sum(1 for c in cases if c.passed)
+        accuracies.append(passed_count / len(cases) if cases else 0.0)
+
+    final_cases = all_runs_cases[-1] if all_runs_cases else []
+    metrics = _compute_metrics(final_cases, accuracies if runs > 1 else None)
+    return DimensionResult(
+        dimension="gui_integration",
+        metrics=metrics,
+        cases=final_cases,
+        by_category=_aggregate_by(final_cases, "category"),
+        by_difficulty=_aggregate_by(final_cases, "difficulty"),
+    )
+
+
 # ---------------------------------------------------------------------------
 # Utility functions
 # ---------------------------------------------------------------------------
@@ -1634,6 +1755,7 @@ def _generate_markdown_report(
 
     timestamp = str(report_data.get("timestamp", ""))
     version = str(report_data.get("version", ""))
+    mode = str(report_data.get("mode", "mock"))
     runs = int(report_data.get("runs", 1))
     overall = float(report_data.get("overall_accuracy", 0.0))
     overall_mean = float(report_data.get("overall_accuracy_mean", overall))
@@ -1645,6 +1767,7 @@ def _generate_markdown_report(
     lines.append("## 测试概要")
     lines.append(f"- 时间: {timestamp}")
     lines.append(f"- 版本: {version}")
+    lines.append(f"- 模式: {mode}")
     lines.append(f"- 运行次数: {runs}")
     lines.append(f"- 总体准确率: {overall_mean:.1%} ± {overall_std:.1%}")
     lines.append("")
@@ -1670,13 +1793,15 @@ def _generate_markdown_report(
         dimensions = {}
 
     dim_titles = {
-        "preprocessing": "1. 预处理准确度 (Preprocessing Accuracy)",
-        "overfitting": "2. 过拟合检测 (Overfitting Detection)",
-        "efficiency": "3. 效率测试 (Efficiency)",
-        "tool_search": "4. 工具搜索 (Tool Search)",
-        "event_model": "5. 事件模型 (Event Model)",
-        "spec_management": "6. 规格管理 (Spec Management)",
-        "verification": "7. 验证循环 (Verification Loop)",
+        "preprocessing": "1. 预处理准确度 (Preprocessing Accuracy) [Mock]",
+        "overfitting": "2. 过拟合检测 (Overfitting Detection) [Mock]",
+        "efficiency": "3. 效率测试 (Efficiency) [Mock]",
+        "tool_search": "4. 工具搜索 (Tool Search) [Mock]",
+        "event_model": "5. 事件模型 (Event Model) [Mock]",
+        "spec_management": "6. 规格管理 (Spec Management) [Mock]",
+        "verification": "7. 验证循环 (Verification Loop) [Mock]",
+        "llm_reasoning": "8. LLM 推理能力 (LLM Reasoning) [LLM]",
+        "gui_integration": "9. GUI 集成测试 (GUI Integration) [GUI]",
     }
 
     lines.append("## 维度结果")
@@ -1929,6 +2054,7 @@ def _generate_html_report(
 
     timestamp = str(report_data.get("timestamp", ""))
     version = str(report_data.get("version", ""))
+    mode = str(report_data.get("mode", "mock"))
     runs = int(report_data.get("runs", 1))
 
     html = f"""<!DOCTYPE html>
@@ -1956,6 +2082,7 @@ def _generate_html_report(
 <div class="meta">
   <p>Timestamp: {timestamp}</p>
   <p>Version: {version}</p>
+  <p>Mode: {mode}</p>
   <p>Runs: {runs}</p>
   <p>Overall Accuracy: <strong class="{overall_class}">{overall:.1%}</strong></p>
 </div>
@@ -2118,6 +2245,11 @@ def benchmark(
         "-d",
         help="Benchmark dimension to run (default: all)",
     ),
+    mode: BenchmarkMode = typer.Option(
+        BenchmarkMode.MOCK,
+        "--mode",
+        help="Execution mode: mock (default), llm, gui, or all",
+    ),
     report: bool = typer.Option(False, "--report", help="Generate report files"),
     format: str = typer.Option(
         "markdown",
@@ -2138,18 +2270,22 @@ def benchmark(
 ):
     """Run AgentKit capability benchmarks with standardized metrics.
 
-    Tests core components directly (no LLM, no pytest subprocess):
-    preprocessing, overfitting, efficiency, tool_search, event_model,
-    spec_management, verification.
+    Supports three execution modes via --mode:
+    - mock: 全部使用 Mock（默认，快速、无 LLM 依赖）
+    - llm: 使用真实 LLM（需要 agentkit.yaml 配置）
+    - gui: 启动真实 GUI 服务器测试端到端
+    - all: 运行所有模式（Mock + LLM + GUI）
 
     Produces Accuracy / Precision / Recall / F1 / Latency / Consistency
     metrics with multi-run averaging and 95% confidence intervals.
     """
     import tempfile
 
-    # Normalize dimension (Typer may pass string)
+    # Normalize enums (Typer may pass strings)
     if isinstance(dimension, str):
         dimension = BenchmarkDimension(dimension)
+    if isinstance(mode, str):
+        mode = BenchmarkMode(mode)
 
     # Normalize format
     fmt = format.lower()
@@ -2160,6 +2296,7 @@ def benchmark(
     console.print(
         Panel.fit(
             "[bold cyan]AgentKit Benchmark[/bold cyan]\n"
+            f"Mode: [yellow]{mode.value}[/yellow]  "
             f"Dimension: [yellow]{dimension.value}[/yellow]  "
             f"Runs: [yellow]{runs}[/yellow]  "
             f"Fast: [yellow]{fast}[/yellow]  "
@@ -2169,26 +2306,82 @@ def benchmark(
     )
     console.print()
 
-    # Determine which dimensions to run
-    if dimension == BenchmarkDimension.ALL:
-        dims_to_run = [
-            BenchmarkDimension.PREPROCESSING,
-            BenchmarkDimension.OVERFITTING,
-            BenchmarkDimension.EFFICIENCY,
-            BenchmarkDimension.TOOL_SEARCH,
-            BenchmarkDimension.EVENT_MODEL,
-            BenchmarkDimension.SPEC_MANAGEMENT,
-            BenchmarkDimension.VERIFICATION,
-        ]
-    else:
-        dims_to_run = [dimension]
+    # Determine which dimensions to run based on mode and dimension filter
+    mock_dims: list[BenchmarkDimension] = []
+    run_llm = False
+    run_gui = False
+
+    if mode == BenchmarkMode.MOCK:
+        if dimension == BenchmarkDimension.ALL:
+            mock_dims = list(_MOCK_DIMENSIONS)
+        elif dimension in _MOCK_DIMENSIONS:
+            mock_dims = [dimension]
+    elif mode == BenchmarkMode.LLM:
+        if dimension in (BenchmarkDimension.ALL, BenchmarkDimension.LLM_REASONING):
+            run_llm = True
+    elif mode == BenchmarkMode.GUI:
+        if dimension in (BenchmarkDimension.ALL, BenchmarkDimension.GUI_INTEGRATION):
+            run_gui = True
+    elif mode == BenchmarkMode.ALL:
+        if dimension == BenchmarkDimension.ALL:
+            mock_dims = list(_MOCK_DIMENSIONS)
+            run_llm = True
+            run_gui = True
+        elif dimension in _MOCK_DIMENSIONS:
+            mock_dims = [dimension]
+        elif dimension == BenchmarkDimension.LLM_REASONING:
+            run_llm = True
+        elif dimension == BenchmarkDimension.GUI_INTEGRATION:
+            run_gui = True
 
     results: dict[str, DimensionResult] = {}
 
-    with tempfile.TemporaryDirectory(prefix="agentkit-benchmark-") as tmp:
-        tmp_path = Path(tmp)
-        ctx = _make_context(tmp_path)
+    # --- Mock dimensions ---
+    if mock_dims:
+        with tempfile.TemporaryDirectory(prefix="agentkit-benchmark-") as tmp:
+            tmp_path = Path(tmp)
+            ctx = _make_context(tmp_path)
 
+            with Progress(
+                SpinnerColumn(),
+                TextColumn("[progress.description]{task.description}"),
+                BarColumn(),
+                TaskProgressColumn(),
+                console=console,
+            ) as progress:
+                for dim in mock_dims:
+                    task = progress.add_task(f"Running [mock] {dim.value}...", total=None)
+                    dim_result = asyncio.run(_run_dimension(dim.value, runs, fast, verbose, ctx))
+                    results[dim.value] = dim_result
+                    progress.update(task, completed=True, total=1)
+
+    # --- LLM reasoning dimension ---
+    if run_llm:
+        console.print("[cyan]Loading real components for LLM mode...[/cyan]")
+        components = _build_real_components()
+        if components is None:
+            console.print(
+                "[yellow]⚠ LLM mode skipped — no valid agentkit.yaml or API key.[/yellow]"
+            )
+        else:
+            preprocessor, _skill_registry, llm_gateway = components
+            with Progress(
+                SpinnerColumn(),
+                TextColumn("[progress.description]{task.description}"),
+                BarColumn(),
+                TaskProgressColumn(),
+                console=console,
+            ) as progress:
+                task = progress.add_task("Running [llm] llm_reasoning...", total=None)
+                dim_result = asyncio.run(
+                    _run_llm_reasoning(runs, fast, verbose, preprocessor, llm_gateway)
+                )
+                results["llm_reasoning"] = dim_result
+                progress.update(task, completed=True, total=1)
+
+    # --- GUI integration dimension ---
+    if run_gui:
+        console.print("[cyan]Starting GUI integration tests...[/cyan]")
         with Progress(
             SpinnerColumn(),
             TextColumn("[progress.description]{task.description}"),
@@ -2196,11 +2389,14 @@ def benchmark(
             TaskProgressColumn(),
             console=console,
         ) as progress:
-            for dim in dims_to_run:
-                task = progress.add_task(f"Running {dim.value}...", total=None)
-                dim_result = asyncio.run(_run_dimension(dim.value, runs, fast, verbose, ctx))
-                results[dim.value] = dim_result
-                progress.update(task, completed=True, total=1)
+            task = progress.add_task("Running [gui] gui_integration...", total=None)
+            dim_result = asyncio.run(_run_gui_integration(runs, fast, verbose))
+            results["gui_integration"] = dim_result
+            progress.update(task, completed=True, total=1)
+
+    if not results:
+        console.print("[yellow]⚠ No dimensions were run.[/yellow]")
+        return
 
     # Display summary table
     console.print()
@@ -2252,6 +2448,7 @@ def benchmark(
         report_data: dict[str, object] = {
             "timestamp": timestamp,
             "version": version,
+            "mode": mode.value,
             "runs": runs,
             "fast": fast,
             "overall_accuracy": round(overall_score, 4),
diff --git a/test-results/benchmark/benchmark_report.json b/test-results/benchmark/benchmark_report.json
index a38ea17..48bc2f3 100644
--- a/test-results/benchmark/benchmark_report.json
+++ b/test-results/benchmark/benchmark_report.json
@@ -1,12 +1,13 @@
 {
-  "timestamp": "2026-06-17T04:00:50.738066+00:00",
+  "timestamp": "2026-06-17T04:52:53.863927+00:00",
   "version": "0.1.0",
-  "runs": 3,
+  "mode": "all",
+  "runs": 1,
   "fast": false,
-  "overall_accuracy": 1.0,
-  "overall_accuracy_mean": 1.0,
+  "overall_accuracy": 0.9524,
+  "overall_accuracy_mean": 0.9524,
   "overall_accuracy_std": 0.0,
-  "summary": "All 53 tests passed across 7 dimensions.",
+  "summary": "60/63 tests passed (3 failed) across 9 dimensions.",
   "dimensions": {
     "preprocessing": {
       "metrics": {
@@ -14,9 +15,9 @@
         "precision": 1.0,
         "recall": 1.0,
         "f1": 1.0,
-        "latency_p50_ms": 0.006,
-        "latency_p95_ms": 0.0295,
-        "latency_p99_ms": 0.0569,
+        "latency_p50_ms": 0.0128,
+        "latency_p95_ms": 0.057,
+        "latency_p99_ms": 0.1086,
         "consistency": 1.0,
         "total": 15,
         "passed": 15,
@@ -32,9 +33,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0069,
-          "latency_p95_ms": 0.0111,
-          "latency_p99_ms": 0.0117,
+          "latency_p50_ms": 0.0133,
+          "latency_p95_ms": 0.026,
+          "latency_p99_ms": 0.0275,
           "consistency": 1.0,
           "total": 4,
           "passed": 4,
@@ -49,9 +50,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0051,
-          "latency_p95_ms": 0.0052,
-          "latency_p99_ms": 0.0052,
+          "latency_p50_ms": 0.0115,
+          "latency_p95_ms": 0.0166,
+          "latency_p99_ms": 0.0172,
           "consistency": 1.0,
           "total": 5,
           "passed": 5,
@@ -66,9 +67,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0149,
-          "latency_p95_ms": 0.0588,
-          "latency_p99_ms": 0.0627,
+          "latency_p50_ms": 0.0294,
+          "latency_p95_ms": 0.1123,
+          "latency_p99_ms": 0.1197,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -83,9 +84,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0056,
-          "latency_p95_ms": 0.0074,
-          "latency_p99_ms": 0.0076,
+          "latency_p50_ms": 0.0101,
+          "latency_p95_ms": 0.0125,
+          "latency_p99_ms": 0.0127,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -102,9 +103,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0066,
-          "latency_p95_ms": 0.0109,
-          "latency_p99_ms": 0.0116,
+          "latency_p50_ms": 0.0115,
+          "latency_p95_ms": 0.0253,
+          "latency_p99_ms": 0.0274,
           "consistency": 1.0,
           "total": 5,
           "passed": 5,
@@ -119,9 +120,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0051,
-          "latency_p95_ms": 0.0132,
-          "latency_p99_ms": 0.0146,
+          "latency_p50_ms": 0.0136,
+          "latency_p95_ms": 0.0263,
+          "latency_p99_ms": 0.0288,
           "consistency": 1.0,
           "total": 7,
           "passed": 7,
@@ -136,9 +137,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0076,
-          "latency_p95_ms": 0.0581,
-          "latency_p99_ms": 0.0626,
+          "latency_p50_ms": 0.0128,
+          "latency_p95_ms": 0.1106,
+          "latency_p99_ms": 0.1193,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -158,7 +159,7 @@
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.0118,
+          "duration_ms": 0.0279,
           "root_cause": "none",
           "detail": "input='你好' method=regex_direct",
           "consistency": 1.0
@@ -171,7 +172,7 @@
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.0071,
+          "duration_ms": 0.0151,
           "root_cause": "none",
           "detail": "input='hello' method=regex_direct",
           "consistency": 1.0
@@ -184,7 +185,7 @@
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.0066,
+          "duration_ms": 0.0111,
           "root_cause": "none",
           "detail": "input='谢谢' method=regex_direct",
           "consistency": 1.0
@@ -197,7 +198,7 @@
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.006,
+          "duration_ms": 0.0115,
           "root_cause": "none",
           "detail": "input='你是谁' method=regex_direct",
           "consistency": 1.0
@@ -210,7 +211,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0052,
+          "duration_ms": 0.0136,
           "root_cause": "none",
           "detail": "input='搜索golang教程' method=default_react",
           "consistency": 1.0
@@ -223,7 +224,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0046,
+          "duration_ms": 0.0115,
           "root_cause": "none",
           "detail": "input='执行ls命令' method=default_react",
           "consistency": 1.0
@@ -236,7 +237,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0051,
+          "duration_ms": 0.0174,
           "root_cause": "none",
           "detail": "input='翻译hello为中文' method=default_react",
           "consistency": 1.0
@@ -249,7 +250,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0051,
+          "duration_ms": 0.0113,
           "root_cause": "none",
           "detail": "input='什么是机器学习' method=default_react",
           "consistency": 1.0
@@ -262,7 +263,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0047,
+          "duration_ms": 0.0109,
           "root_cause": "none",
           "detail": "input='帮我分析数据' method=default_react",
           "consistency": 1.0
@@ -275,7 +276,7 @@
           "passed": true,
           "expected": "skill_react",
           "actual": "skill_react",
-          "duration_ms": 0.0149,
+          "duration_ms": 0.0294,
           "root_cause": "none",
           "detail": "input='@skill:react_agent 查看ip' method=skill_prefix",
           "consistency": 1.0
@@ -288,7 +289,7 @@
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.0092,
+          "duration_ms": 0.0191,
           "root_cause": "none",
           "detail": "input='@skill:chat_only 你好' method=skill_prefix",
           "consistency": 1.0
@@ -301,7 +302,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0637,
+          "duration_ms": 0.1215,
           "root_cause": "none",
           "detail": "input='@skill:nonexistent 做点什么' method=skill_not_found_fallback",
           "consistency": 1.0
@@ -314,7 +315,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0076,
+          "duration_ms": 0.0101,
           "root_cause": "none",
           "detail": "input='帮我分析这个数据并生成报告' method=default_react",
           "consistency": 1.0
@@ -327,7 +328,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0056,
+          "duration_ms": 0.0099,
           "root_cause": "none",
           "detail": "input='随便聊聊' method=default_react",
           "consistency": 1.0
@@ -340,7 +341,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0047,
+          "duration_ms": 0.0128,
           "root_cause": "none",
           "detail": "input='请帮我完成以下任务：1. 查询天气 2. 生成报告' method=default_react",
           "consistency": 1.0
@@ -353,9 +354,9 @@
         "precision": 1.0,
         "recall": 1.0,
         "f1": 1.0,
-        "latency_p50_ms": 0.0426,
-        "latency_p95_ms": 0.0644,
-        "latency_p99_ms": 0.0675,
+        "latency_p50_ms": 0.025,
+        "latency_p95_ms": 0.0557,
+        "latency_p99_ms": 0.0596,
         "consistency": 1.0,
         "total": 5,
         "passed": 5,
@@ -371,9 +372,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0426,
-          "latency_p95_ms": 0.0426,
-          "latency_p99_ms": 0.0426,
+          "latency_p50_ms": 0.0362,
+          "latency_p95_ms": 0.0362,
+          "latency_p99_ms": 0.0362,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -388,9 +389,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0309,
-          "latency_p95_ms": 0.0309,
-          "latency_p99_ms": 0.0309,
+          "latency_p50_ms": 0.0243,
+          "latency_p95_ms": 0.0243,
+          "latency_p99_ms": 0.0243,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -405,9 +406,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.049,
-          "latency_p95_ms": 0.049,
-          "latency_p99_ms": 0.049,
+          "latency_p50_ms": 0.0606,
+          "latency_p95_ms": 0.0606,
+          "latency_p99_ms": 0.0606,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -422,9 +423,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0252,
-          "latency_p95_ms": 0.0252,
-          "latency_p99_ms": 0.0252,
+          "latency_p50_ms": 0.0233,
+          "latency_p95_ms": 0.0233,
+          "latency_p99_ms": 0.0233,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -439,9 +440,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0683,
-          "latency_p95_ms": 0.0683,
-          "latency_p99_ms": 0.0683,
+          "latency_p50_ms": 0.025,
+          "latency_p95_ms": 0.025,
+          "latency_p99_ms": 0.025,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -458,9 +459,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0309,
-          "latency_p95_ms": 0.0414,
-          "latency_p99_ms": 0.0424,
+          "latency_p50_ms": 0.0243,
+          "latency_p95_ms": 0.035,
+          "latency_p99_ms": 0.036,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -475,9 +476,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.049,
-          "latency_p95_ms": 0.049,
-          "latency_p99_ms": 0.049,
+          "latency_p50_ms": 0.0606,
+          "latency_p95_ms": 0.0606,
+          "latency_p99_ms": 0.0606,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -492,9 +493,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0683,
-          "latency_p95_ms": 0.0683,
-          "latency_p99_ms": 0.0683,
+          "latency_p50_ms": 0.025,
+          "latency_p95_ms": 0.025,
+          "latency_p99_ms": 0.025,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -514,7 +515,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0426,
+          "duration_ms": 0.0362,
           "root_cause": "none",
           "detail": "paraphrases=5 modes=['react', 'react', 'react', 'react', 'react']",
           "consistency": 1.0
@@ -527,7 +528,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0309,
+          "duration_ms": 0.0243,
           "root_cause": "none",
           "detail": "paraphrases=3 modes=['react', 'react', 'react']",
           "consistency": 1.0
@@ -540,7 +541,7 @@
           "passed": true,
           "expected": "direct_chat",
           "actual": "direct_chat",
-          "duration_ms": 0.049,
+          "duration_ms": 0.0606,
           "root_cause": "none",
           "detail": "paraphrases=5 modes=['direct_chat', 'direct_chat', 'direct_chat', 'direct_chat', 'direct_chat']",
           "consistency": 1.0
@@ -553,7 +554,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0252,
+          "duration_ms": 0.0233,
           "root_cause": "none",
           "detail": "paraphrases=3 modes=['react', 'react', 'react']",
           "consistency": 1.0
@@ -566,7 +567,7 @@
           "passed": true,
           "expected": "react",
           "actual": "react",
-          "duration_ms": 0.0683,
+          "duration_ms": 0.025,
           "root_cause": "none",
           "detail": "paraphrases=3 modes=['react', 'react', 'react']",
           "consistency": 1.0
@@ -579,9 +580,9 @@
         "precision": 0.0,
         "recall": 0.0,
         "f1": 0.0,
-        "latency_p50_ms": 0.4,
-        "latency_p95_ms": 0.768,
-        "latency_p99_ms": 0.8176,
+        "latency_p50_ms": 0.33,
+        "latency_p95_ms": 0.622,
+        "latency_p99_ms": 0.6604,
         "consistency": 1.0,
         "total": 5,
         "passed": 5,
@@ -597,9 +598,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.4,
-          "latency_p95_ms": 0.508,
-          "latency_p99_ms": 0.5176,
+          "latency_p50_ms": 0.33,
+          "latency_p95_ms": 0.42,
+          "latency_p99_ms": 0.428,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -614,9 +615,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.44,
-          "latency_p95_ms": 0.791,
-          "latency_p99_ms": 0.8222,
+          "latency_p50_ms": 0.355,
+          "latency_p95_ms": 0.6385,
+          "latency_p99_ms": 0.6637,
           "consistency": 1.0,
           "total": 2,
           "passed": 2,
@@ -633,9 +634,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.2,
-          "latency_p95_ms": 0.335,
-          "latency_p99_ms": 0.347,
+          "latency_p50_ms": 0.165,
+          "latency_p95_ms": 0.2775,
+          "latency_p99_ms": 0.2875,
           "consistency": 1.0,
           "total": 2,
           "passed": 2,
@@ -650,9 +651,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.52,
-          "latency_p95_ms": 0.799,
-          "latency_p99_ms": 0.8238,
+          "latency_p50_ms": 0.43,
+          "latency_p95_ms": 0.646,
+          "latency_p99_ms": 0.6652,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -671,10 +672,10 @@
           "difficulty": "easy",
           "passed": true,
           "expected": "<=50ms",
-          "actual": "0.004ms",
-          "duration_ms": 0.35,
+          "actual": "0.003ms",
+          "duration_ms": 0.29,
           "root_cause": "none",
-          "detail": "iterations=100 avg=0.004ms threshold=50.0ms",
+          "detail": "iterations=100 avg=0.003ms threshold=50.0ms",
           "consistency": 1.0
         },
         {
@@ -684,10 +685,10 @@
           "difficulty": "medium",
           "passed": true,
           "expected": "<=50ms",
-          "actual": "0.004ms",
-          "duration_ms": 0.4,
+          "actual": "0.003ms",
+          "duration_ms": 0.33,
           "root_cause": "none",
-          "detail": "iterations=100 avg=0.004ms threshold=50.0ms",
+          "detail": "iterations=100 avg=0.003ms threshold=50.0ms",
           "consistency": 1.0
         },
         {
@@ -697,10 +698,10 @@
           "difficulty": "medium",
           "passed": true,
           "expected": "<=50ms",
-          "actual": "0.005ms",
-          "duration_ms": 0.52,
+          "actual": "0.004ms",
+          "duration_ms": 0.43,
           "root_cause": "none",
-          "detail": "iterations=100 avg=0.005ms threshold=50.0ms",
+          "detail": "iterations=100 avg=0.004ms threshold=50.0ms",
           "consistency": 1.0
         },
         {
@@ -710,10 +711,10 @@
           "difficulty": "medium",
           "passed": true,
           "expected": "<=10ms",
-          "actual": "0.008ms",
-          "duration_ms": 0.83,
+          "actual": "0.007ms",
+          "duration_ms": 0.67,
           "root_cause": "none",
-          "detail": "iterations=100 avg=0.008ms threshold=10.0ms",
+          "detail": "iterations=100 avg=0.007ms threshold=10.0ms",
           "consistency": 1.0
         },
         {
@@ -724,7 +725,7 @@
           "passed": true,
           "expected": "<=5ms",
           "actual": "0.000ms",
-          "duration_ms": 0.05,
+          "duration_ms": 0.04,
           "root_cause": "none",
           "detail": "iterations=100 avg=0.000ms threshold=5.0ms",
           "consistency": 1.0
@@ -737,9 +738,9 @@
         "precision": 0.8333,
         "recall": 0.8333,
         "f1": 0.8333,
-        "latency_p50_ms": 0.0112,
-        "latency_p95_ms": 0.0153,
-        "latency_p99_ms": 0.0163,
+        "latency_p50_ms": 0.0192,
+        "latency_p95_ms": 0.0278,
+        "latency_p99_ms": 0.0326,
         "consistency": 1.0,
         "total": 10,
         "passed": 10,
@@ -755,9 +756,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0124,
-          "latency_p95_ms": 0.016,
-          "latency_p99_ms": 0.0165,
+          "latency_p50_ms": 0.0199,
+          "latency_p95_ms": 0.0203,
+          "latency_p99_ms": 0.0204,
           "consistency": 1.0,
           "total": 5,
           "passed": 5,
@@ -772,9 +773,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0108,
-          "latency_p95_ms": 0.0111,
-          "latency_p99_ms": 0.0111,
+          "latency_p50_ms": 0.0264,
+          "latency_p95_ms": 0.0331,
+          "latency_p99_ms": 0.0337,
           "consistency": 1.0,
           "total": 2,
           "passed": 2,
@@ -789,9 +790,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.0044,
-          "latency_p95_ms": 0.0071,
-          "latency_p99_ms": 0.0073,
+          "latency_p50_ms": 0.0118,
+          "latency_p95_ms": 0.0122,
+          "latency_p99_ms": 0.0123,
           "consistency": 1.0,
           "total": 2,
           "passed": 2,
@@ -806,9 +807,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0091,
-          "latency_p95_ms": 0.0091,
-          "latency_p99_ms": 0.0091,
+          "latency_p50_ms": 0.016,
+          "latency_p95_ms": 0.016,
+          "latency_p99_ms": 0.016,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -825,9 +826,9 @@
           "precision": 0.8333,
           "recall": 0.8333,
           "f1": 0.8333,
-          "latency_p50_ms": 0.0124,
-          "latency_p95_ms": 0.0158,
-          "latency_p99_ms": 0.0164,
+          "latency_p50_ms": 0.0194,
+          "latency_p95_ms": 0.0203,
+          "latency_p99_ms": 0.0204,
           "consistency": 1.0,
           "total": 7,
           "passed": 7,
@@ -842,9 +843,9 @@
           "precision": 1.0,
           "recall": 1.0,
           "f1": 1.0,
-          "latency_p50_ms": 0.0105,
-          "latency_p95_ms": 0.011,
-          "latency_p99_ms": 0.0111,
+          "latency_p50_ms": 0.019,
+          "latency_p95_ms": 0.0323,
+          "latency_p99_ms": 0.0335,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -864,7 +865,7 @@
           "passed": true,
           "expected": "read_file",
           "actual": "read_file",
-          "duration_ms": 0.0166,
+          "duration_ms": 0.0199,
           "root_cause": "none",
           "detail": "query='read file' top_k=5 results=2",
           "consistency": 1.0
@@ -877,7 +878,7 @@
           "passed": true,
           "expected": "write_file",
           "actual": "write_file",
-          "duration_ms": 0.0138,
+          "duration_ms": 0.0204,
           "root_cause": "none",
           "detail": "query='write file content' top_k=5 results=2",
           "consistency": 1.0
@@ -890,7 +891,7 @@
           "passed": true,
           "expected": "web_search",
           "actual": "web_search",
-          "duration_ms": 0.0124,
+          "duration_ms": 0.02,
           "root_cause": "none",
           "detail": "query='search web information' top_k=5 results=2",
           "consistency": 1.0
@@ -903,7 +904,7 @@
           "passed": true,
           "expected": "shell_exec",
           "actual": "shell_exec",
-          "duration_ms": 0.0113,
+          "duration_ms": 0.018,
           "root_cause": "none",
           "detail": "query='execute shell command' top_k=5 results=1",
           "consistency": 1.0
@@ -916,7 +917,7 @@
           "passed": true,
           "expected": "http_request",
           "actual": "http_request",
-          "duration_ms": 0.0124,
+          "duration_ms": 0.0194,
           "root_cause": "none",
           "detail": "query='send http request url' top_k=5 results=1",
           "consistency": 1.0
@@ -929,7 +930,7 @@
           "passed": true,
           "expected": "read_file",
           "actual": "read_file",
-          "duration_ms": 0.0105,
+          "duration_ms": 0.0338,
           "root_cause": "none",
           "detail": "query='io file' top_k=5 results=2",
           "consistency": 1.0
@@ -942,7 +943,7 @@
           "passed": true,
           "expected": "web_search",
           "actual": "web_search",
-          "duration_ms": 0.0111,
+          "duration_ms": 0.019,
           "root_cause": "none",
           "detail": "query='search query engine' top_k=5 results=1",
           "consistency": 1.0
@@ -955,7 +956,7 @@
           "passed": true,
           "expected": "__none__",
           "actual": "[]",
-          "duration_ms": 0.0015,
+          "duration_ms": 0.0112,
           "root_cause": "none",
           "detail": "query='' top_k=5 results=0",
           "consistency": 1.0
@@ -968,7 +969,7 @@
           "passed": true,
           "expected": "__none__",
           "actual": "[]",
-          "duration_ms": 0.0074,
+          "duration_ms": 0.0123,
           "root_cause": "none",
           "detail": "query='zzzznonexistent' top_k=5 results=0",
           "consistency": 1.0
@@ -981,7 +982,7 @@
           "passed": true,
           "expected": "read_file",
           "actual": "read_file",
-          "duration_ms": 0.0091,
+          "duration_ms": 0.016,
           "root_cause": "none",
           "detail": "query='file' top_k=1 results=1",
           "consistency": 1.0
@@ -994,9 +995,9 @@
         "precision": 0.0,
         "recall": 0.0,
         "f1": 0.0,
-        "latency_p50_ms": 0.0409,
-        "latency_p95_ms": 15.6839,
-        "latency_p99_ms": 19.8446,
+        "latency_p50_ms": 0.057,
+        "latency_p95_ms": 15.9984,
+        "latency_p99_ms": 20.2369,
         "consistency": 1.0,
         "total": 6,
         "passed": 6,
@@ -1012,9 +1013,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.038,
-          "latency_p95_ms": 0.0773,
-          "latency_p99_ms": 0.0808,
+          "latency_p50_ms": 0.046,
+          "latency_p95_ms": 0.0982,
+          "latency_p99_ms": 0.1028,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -1029,9 +1030,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.0438,
-          "latency_p95_ms": 18.8006,
-          "latency_p99_ms": 20.4679,
+          "latency_p50_ms": 0.0681,
+          "latency_p95_ms": 19.1737,
+          "latency_p99_ms": 20.8719,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -1048,9 +1049,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 0.0409,
-          "latency_p95_ms": 15.6839,
-          "latency_p99_ms": 19.8446,
+          "latency_p50_ms": 0.057,
+          "latency_p95_ms": 15.9984,
+          "latency_p99_ms": 20.2369,
           "consistency": 1.0,
           "total": 6,
           "passed": 6,
@@ -1070,9 +1071,9 @@
           "passed": true,
           "expected": "passed",
           "actual": "drained=['hello']",
-          "duration_ms": 0.0817,
+          "duration_ms": 0.104,
           "root_cause": "none",
-          "detail": "task_id=b0a1c409...",
+          "detail": "task_id=09dccea9...",
           "consistency": 1.0
         },
         {
@@ -1083,7 +1084,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "cancelled=True",
-          "duration_ms": 0.038,
+          "duration_ms": 0.046,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1096,7 +1097,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "raised=True closed=True",
-          "duration_ms": 0.0091,
+          "duration_ms": 0.0115,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1109,7 +1110,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "received=1",
-          "duration_ms": 0.0438,
+          "duration_ms": 0.0681,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1122,7 +1123,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "events=1 closed=True",
-          "duration_ms": 20.8847,
+          "duration_ms": 21.2965,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1135,7 +1136,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "subscribers=0",
-          "duration_ms": 0.0045,
+          "duration_ms": 0.007,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1148,9 +1149,9 @@
         "precision": 0.0,
         "recall": 0.0,
         "f1": 0.0,
-        "latency_p50_ms": 1.414,
-        "latency_p95_ms": 3.5951,
-        "latency_p99_ms": 4.0383,
+        "latency_p50_ms": 1.3834,
+        "latency_p95_ms": 3.4578,
+        "latency_p99_ms": 4.0077,
         "consistency": 1.0,
         "total": 7,
         "passed": 7,
@@ -1166,9 +1167,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 1.414,
-          "latency_p95_ms": 3.6332,
-          "latency_p99_ms": 4.0459,
+          "latency_p50_ms": 1.3834,
+          "latency_p95_ms": 3.6044,
+          "latency_p99_ms": 4.037,
           "consistency": 1.0,
           "total": 5,
           "passed": 5,
@@ -1183,9 +1184,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 1.1783,
-          "latency_p95_ms": 2.1899,
-          "latency_p99_ms": 2.2798,
+          "latency_p50_ms": 0.9497,
+          "latency_p95_ms": 1.7635,
+          "latency_p99_ms": 1.8358,
           "consistency": 1.0,
           "total": 2,
           "passed": 2,
@@ -1202,9 +1203,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 1.3787,
-          "latency_p95_ms": 3.5042,
-          "latency_p99_ms": 4.0201,
+          "latency_p50_ms": 1.3659,
+          "latency_p95_ms": 3.4693,
+          "latency_p99_ms": 4.01,
           "consistency": 1.0,
           "total": 6,
           "passed": 6,
@@ -1219,9 +1220,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 2.3023,
-          "latency_p95_ms": 2.3023,
-          "latency_p99_ms": 2.3023,
+          "latency_p50_ms": 1.8539,
+          "latency_p95_ms": 1.8539,
+          "latency_p99_ms": 1.8539,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -1241,9 +1242,9 @@
           "passed": true,
           "expected": "passed",
           "actual": "exists=True",
-          "duration_ms": 1.414,
+          "duration_ms": 1.3484,
           "root_cause": "none",
-          "detail": "path=/var/folders/6b/ljk5bdq50yxcsth24frf05200000gn/T/agentkit-benchmark-pz2hpb1l/run-2/specs/sm-001/test-spec.yaml",
+          "detail": "path=/var/folders/6b/ljk5bdq50yxcsth24frf05200000gn/T/agentkit-benchmark-wll_nqgl/run-0/specs/sm-001/test-spec.yaml",
           "consistency": 1.0
         },
         {
@@ -1254,7 +1255,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "steps=2",
-          "duration_ms": 1.3435,
+          "duration_ms": 1.3834,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1267,7 +1268,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "goal=Updated goal",
-          "duration_ms": 1.5695,
+          "duration_ms": 1.4414,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1280,7 +1281,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "deleted=True remaining=0",
-          "duration_ms": 1.1556,
+          "duration_ms": 1.0766,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1293,7 +1294,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "count=2",
-          "duration_ms": 4.1491,
+          "duration_ms": 4.1452,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1306,7 +1307,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "status=confirmed",
-          "duration_ms": 2.3023,
+          "duration_ms": 1.8539,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1319,7 +1320,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "result=None",
-          "duration_ms": 0.0544,
+          "duration_ms": 0.0454,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1332,9 +1333,9 @@
         "precision": 0.0,
         "recall": 0.0,
         "f1": 0.0,
-        "latency_p50_ms": 25.4393,
-        "latency_p95_ms": 413.4245,
-        "latency_p99_ms": 488.3185,
+        "latency_p50_ms": 22.0041,
+        "latency_p95_ms": 411.5705,
+        "latency_p99_ms": 487.0649,
         "consistency": 1.0,
         "total": 5,
         "passed": 5,
@@ -1350,9 +1351,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 12.9474,
-          "latency_p95_ms": 13.0775,
-          "latency_p99_ms": 13.0891,
+          "latency_p50_ms": 11.4916,
+          "latency_p95_ms": 11.8303,
+          "latency_p99_ms": 11.8604,
           "consistency": 1.0,
           "total": 2,
           "passed": 2,
@@ -1367,9 +1368,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 38.9547,
-          "latency_p95_ms": 38.9547,
-          "latency_p99_ms": 38.9547,
+          "latency_p50_ms": 34.0985,
+          "latency_p95_ms": 34.0985,
+          "latency_p99_ms": 34.0985,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -1384,9 +1385,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 507.042,
-          "latency_p95_ms": 507.042,
-          "latency_p99_ms": 507.042,
+          "latency_p50_ms": 505.9385,
+          "latency_p95_ms": 505.9385,
+          "latency_p99_ms": 505.9385,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -1401,9 +1402,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 25.4393,
-          "latency_p95_ms": 25.4393,
-          "latency_p99_ms": 25.4393,
+          "latency_p50_ms": 22.0041,
+          "latency_p95_ms": 22.0041,
+          "latency_p99_ms": 22.0041,
           "consistency": 1.0,
           "total": 1,
           "passed": 1,
@@ -1420,9 +1421,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 12.9474,
-          "latency_p95_ms": 13.0775,
-          "latency_p99_ms": 13.0891,
+          "latency_p50_ms": 11.4916,
+          "latency_p95_ms": 11.8303,
+          "latency_p99_ms": 11.8604,
           "consistency": 1.0,
           "total": 2,
           "passed": 2,
@@ -1437,9 +1438,9 @@
           "precision": 0.0,
           "recall": 0.0,
           "f1": 0.0,
-          "latency_p50_ms": 38.9547,
-          "latency_p95_ms": 460.2333,
-          "latency_p99_ms": 497.6803,
+          "latency_p50_ms": 34.0985,
+          "latency_p95_ms": 458.7545,
+          "latency_p99_ms": 496.5017,
           "consistency": 1.0,
           "total": 3,
           "passed": 3,
@@ -1459,7 +1460,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "passed=True attempts=1",
-          "duration_ms": 13.092,
+          "duration_ms": 11.8679,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1472,7 +1473,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "passed=False errors=1",
-          "duration_ms": 12.8029,
+          "duration_ms": 11.1154,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1485,7 +1486,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "attempts=3 callbacks=2",
-          "duration_ms": 38.9547,
+          "duration_ms": 34.0985,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
@@ -1498,7 +1499,7 @@
           "passed": true,
           "expected": "passed",
           "actual": "passed=False errors=1",
-          "duration_ms": 507.042,
+          "duration_ms": 505.9385,
           "root_cause": "none",
           "detail": "errors=['Command timed out after 0.5s: sleep 10']",
           "consistency": 1.0
@@ -1511,12 +1512,447 @@
           "passed": true,
           "expected": "passed",
           "actual": "passed=False",
-          "duration_ms": 25.4393,
+          "duration_ms": 22.0041,
           "root_cause": "none",
           "detail": "",
           "consistency": 1.0
         }
       ]
+    },
+    "llm_reasoning": {
+      "metrics": {
+        "accuracy": 0.6,
+        "precision": 0.0,
+        "recall": 0.0,
+        "f1": 0.0,
+        "latency_p50_ms": 25149.4865,
+        "latency_p95_ms": 30001.1677,
+        "latency_p99_ms": 30001.2291,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 3,
+        "failed": 2,
+        "accuracy_mean": 0.6,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.2307,
+        "ci_upper": 0.8824
+      },
+      "by_category": {
+        "intent_understanding": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 21288.4177,
+          "latency_p95_ms": 21288.4177,
+          "latency_p99_ms": 21288.4177,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "tool_selection": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 5894.9682,
+          "latency_p95_ms": 5894.9682,
+          "latency_p99_ms": 5894.9682,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "multi_step": {
+          "accuracy": 0.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 30000.8609,
+          "latency_p95_ms": 30000.8609,
+          "latency_p99_ms": 30000.8609,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 0,
+          "failed": 1,
+          "accuracy_mean": 0.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.0,
+          "ci_upper": 0.7935
+        },
+        "code_generation": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 25149.4865,
+          "latency_p95_ms": 25149.4865,
+          "latency_p99_ms": 25149.4865,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "error_recovery": {
+          "accuracy": 0.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 30001.2444,
+          "latency_p95_ms": 30001.2444,
+          "latency_p99_ms": 30001.2444,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 0,
+          "failed": 1,
+          "accuracy_mean": 0.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.0,
+          "ci_upper": 0.7935
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 21288.4177,
+          "latency_p95_ms": 21288.4177,
+          "latency_p99_ms": 21288.4177,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 15522.2273,
+          "latency_p95_ms": 24186.7606,
+          "latency_p99_ms": 24956.9413,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "hard": {
+          "accuracy": 0.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 30001.0526,
+          "latency_p95_ms": 30001.2252,
+          "latency_p99_ms": 30001.2406,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 0,
+          "failed": 2,
+          "accuracy_mean": 0.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.0,
+          "ci_upper": 0.6576
+        }
+      },
+      "cases": [
+        {
+          "task_id": "llm-001",
+          "dimension": "llm_reasoning",
+          "category": "intent_understanding",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "react",
+          "actual": "mode=react tokens=1116 len=974",
+          "duration_ms": 21288.4177,
+          "root_cause": "none",
+          "detail": "mode=react keywords=['ip', '地址', 'ifconfig', 'hostname', '网络']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "llm-002",
+          "dimension": "llm_reasoning",
+          "category": "tool_selection",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "mode=react tokens=205 len=87",
+          "duration_ms": 5894.9682,
+          "root_cause": "none",
+          "detail": "mode=react keywords=['search', '搜索', 'web', '论文', 'paper', 'agent']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "llm-003",
+          "dimension": "llm_reasoning",
+          "category": "multi_step",
+          "difficulty": "hard",
+          "passed": false,
+          "expected": "react",
+          "actual": "timeout",
+          "duration_ms": 30000.8609,
+          "root_cause": "timeout",
+          "detail": "LLM call timed out after 30s",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "llm-004",
+          "dimension": "llm_reasoning",
+          "category": "code_generation",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "react",
+          "actual": "mode=react tokens=1359 len=1001",
+          "duration_ms": 25149.4865,
+          "root_cause": "none",
+          "detail": "mode=react keywords=['def', 'fib', 'return', 'python']",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "llm-005",
+          "dimension": "llm_reasoning",
+          "category": "error_recovery",
+          "difficulty": "hard",
+          "passed": false,
+          "expected": "react",
+          "actual": "timeout",
+          "duration_ms": 30001.2444,
+          "root_cause": "timeout",
+          "detail": "LLM call timed out after 30s",
+          "consistency": 1.0
+        }
+      ]
+    },
+    "gui_integration": {
+      "metrics": {
+        "accuracy": 0.8,
+        "precision": 0.8,
+        "recall": 0.8,
+        "f1": 0.8,
+        "latency_p50_ms": 0.0,
+        "latency_p95_ms": 0.0,
+        "latency_p99_ms": 0.0,
+        "consistency": 1.0,
+        "total": 5,
+        "passed": 4,
+        "failed": 1,
+        "accuracy_mean": 0.8,
+        "accuracy_std": 0.0,
+        "ci_lower": 0.3755,
+        "ci_upper": 0.9638
+      },
+      "by_category": {
+        "service_startup": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0,
+          "latency_p95_ms": 0.0,
+          "latency_p99_ms": 0.0,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        },
+        "api_availability": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0,
+          "latency_p95_ms": 0.0,
+          "latency_p99_ms": 0.0,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "websocket": {
+          "accuracy": 0.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0,
+          "latency_p95_ms": 0.0,
+          "latency_p99_ms": 0.0,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 0,
+          "failed": 1,
+          "accuracy_mean": 0.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.0,
+          "ci_upper": 0.7935
+        },
+        "frontend": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0,
+          "latency_p95_ms": 0.0,
+          "latency_p99_ms": 0.0,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 1,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.2065,
+          "ci_upper": 1.0
+        }
+      },
+      "by_difficulty": {
+        "easy": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0,
+          "latency_p95_ms": 0.0,
+          "latency_p99_ms": 0.0,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "medium": {
+          "accuracy": 1.0,
+          "precision": 1.0,
+          "recall": 1.0,
+          "f1": 1.0,
+          "latency_p50_ms": 0.0,
+          "latency_p95_ms": 0.0,
+          "latency_p99_ms": 0.0,
+          "consistency": 1.0,
+          "total": 2,
+          "passed": 2,
+          "failed": 0,
+          "accuracy_mean": 1.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.3424,
+          "ci_upper": 1.0
+        },
+        "hard": {
+          "accuracy": 0.0,
+          "precision": 0.0,
+          "recall": 0.0,
+          "f1": 0.0,
+          "latency_p50_ms": 0.0,
+          "latency_p95_ms": 0.0,
+          "latency_p99_ms": 0.0,
+          "consistency": 1.0,
+          "total": 1,
+          "passed": 0,
+          "failed": 1,
+          "accuracy_mean": 0.0,
+          "accuracy_std": 0.0,
+          "ci_lower": 0.0,
+          "ci_upper": 0.7935
+        }
+      },
+      "cases": [
+        {
+          "task_id": "gui-001",
+          "dimension": "gui_integration",
+          "category": "service_startup",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "started",
+          "actual": "started",
+          "duration_ms": 0.0,
+          "root_cause": "none",
+          "detail": "port=64767 pid=20993",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "gui-002",
+          "dimension": "gui_integration",
+          "category": "api_availability",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "200",
+          "actual": "200",
+          "duration_ms": 0.0,
+          "root_cause": "none",
+          "detail": "health=200 skills=200",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "gui-003",
+          "dimension": "gui_integration",
+          "category": "api_availability",
+          "difficulty": "medium",
+          "passed": true,
+          "expected": "reachable",
+          "actual": "reachable",
+          "duration_ms": 0.0,
+          "root_cause": "none",
+          "detail": "status=405",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "gui-004",
+          "dimension": "gui_integration",
+          "category": "websocket",
+          "difficulty": "hard",
+          "passed": false,
+          "expected": "connected",
+          "actual": "failed",
+          "duration_ms": 0.0,
+          "root_cause": "gui_failure",
+          "detail": "error: server rejected WebSocket connection: HTTP 403",
+          "consistency": 1.0
+        },
+        {
+          "task_id": "gui-005",
+          "dimension": "gui_integration",
+          "category": "frontend",
+          "difficulty": "easy",
+          "passed": true,
+          "expected": "html",
+          "actual": "html",
+          "duration_ms": 0.0,
+          "root_cause": "none",
+          "detail": "status=200 len=465",
+          "consistency": 1.0
+        }
+      ]
     }
   },
   "baseline_comparison": {
@@ -1563,6 +1999,18 @@
         "current_accuracy": 1.0,
         "change": 0.0,
         "direction": "—"
+      },
+      "llm_reasoning": {
+        "baseline_accuracy": 0.0,
+        "current_accuracy": 0.6,
+        "change": 0.6,
+        "direction": "↑"
+      },
+      "gui_integration": {
+        "baseline_accuracy": 0.0,
+        "current_accuracy": 0.8,
+        "change": 0.8,
+        "direction": "↑"
       }
     }
   }
diff --git a/test-results/benchmark/benchmark_report.md b/test-results/benchmark/benchmark_report.md
index 87c6399..a8dde39 100644
--- a/test-results/benchmark/benchmark_report.md
+++ b/test-results/benchmark/benchmark_report.md
@@ -1,10 +1,11 @@
 # AgentKit 能力基准测试报告
 
 ## 测试概要
-- 时间: 2026-06-17T04:00:50.738066+00:00
+- 时间: 2026-06-17T04:52:53.863927+00:00
 - 版本: 0.1.0
-- 运行次数: 3
-- 总体准确率: 100.0% ± 0.0%
+- 模式: all
+- 运行次数: 1
+- 总体准确率: 95.2% ± 0.0%
 
 ## 与行业 Benchmark 对比
 
@@ -16,7 +17,7 @@
 
 ## 维度结果
 
-### 1. 预处理准确度 (Preprocessing Accuracy)
+### 1. 预处理准确度 (Preprocessing Accuracy) [Mock]
 
 | 指标 | 值 |
 |---|---|
@@ -26,8 +27,8 @@
 | Recall | 100.0% |
 | F1 | 100.0% |
 | Latency p50 | 0.01ms |
-| Latency p95 | 0.03ms |
-| Latency p99 | 0.06ms |
+| Latency p95 | 0.06ms |
+| Latency p99 | 0.11ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 15 / 15 / 0 |
 
@@ -48,7 +49,7 @@
 | medium | 7 | 7 | 100.0% |
 | hard | 3 | 3 | 100.0% |
 
-### 2. 过拟合检测 (Overfitting Detection)
+### 2. 过拟合检测 (Overfitting Detection) [Mock]
 
 | 指标 | 值 |
 |---|---|
@@ -57,9 +58,9 @@
 | Precision | 100.0% |
 | Recall | 100.0% |
 | F1 | 100.0% |
-| Latency p50 | 0.04ms |
+| Latency p50 | 0.03ms |
 | Latency p95 | 0.06ms |
-| Latency p99 | 0.07ms |
+| Latency p99 | 0.06ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 5 / 0 |
 
@@ -81,7 +82,7 @@
 | easy | 1 | 1 | 100.0% |
 | hard | 1 | 1 | 100.0% |
 
-### 3. 效率测试 (Efficiency)
+### 3. 效率测试 (Efficiency) [Mock]
 
 | 指标 | 值 |
 |---|---|
@@ -90,9 +91,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 0.40ms |
-| Latency p95 | 0.77ms |
-| Latency p99 | 0.82ms |
+| Latency p50 | 0.33ms |
+| Latency p95 | 0.62ms |
+| Latency p99 | 0.66ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 5 / 0 |
 
@@ -110,7 +111,7 @@
 | easy | 2 | 2 | 100.0% |
 | medium | 3 | 3 | 100.0% |
 
-### 4. 工具搜索 (Tool Search)
+### 4. 工具搜索 (Tool Search) [Mock]
 
 | 指标 | 值 |
 |---|---|
@@ -119,9 +120,9 @@
 | Precision | 83.3% |
 | Recall | 83.3% |
 | F1 | 83.3% |
-| Latency p50 | 0.01ms |
-| Latency p95 | 0.02ms |
-| Latency p99 | 0.02ms |
+| Latency p50 | 0.02ms |
+| Latency p95 | 0.03ms |
+| Latency p99 | 0.03ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 10 / 10 / 0 |
 
@@ -141,7 +142,7 @@
 | easy | 7 | 7 | 100.0% |
 | medium | 3 | 3 | 100.0% |
 
-### 5. 事件模型 (Event Model)
+### 5. 事件模型 (Event Model) [Mock]
 
 | 指标 | 值 |
 |---|---|
@@ -150,9 +151,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 0.04ms |
-| Latency p95 | 15.68ms |
-| Latency p99 | 19.84ms |
+| Latency p50 | 0.06ms |
+| Latency p95 | 16.00ms |
+| Latency p99 | 20.24ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 6 / 6 / 0 |
 
@@ -169,7 +170,7 @@
 |---|---|---|---|
 | easy | 6 | 6 | 100.0% |
 
-### 6. 规格管理 (Spec Management)
+### 6. 规格管理 (Spec Management) [Mock]
 
 | 指标 | 值 |
 |---|---|
@@ -178,9 +179,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 1.41ms |
-| Latency p95 | 3.60ms |
-| Latency p99 | 4.04ms |
+| Latency p50 | 1.38ms |
+| Latency p95 | 3.46ms |
+| Latency p99 | 4.01ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 7 / 7 / 0 |
 
@@ -198,7 +199,7 @@
 | easy | 6 | 6 | 100.0% |
 | medium | 1 | 1 | 100.0% |
 
-### 7. 验证循环 (Verification Loop)
+### 7. 验证循环 (Verification Loop) [Mock]
 
 | 指标 | 值 |
 |---|---|
@@ -207,9 +208,9 @@
 | Precision | 0.0% |
 | Recall | 0.0% |
 | F1 | 0.0% |
-| Latency p50 | 25.44ms |
-| Latency p95 | 413.42ms |
-| Latency p99 | 488.32ms |
+| Latency p50 | 22.00ms |
+| Latency p95 | 411.57ms |
+| Latency p99 | 487.06ms |
 | Consistency | 100.0% |
 | Total / Pass / Fail | 5 / 5 / 0 |
 
@@ -229,6 +230,84 @@
 | easy | 2 | 2 | 100.0% |
 | medium | 3 | 3 | 100.0% |
 
+### 8. LLM 推理能力 (LLM Reasoning) [LLM]
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 60.0% ± 0.0% |
+| 95% CI | [23.1%, 88.2%] |
+| Precision | 0.0% |
+| Recall | 0.0% |
+| F1 | 0.0% |
+| Latency p50 | 25149.49ms |
+| Latency p95 | 30001.17ms |
+| Latency p99 | 30001.23ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 3 / 2 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| intent_understanding | 1 | 1 | 100.0% |
+| tool_selection | 1 | 1 | 100.0% |
+| multi_step | 1 | 0 | 0.0% |
+| code_generation | 1 | 1 | 100.0% |
+| error_recovery | 1 | 0 | 0.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 1 | 1 | 100.0% |
+| medium | 2 | 2 | 100.0% |
+| hard | 2 | 0 | 0.0% |
+
+#### 失败用例分析
+
+| 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 |
+|---|---|---|---|---|---|
+| llm-003 | multi_step | hard | react | timeout | timeout |
+| llm-005 | error_recovery | hard | react | timeout | timeout |
+
+### 9. GUI 集成测试 (GUI Integration) [GUI]
+
+| 指标 | 值 |
+|---|---|
+| Accuracy | 80.0% ± 0.0% |
+| 95% CI | [37.5%, 96.4%] |
+| Precision | 80.0% |
+| Recall | 80.0% |
+| F1 | 80.0% |
+| Latency p50 | 0.00ms |
+| Latency p95 | 0.00ms |
+| Latency p99 | 0.00ms |
+| Consistency | 100.0% |
+| Total / Pass / Fail | 5 / 4 / 1 |
+
+#### 按类别分布
+
+| 类别 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| service_startup | 1 | 1 | 100.0% |
+| api_availability | 2 | 2 | 100.0% |
+| websocket | 1 | 0 | 0.0% |
+| frontend | 1 | 1 | 100.0% |
+
+#### 按难度分布
+
+| 难度 | 用例数 | 通过 | 准确率 |
+|---|---|---|---|
+| easy | 2 | 2 | 100.0% |
+| medium | 2 | 2 | 100.0% |
+| hard | 1 | 0 | 0.0% |
+
+#### 失败用例分析
+
+| 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 |
+|---|---|---|---|---|---|
+| gui-004 | websocket | hard | connected | failed | gui_failure |
+
 ## 基线对比
 
 | 维度 | 基线准确率 | 当前准确率 | 变化 |
@@ -240,7 +319,12 @@
 | event_model | 100.0% | 100.0% | — |
 | spec_management | 100.0% | 100.0% | — |
 | verification | 100.0% | 100.0% | — |
+| llm_reasoning | 0.0% | 60.0% | ↑ |
+| gui_integration | 0.0% | 80.0% | ↑ |
 
 ## 问题总结与改进建议
 
-- **verification**: P95 延迟 413.42ms 较高，建议优化性能
+- **verification**: P95 延迟 411.57ms 较高，建议优化性能
+- **llm_reasoning**: 准确率 60.0% 低于 90%，建议检查失败用例并优化
+- **llm_reasoning**: P95 延迟 30001.17ms 较高，建议优化性能
+- **gui_integration**: 准确率 80.0% 低于 90%，建议检查失败用例并优化
diff --git a/test-results/benchmark/benchmark_report.txt b/test-results/benchmark/benchmark_report.txt
index 7b8c1f0..53131f4 100644
--- a/test-results/benchmark/benchmark_report.txt
+++ b/test-results/benchmark/benchmark_report.txt
@@ -1,7 +1,7 @@
 ======================================================================
 AgentKit Benchmark Report
 ======================================================================
-Timestamp:      2026-06-17T03:26:25.072956+00:00
+Timestamp:      2026-06-17T03:31:00.118497+00:00
 Version:        0.1.0
 Overall Score:  98.0%
 Summary:        50/51 tests passed (1 failed) across 7 dimensions.