Merge branch 'test/full-regression-real-llm-e2e' into main
Deploy to Production / deploy (push) Failing after 10s Details

合并全面回测 + 真实 LLM E2E + 路由优化 + 代码审查修复到主干。

主要变更:
- U1-U6: 6 个修复单元(benchmark 超时、LLM 超时、QualityGate、disambiguation_keywords、路由正则、重新基准测试)
- ce-code-review: 5 项安全修复
- Benchmark 准确率:60% -> 93.3%
- 40 项单元测试全通过
This commit is contained in:
chiguyong 2026-06-20 19:36:08 +08:00
commit 44bc27c9b3
37 changed files with 3946 additions and 3598 deletions

5
.gitignore vendored
View File

@ -21,6 +21,11 @@ venv/
.coverage
htmlcov/
# Playwright E2E (scoped to frontend dir to avoid ignoring project-level test-results/)
src/agentkit/server/frontend/playwright-report/
src/agentkit/server/frontend/test-results/
src/agentkit/server/frontend/blob-report/
# OS
.DS_Store

View File

@ -16,6 +16,7 @@ intent:
- "帮我看看代码有没有问题"
- "代码审查一下"
- "review一下这段代码"
disambiguation_keywords: ["代码质量", "bug检查", "安全漏洞", "逻辑检查"]
capabilities:
- code_review

View File

@ -18,6 +18,7 @@ intent:
- "对手怎么样"
- "竞品啥情况"
- "How are competitors doing"
disambiguation_keywords: ["竞品分析", "竞争对比", "市场对手", "品牌差距"]
input_schema:
type: object

View File

@ -18,6 +18,7 @@ intent:
- "帮我写点东西"
- "写篇文章吧"
- "Write something for me"
disambiguation_keywords: ["内容创作", "文章生成", "选题写作", "原创内容"]
input_schema:
type: object

View File

@ -16,6 +16,7 @@ intent:
- "提升文章在AI搜索中的排名"
- "做个SEO优化"
- "Optimize for AI search"
disambiguation_keywords: ["搜索排名", "AI搜索引擎", "内容可见性", "引用率提升"]
input_schema:
type: object

View File

@ -16,6 +16,7 @@ intent:
- "分析竞品 SEO 策略并生成优化方案"
- "调研3个技术方案并生成对比报告"
- "制定市场推广计划并执行"
disambiguation_keywords: ["目标分解", "多步规划", "方案对比", "执行计划"]
input_schema:
type: object

View File

@ -14,6 +14,7 @@ intent:
- "搜索一下AI Agent市场数据"
- "帮我分析这个数据"
- "实时监控竞品动态"
disambiguation_keywords: ["实时搜索", "工具调用", "信息查询", "动态适应"]
capabilities:
- dynamic_adaptation

View File

@ -14,6 +14,7 @@ intent:
- "审查这段代码的合规性"
- "生成一个高精度的数据分析脚本"
- "检查报告中的合规问题"
disambiguation_keywords: ["反思", "自我验证", "迭代优化", "高精度"]
capabilities:
- self_evaluation

View File

@ -18,6 +18,7 @@ intent:
- "采集A、B、C三个竞品的功能数据"
- "批量获取多个知识库的信息"
- "并行搜索多个关键词"
disambiguation_keywords: ["并行采集", "批量获取", "多源数据", "无依赖调用"]
capabilities:
- batch_execution

View File

@ -0,0 +1,320 @@
---
title: "fix: 回测问题修复 + 路由优化 + 质量门控强化"
status: completed
created: 2026-06-20
type: fix
origin: test/full-regression-real-llm-e2e 回测结果
---
# fix: 回测问题修复 + 路由优化 + 质量门控强化
## Summary
修复全面回测中发现的 5 个代码问题,优化当前 RequestPreprocessor 路由准确率,强化 QualityGate 质量门控,并重新基准测试建立当前架构基线。
## Problem Frame
回测发现以下问题(基于 `test/full-regression-real-llm-e2e` 分支):
1. **Benchmark 超时过短**`llm-001`easy 难度)超时阈值 20s真实 LLMqwen3.7-plus无法在 20s 内完成工具调用推理,导致 2/5 用例超时
2. **LLM Provider httpx 超时硬编码**`OpenAICompatibleProvider` 的 httpx 客户端硬编码 `timeout=60.0`,忽略 `ProviderConfig.timeout`120s
3. **QualityGate skill_match 休眠**`_check_skill_match()` 方法存在但无调用方传入 `skill_context`,质量门控形同虚设
4. **QualityGate 自定义验证器过于宽松** — 验证器导入/执行失败时静默跳过(`passed=True`),不拦截低质量输出
5. **16 个技能配置均无 disambiguation_keywords** — 易混淆技能对reflexion_agent↔code_reviewer 等)无法消歧
6. **路由优化** — 当前 RequestPreprocessor 仅 3 条正则(问候/闲聊/身份),大量简单 factual 问题被送入 REACT 循环,浪费 token
## Requirements
- R1: Benchmark easy 难度超时从 20s 提升至 45smedium 从 40s 提升至 60s
- R2: OpenAICompatibleProvider httpx 客户端使用 ProviderConfig.timeout 而非硬编码 60s
- R3: QualityGate skill_match 在执行管线中被实际调用(传入 skill_context
- R4: QualityGate 自定义验证器失败时支持严格模式(可配置拦截 vs 警告)
- R5: 为 4 对易混淆技能添加 disambiguation_keywords 字段
- R6: RequestPreprocessor 新增 factual/数学/翻译类正则,减少不必要的 REACT 调用
- R7: 修复后重新运行 benchmark 建立当前架构基线
## Key Technical Decisions
### KTD1: Benchmark 超时按难度分级保留,但提升阈值
**决策**: 保留 `_LLM_TIMEOUT_BY_DIFFICULTY` 字典结构,提升 easy→45s、medium→60s、hard→90s。
**理由**: 分级超时是合理设计(简单任务不应等太久),但 20s 对真实 LLM 工具调用太短。qwen3.7-plus 的 p50 延迟 35s、p95 42s来自 benchmark 报告20s 必然超时。
### KTD2: httpx 超时从 ProviderConfig 透传,保留硬编码作为 fallback
**决策**: `OpenAICompatibleProvider.__init__` 读取 `config.timeout`,若未设置则 fallback 到 60s。
**理由**: ProviderConfig.timeout 默认 120s 是有意的LLM 推理慢httpx 硬编码 60s 会先于 ProviderConfig 触发,导致配置无效。
### KTD3: QualityGate skill_match 在 ConfigDrivenAgent 执行后调用
**决策**: 在 `ConfigDrivenAgent._execute_skill_task()` 返回前调用 `QualityGate.validate(output, skill_context=skill_config)`
**理由**: skill_match 需要技能上下文intent_keywords才能校验输出一致性。ConfigDrivenAgent 是技能执行的统一入口,在此处调用覆盖面最广。
### KTD4: disambiguation_keywords 作为 QualityGate 消歧输入,不用于路由
**决策**: disambiguation_keywords 添加到 skill yaml 的 `intent` 节点下,由 QualityGate 读取用于输出校验,不影响 RequestPreprocessor 路由决策。
**理由**: 当前路由已简化为"显式前缀 + 正则 + 默认 REACT"不依赖关键词。disambiguation_keywords 的价值在于 QualityGate 校验输出是否与技能意图一致。
### KTD5: 路由优化采用"扩展正则 + 不引入 LLM 分类"策略
**决策**: 新增 factual是什么/什么是/解释)、数学(计算/算一下)、翻译(翻译/translate三类正则走 DIRECT_CHAT不引入 LLM quick_classify。
**理由**: 保持 RequestPreprocessor 的"零 token 成本快速路径"设计哲学。LLM 二次分类已被明确移除docstring: "LLM blind-classification without tool context is unreliable"),不回退。
## Scope Boundaries
### In Scope
- Benchmark 超时阈值调整
- OpenAICompatibleProvider httpx 超时修复
- QualityGate skill_match 激活 + 严格模式
- 4 对易混淆技能 disambiguation_keywords
- RequestPreprocessor 正则扩展
- 重新基准测试
### Deferred to Follow-Up Work
- DockerComputerUseSession 4 个 stub需真实 Docker 环境)
- 计划 001U7/U8/U9/U10 未完成项)
- 计划 0028 个待决策问题)
- 计划 0037 项 Deferred
- LLM 二次分类消歧P2需评估延迟代价
- 复杂度校准数据集构建P2需收集标注数据
---
## Implementation Units
### U1. 修复 Benchmark 超时阈值
**Goal:** 提升 easy/medium/hard 难度的 LLM 超时阈值,避免真实 LLM 因超时失败
**Requirements:** R1
**Dependencies:** 无
**Files:**
- `src/agentkit/cli/benchmark.py` — 修改 `_LLM_TIMEOUT_BY_DIFFICULTY` 字典
**Approach:**
`_LLM_TIMEOUT_BY_DIFFICULTY``{"easy": 20.0, "medium": 40.0, "hard": 60.0}` 改为 `{"easy": 45.0, "medium": 60.0, "hard": 90.0}`。默认 fallback 从 30.0 改为 60.0。
**Patterns to follow:** 现有 `_LLM_TIMEOUT_BY_DIFFICULTY` 字典结构
**Test scenarios:**
- Happy path: easy 难度用例在 45s 内完成 → passed=True
- Edge case: easy 难度用例在 20-45s 之间完成 → 旧逻辑会超时,新逻辑 passed=True
- Error path: easy 难度用例超过 45s → 超时失败detail 包含 "45s"
**Verification:** 运行 `agentkit benchmark --mode llm`llm-001 不再因超时失败
---
### U2. 修复 OpenAICompatibleProvider httpx 超时硬编码
**Goal:** httpx 客户端使用 ProviderConfig.timeout 而非硬编码 60s
**Requirements:** R2
**Dependencies:** 无
**Files:**
- `src/agentkit/llm/providers/openai.py` — 修改 httpx.AsyncClient 构造
- `tests/unit/llm/test_openai_provider.py` — 新增超时透传测试
**Approach:**
`OpenAICompatibleProvider.__init__` 中,将 `httpx.AsyncClient(timeout=60.0)` 改为 `httpx.AsyncClient(timeout=self._config.timeout)`。若 `self._config` 不存在或 `timeout` 未设置fallback 到 60.0。
**Patterns to follow:** `RemoteLLMProvider` 已使用 `timeout=120.0` 参数模式
**Test scenarios:**
- Happy path: ProviderConfig(timeout=120) → httpx client timeout=120
- Edge case: ProviderConfig(timeout=0) → fallback 到 60.0
- Edge case: ProviderConfig 未设置 timeout → 使用默认 120.0
- Integration: 实际 LLM 调用在 60-120s 之间完成 → 旧逻辑会超时,新逻辑成功
**Verification:** 单元测试通过 + benchmark 中无 httpx 超时错误
---
### U3. 激活 QualityGate skill_match 校验
**Goal:** 在技能执行管线中传入 skill_context激活 skill_match 输出一致性校验
**Requirements:** R3
**Dependencies:** U4disambiguation_keywords 提供 intent_keywords 消歧)
**Files:**
- `src/agentkit/core/config_driven.py` — 在 `_execute_skill_task` 返回前调用 QualityGate.validate 传入 skill_context
- `src/agentkit/quality/gate.py` — 确认 `_check_skill_match` 读取 disambiguation_keywords
- `tests/unit/quality/test_gate.py` — 新增 skill_match 激活测试
**Approach:**
1. 在 `ConfigDrivenAgent._execute_skill_task()` 中,构造 `skill_context = {"intent_keywords": skill_config.intent.keywords + skill_config.intent.disambiguation_keywords}`
2. 调用 `self._quality_gate.validate(output, skill_context=skill_context)`
3. 在 `gate.py``_check_skill_match` 中,同时检查 `intent_keywords``disambiguation_keywords`
**Patterns to follow:** `gate.py` 现有 `_check_skill_match` 方法签名
**Test scenarios:**
- Happy path: 技能输出包含 intent_keywords → skill_match passed=True
- Error path: 技能输出不包含任何 intent_keywords → skill_match 警告
- Integration: reflexion_agent 输出包含 "review" → 与 code_reviewer 的 disambiguation_keywords 匹配 → 触发消歧警告
- Edge case: skill_context=None → 跳过 skill_match向后兼容
**Verification:** 单元测试通过 + 技能执行日志中出现 skill_match 校验记录
---
### U4. 添加 disambiguation_keywords 到易混淆技能对
**Goal:** 为 4 对易混淆技能添加 disambiguation_keywords支持 QualityGate 消歧
**Requirements:** R5
**Dependencies:** 无
**Files:**
- `configs/skills/reflexion_agent.yaml` — 添加 disambiguation_keywords
- `configs/skills/code_reviewer.yaml` — 添加 disambiguation_keywords
- `configs/skills/react_agent.yaml` — 添加 disambiguation_keywords
- `configs/skills/goal_driven_agent.yaml` — 添加 disambiguation_keywords
- `configs/skills/rewoo_agent.yaml` — 添加 disambiguation_keywords
- `configs/skills/competitor_analyzer.yaml` — 添加 disambiguation_keywords
- `configs/skills/content_generator.yaml` — 添加 disambiguation_keywords
- `configs/skills/geo_optimizer.yaml` — 添加 disambiguation_keywords
- `src/agentkit/skills/base.py` — SkillConfig.intent 添加 disambiguation_keywords 字段
**Approach:**
1. 在 `SkillIntent` model 中添加 `disambiguation_keywords: list[str] = []` 字段
2. 为每对易混淆技能添加互斥关键词:
- reflexion_agent: `["反思", "自我验证", "迭代优化"]`
- code_reviewer: `["代码审查", "代码问题", "bug 检查"]`
- react_agent: `["实时搜索", "工具调用", "信息查询"]`
- goal_driven_agent: `["目标分解", "多步规划", "方案对比"]`
- rewoo_agent: `["并行采集", "批量获取", "多源数据"]`
- competitor_analyzer: `["竞品分析", "竞争对比", "市场对手"]`
- content_generator: `["内容创作", "文章生成", "选题写作"]`
- geo_optimizer: `["SEO 优化", "GEO 优化", "搜索排名"]`
**Patterns to follow:** 现有 `intent.keywords` 字段结构
**Test scenarios:**
- Happy path: SkillConfig 加载 yaml 含 disambiguation_keywords → 字段非空
- Edge case: yaml 未含 disambiguation_keywords → 字段默认空列表
- Integration: QualityGate 读取 disambiguation_keywords 用于消歧校验
**Verification:** `agentkit skill list` 正常加载所有技能 + 单元测试通过
---
### U5. 优化 RequestPreprocessor 路由正则
**Goal:** 新增 factual/数学/翻译类正则,减少不必要的 REACT 调用
**Requirements:** R6
**Dependencies:** 无
**Files:**
- `src/agentkit/chat/request_preprocessor.py` — 新增 3 条正则
- `tests/unit/chat/test_request_preprocessor.py` — 新增路由测试
**Approach:**
新增 3 条正则走 DIRECT_CHAT
1. `_FACTUAL_RE` — "什么是X/X是什么/解释一下X/define X" 等纯知识问答
2. `_MATH_RE` — "计算X/算一下X/calculate X" 等简单数学(无变量、无方程)
3. `_TRANSLATION_RE` — "翻译X/translate X" 等纯翻译请求
**注意**: 这些正则必须严格匹配,避免误拦截需要工具的请求。例如 "分析一下服务器的IP" 不应匹配 `_FACTUAL_RE`(包含"分析"动词暗示需要工具)。
**Patterns to follow:** 现有 `_GREETING_RE` / `_CHAT_MODE_RE` / `_IDENTITY_RE` 正则模式
**Test scenarios:**
- Happy path: "什么是机器学习" → 匹配 _FACTUAL_RE → DIRECT_CHAT
- Happy path: "计算 1+2+3" → 匹配 _MATH_RE → DIRECT_CHAT
- Happy path: "translate hello to Chinese" → 匹配 _TRANSLATION_RE → DIRECT_CHAT
- Edge case: "什么是当前服务器的IP地址" → 不匹配 _FACTUAL_RE含"当前服务器"暗示需要工具)→ REACT
- Edge case: "计算斐波那契数列的第100项" → 不匹配 _MATH_RE含"斐波那契数列"暗示需要代码)→ REACT
- Error path: 空字符串 → 不匹配任何正则 → REACT
**Verification:** 单元测试通过 + benchmark 中 DIRECT_CHAT 比例提升
---
### U6. 重新基准测试 + 建立当前架构基线
**Goal:** 修复后重新运行 benchmark建立当前 RequestPreprocessor 架构的基线
**Requirements:** R7
**Dependencies:** U1, U2, U3, U4, U5所有修复完成后
**Files:**
- `test-results/benchmark/baseline.json` — 更新基线
- `test-results/benchmark/benchmark_report.md` — 更新报告
**Approach:**
1. 运行 `agentkit benchmark --mode llm`full 模式,真实 LLM
2. 运行 `agentkit benchmark --mode llm --fast`fast 模式)
3. 对比修复前后准确率、超时率、延迟
4. 更新 `baseline.json` 作为当前架构基线
**Test scenarios:**
- Happy path: full 模式准确率 ≥ 80%5 用例至少 4 通过)
- Happy path: fast 模式准确率 = 100%
- Edge case: llm-001 不再超时
- Edge case: llm-004 不再超时
**Verification:** benchmark 报告生成 + 准确率达标
---
## Risks & Dependencies
| 风险 | 严重度 | 缓解措施 |
|------|--------|----------|
| 新增正则误拦截需要工具的请求 | 中 | 正则设计保守,仅匹配纯知识/数学/翻译,单元测试覆盖边界 |
| QualityGate skill_match 误报导致输出被拦截 | 中 | skill_match 单独不拦截(现有设计),仅与其他失败共病时拦截 |
| disambiguation_keywords 与现有 keywords 语义重叠 | 低 | disambiguation_keywords 是 keywords 的补充,不替代 |
| benchmark 超时提升后延迟增加 | 低 | 超时是上限而非目标,快速完成的用例不受影响 |
## Open Questions
无 — 所有技术决策已在 KTD 中明确。
## System-Wide Impact
- **LLM 网关**: httpx 超时修复影响所有 LLM 调用(更宽松的超时)
- **技能执行**: QualityGate 激活影响所有技能输出校验
- **Benchmark**: 超时阈值影响所有 benchmark 用例
- **路由**: 新增正则影响所有非显式前缀的请求
## Verification Results (2026-06-20)
### U1U5 代码修复验证
| 单元 | 验证方式 | 结果 |
|------|----------|------|
| U1: Benchmark 超时 | `agentkit benchmark --mode llm` | ✅ llm-001/llm-004 不再超时 |
| U2: httpx 超时 | `pytest tests/unit/test_llm_provider.py` | ✅ 2 个新测试通过 |
| U3: QualityGate 激活 | `pytest tests/unit/quality/` | ✅ 176 个质量门控测试通过 |
| U4: disambiguation_keywords | 16 个技能 yaml 加载验证 | ✅ 全部加载成功 |
| U5: 路由正则 | `pytest tests/unit/chat/test_request_preprocessor.py` | ✅ 38 个测试通过19 新增) |
### U6 基准测试结果
| 指标 | 修复前 (2026-06-20 03:18) | 修复后 (2026-06-20 11:05) | 变化 |
|------|--------------------------|--------------------------|------|
| 准确率 | 60.0% | 93.3% ± 9.4% | **+33.3%** |
| 通过/总数 | 3/5 | 4/5 | +1 |
| 超时数 | 2 | 0 (llm-002 偶发) | **-2** |
| 一致性 | N/A | 100% | — |
| p50 延迟 | 35.3s | 40.8s | +5.5s(可接受) |
**剩余问题**: llm-002 (tool_selection, medium) 在 3 次运行中 1 次超时p95=56.3s 接近 medium 60s 阈值。后续可考虑提升 medium 超时至 75s。

View File

@ -21,7 +21,9 @@ dependencies = [
"pyyaml>=6.0",
"jsonschema>=4.0",
"typer>=0.12",
"rich>=13.0",
"pyjwt>=2.8",
"bcrypt>=4.0",
"aiosqlite>=0.20",
]
[project.scripts]

View File

@ -52,6 +52,44 @@ _IDENTITY_RE = re.compile(
re.IGNORECASE,
)
# 中文知识问答什么是X/解释X/定义X — 中文不需要空格分隔
# 仅匹配纯知识性问句,排除需要实时数据的请求(由 _TOOL_CONTEXT_RE 过滤)
# 支持尾部标点(?//。等),与 _GREETING_RE/_IDENTITY_RE 保持一致
_FACTUAL_CN_RE = re.compile(
r"^(什么是|解释一下|解释下|定义一下|定义|说说什么是|介绍下什么是)"
r"[\u4e00-\u9fa5a-zA-Z0-9 \t]+[?!.。]*$"
)
# English factual questions — requires whitespace separator
_FACTUAL_EN_RE = re.compile(
r"^(what\s+is|what's|define|explain)\s+[\u4e00-\u9fa5a-zA-Z0-9 \t]+[?!.。]*$",
re.IGNORECASE,
)
# 需要工具/实时数据的上下文关键词 — 出现这些词时不走 DIRECT_CHAT
# 包含中英文关键词,覆盖服务器/数据库/系统状态/配置文件等场景
_TOOL_CONTEXT_RE = re.compile(
r"(当前|现在|服务器|数据库|系统|状态|最新|实时|今天|昨天|本机|本地|线上|"
r"线上环境|生产环境|测试环境|配置文件|日志|进程|端口|IP|CPU|内存|磁盘|"
r"current|server|database|system\s+status|latest|realtime|today|yesterday|"
r"local|process|port|log|config\s+file)",
re.IGNORECASE,
)
# 纯算术:计算 1+2+3 / 算一下 15*23 — 仅匹配数字和运算符
# 不匹配含中文/字母的复杂表达式(如"计算斐波那契数列"
_MATH_RE = re.compile(
r"^(计算|算一下|算下|calculate|compute)\s+[\d +\-*/().\t]+[?!.。]*$",
re.IGNORECASE,
)
# 纯翻译:翻译 X / translate X — 需要空格分隔,排除"翻译X为Y"格式
# 排除含工具上下文关键词的请求(如"翻译 这个配置文件"
_TRANSLATION_RE = re.compile(
r"^(翻译|translate)\s+.+$",
re.IGNORECASE,
)
class RequestPreprocessor:
"""Minimal preprocessing layer: regex fast-path + default REACT.
@ -190,10 +228,33 @@ class RequestPreprocessor:
@staticmethod
def _is_trivial_input(text: str) -> bool:
"""Check if the input is a greeting, chitchat, or identity question.
"""Check if the input is a greeting, chitchat, identity question, or pure knowledge/math/translation.
These are zero-cost direct chat: no tool usage, no ReAct loop needed.
Factual/translation patterns are conservative they exclude requests
that contain tool-context keywords (当前/服务器/数据库/config etc.) to avoid
misrouting tool-requiring queries to DIRECT_CHAT.
"""
return bool(
_GREETING_RE.match(text) or _CHAT_MODE_RE.match(text) or _IDENTITY_RE.match(text)
)
# Multi-line inputs always go to REACT (avoid bypassing tools via newline)
if "\n" in text or "\r" in text:
return False
# Greeting / chitchat / identity — always safe
if _GREETING_RE.match(text) or _CHAT_MODE_RE.match(text) or _IDENTITY_RE.match(text):
return True
# Factual questions (CN/EN) — only if no tool-context keywords present
if (
_FACTUAL_CN_RE.match(text) or _FACTUAL_EN_RE.match(text)
) and not _TOOL_CONTEXT_RE.search(text):
return True
# Pure arithmetic — only digits and operators, no tool context possible
if _MATH_RE.match(text):
return True
# Pure translation — exclude tool-context (e.g. "翻译 这个配置文件")
if _TRANSLATION_RE.match(text) and not _TOOL_CONTEXT_RE.search(text):
return True
return False

View File

@ -682,9 +682,9 @@ def _build_real_components() -> tuple[object, object, object] | None:
# Difficulty-based timeout (seconds) and max_tokens for LLM calls.
# Hard tasks use streaming with keyword detection for early termination.
_LLM_TIMEOUT_BY_DIFFICULTY: dict[str, float] = {
"easy": 20.0,
"medium": 40.0,
"hard": 60.0,
"easy": 45.0,
"medium": 60.0,
"hard": 90.0,
}
_LLM_MAX_TOKENS_BY_DIFFICULTY: dict[str, int] = {
@ -745,7 +745,7 @@ async def _execute_llm_reasoning_task(
start = time.perf_counter()
# Difficulty-based configuration
timeout_s = _LLM_TIMEOUT_BY_DIFFICULTY.get(task.difficulty, 30.0)
timeout_s = _LLM_TIMEOUT_BY_DIFFICULTY.get(task.difficulty, 60.0)
max_tokens = _LLM_MAX_TOKENS_BY_DIFFICULTY.get(task.difficulty, 512)
# Step 1: preprocess to get execution mode

View File

@ -192,6 +192,18 @@ class BaseAgent(ABC):
lines.append(f" - {msg}")
return "\n".join(lines)
def _build_skill_context(self) -> dict[str, Any] | None:
"""从当前技能配置构建 skill_context用于 QualityGate skill_match 校验"""
if not self._skill:
return None
intent = getattr(self._skill.config, "intent", None)
if intent is None:
return None
keywords = list(intent.keywords) + list(intent.disambiguation_keywords)
if not keywords:
return None
return {"intent_keywords": keywords}
# ── 可插拔能力注入 ──────────────────────────────────────
def use_tool(self, tool: "Tool") -> "BaseAgent":
@ -329,14 +341,19 @@ class BaseAgent(ABC):
# v2: Quality Gate 检查
if self._skill:
quality_result = await self.quality_gate.validate(output, self._skill)
skill_context = self._build_skill_context()
quality_result = await self.quality_gate.validate(
output, self._skill, skill_context=skill_context
)
if not quality_result.passed and quality_result.can_retry:
max_retries = self._skill.config.quality_gate.max_retries
retry_count = 0
while not quality_result.passed and retry_count < max_retries:
feedback = self._build_quality_feedback(quality_result)
output = await self.handle_task_with_feedback(task, feedback)
quality_result = await self.quality_gate.validate(output, self._skill)
quality_result = await self.quality_gate.validate(
output, self._skill, skill_context=skill_context
)
retry_count += 1
# 后置钩子

View File

@ -56,6 +56,7 @@ class OpenAICompatibleProvider(LLMProvider):
max_connections: int = 100,
max_keepalive_connections: int = 20,
keepalive_expiry: float = 30.0,
timeout: float = 120.0,
):
self._api_key = api_key
self._base_url = base_url.rstrip("/")
@ -65,7 +66,7 @@ class OpenAICompatibleProvider(LLMProvider):
max_keepalive_connections=max_keepalive_connections,
keepalive_expiry=keepalive_expiry,
)
self._client = httpx.AsyncClient(timeout=60.0, limits=limits)
self._client = httpx.AsyncClient(timeout=timeout, limits=limits)
self._retry_policy = RetryPolicy(retry_config) if retry_config else None
self._circuit_breaker = (
CircuitBreaker(circuit_breaker_config, provider="openai")

View File

@ -128,6 +128,7 @@ def _create_provider(name: str, pconf) -> object:
max_connections=pconf.max_connections,
max_keepalive_connections=pconf.max_keepalive_connections,
keepalive_expiry=pconf.keepalive_expiry,
timeout=pconf.timeout,
)

View File

@ -0,0 +1,85 @@
import { test, expect } from '@playwright/test'
import {
loginAndHydrate,
sendChatMessage,
waitForLlmResponse,
LLM_RESPONSE_TIMEOUT_MS,
} from './helpers'
test.describe('Chat flow', () => {
test.beforeEach(async ({ page }) => {
// Authenticate via API and hydrate localStorage before navigating
await loginAndHydrate(page)
await page.goto('/agent/chat')
// Wait for the chat view to mount — the input textarea should be visible
await expect(page.getByPlaceholder('输入消息,按 Enter 发送...')).toBeVisible({
timeout: 15_000,
})
})
test('should send a message and receive a real LLM response', async ({ page }) => {
const testMessage = '你好,请用一句话介绍自己'
// Send the message
await sendChatMessage(page, testMessage)
// The user's message should appear immediately in the chat view.
// Use .last() because the conversation may contain prior messages.
const userMessage = page.locator('.message-shell--user .user-bubble')
await expect(userMessage.last()).toContainText('你好', { timeout: 10_000 })
// Wait for the real LLM response (up to 60 seconds).
// The assistant message is rendered inside .message-shell--assistant
// with markdown content in .assistant-text__markdown.
test.setTimeout(LLM_RESPONSE_TIMEOUT_MS + 30_000)
await waitForLlmResponse(page, expect, LLM_RESPONSE_TIMEOUT_MS)
// The response should contain some text (non-empty, non-error)
const assistantContent = page.locator(
'.message-shell--assistant .assistant-text__markdown',
)
const responseText = (await assistantContent.last().textContent()) ?? ''
expect(responseText.trim().length).toBeGreaterThan(0)
// The response should not be an error message
const errorCard = page.locator('.message-shell--assistant .error-card')
await expect(errorCard).toHaveCount(0)
})
test('should display both user and assistant messages in history', async ({ page }) => {
const testMessage = '1+1等于几请只回答数字'
await sendChatMessage(page, testMessage)
// Verify user message is displayed (use .last() for most recent)
await expect(
page.locator('.message-shell--user .user-bubble').last(),
).toContainText('1+1', { timeout: 10_000 })
// Wait for assistant response
test.setTimeout(LLM_RESPONSE_TIMEOUT_MS + 30_000)
await waitForLlmResponse(page, expect, LLM_RESPONSE_TIMEOUT_MS)
// Both user and assistant message shells should be present
const userMessages = page.locator('.message-shell--user')
const assistantMessages = page.locator('.message-shell--assistant')
await expect(userMessages.first()).toBeVisible()
await expect(assistantMessages.first()).toBeVisible()
// There should be at least one user message and one assistant message
expect(await userMessages.count()).toBeGreaterThanOrEqual(1)
expect(await assistantMessages.count()).toBeGreaterThanOrEqual(1)
})
test('should clear input after sending', async ({ page }) => {
const textarea = page.getByPlaceholder('输入消息,按 Enter 发送...')
await textarea.fill('测试消息清空')
await textarea.press('Enter')
// The textarea should be cleared after sending
await expect(textarea).toHaveText('', { timeout: 5_000 })
})
})

View File

@ -0,0 +1,61 @@
/**
* Playwright global setup runs once before all test files.
*
* Responsibilities:
* 1. Wait for the backend health endpoint to respond (the webServer config
* already polls the URL, but we double-check here for robustness).
* 2. Invoke the Python script that creates / updates the E2E test admin user
* in the auth SQLite DB.
*/
import { execFileSync } from 'node:child_process'
import { existsSync } from 'node:fs'
import { dirname, resolve } from 'node:path'
import { fileURLToPath } from 'node:url'
const __filename = fileURLToPath(import.meta.url)
const __dirname = dirname(__filename)
const BACKEND_HEALTH_URL = 'http://127.0.0.1:8000/api/v1/health'
const SETUP_SCRIPT = resolve(__dirname, 'setup-test-user.py')
/** Poll a URL until it returns 200 or the timeout expires. */
async function waitForUrl(url: string, timeoutMs = 60_000): Promise<void> {
const deadline = Date.now() + timeoutMs
while (Date.now() < deadline) {
try {
const resp = await fetch(url)
if (resp.ok) return
} catch {
// server not ready yet
}
await new Promise((r) => setTimeout(r, 1000))
}
throw new Error(`Timed out waiting for ${url}`)
}
export default async function globalSetup(): Promise<void> {
// 1. Verify backend is up (webServer should have started it already).
await waitForUrl(BACKEND_HEALTH_URL, 60_000)
console.log('[global-setup] Backend health check passed')
// 2. Create / update the test admin user.
if (!existsSync(SETUP_SCRIPT)) {
throw new Error(`Setup script not found: ${SETUP_SCRIPT}`)
}
const pythonBin = process.env.E2E_PYTHON ?? 'python3'
try {
execFileSync(pythonBin, [SETUP_SCRIPT], {
stdio: 'inherit',
timeout: 30_000,
})
} catch (err) {
throw new Error(
`Failed to create test user via ${pythonBin} ${SETUP_SCRIPT}: ${
err instanceof Error ? err.message : String(err)
}`
)
}
console.log('[global-setup] Test user ready')
}

View File

@ -0,0 +1,233 @@
/**
* Shared E2E test helpers.
*
* - Login via API and hydrate localStorage so the Vue auth store picks up
* the tokens on page load (the store reads from localStorage on init).
* - Server health check.
* - Wait for a real LLM response in the chat view.
*/
import type { Page, expect as ExpectType } from '@playwright/test'
// ---------------------------------------------------------------------------
// Constants
// ---------------------------------------------------------------------------
/**
* Backend API base absolute URL so fetch() works in both Node.js (Playwright
* test context) and browser context. The Vite dev-server proxy is not available
* in Node.js, so we target the backend directly.
*/
export const API_BASE = 'http://127.0.0.1:8000/api/v1'
/** Backend health endpoint (absolute URL for direct fetch). */
export const BACKEND_HEALTH_URL = 'http://127.0.0.1:8000/api/v1/health'
/** Test admin credentials — must match setup-test-user.py defaults. */
export const TEST_USER = {
username: process.env.E2E_TEST_USERNAME ?? 'e2e_test_admin',
password: process.env.E2E_TEST_PASSWORD ?? 'E2eTestPass123!',
email: process.env.E2E_TEST_EMAIL ?? 'e2e-test@example.com',
} as const
/** localStorage keys used by the auth store (see stores/auth.ts). */
const ACCESS_TOKEN_KEY = 'agentkit.access_token'
const REFRESH_TOKEN_KEY = 'agentkit.refresh_token'
const USER_KEY = 'agentkit.user'
/** Max wait for a real LLM response (seconds → ms). */
export const LLM_RESPONSE_TIMEOUT_MS = 60_000
// ---------------------------------------------------------------------------
// Types
// ---------------------------------------------------------------------------
interface IAuthUser {
id: string
username: string
email: string
role: string
is_active: boolean
is_terminal_authorized: boolean
is_server_terminal_authorized: boolean
}
interface ITokenPair {
access_token: string
refresh_token: string
token_type: string
expires_in: number
user: IAuthUser
}
// ---------------------------------------------------------------------------
// Server health
// ---------------------------------------------------------------------------
/**
* Poll the backend health endpoint until it responds 200 or times out.
* Useful as a sanity check inside tests.
*/
export async function waitForServer(
url: string = BACKEND_HEALTH_URL,
timeoutMs = 30_000,
): Promise<void> {
const deadline = Date.now() + timeoutMs
while (Date.now() < deadline) {
try {
const resp = await fetch(url)
if (resp.ok) return
} catch {
// not ready
}
await new Promise((r) => setTimeout(r, 1_000))
}
throw new Error(`Server at ${url} did not become healthy within ${timeoutMs}ms`)
}
// ---------------------------------------------------------------------------
// Login helpers
// ---------------------------------------------------------------------------
/**
* Authenticate via the REST API and return the token pair.
* Retries on 429 (rate limit) with exponential backoff.
* Caches the token pair module-level so subsequent calls reuse it
* (avoids triggering the server's rate limiter).
* Throws on other non-200 responses.
*/
let _cachedTokenPair: ITokenPair | null = null
export async function loginViaApi(): Promise<ITokenPair> {
// Return cached token if available (avoids rate limiting across tests).
if (_cachedTokenPair) {
return _cachedTokenPair
}
const maxRetries = 5
for (let attempt = 0; attempt < maxRetries; attempt++) {
const resp = await fetch(`${API_BASE}/auth/login`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
username: TEST_USER.username,
password: TEST_USER.password,
}),
})
if (resp.ok) {
_cachedTokenPair = (await resp.json()) as ITokenPair
return _cachedTokenPair
}
if (resp.status === 429 && attempt < maxRetries - 1) {
// Rate limited — wait and retry (5s, 10s, 20s, 40s)
const delayMs = 5000 * Math.pow(2, attempt)
await new Promise((r) => setTimeout(r, delayMs))
continue
}
const detail = await resp.text().catch(() => '<no body>')
throw new Error(`Login failed (${resp.status}): ${detail}`)
}
throw new Error('Login failed: max retries exceeded')
}
/**
* Log in via the API and hydrate localStorage so the Pinia auth store
* picks up the tokens on the next page navigation.
*
* The auth store (stores/auth.ts) reads `agentkit.access_token`,
* `agentkit.refresh_token`, and `agentkit.user` from localStorage on
* construction, so populating these before navigating is sufficient.
*/
export async function loginAndHydrate(page: Page): Promise<ITokenPair> {
const tokens = await loginViaApi()
await page.goto('/login')
await page.evaluate(
({ access, refresh, user }) => {
localStorage.setItem('agentkit.access_token', access)
localStorage.setItem('agentkit.refresh_token', refresh)
localStorage.setItem('agentkit.user', JSON.stringify(user))
},
{
access: tokens.access_token,
refresh: tokens.refresh_token,
user: tokens.user,
},
)
return tokens
}
/**
* Clear auth state from localStorage useful for testing the
* unauthenticated-redirect behaviour.
*/
export async function clearAuth(page: Page): Promise<void> {
await page.evaluate(
({ access, refresh, user }) => {
localStorage.removeItem(access)
localStorage.removeItem(refresh)
localStorage.removeItem(user)
},
{
access: ACCESS_TOKEN_KEY,
refresh: REFRESH_TOKEN_KEY,
user: USER_KEY,
},
)
}
// ---------------------------------------------------------------------------
// Chat helpers
// ---------------------------------------------------------------------------
/**
* Wait for a real LLM response in the chat view.
*
* After sending a message, the assistant's response is rendered inside
* `.message-shell--assistant .assistant-text__markdown`. While the LLM is
* still streaming, the element may be empty or show a spinner. This helper
* waits until the assistant message contains non-whitespace text.
*
* @param page Playwright page
* @param expect The `expect` function from @playwright/test
* @param timeoutMs Max wait time (default 60s for real LLM calls)
*/
export async function waitForLlmResponse(
page: Page,
expect: typeof ExpectType,
timeoutMs = LLM_RESPONSE_TIMEOUT_MS,
): Promise<void> {
// The assistant message content is rendered as sanitized HTML inside
// .assistant-text__markdown. Wait for it to have non-empty text content.
const assistantContent = page.locator(
'.message-shell--assistant .assistant-text__markdown',
)
await expect
.poll(
async () => {
// Check count first to avoid auto-wait on a non-existent element.
const count = await assistantContent.count()
if (count === 0) return 0
const text = await assistantContent.last().textContent()
return (text ?? '').trim().length
},
{ timeout: timeoutMs, intervals: [1_000, 2_000, 5_000] },
)
.toBeGreaterThan(0)
}
/**
* Send a chat message by typing into the textarea and pressing Enter.
* Falls back to clicking the send button if Enter doesn't trigger send.
*/
export async function sendChatMessage(page: Page, message: string): Promise<void> {
const textarea = page.getByPlaceholder('输入消息,按 Enter 发送...')
await textarea.fill(message)
await textarea.press('Enter')
}

View File

@ -0,0 +1,79 @@
import { test, expect } from '@playwright/test'
import { TEST_USER, clearAuth } from './helpers'
test.describe('Login flow', () => {
test.beforeEach(async ({ page }) => {
// Ensure no stale tokens from a previous test
await page.goto('/login')
await clearAuth(page)
})
test('should login successfully with valid credentials', async ({ page }) => {
await page.goto('/login')
// Fill in the form
await page.getByPlaceholder('请输入用户名').fill(TEST_USER.username)
await page.getByPlaceholder('请输入密码').fill(TEST_USER.password)
// Submit
await page.getByRole('button', { name: /登\s*录/ }).click()
// Should redirect to /agent (which redirects to /agent/chat)
await expect(page).toHaveURL(/\/agent/)
// The login logo should no longer be visible
await expect(page.locator('.login-logo')).not.toBeVisible()
})
test('should show error for wrong password', async ({ page }) => {
await page.goto('/login')
await page.getByPlaceholder('请输入用户名').fill(TEST_USER.username)
await page.getByPlaceholder('请输入密码').fill('definitely-wrong-password-12345')
await page.getByRole('button', { name: /登\s*录/ }).click()
// The LoginView shows an a-alert with type="error" containing the
// server's error message ("Invalid username or password").
const errorAlert = page.locator('.ant-alert-error')
await expect(errorAlert).toBeVisible({ timeout: 10_000 })
// Should still be on the login page
await expect(page).toHaveURL(/\/login/)
// The error message should mention invalid credentials
const alertText = await errorAlert.textContent()
expect(alertText?.toLowerCase()).toMatch(/invalid|无效|错误|incorrect|失败/)
})
test('should redirect unauthenticated users to login', async ({ page }) => {
// Clear any existing auth state, then try to visit a protected route
await clearAuth(page)
await page.goto('/agent/chat')
// The router guard should redirect to /login?redirect=/agent/chat
await expect(page).toHaveURL(/\/login/)
await expect(page).toHaveURL(/redirect=/)
// The login form should be visible
await expect(page.getByPlaceholder('请输入用户名')).toBeVisible()
await expect(page.getByPlaceholder('请输入密码')).toBeVisible()
})
test('should redirect to original page after login', async ({ page }) => {
await clearAuth(page)
// Visit a protected route — should redirect to login with redirect param
await page.goto('/agent/chat')
await expect(page).toHaveURL(/\/login\?redirect=/)
// Now log in
await page.getByPlaceholder('请输入用户名').fill(TEST_USER.username)
await page.getByPlaceholder('请输入密码').fill(TEST_USER.password)
await page.getByRole('button', { name: /登\s*录/ }).click()
// Should be redirected back to the originally requested page
await expect(page).toHaveURL(/\/agent\/chat/)
})
})

View File

@ -0,0 +1,92 @@
#!/usr/bin/env python3
"""Create or update the E2E test admin user in the auth SQLite DB.
This script is invoked by Playwright's ``globalSetup`` (via ``global-setup.ts``)
before any test runs. It ensures the auth DB schema exists and that a test
admin user with known credentials is present.
The user credentials default to:
username: e2e_test_admin
password: E2eTestPass123!
email: e2e-test@agentkit.local
role: admin
Override via environment variables ``E2E_TEST_USERNAME``, ``E2E_TEST_PASSWORD``,
``E2E_TEST_EMAIL`` if needed.
Exit codes:
0 user created or updated successfully
1 unexpected error
"""
from __future__ import annotations
import asyncio
import os
import sys
import uuid
from datetime import datetime, timezone
from pathlib import Path
# Resolve project root so we can import agentkit regardless of CWD.
# This file lives at src/agentkit/server/frontend/e2e/setup-test-user.py
_PROJECT_ROOT = Path(__file__).resolve().parents[3]
_SRC_ROOT = _PROJECT_ROOT / "src"
if str(_SRC_ROOT) not in sys.path:
sys.path.insert(0, str(_SRC_ROOT))
import aiosqlite # noqa: E402
from agentkit.server.auth.models import DEFAULT_AUTH_DB_PATH, init_auth_db # noqa: E402
from agentkit.server.auth.password import hash_password # noqa: E402
TEST_USERNAME = os.environ.get("E2E_TEST_USERNAME", "e2e_test_admin")
TEST_PASSWORD = os.environ.get("E2E_TEST_PASSWORD", "E2eTestPass123!")
TEST_EMAIL = os.environ.get("E2E_TEST_EMAIL", "e2e-test@example.com")
async def ensure_test_user() -> None:
db_path = DEFAULT_AUTH_DB_PATH
# Create schema (idempotent) — mirrors what /auth/login does on first hit.
await init_auth_db(db_path)
password_hash = hash_password(TEST_PASSWORD)
now_iso = datetime.now(timezone.utc).isoformat()
async with aiosqlite.connect(str(db_path)) as db:
cursor = await db.execute("SELECT id FROM users WHERE username = ?", (TEST_USERNAME,))
existing = await cursor.fetchone()
if existing:
# Update password + ensure admin role + terminal authorization
await db.execute(
"UPDATE users SET password_hash = ?, role = 'admin', is_active = 1, "
"is_terminal_authorized = 1, is_server_terminal_authorized = 1, "
"email = ?, updated_at = ? WHERE username = ?",
(password_hash, TEST_EMAIL, now_iso, TEST_USERNAME),
)
await db.commit()
print(f"[setup-test-user] Updated existing test user '{TEST_USERNAME}'")
else:
user_id = str(uuid.uuid4())
await db.execute(
"INSERT INTO users (id, username, email, password_hash, role, "
"is_active, is_terminal_authorized, is_server_terminal_authorized, "
"created_at, updated_at) VALUES (?, ?, ?, ?, 'admin', 1, 1, 1, ?, ?)",
(user_id, TEST_USERNAME, TEST_EMAIL, password_hash, now_iso, now_iso),
)
await db.commit()
print(f"[setup-test-user] Created test admin user '{TEST_USERNAME}'")
def main() -> int:
try:
asyncio.run(ensure_test_user())
return 0
except Exception as exc: # noqa: BLE001
print(f"[setup-test-user] ERROR: {exc}", file=sys.stderr)
return 1
if __name__ == "__main__":
raise SystemExit(main())

View File

@ -0,0 +1,76 @@
import { test, expect } from '@playwright/test'
import { loginAndHydrate } from './helpers'
test.describe('Terminal panel', () => {
test.beforeEach(async ({ page }) => {
await loginAndHydrate(page)
// The terminal view lives at /legacy/terminal (the /terminal route
// redirects there — see router/index.ts).
await page.goto('/legacy/terminal')
})
test('should display the terminal panel with mode tabs', async ({ page }) => {
// The TerminalPanel component renders .terminal-panel
const terminalPanel = page.locator('.terminal-panel')
await expect(terminalPanel).toBeVisible({ timeout: 10_000 })
// The "本地终端" (local terminal) tab should always be visible
await expect(
terminalPanel.getByRole('button', { name: /本地终端/ }),
).toBeVisible()
// The connection status indicator should be present
await expect(terminalPanel.locator('.terminal-panel__indicator')).toBeVisible()
})
test('should show server terminal tab for admin users', async ({ page }) => {
// The test user is an admin, so the "服务端终端" tab should be visible
// (it's gated behind authStore.canUseServerTerminal()).
const terminalPanel = page.locator('.terminal-panel')
await expect(terminalPanel).toBeVisible({ timeout: 10_000 })
await expect(
terminalPanel.getByRole('button', { name: /服务端终端/ }),
).toBeVisible()
})
test('should open the whitelist manager drawer', async ({ page }) => {
// Wait for the terminal view to mount
await expect(page.locator('.terminal-panel')).toBeVisible({ timeout: 10_000 })
// The whitelist button is positioned in the top-right corner of the
// terminal view (SafetyOutlined icon inside .terminal-view__whitelist-btn).
const whitelistBtn = page.locator('.terminal-view__whitelist-btn')
await expect(whitelistBtn).toBeVisible()
await whitelistBtn.click()
// The drawer should open and contain the WhitelistManager component.
// The drawer title is "终端白名单管理".
const drawer = page.locator('.ant-drawer-content')
await expect(drawer).toBeVisible({ timeout: 5_000 })
// The WhitelistManager renders an a-tabs with "我的白名单" tab
await expect(page.getByRole('tab', { name: '我的白名单' })).toBeVisible()
// The "添加" button and the input for new patterns should be visible.
// Use regex to match possible Ant Design Vue auto-inserted space.
await expect(
drawer.getByPlaceholder('输入命令模式,如: git, npm, ls'),
).toBeVisible()
await expect(drawer.getByRole('button', { name: /添\s*加/ })).toBeVisible()
})
test('should display admin-only tabs in whitelist manager', async ({ page }) => {
// Open the whitelist drawer
await expect(page.locator('.terminal-panel')).toBeVisible({ timeout: 10_000 })
await page.locator('.terminal-view__whitelist-btn').click()
const drawer = page.locator('.ant-drawer-content')
await expect(drawer).toBeVisible({ timeout: 5_000 })
// Admin users should see the "全局白名单", "黑名单", and "审计日志" tabs
await expect(page.getByRole('tab', { name: '全局白名单' })).toBeVisible()
await expect(page.getByRole('tab', { name: '黑名单' })).toBeVisible()
await expect(page.getByRole('tab', { name: '审计日志' })).toBeVisible()
})
})

View File

@ -23,6 +23,7 @@
"vue-router": "^4.4.0"
},
"devDependencies": {
"@playwright/test": "^1.59.0",
"@tauri-apps/cli": "^2.11.2",
"@types/dompurify": "^3.0.5",
"@types/markdown-it": "^14.1.2",
@ -579,6 +580,22 @@
"@jridgewell/sourcemap-codec": "^1.4.14"
}
},
"node_modules/@playwright/test": {
"version": "1.61.0",
"resolved": "https://registry.npmmirror.com/@playwright/test/-/test-1.61.0.tgz",
"integrity": "sha512-cKA5B6lpFEMyMGjxF54QihfYpB4FkEGH+qZhtArDEG+wezQAJY8Pq6C7T1SjWz+FFzt3TbyoXBQYk/0292TdJA==",
"dev": true,
"license": "Apache-2.0",
"dependencies": {
"playwright": "1.61.0"
},
"bin": {
"playwright": "cli.js"
},
"engines": {
"node": ">=18"
}
},
"node_modules/@rollup/rollup-android-arm-eabi": {
"version": "4.61.1",
"resolved": "https://registry.npmmirror.com/@rollup/rollup-android-arm-eabi/-/rollup-android-arm-eabi-4.61.1.tgz",
@ -2220,6 +2237,53 @@
"pathe": "^2.0.3"
}
},
"node_modules/playwright": {
"version": "1.61.0",
"resolved": "https://registry.npmmirror.com/playwright/-/playwright-1.61.0.tgz",
"integrity": "sha512-Z+7BeeqQPRRzklHsVFP4KTGIyMxKUmfeRA4WisM6G3/XW6nwGeX6fX9qYaDa+CiUqpOkb2f6X3nar05R3kSuJQ==",
"dev": true,
"license": "Apache-2.0",
"dependencies": {
"playwright-core": "1.61.0"
},
"bin": {
"playwright": "cli.js"
},
"engines": {
"node": ">=18"
},
"optionalDependencies": {
"fsevents": "2.3.2"
}
},
"node_modules/playwright-core": {
"version": "1.61.0",
"resolved": "https://registry.npmmirror.com/playwright-core/-/playwright-core-1.61.0.tgz",
"integrity": "sha512-caX7TrY3Ml6egyDX0WUcTHDxodl/b51y5wJOdCEA36QviK/s2g081hvmGs8eaE3DWb6NYZQ6BjO/QkNRPenoPA==",
"dev": true,
"license": "Apache-2.0",
"bin": {
"playwright-core": "cli.js"
},
"engines": {
"node": ">=18"
}
},
"node_modules/playwright/node_modules/fsevents": {
"version": "2.3.2",
"resolved": "https://registry.npmmirror.com/fsevents/-/fsevents-2.3.2.tgz",
"integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==",
"dev": true,
"hasInstallScript": true,
"license": "MIT",
"optional": true,
"os": [
"darwin"
],
"engines": {
"node": "^8.16.0 || ^10.6.0 || >=11.0.0"
}
},
"node_modules/postcss": {
"version": "8.5.15",
"resolved": "https://registry.npmmirror.com/postcss/-/postcss-8.5.15.tgz",

View File

@ -9,7 +9,9 @@
"build": "vue-tsc --noEmit && vite build",
"build:frontend": "vue-tsc --noEmit && vite build",
"preview": "vite preview",
"tauri": "tauri"
"tauri": "tauri",
"test:e2e": "playwright test",
"test:e2e:ui": "playwright test --ui"
},
"dependencies": {
"@ant-design/icons-vue": "^7.0.0",
@ -27,6 +29,7 @@
"vue-router": "^4.4.0"
},
"devDependencies": {
"@playwright/test": "^1.59.0",
"@tauri-apps/cli": "^2.11.2",
"@types/dompurify": "^3.0.5",
"@types/markdown-it": "^14.1.2",

View File

@ -0,0 +1,80 @@
import { defineConfig, devices } from '@playwright/test'
/**
* Playwright E2E configuration for Fischer AgentKit frontend.
*
* Architecture:
* - Backend (uvicorn direct, avoids agentkit serve interactive prompts) runs on
* port 8000 to match the Vite dev-server proxy target in vite.config.ts.
* - Frontend (Vite dev server) runs on port 5173 (strictPort in vite.config.ts).
* - Tests target the frontend at http://localhost:5173; API/WS calls are
* transparently proxied to the backend.
*
* The `globalSetup` script creates a test admin user in the auth DB before
* any test runs, so login-based tests have valid credentials available.
*/
// Project root relative to this config file
// (src/agentkit/server/frontend/ → 4 levels up to project root)
const PROJECT_ROOT = '../../../..'
export default defineConfig({
testDir: './e2e',
fullyParallel: false,
forbidOnly: !!process.env.CI,
retries: process.env.CI ? 1 : 0,
workers: 1,
reporter: [['list'], ['html', { open: 'never' }]],
timeout: 90_000,
expect: { timeout: 15_000 },
globalSetup: './e2e/global-setup.ts',
use: {
baseURL: 'http://localhost:5173',
trace: 'on-first-retry',
screenshot: 'only-on-failure',
video: 'retain-on-failure',
actionTimeout: 15_000,
navigationTimeout: 30_000,
},
projects: [
{
name: 'chromium',
use: {
...devices['Desktop Chrome'],
// Use system Chrome to avoid slow browser downloads.
channel: 'chrome',
},
},
],
webServer: [
{
// Use uvicorn directly — `agentkit serve` has Confirm.ask() prompts
// that fail in non-tty subprocess environments.
// Env vars set inline to avoid Playwright's env property replacing
// the entire process.env (which would lose PATH, API keys, etc.).
command:
'AGENTKIT_GUI_MODE=1 NO_PROXY=127.0.0.1,localhost no_proxy=127.0.0.1,localhost ' +
'python3 -c "import uvicorn; uvicorn.run(' +
"'agentkit.server.app:create_app', " +
"host='127.0.0.1', port=8000, factory=True)\"",
url: 'http://127.0.0.1:8000/api/v1/health',
cwd: PROJECT_ROOT,
reuseExistingServer: !process.env.CI,
timeout: 120_000,
stdout: 'pipe',
stderr: 'pipe',
},
{
command: 'npm run dev',
url: 'http://localhost:5173',
cwd: '.',
reuseExistingServer: !process.env.CI,
timeout: 60_000,
stdout: 'pipe',
stderr: 'pipe',
},
],
})

View File

@ -7,12 +7,7 @@ import type {
IConversation,
IChatRequest,
WsClientMessage,
IExpertTeamState,
IBoardStartedData,
IExpertSpeechData,
IRoundSummaryData,
IUserInterventionData,
IBoardConcludedData,
WsServerMessage,
} from '@/api/types'
function generateId(): string {
@ -276,7 +271,7 @@ export const useChatStore = defineStore('chat', () => {
socket.onmessage = (event: MessageEvent) => {
try {
const data = JSON.parse(event.data as string) as Record<string, unknown>
const data = JSON.parse(event.data as string) as WsServerMessage
console.log('[Chat WS] Received:', data.type, data)
handleWsMessage(data)
} catch (error) {
@ -403,17 +398,14 @@ export const useChatStore = defineStore('chat', () => {
return _teamStore
}
// TODO: refactor to WsServerMessage union to eliminate `any`.
// This function predates the current VI redesign and touches many legacy branches.
function handleWsMessage(data: Record<string, any>): void {
// Backend sends nested data: {type, data: {...}}
// Flatten for easier access
const payload = data.data ?? data
function handleWsMessage(data: WsServerMessage): void {
// Discriminated union narrowing: each `case` branch narrows `data` to a
// specific variant of WsServerMessage, so typed fields can be accessed
// directly from `data` (or `data.data` for variants with a nested payload).
switch (data.type) {
case 'connected': {
// Backend confirms conversation — update local ID if backend assigned a different one
const serverConvId = data.conversation_id || payload.conversation_id
const serverConvId = data.conversation_id
if (serverConvId && serverConvId !== currentConversationId.value) {
// Rename the local conversation to match the server ID
const localId = currentConversationId.value
@ -453,11 +445,12 @@ export const useChatStore = defineStore('chat', () => {
const lastAssistantMsg = [...conv.messages]
.reverse()
.find((m) => m.role === 'assistant')
const stepInfo = payload
const stepInfo = data.data
const innerData = stepInfo.data as Record<string, unknown>
const desc = stepInfo.event_type === 'final_answer'
? '生成最终回答'
: stepInfo.event_type === 'tool_call'
? `调用工具: ${stepInfo.data?.tool_name || stepInfo.data?.name || '#'}`
? `调用工具: ${(innerData.tool_name || innerData.name || '#') as string}`
: stepInfo.event_type === 'thinking'
? '思考中...'
: `步骤 ${stepInfo.step || ''}: ${stepInfo.event_type || ''}`
@ -469,11 +462,11 @@ export const useChatStore = defineStore('chat', () => {
if (stepInfo.event_type === 'tool_call') {
const tcId = `tc-${stepInfo.step || toolCalls.length}`
const toolName = stepInfo.data?.tool_name || stepInfo.data?.name || 'unknown'
const params = stepInfo.data?.arguments
? (typeof stepInfo.data.arguments === 'string'
? stepInfo.data.arguments
: JSON.stringify(stepInfo.data.arguments, null, 2))
const toolName = (innerData.tool_name || innerData.name || 'unknown') as string
const params = innerData.arguments
? (typeof innerData.arguments === 'string'
? innerData.arguments
: JSON.stringify(innerData.arguments, null, 2))
: undefined
toolCalls.push({
id: tcId,
@ -486,20 +479,20 @@ export const useChatStore = defineStore('chat', () => {
// Find the last running tool call and update it
const lastRunning = [...toolCalls].reverse().find(tc => tc.status === 'running')
if (lastRunning) {
const resultStr = stepInfo.data?.output
? (typeof stepInfo.data.output === 'string'
? stepInfo.data.output
: JSON.stringify(stepInfo.data.output, null, 2))
const resultStr = innerData.output
? (typeof innerData.output === 'string'
? innerData.output
: JSON.stringify(innerData.output, null, 2))
: ''
lastRunning.status = stepInfo.data?.error ? 'error' : 'completed'
lastRunning.status = innerData.error ? 'error' : 'completed'
lastRunning.result = resultStr.length > 2000 ? resultStr.substring(0, 2000) + '...' : resultStr
lastRunning.error = stepInfo.data?.error
lastRunning.duration = stepInfo.data?.duration
lastRunning.error = innerData.error as string | undefined
lastRunning.duration = innerData.duration as number | undefined
updateMessage(conversationId, lastAssistantMsg.id, { tool_calls: [...toolCalls] })
}
} else if (stepInfo.event_type === 'thinking') {
// Accumulate thinking content for ThinkingBlock rendering
const thinkingChunk = stepInfo.data?.content || stepInfo.data?.thought || ''
const thinkingChunk = (innerData.content || innerData.thought || '') as string
if (thinkingChunk && lastAssistantMsg) {
updateMessage(conversationId, lastAssistantMsg.id, {
thinking: (lastAssistantMsg.thinking || '') + thinkingChunk,
@ -510,7 +503,7 @@ export const useChatStore = defineStore('chat', () => {
// Accumulate final_answer content for streaming display
if (stepInfo.event_type === 'final_answer' && lastAssistantMsg) {
const chunk = stepInfo.data?.output || ''
const chunk = (innerData.output || '') as string
if (chunk) {
updateMessage(conversationId, lastAssistantMsg.id, {
content: (lastAssistantMsg.content || '') + chunk,
@ -529,7 +522,7 @@ export const useChatStore = defineStore('chat', () => {
.reverse()
.find((m) => m.role === 'assistant')
// Backend sends: {type: "result", data: {message: "..."}} or {data: {status, content}}
const content = payload.message || payload.content || ''
const content = data.data.message || data.data.content || ''
if (lastAssistantMsg) {
// Only overwrite if we didn't already stream the content
const finalContent = content || lastAssistantMsg.content || ''
@ -562,7 +555,7 @@ export const useChatStore = defineStore('chat', () => {
updateMessage(conversationId, lastAssistantMsg.id, {
message_type: 'error',
status: 'error',
error_detail: payload.message || '未知错误',
error_detail: data.data.message || '未知错误',
content: lastAssistantMsg.content || '',
})
} else {
@ -573,7 +566,7 @@ export const useChatStore = defineStore('chat', () => {
timestamp: new Date().toISOString(),
status: 'error',
message_type: 'error',
error_detail: payload.message || '未知错误',
error_detail: data.data.message || '未知错误',
}
appendMessage(conversationId, errorMsg)
}
@ -585,9 +578,9 @@ export const useChatStore = defineStore('chat', () => {
case 'team_formed': {
const teamStore = _getTeamStore()
if (teamStore) {
teamStore.setTeamState(payload as IExpertTeamState)
teamStore.setTeamState(data.data)
}
streamingSteps.value.push(`专家团队已组建: ${(payload as IExpertTeamState).experts.map((e) => e.name).join(', ')}`)
streamingSteps.value.push(`专家团队已组建: ${data.data.experts.map((e) => e.name).join(', ')}`)
break
}
@ -599,26 +592,26 @@ export const useChatStore = defineStore('chat', () => {
// Dedup: append to existing expert message if one exists for this expert
const existingExpertMsg = [...conv.messages]
.reverse()
.find((m) => m.expert_id === payload.expert_id && m.status === 'pending')
.find((m) => m.expert_id === data.data.expert_id && m.status === 'pending')
if (existingExpertMsg) {
updateMessage(conversationId, existingExpertMsg.id, {
content: (existingExpertMsg.content || '') + (payload.content || ''),
content: (existingExpertMsg.content || '') + (data.data.content || ''),
})
} else {
const expertMsg: IChatMessage = {
id: generateId(),
role: 'assistant',
content: payload.content || '',
content: data.data.content || '',
timestamp: new Date().toISOString(),
status: 'pending',
expert_id: payload.expert_id,
expert_name: payload.expert_name,
expert_color: payload.expert_color,
expert_id: data.data.expert_id,
expert_name: data.data.expert_name,
expert_color: data.data.expert_color,
message_type: 'chat',
}
appendMessage(conversationId, expertMsg)
}
streamingSteps.value.push(`${payload.expert_name}: 步骤 ${payload.step}`)
streamingSteps.value.push(`${data.data.expert_name}: 步骤 ${data.data.step}`)
break
}
@ -630,12 +623,12 @@ export const useChatStore = defineStore('chat', () => {
const expertMsg: IChatMessage = {
id: generateId(),
role: 'assistant',
content: payload.content || '',
content: data.data.content || '',
timestamp: new Date().toISOString(),
status: 'completed',
expert_id: payload.expert_id,
expert_name: payload.expert_name,
expert_color: payload.expert_color,
expert_id: data.data.expert_id,
expert_name: data.data.expert_name,
expert_color: data.data.expert_color,
message_type: 'chat',
}
appendMessage(conversationId, expertMsg)
@ -645,7 +638,7 @@ export const useChatStore = defineStore('chat', () => {
case 'plan_update': {
const teamStore = _getTeamStore()
if (teamStore) {
teamStore.updatePhases(payload.plan_phases)
teamStore.updatePhases(data.data.plan_phases)
}
const conversationId = currentConversationId.value
if (!conversationId) break
@ -656,7 +649,7 @@ export const useChatStore = defineStore('chat', () => {
.find((m) => m.message_type === 'plan_update')
if (existingPlanMsg) {
updateMessage(conversationId, existingPlanMsg.id, {
plan_phases: payload.plan_phases,
plan_phases: data.data.plan_phases,
})
} else {
const planMsg: IChatMessage = {
@ -666,7 +659,7 @@ export const useChatStore = defineStore('chat', () => {
timestamp: new Date().toISOString(),
status: 'completed',
message_type: 'plan_update',
plan_phases: payload.plan_phases,
plan_phases: data.data.plan_phases,
}
appendMessage(conversationId, planMsg)
}
@ -681,7 +674,7 @@ export const useChatStore = defineStore('chat', () => {
const synthesisMsg: IChatMessage = {
id: generateId(),
role: 'assistant',
content: payload.content || '',
content: data.data.content || '',
timestamp: new Date().toISOString(),
status: 'completed',
message_type: 'milestone',
@ -702,8 +695,8 @@ export const useChatStore = defineStore('chat', () => {
case 'phase_started': {
const teamStore = _getTeamStore()
if (teamStore?.teamState) {
teamStore.updatePhaseStatus(payload.phase_id, 'in_progress')
streamingSteps.value.push(`阶段开始: ${payload.phase_name} (${payload.assigned_expert})`)
teamStore.updatePhaseStatus(data.data.phase_id, 'in_progress')
streamingSteps.value.push(`阶段开始: ${data.data.phase_name} (${data.data.assigned_expert})`)
}
break
}
@ -711,8 +704,8 @@ export const useChatStore = defineStore('chat', () => {
case 'phase_completed': {
const teamStore = _getTeamStore()
if (teamStore?.teamState) {
teamStore.updatePhaseStatus(payload.phase_id, 'completed', payload.result_summary)
streamingSteps.value.push(`阶段完成: ${payload.phase_name}`)
teamStore.updatePhaseStatus(data.data.phase_id, 'completed', data.data.result_summary)
streamingSteps.value.push(`阶段完成: ${data.data.phase_name}`)
}
break
}
@ -720,8 +713,8 @@ export const useChatStore = defineStore('chat', () => {
case 'phase_failed': {
const teamStore = _getTeamStore()
if (teamStore?.teamState) {
teamStore.updatePhaseStatus(payload.phase_id, 'failed', payload.error)
streamingSteps.value.push(`阶段失败: ${payload.phase_name} - ${payload.error}`)
teamStore.updatePhaseStatus(data.data.phase_id, 'failed', data.data.error)
streamingSteps.value.push(`阶段失败: ${data.data.phase_name} - ${data.data.error}`)
}
break
}
@ -729,23 +722,23 @@ export const useChatStore = defineStore('chat', () => {
// ── Board Meeting 模式事件 ────────────────────────────────────────
case 'board_started': {
const data = payload as IBoardStartedData
const boardData = data.data
// Initialize board state
boardState.value = {
topic: data.topic,
experts: data.experts.map((e) => ({
topic: boardData.topic,
experts: boardData.experts.map((e) => ({
name: e.name,
avatar: e.avatar,
color: e.color,
is_moderator: e.is_moderator,
persona: e.persona,
})),
max_rounds: data.max_rounds,
max_rounds: boardData.max_rounds,
current_round: 0,
status: 'discussing',
}
streamingSteps.value.push(
`私董会已开启: 主题「${data.topic}」, ${data.experts.length} 位专家, 最多 ${data.max_rounds}`
`私董会已开启: 主题「${boardData.topic}」, ${boardData.experts.length} 位专家, 最多 ${boardData.max_rounds}`
)
// Push a structured banner message so the renderer can show BoardBannerCard
const conversationId = currentConversationId.value
@ -753,11 +746,11 @@ export const useChatStore = defineStore('chat', () => {
const startMsg: IChatMessage = {
id: generateId(),
role: 'assistant',
content: `🏛️ 私董会开始:${data.topic}`,
content: `🏛️ 私董会开始:${boardData.topic}`,
timestamp: new Date().toISOString(),
status: 'completed',
message_type: 'board_started',
board_started: data,
board_started: boardData,
board_round: 0,
}
appendMessage(conversationId, startMsg)
@ -766,67 +759,67 @@ export const useChatStore = defineStore('chat', () => {
}
case 'expert_speech': {
const data = payload as IExpertSpeechData
const speechData = data.data
// Update current round in board state
if (boardState.value && data.round > boardState.value.current_round) {
boardState.value.current_round = data.round
if (boardState.value && speechData.round > boardState.value.current_round) {
boardState.value.current_round = speechData.round
}
const conversationId = currentConversationId.value
if (!conversationId) break
const speechMsg: IChatMessage = {
id: generateId(),
role: 'assistant',
content: data.content || '',
content: speechData.content || '',
timestamp: new Date().toISOString(),
status: 'completed',
expert_name: data.expert_name,
expert_color: data.expert_color,
expert_avatar: data.expert_avatar,
expert_name: speechData.expert_name,
expert_color: speechData.expert_color,
expert_avatar: speechData.expert_avatar,
message_type: 'board_speech',
board_round: data.round,
board_role: data.role,
board_round: speechData.round,
board_role: speechData.role,
}
appendMessage(conversationId, speechMsg)
streamingSteps.value.push(
`${data.expert_avatar} ${data.expert_name} (第${data.round}${data.role === 'moderator' ? '·主持' : ''})`
`${speechData.expert_avatar} ${speechData.expert_name} (第${speechData.round}${speechData.role === 'moderator' ? '·主持' : ''})`
)
break
}
case 'round_summary': {
const data = payload as IRoundSummaryData
const summaryData = data.data
const conversationId = currentConversationId.value
if (!conversationId) break
const summaryMsg: IChatMessage = {
id: generateId(),
role: 'assistant',
content: data.content || '',
content: summaryData.content || '',
timestamp: new Date().toISOString(),
status: 'completed',
expert_name: data.moderator_name,
expert_name: summaryData.moderator_name,
message_type: 'board_summary',
board_round: data.round,
board_round: summaryData.round,
board_role: 'summary',
}
appendMessage(conversationId, summaryMsg)
streamingSteps.value.push(`${data.round}轮小结${data.continue ? '(继续讨论)' : '(即将结束)'}`)
streamingSteps.value.push(`${summaryData.round}轮小结${summaryData.continue ? '(继续讨论)' : '(即将结束)'}`)
break
}
case 'user_intervention': {
const data = payload as IUserInterventionData
streamingSteps.value.push(`用户干预: ${data.content.slice(0, 50)}...`)
const interventionData = data.data
streamingSteps.value.push(`用户干预: ${interventionData.content.slice(0, 50)}...`)
break
}
case 'board_concluded': {
const data = payload as IBoardConcludedData
const conclusionData = data.data
// Update board state to completed
if (boardState.value) {
boardState.value.status = 'completed'
}
streamingSteps.value.push(
`私董会结束: ${data.total_rounds} 轮讨论${data.error ? ' (异常)' : ''}`
`私董会结束: ${conclusionData.total_rounds} 轮讨论${conclusionData.error ? ' (异常)' : ''}`
)
// Push a structured conclusion message so the renderer can show BoardConclusionCard
const conversationId = currentConversationId.value
@ -834,12 +827,12 @@ export const useChatStore = defineStore('chat', () => {
const conclusionMsg: IChatMessage = {
id: generateId(),
role: 'assistant',
content: data.summary || '私董会已结束',
content: conclusionData.summary || '私董会已结束',
timestamp: new Date().toISOString(),
status: 'completed',
message_type: 'board_conclusion',
board_conclusion: data,
board_round: data.total_rounds,
board_conclusion: conclusionData,
board_round: conclusionData.total_rounds,
}
appendMessage(conversationId, conclusionMsg)
}

View File

@ -135,7 +135,15 @@ async def submit_task(request: SubmitTaskRequest, req: Request):
quality_result = None
if skill:
try:
quality_result = await quality_gate.validate(task_result.output_data or {}, skill)
intent = getattr(skill.config, "intent", None)
skill_context = None
if intent is not None:
keywords = list(intent.keywords) + list(intent.disambiguation_keywords)
if keywords:
skill_context = {"intent_keywords": keywords}
quality_result = await quality_gate.validate(
task_result.output_data or {}, skill, skill_context=skill_context
)
except Exception:
pass # Quality gate failure shouldn't block the response

View File

@ -110,8 +110,18 @@ class BackgroundRunner:
quality_result = None
if skill and quality_gate:
try:
intent = getattr(skill.config, "intent", None)
skill_context = None
if intent is not None:
keywords = list(intent.keywords) + list(
intent.disambiguation_keywords
)
if keywords:
skill_context = {"intent_keywords": keywords}
quality_result = await quality_gate.validate(
task_result.output_data or {}, skill
task_result.output_data or {},
skill,
skill_context=skill_context,
)
except Exception as e:
logger.warning(f"Quality gate failed for {task_id}: {e}")

View File

@ -36,6 +36,7 @@ class IntentConfig:
keywords: list[str] = field(default_factory=list)
description: str = ""
examples: list[str] = field(default_factory=list)
disambiguation_keywords: list[str] = field(default_factory=list)
@dataclass
@ -214,6 +215,7 @@ class SkillConfig(AgentConfig):
"keywords": self.intent.keywords,
"description": self.intent.description,
"examples": self.intent.examples,
"disambiguation_keywords": self.intent.disambiguation_keywords,
}
d["quality_gate"] = {
"required_fields": self.quality_gate.required_fields,

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@ -1,11 +1,11 @@
# AgentKit 能力基准测试报告
## 测试概要
- 时间: 2026-06-17T15:47:33.591101+00:00
- 时间: 2026-06-20T11:05:39.446588+00:00
- 版本: 0.1.0
- 模式: mock
- 运行次数: 1
- 总体准确率: 100.0% ± 0.0%
- 模式: llm
- 运行次数: 3
- 总体准确率: 93.3% ± 0.0%
## 与行业 Benchmark 对比
@ -17,252 +17,46 @@
## 维度结果
### 1. 预处理准确度 (Preprocessing Accuracy) [Mock]
### 9. LLM 推理能力 (LLM Reasoning) [LLM]
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [79.6%, 100.0%] |
| Precision | 100.0% |
| Recall | 100.0% |
| F1 | 100.0% |
| Latency p50 | 0.01ms |
| Latency p95 | 0.07ms |
| Latency p99 | 0.11ms |
| Accuracy | 93.3% ± 9.4% |
| 95% CI | [37.5%, 96.4%] |
| Precision | 0.0% |
| Recall | 0.0% |
| F1 | 0.0% |
| Latency p50 | 40798.45ms |
| Latency p95 | 56307.93ms |
| Latency p99 | 59262.53ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 15 / 15 / 0 |
| Total / Pass / Fail | 5 / 4 / 1 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| greeting | 4 | 4 | 100.0% |
| tool_query | 5 | 5 | 100.0% |
| skill_prefix | 3 | 3 | 100.0% |
| complex | 3 | 3 | 100.0% |
| intent_understanding | 1 | 1 | 100.0% |
| tool_selection | 1 | 0 | 0.0% |
| multi_step | 1 | 1 | 100.0% |
| code_generation | 1 | 1 | 100.0% |
| error_recovery | 1 | 1 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 5 | 5 | 100.0% |
| medium | 7 | 7 | 100.0% |
| hard | 3 | 3 | 100.0% |
### 2. 过拟合检测 (Overfitting Detection) [Mock]
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [56.5%, 100.0%] |
| Precision | 100.0% |
| Recall | 100.0% |
| F1 | 100.0% |
| Latency p50 | 0.01ms |
| Latency p95 | 0.03ms |
| Latency p99 | 0.03ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 5 / 5 / 0 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| ip_check | 1 | 1 | 100.0% |
| search | 1 | 1 | 100.0% |
| greeting | 1 | 1 | 100.0% |
| tool_use | 1 | 1 | 100.0% |
| complex | 1 | 1 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| medium | 3 | 3 | 100.0% |
| easy | 1 | 1 | 100.0% |
| hard | 1 | 1 | 100.0% |
| medium | 2 | 1 | 50.0% |
| hard | 2 | 2 | 100.0% |
### 3. 效率测试 (Efficiency) [Mock]
#### 失败用例分析
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [56.5%, 100.0%] |
| Precision | 0.0% |
| Recall | 0.0% |
| F1 | 0.0% |
| Latency p50 | 0.33ms |
| Latency p95 | 0.64ms |
| Latency p99 | 0.67ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 5 / 5 / 0 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| preprocess_latency | 3 | 3 | 100.0% |
| tool_search_latency | 2 | 2 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 2 | 2 | 100.0% |
| medium | 3 | 3 | 100.0% |
### 4. 工具搜索 (Tool Search) [Mock]
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [72.2%, 100.0%] |
| Precision | 83.3% |
| Recall | 83.3% |
| F1 | 83.3% |
| Latency p50 | 0.01ms |
| Latency p95 | 0.02ms |
| Latency p99 | 0.02ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 10 / 10 / 0 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| exact_match | 5 | 5 | 100.0% |
| fuzzy_match | 2 | 2 | 100.0% |
| no_match | 2 | 2 | 100.0% |
| top_k | 1 | 1 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 7 | 7 | 100.0% |
| medium | 3 | 3 | 100.0% |
### 5. 事件模型 (Event Model) [Mock]
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [61.0%, 100.0%] |
| Precision | 0.0% |
| Recall | 0.0% |
| F1 | 0.0% |
| Latency p50 | 0.05ms |
| Latency p95 | 15.87ms |
| Latency p99 | 20.08ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 6 / 6 / 0 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| sq_lifecycle | 3 | 3 | 100.0% |
| eq_lifecycle | 3 | 3 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 6 | 6 | 100.0% |
### 6. 规格管理 (Spec Management) [Mock]
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [64.6%, 100.0%] |
| Precision | 0.0% |
| Recall | 0.0% |
| F1 | 0.0% |
| Latency p50 | 1.94ms |
| Latency p95 | 2.94ms |
| Latency p99 | 3.25ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 7 / 7 / 0 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| crud | 5 | 5 | 100.0% |
| edge | 2 | 2 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 6 | 6 | 100.0% |
| medium | 1 | 1 | 100.0% |
### 7. 验证循环 (Verification Loop) [Mock]
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [56.5%, 100.0%] |
| Precision | 0.0% |
| Recall | 0.0% |
| F1 | 0.0% |
| Latency p50 | 22.22ms |
| Latency p95 | 47.79ms |
| Latency p99 | 50.93ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 5 / 5 / 0 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| basic | 2 | 2 | 100.0% |
| retry | 1 | 1 | 100.0% |
| timeout | 1 | 1 | 100.0% |
| multi | 1 | 1 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 2 | 2 | 100.0% |
| medium | 3 | 3 | 100.0% |
### 8. 私董会路由 (Board Meeting Routing) [Mock]
| 指标 | 值 |
|---|---|
| Accuracy | 100.0% ± 0.0% |
| 95% CI | [82.4%, 100.0%] |
| Precision | 100.0% |
| Recall | 100.0% |
| F1 | 100.0% |
| Latency p50 | 0.01ms |
| Latency p95 | 0.39ms |
| Latency p99 | 1.19ms |
| Consistency | 100.0% |
| Total / Pass / Fail | 18 / 18 / 0 |
#### 按类别分布
| 类别 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| default_template | 3 | 3 | 100.0% |
| explicit_experts | 3 | 3 | 100.0% |
| topic_extraction | 3 | 3 | 100.0% |
| no_match | 3 | 3 | 100.0% |
| name_validation | 3 | 3 | 100.0% |
| stop_command | 3 | 3 | 100.0% |
#### 按难度分布
| 难度 | 用例数 | 通过 | 准确率 |
|---|---|---|---|
| easy | 11 | 11 | 100.0% |
| medium | 7 | 7 | 100.0% |
| 用例 ID | 类别 | 难度 | 期望 | 实际 | 根因 |
|---|---|---|---|---|---|
| llm-002 | tool_selection | medium | react | timeout | timeout |
## 问题总结与改进建议
- 所有维度表现良好,无需特别改进。
- **llm_reasoning**: 准确率 80.0% 低于 90%,建议检查失败用例并优化
- **llm_reasoning**: P95 延迟 56307.93ms 较高,建议优化性能

View File

@ -0,0 +1,640 @@
"""Real LLM E2E tests — tests against a live server with real LLM providers.
These tests start a real AgentKit server using the project's ``agentkit.yaml``
configuration and make actual LLM API calls to Bailian (DashScope).
Requirements:
- ``DASHSCOPE_API_KEY`` environment variable (loaded from ``.env``)
- Network access to ``https://coding.dashscope.aliyuncs.com/v1``
Run with::
.venv/bin/python -m pytest tests/e2e/test_real_llm_e2e.py -v --timeout=180
All tests are marked with ``@pytest.mark.integration`` so they are excluded
from the default unit-test run (``pytest -m "not integration"``).
"""
from __future__ import annotations
import asyncio
import json
import os
import subprocess
import sys
import time
import uuid
from datetime import datetime, timezone
from pathlib import Path
from typing import Any, Generator
import aiosqlite
import httpx
import pytest
# Disable HTTP proxies for localhost requests (Clash/V2Ray intercepts localhost).
os.environ["NO_PROXY"] = "127.0.0.1,localhost"
os.environ["no_proxy"] = "127.0.0.1,localhost"
# ---------------------------------------------------------------------------
# Constants
# ---------------------------------------------------------------------------
PROJECT_ROOT = Path(__file__).resolve().parents[2]
REAL_LLM_HOST = "127.0.0.1"
REAL_LLM_PORT = 18766 # dedicated port to avoid conflict with mock E2E (18765)
REAL_LLM_BASE_URL = f"http://{REAL_LLM_HOST}:{REAL_LLM_PORT}"
REAL_LLM_WS_URL = f"ws://{REAL_LLM_HOST}:{REAL_LLM_PORT}"
# Fixed JWT secret so tokens are deterministic across the session.
TEST_JWT_SECRET = "test-jwt-secret-for-real-llm-e2e-fixed-do-not-use-in-prod"
# Test user credentials (created directly in the auth DB).
TEST_USERNAME = "real_llm_e2e_user"
TEST_PASSWORD = "TestPassword123!@#"
TEST_EMAIL = "real_llm_e2e@example.com"
# Model alias from agentkit.yaml (resolves to bailian-coding/qwen3.7-plus).
TEST_MODEL = "default"
# ---------------------------------------------------------------------------
# .env loading
# ---------------------------------------------------------------------------
def _load_dotenv_vars(dotenv_path: Path) -> dict[str, str]:
"""Load env vars from a .env file into a dict (does not touch os.environ)."""
env_vars: dict[str, str] = {}
if not dotenv_path.exists():
return env_vars
with open(dotenv_path, encoding="utf-8") as f:
for line in f:
line = line.strip()
if not line or line.startswith("#"):
continue
if "=" not in line:
continue
key, _, value = line.partition("=")
key = key.strip()
value = value.strip().strip("\"'")
if key:
env_vars[key] = value
return env_vars
def _has_dashscope_key() -> bool:
"""Return True if DASHSCOPE_API_KEY is available (env or .env file)."""
if os.environ.get("DASHSCOPE_API_KEY"):
return True
dotenv_vars = _load_dotenv_vars(PROJECT_ROOT / ".env")
return bool(dotenv_vars.get("DASHSCOPE_API_KEY"))
# ---------------------------------------------------------------------------
# Test user creation
# ---------------------------------------------------------------------------
def _create_test_user(auth_db_path: Path) -> None:
"""Create the test user directly in the SQLite auth DB.
Uses bcrypt hashing (rounds=12) via the project's password utility so the
``/auth/login`` route can verify the password.
"""
from agentkit.server.auth.models import init_auth_db
from agentkit.server.auth.password import hash_password
# Ensure the schema exists.
asyncio.run(init_auth_db(auth_db_path))
user_id = str(uuid.uuid4())
password_hash = hash_password(TEST_PASSWORD)
now_iso = datetime.now(timezone.utc).isoformat()
async def _insert() -> None:
async with aiosqlite.connect(str(auth_db_path)) as db:
# Remove any stale row from a previous run.
await db.execute("DELETE FROM users WHERE username = ?", (TEST_USERNAME,))
await db.execute(
"INSERT INTO users "
"(id, username, email, password_hash, role, is_active, "
" is_terminal_authorized, is_server_terminal_authorized, "
" created_at, updated_at) "
"VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?)",
(
user_id,
TEST_USERNAME,
TEST_EMAIL,
password_hash,
"admin", # admin role → full access for tests
1,
1,
1,
now_iso,
now_iso,
),
)
await db.commit()
asyncio.run(_insert())
# ---------------------------------------------------------------------------
# Session-scoped server fixture
# ---------------------------------------------------------------------------
@pytest.fixture(scope="session")
def real_llm_server(
tmp_path_factory: pytest.TempPathFactory,
) -> Generator[tuple[str, Path], None, None]:
"""Start a real AgentKit server with actual LLM providers.
Yields ``(base_url, auth_db_path)``. The server uses the project root's
``agentkit.yaml`` (Bailian coding plan) no mock providers.
Skips the entire session if ``DASHSCOPE_API_KEY`` is not available.
"""
if not _has_dashscope_key():
pytest.skip("DASHSCOPE_API_KEY not set — skipping real LLM E2E tests")
tmp_path = tmp_path_factory.mktemp("real_llm_server")
auth_db_path = tmp_path / "auth.db"
# Build subprocess environment.
env = os.environ.copy()
# Disable HTTP proxies so localhost requests don't go through Clash/V2Ray.
for proxy_var in ("HTTP_PROXY", "HTTPS_PROXY", "http_proxy", "https_proxy", "ALL_PROXY", "all_proxy"):
env.pop(proxy_var, None)
env["NO_PROXY"] = "127.0.0.1,localhost"
env["no_proxy"] = "127.0.0.1,localhost"
# Ensure API keys from .env are available to the subprocess.
dotenv_vars = _load_dotenv_vars(PROJECT_ROOT / ".env")
for key, value in dotenv_vars.items():
if not env.get(key):
env[key] = value
# Auth configuration.
env["AGENTKIT_JWT_SECRET"] = TEST_JWT_SECRET
env["AGENTKIT_AUTH_DB"] = str(auth_db_path)
# GUI mode creates a default chat agent (needed for chat / WebSocket tests).
env["AGENTKIT_GUI_MODE"] = "1"
# Explicit config path (also auto-discovered via CWD, but set explicitly).
config_path = PROJECT_ROOT / "agentkit.yaml"
env["AGENTKIT_CONFIG_PATH"] = str(config_path)
# Start the server via uvicorn directly (agentkit serve has interactive
# prompts that fail in non-tty subprocess environments).
# Redirect stderr to a file so we can read server logs on test failures.
stderr_log = tmp_path / "server_stderr.log"
stderr_fh = open(stderr_log, "w", encoding="utf-8")
try:
proc = subprocess.Popen(
[
sys.executable,
"-c",
"import uvicorn; uvicorn.run("
"'agentkit.server.app:create_app', "
f"host='{REAL_LLM_HOST}', port={REAL_LLM_PORT}, factory=True)",
],
env=env,
stdout=subprocess.PIPE,
stderr=stderr_fh,
cwd=str(PROJECT_ROOT),
)
# Wait for the server to become healthy (max 60s — real LLM server
# initialization is slower than the mock E2E server).
base_url = REAL_LLM_BASE_URL
deadline = time.monotonic() + 60
ready = False
while time.monotonic() < deadline:
if proc.poll() is not None:
# Process exited early — capture output for diagnostics.
stdout, stderr = proc.communicate(timeout=5)
pytest.fail(
"Real LLM server exited early.\n"
f"stdout: {stdout.decode()[:2000] if stdout else ''}\n"
f"stderr: {stderr.decode()[:2000] if stderr else ''}"
)
try:
resp = httpx.get(f"{base_url}/api/v1/health", timeout=2)
if resp.status_code == 200:
ready = True
break
except httpx.ConnectError:
pass
time.sleep(0.5)
if not ready:
proc.terminate()
try:
stdout, stderr = proc.communicate(timeout=5)
except subprocess.TimeoutExpired:
proc.kill()
stdout, stderr = proc.communicate()
pytest.fail(
"Real LLM server failed to start within 60s.\n"
f"stdout: {stdout.decode()[:2000] if stdout else ''}\n"
f"stderr: {stderr.decode()[:2000] if stderr else ''}"
)
# Create the test user now that the server (and auth DB schema) is up.
_create_test_user(auth_db_path)
yield base_url, auth_db_path
# Teardown — terminate the server process.
proc.terminate()
try:
proc.wait(timeout=10)
except subprocess.TimeoutExpired:
proc.kill()
proc.wait()
finally:
stderr_fh.close()
# If the server logged any errors, print them for debugging.
if stderr_log.exists():
log_content = stderr_log.read_text(encoding="utf-8", errors="replace")
if "Error" in log_content or "Traceback" in log_content:
print(f"\n--- Server stderr log ---\n{log_content[-3000:]}\n--- End server log ---")
# ---------------------------------------------------------------------------
# Convenience fixtures derived from real_llm_server
# ---------------------------------------------------------------------------
@pytest.fixture(scope="session")
def base_url(real_llm_server: tuple[str, Path]) -> str:
return real_llm_server[0]
@pytest.fixture(scope="session")
def auth_db_path(real_llm_server: tuple[str, Path]) -> Path:
return real_llm_server[1]
def _login_with_retry(
base_url: str, max_retries: int = 3, delay: float = 1.0
) -> httpx.Response:
"""Login with retry on 500 (transient SQLite write-lock contention)."""
if max_retries <= 0:
raise ValueError("max_retries must be > 0")
with httpx.Client(base_url=base_url, timeout=30) as client:
for attempt in range(max_retries):
resp = client.post(
"/api/v1/auth/login",
json={"username": TEST_USERNAME, "password": TEST_PASSWORD},
)
if resp.status_code == 200:
return resp
if resp.status_code == 500 and attempt < max_retries - 1:
time.sleep(delay)
continue
return resp
raise RuntimeError("unreachable: loop should have returned")
@pytest.fixture(scope="session")
def auth_token(base_url: str) -> str:
"""Log in once per session and return the access token."""
resp = _login_with_retry(base_url)
assert resp.status_code == 200, (
f"Login failed: {resp.status_code} {resp.text[:1000]}"
)
data = resp.json()
assert "access_token" in data
return data["access_token"]
@pytest.fixture(scope="session")
def refresh_token(base_url: str) -> str:
"""Log in once per session and return the refresh token."""
resp = _login_with_retry(base_url)
assert resp.status_code == 200, (
f"Login failed: {resp.status_code} {resp.text[:1000]}"
)
return resp.json()["refresh_token"]
@pytest.fixture(scope="session")
def auth_headers(auth_token: str) -> dict[str, str]:
"""Default headers with a Bearer JWT for authenticated requests."""
return {"Authorization": f"Bearer {auth_token}", "Content-Type": "application/json"}
# ---------------------------------------------------------------------------
# 1. Authentication Flow Tests
# ---------------------------------------------------------------------------
@pytest.mark.integration
@pytest.mark.timeout(30)
class TestAuthFlow:
"""Verify the JWT authentication flow against the live server."""
def test_login_success(self, base_url: str):
"""POST /auth/login with correct credentials returns a JWT pair."""
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.post(
"/api/v1/auth/login",
json={"username": TEST_USERNAME, "password": TEST_PASSWORD},
)
assert resp.status_code == 200
data = resp.json()
assert "access_token" in data
assert "refresh_token" in data
assert data["token_type"] == "bearer"
assert data["user"]["username"] == TEST_USERNAME
assert data["user"]["role"] == "admin"
def test_login_wrong_password(self, base_url: str):
"""POST /auth/login with wrong password returns 401."""
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.post(
"/api/v1/auth/login",
json={"username": TEST_USERNAME, "password": "definitely-wrong"},
)
assert resp.status_code == 401
def test_me_with_valid_token(self, base_url: str, auth_headers: dict[str, str]):
"""GET /auth/me with a valid JWT returns the user profile."""
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.get("/api/v1/auth/me", headers=auth_headers)
assert resp.status_code == 200
data = resp.json()
assert data["username"] == TEST_USERNAME
assert data["email"] == TEST_EMAIL
assert data["role"] == "admin"
assert data["is_active"] is True
def test_me_without_token_returns_401(self, base_url: str):
"""GET /auth/me without a token returns 401."""
with httpx.Client(base_url=base_url, timeout=10) as client:
resp = client.get("/api/v1/auth/me")
assert resp.status_code == 401
def test_refresh_token(self, base_url: str, refresh_token: str):
"""POST /auth/refresh exchanges a refresh token for a new access token."""
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.post(
"/api/v1/auth/refresh",
json={"refresh_token": refresh_token},
)
assert resp.status_code == 200
data = resp.json()
assert "access_token" in data
assert data["user"]["username"] == TEST_USERNAME
# ---------------------------------------------------------------------------
# 2. LLM Gateway Tests
# ---------------------------------------------------------------------------
@pytest.mark.integration
@pytest.mark.timeout(120)
class TestLLMGateway:
"""Verify the LLM gateway proxy returns real LLM responses."""
def test_chat_non_streaming(self, base_url: str, auth_headers: dict[str, str]):
"""POST /llm/chat returns a non-empty real LLM response."""
with httpx.Client(base_url=base_url, timeout=90) as client:
resp = client.post(
"/api/v1/llm/chat",
headers=auth_headers,
json={
"messages": [{"role": "user", "content": "你好,请用一句话介绍自己"}],
"model": TEST_MODEL,
"temperature": 0.7,
"max_tokens": 200,
},
)
assert resp.status_code == 200
data = resp.json()
assert "content" in data
content: str = data["content"]
assert len(content) > 0
# Real LLM response should contain Chinese characters.
assert any("\u4e00" <= ch <= "\u9fff" for ch in content)
assert "model" in data
assert "usage" in data
def test_chat_streaming_sse(self, base_url: str, auth_headers: dict[str, str]):
"""POST /llm/chat/stream returns SSE chunks with real content."""
chunks: list[dict[str, Any]] = []
with httpx.Client(base_url=base_url, timeout=90) as client:
with client.stream(
"POST",
"/api/v1/llm/chat/stream",
headers=auth_headers,
json={
"messages": [{"role": "user", "content": "用一句话说明什么是人工智能"}],
"model": TEST_MODEL,
"temperature": 0.7,
"max_tokens": 200,
},
) as resp:
assert resp.status_code == 200
for line in resp.iter_lines():
if not line.startswith("data: "):
continue
payload = line[6:]
if payload == "[DONE]":
break
chunks.append(json.loads(payload))
assert len(chunks) > 0
full_content = "".join(c.get("content", "") for c in chunks)
assert len(full_content) > 0
assert any("\u4e00" <= ch <= "\u9fff" for ch in full_content)
def test_chat_invalid_model_returns_error(self, base_url: str, auth_headers: dict[str, str]):
"""POST /llm/chat with an unknown model returns 404 or 502."""
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.post(
"/api/v1/llm/chat",
headers=auth_headers,
json={
"messages": [{"role": "user", "content": "test"}],
"model": "nonexistent-model-xyz-12345",
},
)
assert resp.status_code in (404, 502)
# ---------------------------------------------------------------------------
# 3. Chat REST API Tests
# ---------------------------------------------------------------------------
@pytest.fixture(scope="class")
def chat_session_id(base_url: str, auth_headers: dict[str, str]) -> str:
"""Create a chat session bound to the default agent (created in GUI mode)."""
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.post(
"/api/v1/chat/sessions",
headers=auth_headers,
json={"agent_name": "default"},
)
assert resp.status_code in (200, 201), f"Failed to create chat session: {resp.text}"
return resp.json()["session_id"]
@pytest.mark.integration
@pytest.mark.timeout(120)
class TestChatAPI:
"""Verify the chat REST API returns real LLM responses."""
def test_create_session(self, chat_session_id: str):
"""A chat session is created with a non-empty ID."""
assert chat_session_id
assert len(chat_session_id) > 0
def test_send_message_and_get_real_response(
self, base_url: str, auth_headers: dict[str, str], chat_session_id: str
):
"""POST /chat/sessions/{id}/messages returns a real LLM reply."""
with httpx.Client(base_url=base_url, timeout=90) as client:
resp = client.post(
f"/api/v1/chat/sessions/{chat_session_id}/messages",
headers=auth_headers,
json={"content": "你好,请用一句话介绍自己"},
)
assert resp.status_code == 200
data = resp.json()
assert data["role"] == "assistant"
content: str = data["content"]
assert len(content) > 0
# Must not be a mock response.
assert "mock" not in content.lower()
# Real LLM response should contain Chinese characters.
assert any("\u4e00" <= ch <= "\u9fff" for ch in content)
def test_message_history_after_conversation(
self, base_url: str, auth_headers: dict[str, str], chat_session_id: str
):
"""GET /chat/sessions/{id}/messages returns user + assistant messages."""
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.get(
f"/api/v1/chat/sessions/{chat_session_id}/messages",
headers=auth_headers,
)
assert resp.status_code == 200
messages = resp.json()
assert isinstance(messages, list)
assert len(messages) >= 2 # at least one user + one assistant
roles = [m["role"] for m in messages]
assert "user" in roles
assert "assistant" in roles
# ---------------------------------------------------------------------------
# 4. WebSocket Chat Tests
# ---------------------------------------------------------------------------
@pytest.mark.integration
@pytest.mark.timeout(120)
class TestWebSocketChat:
"""Verify the WebSocket chat protocol with real LLM streaming."""
@pytest.mark.asyncio
async def test_websocket_full_chat_flow(self, base_url: str, auth_token: str):
"""Connect → send message → receive final_answer with real LLM content."""
try:
import websockets
except ImportError:
pytest.skip("websockets package not installed")
# Create a chat session via REST.
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.post(
"/api/v1/chat/sessions",
headers={
"Authorization": f"Bearer {auth_token}",
"Content-Type": "application/json",
},
json={"agent_name": "default"},
)
assert resp.status_code in (200, 201)
session_id = resp.json()["session_id"]
# Connect to the WebSocket (JWT passed via ?token= query param).
ws_url = f"{REAL_LLM_WS_URL}/api/v1/chat/ws/{session_id}?token={auth_token}"
received: list[dict[str, Any]] = []
async with websockets.connect(ws_url) as ws: # type: ignore[name-defined]
# 1. Expect a connected event.
raw = await asyncio.wait_for(ws.recv(), timeout=10)
data = json.loads(raw)
received.append(data)
assert data["type"] == "connected"
# 2. Send a user message.
await ws.send(json.dumps({"type": "message", "content": "你好,请用一句话介绍自己"}))
# 3. Collect events until final_answer / error / timeout.
deadline = time.monotonic() + 90
while time.monotonic() < deadline:
try:
raw = await asyncio.wait_for(ws.recv(), timeout=90)
except asyncio.TimeoutError:
received.append({"type": "timeout"})
break
msg = json.loads(raw)
received.append(msg)
if msg.get("type") in ("final_answer", "error"):
break
# 4. Assert we got a final_answer (not an error).
types = [m.get("type") for m in received]
assert "connected" in types
final_msgs = [m for m in received if m.get("type") == "final_answer"]
assert final_msgs, f"Expected final_answer, got event types: {types}"
final_content: str = final_msgs[0].get("content", "")
assert len(final_content) > 0
# Must not be a mock response.
assert "mock" not in final_content.lower()
# Real LLM response should contain Chinese characters.
assert any("\u4e00" <= ch <= "\u9fff" for ch in final_content)
@pytest.mark.asyncio
async def test_websocket_ping_pong(self, base_url: str, auth_token: str):
"""WebSocket ping/pong heartbeat works alongside the chat session."""
try:
import websockets
except ImportError:
pytest.skip("websockets package not installed")
with httpx.Client(base_url=base_url, timeout=30) as client:
resp = client.post(
"/api/v1/chat/sessions",
headers={
"Authorization": f"Bearer {auth_token}",
"Content-Type": "application/json",
},
json={"agent_name": "default"},
)
assert resp.status_code in (200, 201)
session_id = resp.json()["session_id"]
ws_url = f"{REAL_LLM_WS_URL}/api/v1/chat/ws/{session_id}?token={auth_token}"
async with websockets.connect(ws_url) as ws: # type: ignore[name-defined]
# Wait for connected.
await asyncio.wait_for(ws.recv(), timeout=10)
# Send ping → expect pong.
await ws.send(json.dumps({"type": "ping"}))
raw = await asyncio.wait_for(ws.recv(), timeout=10)
msg = json.loads(raw)
assert msg["type"] == "pong"

View File

@ -49,7 +49,8 @@ ROUTING_TEST_CASES = [
# --- Translation/knowledge → REACT (LLM decides no tool needed) ---
{"id": "translation", "input": "翻译hello为中文", "expected_mode": "react"},
{"id": "knowledge", "input": "什么是机器学习", "expected_mode": "react"},
# U5: 纯知识问答(无工具上下文)→ DIRECT_CHAT零成本快速路径
{"id": "knowledge", "input": "什么是机器学习", "expected_mode": "direct_chat"},
{"id": "summarize", "input": "帮我总结一下这段话", "expected_mode": "react"},
# --- Complex queries → REACT ---

View File

@ -5,7 +5,7 @@ from __future__ import annotations
import pytest
from agentkit.chat.request_preprocessor import RequestPreprocessor
from agentkit.chat.skill_routing import ExecutionMode, SkillRoutingResult
from agentkit.chat.skill_routing import ExecutionMode
# ---------------------------------------------------------------------------
@ -130,6 +130,142 @@ class TestDirectChat:
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
# ---------------------------------------------------------------------------
# Layer 1 extended: Factual / Math / Translation regex (U5)
# ---------------------------------------------------------------------------
class TestFactualMathTranslation:
"""U5: 纯知识问答/算术/翻译走 DIRECT_CHAT含工具上下文关键词的走 REACT"""
# --- Factual CN → DIRECT_CHAT ---
@pytest.mark.asyncio
async def test_factual_cn_what_is(self, preprocessor: RequestPreprocessor):
"""什么是机器学习 — 纯知识问答,不需要工具"""
result = await preprocessor.preprocess("什么是机器学习")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
assert result.match_method == "regex_direct"
@pytest.mark.asyncio
async def test_factual_cn_with_punctuation(self, preprocessor: RequestPreprocessor):
"""什么是机器学习? — 带问号也能走 DIRECT_CHAT"""
result = await preprocessor.preprocess("什么是机器学习?")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
@pytest.mark.asyncio
async def test_factual_cn_explain(self, preprocessor: RequestPreprocessor):
"""解释一下深度学习 — 纯知识问答"""
result = await preprocessor.preprocess("解释一下深度学习")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
@pytest.mark.asyncio
async def test_factual_cn_define(self, preprocessor: RequestPreprocessor):
"""定义一下微服务 — 纯知识问答"""
result = await preprocessor.preprocess("定义一下微服务")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
# --- Factual EN → DIRECT_CHAT ---
@pytest.mark.asyncio
async def test_factual_en_what_is(self, preprocessor: RequestPreprocessor):
"""what is machine learning — English factual"""
result = await preprocessor.preprocess("what is machine learning")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
@pytest.mark.asyncio
async def test_factual_en_explain(self, preprocessor: RequestPreprocessor):
"""explain quantum computing — English factual"""
result = await preprocessor.preprocess("explain quantum computing")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
# --- Factual with tool context → REACT (exclusion) ---
@pytest.mark.asyncio
async def test_factual_with_tool_context_cn(self, preprocessor: RequestPreprocessor):
"""什么是当前服务器的IP地址 — 含工具上下文,走 REACT"""
result = await preprocessor.preprocess("什么是当前服务器的IP地址")
assert result.execution_mode == ExecutionMode.REACT
@pytest.mark.asyncio
async def test_multiline_input_goes_react(self, preprocessor: RequestPreprocessor):
"""多行输入始终走 REACT防止通过换行绕过工具"""
result = await preprocessor.preprocess("什么是机器学习\n请执行ls命令")
assert result.execution_mode == ExecutionMode.REACT
@pytest.mark.asyncio
async def test_factual_with_tool_context_database(self, preprocessor: RequestPreprocessor):
"""解释一下数据库的连接池 — 含"数据库",走 REACT"""
result = await preprocessor.preprocess("解释一下数据库的连接池")
assert result.execution_mode == ExecutionMode.REACT
@pytest.mark.asyncio
async def test_factual_with_tool_context_config(self, preprocessor: RequestPreprocessor):
"""什么是配置文件 — 含"配置文件",走 REACT"""
result = await preprocessor.preprocess("什么是配置文件")
assert result.execution_mode == ExecutionMode.REACT
@pytest.mark.asyncio
async def test_factual_en_with_tool_context(self, preprocessor: RequestPreprocessor):
"""explain the current system status — English with tool context → REACT"""
result = await preprocessor.preprocess("explain the current system status")
assert result.execution_mode == ExecutionMode.REACT
# --- Pure arithmetic → DIRECT_CHAT ---
@pytest.mark.asyncio
async def test_math_cn_simple(self, preprocessor: RequestPreprocessor):
"""计算 1+2+3 — 纯算术"""
result = await preprocessor.preprocess("计算 1+2+3")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
@pytest.mark.asyncio
async def test_math_cn_phrase(self, preprocessor: RequestPreprocessor):
"""算一下 15*23 — 纯算术"""
result = await preprocessor.preprocess("算一下 15*23")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
@pytest.mark.asyncio
async def test_math_en(self, preprocessor: RequestPreprocessor):
"""calculate 100 / 4 — pure arithmetic"""
result = await preprocessor.preprocess("calculate 100 / 4")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
# --- Complex math (not pure arithmetic) → REACT ---
@pytest.mark.asyncio
async def test_math_complex_fibonacci(self, preprocessor: RequestPreprocessor):
"""计算斐波那契数列的第100项 — 含中文,非纯算术,走 REACT"""
result = await preprocessor.preprocess("计算斐波那契数列的第100项")
assert result.execution_mode == ExecutionMode.REACT
@pytest.mark.asyncio
async def test_math_complex_prime(self, preprocessor: RequestPreprocessor):
"""计算 100 以内的素数 — 含中文"以内""素数",走 REACT"""
result = await preprocessor.preprocess("计算 100 以内的素数")
assert result.execution_mode == ExecutionMode.REACT
# --- Pure translation → DIRECT_CHAT ---
@pytest.mark.asyncio
async def test_translation_en(self, preprocessor: RequestPreprocessor):
"""translate hello world — pure translation"""
result = await preprocessor.preprocess("translate hello world")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
@pytest.mark.asyncio
async def test_translation_cn_with_space(self, preprocessor: RequestPreprocessor):
"""翻译 hello — 有空格,纯翻译"""
result = await preprocessor.preprocess("翻译 hello")
assert result.execution_mode == ExecutionMode.DIRECT_CHAT
# --- Translation edge cases → REACT ---
@pytest.mark.asyncio
async def test_translation_with_tool_context(self, preprocessor: RequestPreprocessor):
"""翻译 这个配置文件 — 含工具上下文"配置文件",走 REACT"""
result = await preprocessor.preprocess("翻译 这个配置文件")
assert result.execution_mode == ExecutionMode.REACT
@pytest.mark.asyncio
async def test_translation_with_log_context(self, preprocessor: RequestPreprocessor):
"""翻译 服务器日志 — 含工具上下文,走 REACT"""
result = await preprocessor.preprocess("翻译 服务器日志")
assert result.execution_mode == ExecutionMode.REACT
# ---------------------------------------------------------------------------
# Default: REACT
# ---------------------------------------------------------------------------
@ -167,10 +303,9 @@ class TestDefaultReact:
@pytest.mark.asyncio
async def test_translation_goes_react(self, preprocessor: RequestPreprocessor):
"""翻译类查询也走 REACT — LLM 在 agent loop 中决定不需要工具"""
"""翻译hello为中文 — 无空格不匹配翻译正则,走 REACTLLM 决定工具使用)"""
result = await preprocessor.preprocess("翻译hello为中文")
assert result.execution_mode == ExecutionMode.REACT
# LLM will see tools but decide not to use them
@pytest.mark.asyncio
async def test_default_tools_included(self, preprocessor: RequestPreprocessor):

View File

@ -75,6 +75,23 @@ class TestOpenAICompatibleProviderBasic:
assert response.content == "DeepSeek response"
assert response.model == "deepseek-chat"
async def test_timeout_parameter_passed_to_httpx_client(self):
"""Verify that the timeout parameter is passed to the httpx client."""
provider = OpenAICompatibleProvider(
api_key="test-key",
base_url="https://api.openai.com/v1",
timeout=180.0,
)
# httpx stores timeout config on the client
assert provider._client.timeout.read == 180.0
await provider.close()
async def test_default_timeout_is_120s(self):
"""Verify that the default timeout is 120s (not the old hardcoded 60s)."""
provider = OpenAICompatibleProvider(api_key="test-key", base_url="https://api.openai.com/v1")
assert provider._client.timeout.read == 120.0
await provider.close()
class TestOpenAICompatibleProviderToolCalls:
"""Function Calling (tool_calls) 测试"""