fischer-agentkit

History

chiguyong 2e404cf1a0 test: 全面回测 + 真实 LLM E2E + 能力 benchmark + 问题修复 ## 测试结果 ### 后端 E2E（真实 LLM，真实服务器）— 13/13 通过 - tests/e2e/test_real_llm_e2e.py: 认证流程、LLM 网关、Chat API、WebSocket - 使用百炼 coding plan（qwen3.7-plus）真实 LLM，无 mock - 修复 SQLite 写锁竞争导致的间歇性 500（_login_with_retry 重试机制） ### 前端 E2E（Playwright + 真实 LLM）— 11/11 通过 - login.spec.ts (4): 登录流程、表单验证、token 存储 - chat.spec.ts (3): 真实 LLM 对话、消息渲染 - terminal.spec.ts (4): 终端面板、白名单管理 - 使用系统 Chrome（channel: 'chrome'）避免浏览器下载 ### Benchmark 能力评估（真实 LLM） - full 模式: 60% 准确率（5 用例 3 通过 2 超时） - fast 模式: 100% 准确率 - 失败用例: llm-001 (intent_understanding) / llm-004 (code_generation) 均为超时 ### 单元测试 - 174 个新测试通过 - 28 个预存失败（非本次架构变更引入） ## 代码修复 ### chat.ts: 消除 any 类型 TODO（line 406） - handleWsMessage 参数从 Record<string, any> 改为 WsServerMessage 联合类型 - 使用判别联合窄化，每个 case 分支直接访问类型化字段 - 移除通用 payload 变量，移除未使用的类型导入 - vue-tsc --noEmit 零错误 ### 基础设施修复 - playwright.config.ts: 修复 PROJECT_ROOT 路径（4 级而非 2 级） - playwright.config.ts: 用 uvicorn.run() 替代 agentkit serve（避免非 tty 交互提示） - helpers.ts: API_BASE 改为绝对 URL（Node.js fetch 不支持相对 URL） - helpers.ts: clearAuth 修复 page.evaluate 上下文问题（Node 常量传入浏览器） - helpers.ts: loginViaApi 添加 429 限流重试 + token 缓存 - login.spec.ts / terminal.spec.ts: 修复 Ant Design Vue autoInsertSpace 导致的选择器不匹配 - chat.spec.ts: .first() 改 .last() 避免拾取历史消息 - setup-test-user.py: .local 邮箱改为 .com（EmailStr 拒绝 .local TLD） - .gitignore: Playwright 产物路径限定到 frontend 目录 ### 依赖 - pyproject.toml: 补充 pyjwt, bcrypt, aiosqlite 依赖 - package.json: 添加 @playwright/test 依赖 ## 未完成计划清单（核对结果） ### 计划 001（聊天主区 VI 重梳）— active - U7: SkillsTab/SystemTab/KnowledgeTab 三子组件未实现 - U8: Preview 样例场景精修未完成 - U9: BoardMeetingModal VI 适配收尾未完成 - U10: 质量门与后端回归测试未完成 ### 计划 002（企业级 C/S 架构）— 方案评审中 - 8 个待决策问题未明确（卖给谁/部署位置/终端形态等） - P2/P3/P4 模块延后 ### 计划 003（企业级 C/S 演进）— completed - 7 项 Deferred（Web 管理台/技能市场/SSO/代码索引/多租户等） ### 代码 stub - DockerComputerUseSession: start/stop/screenshot/execute_action 4 个方法为 stub （需真实 Docker + VNC + Anthropic Computer Use API，属未来功能）		2026-06-20 18:22:10 +08:00
..
baseline.json	refactor: standardize benchmark with industry methodology (P/R/F1, multi-run, baseline)	2026-06-17 12:01:34 +08:00
benchmark_report.html	feat: comprehensive capability benchmark and agentkit benchmark CLI	2026-06-17 11:28:09 +08:00
benchmark_report.json	test: 全面回测 + 真实 LLM E2E + 能力 benchmark + 问题修复	2026-06-20 18:22:10 +08:00
benchmark_report.md	test: 全面回测 + 真实 LLM E2E + 能力 benchmark + 问题修复	2026-06-20 18:22:10 +08:00
benchmark_report.txt	feat: add LLM and GUI benchmark modes with real agent testing	2026-06-17 12:55:19 +08:00
benchmark_report_cn.md	docs: add detailed Chinese benchmark report with industry comparison	2026-06-17 11:34:56 +08:00