feat: complex-task-quality-loop (R1-R12) #22

Merged
fischer merged 13 commits from feat/complex-task-quality-loop into main 2026-07-05 22:31:22 +08:00
3 changed files with 653 additions and 0 deletions
Showing only changes of commit 7c900ce280 - Show all commits

3
.gitignore vendored
View File

@ -61,3 +61,6 @@ data/
# Local temp files
tmp_*.html
/delete_old_cluster.sh
# Git worktrees (local-only, isolated workspaces)
.worktrees/

View File

@ -0,0 +1,244 @@
---
date: 2026-07-02
topic: complex-task-quality-loop
---
# 复杂任务质量闭环verify → reflect → evolve
## Summary
围绕"单次任务做对 + 失败学习"主线,把 agentkit 已有但未接通的 verification / reflexion / evolution 串成闭环复杂任务PLAN_EXEC/TEAM_COLLAB跑完先 verify不过就 reflexion 反思重试,任务结束自动 trigger evolution 记 pitfall + 优化 prompt。地基补结构化编辑工具和"keep working until done"偏置。
---
## Problem Frame
agentkit 在复杂任务上"压根无法达到预期"——失败形态包括跑不了、走几步就错、直接说没能力。根因不是缺机制,而是机制"声明了但没接通"
- `verification_enabled` 默认 `False``src/agentkit/core/react.py:165`VERIFICATION 阶段不强制测试
- `write_file``_DEFAULT_CORE_TOOLS` 但无实现类(`src/agentkit/core/react.py:150-156`LLM 调用会失败
- reflexion 只在 `_fallback_chain.py` 的 Recovery 层兜底,不在主流程
- evolution 只能手动 trigger`/api/v1/evolution/trigger`),任务后不自动跑
- REWOO/REFLEXION/TEAM_COLLAB fall back to REACT`src/agentkit/server/routes/chat.py:1336`AGENTS.md 说的"抛 not yet supported"已过时
- `max_steps=10` 硬上限,无"keep working until done"偏置,达到即返回 partial
上述症状分三类:(1) **未接通**——reflexion 仅 fallback、evolution 仅手动、REWOO/REFLEXION/TEAM_COLLAB fall back to REACT(2) **bug**——write_file 无实现类;(3) **缺特性**——keep-working 偏置缺失、max_steps=10 硬上限。Approach B 仅直接解决第一类(组装闭环),后两类为附带修复。
用户感受是系统性失效:没有尝试机制、没有自我进化。单点修复不解决——需要把独立零件组装成闭环。
---
## Key Decisions
- **选闭环主线Approach B而非逐项接通A或全新 Task RuntimeC。** A 不形成闭环体验仍碎C 太重且浪费现有 plan_exec 基础B 复用现有机制组装闭环。
- **reflexion 仅复杂任务走。** PLAN_EXEC/TEAM_COLLAB 启用任务中反思DIRECT_CHAT/REACT 不走,避免简单任务被拖慢。
- **step 预算从单一 max_steps 改成阶段配额。** think/verify/reflect 三阶段共用 10 步不够,需分阶段配额。
- **evolution 自动触发而非手动。** 任务结束(无论成败)自动 trigger失败必跑成功按采样率跑。
### Resolved Questions原 OQ1-OQ4ce-doc-review 阶段研究解决)
- **RQ1原 OQ1step 阶段配额think=7 / verify=2 / reflect=1总预算 10向后兼容当前 max_steps=10。** 依据1 step = 1 个 Think→Act→Observe 循环(含 1 次 LLM 调用 + N 次并行工具);当前 verify 通过 `max_reinjections=1` 额外消耗 1 stepreflexion 的 evaluate/reflect LLM 调用不消耗 step 只消耗 token。三者独立计数不共享think 耗尽→强制 verifyverify 耗尽→返回最佳结果reflect 耗尽→不再反思。**预算辩护:** 总预算保持 10 是向后兼容约束Problem Frame 所述 max_steps=10 不足问题通过阶段配额重新分配解决——当前 verify 通过 `max_reinjections=1` 额外消耗 1 stepRQ1 将其显式化为 verify=2 配额,释放出 think=7 连续推理预算(此前 think 与 verify 共享 10 导致 verify 消耗压缩 think。若 planning 验证发现 10 步仍不足,复杂任务可 opt-in 提高总预算(向后兼容:未设时行为同今天)。
- **RQ2原 OQ2evolution 成功任务采样率 = 0.1(折中 0.15)。** 依据:默认 `RuleBasedReflector` 是 0 LLM 调用且只在 `outcome=="failure" and quality<0.3` 时生成 suggestions成功任务几乎不产生 suggestions进化流程早退`BootstrapPromptOptimizer.min_examples=3`10% 采样下约 30 次成功任务凑够优化样本。新增 `EvolutionConfig.success_sample_rate: float = 0.1`,在 `on_task_complete` 入口用 `random.random() < rate` 门控,镜像 `alignment.py:115``audit_sample_rate` 范式。失败任务保持 100% 反思不变。**已知限制:** (1) 默认 `RuleBasedReflector` 仅在 `outcome=='failure'` 时生成 suggestions成功任务采样路径在默认 reflector 下 100% 早退——成功采样需升级到能在成功任务上产生学习信号的 reflector 实现,或移除成功采样路径仅保留失败路径。(2) 0.1 采样率假设约 30 次成功任务凑够 `min_examples=3`,实际激活时间取决于任务吞吐量;`success_sample_rate` 已设为可配置(`EvolutionConfig.success_sample_rate: float`),应按观察到的实际吞吐量校准。
- **RQ3原 OQ3reflexion 最大重试:主路径 2 次Recovery 层保持 1 次。** 依据:主路径当前硬编码 `max_reflections=3`config_driven.py:1047,835无法配置Recovery 层 `max_reflections=1`_fallback_chain.py:118。改为 2 拉开梯度(第 1 次最有效,第 2 次兜底并改为可从配置读取。reflexion attempt 次数由 `max_reflections=2` 配置独立限制,不消耗 step 配额think 配额(7) 由所有 attempt 的 ReAct 循环共享evaluate/reflect 的 LLM 调用不计 step 配额只计 token。
- **RQ4原 OQ4新增 `spec_review_request` 事件,不复用 `confirmation_request`。** 依据:①前端连接的是 `portal.py``/api/v1/portal/ws`),但 confirmation 协议只在 `chat.py` 实现portal.py 完全无 confirmation——复用度极低②`SpecManager.confirm` 是同步数据层方法,只通过 REST API`/specs/{spec_id}/confirm`)调用,不接入 chat 流程;③`plan_exec_engine.py:277` 生成 Spec 后立即执行,无暂停点;④语义差异大:工具确认是单条命令批准/拒绝5min 超时Spec 确认是完整计划审核confirm/reject/edit拒绝后触发重新规划`parked` 状态 + resume-on-return。新增 `spec_review_request`(携带 spec_id/goal/steps、`spec_review_reply`(携带 decision在 PlanExecEngine 新增 `spec_review_handler` 参数。
---
## Requirements
### 地基(所有任务受益)
- R1. 修复 `write_file` 占位符,提供结构化文件编辑工具(`str_replace_editor` 语义create / str_replace / insert_at_line取代 shell `sed`/`cat` 改文件。
**安全要求:** 所有路径参数必须 resolve 后前缀校验限制在 workspace root 内,拒绝符号链接逃逸;与现有 6 层终端安全范式对齐。
- R2. `verification_enabled` 默认改为 `True`
- R3. VERIFICATION 阶段强制运行项目测试pytest / ruff而非仅白名单允许 shell。
### 闭环主线(复杂任务)
- R4. reflexion 从 fallback 兜底升级为复杂任务PLAN_EXEC/TEAM_COLLAB的主流程反思循环verify 不过 → 反思 → 重试。
- R5. 任务结束(无论成败)自动 trigger evolutionReflector 记录 + PitfallDetector 检测 + PromptOptimizer 优化。
**质量门:** pitfall 入库前设 confidence 阈值(低置信丢弃或标记 observe-onlyPromptOptimizer 消费 pitfall 前需通过消费门控(如样本数 ≥ min_examples 且 confidence 达标observe-only 模式下只记录不喂 optimizer避免噪声退化 prompt。
- R6. evolution 触发阈值:失败必跑;成功按采样率跑。
**完整性/授权:** evolution 产物pitfall / optimized prompt跨任务共享前需标注 actor哪个 agent / expert 产生),跨任务共享的信任边界由 planning 定义(默认同 workspace 共享,跨 workspace 需显式 opt-in
### 能力接通
- R7. TEAM_COLLAB 不再 fall back to REACT真正接入对应执行模式REWOO/REFLEXION-as-mode 推迟到 Deferred理由见 Scope Boundaries
- R8. Spec 确认闸门接入 chat 流程:首次生成 Spec 后通过新增 `spec_review_request` 事件暂停等人确认,确认后(`spec_review_reply`)才执行。
### 偏置与预算
- R10. 复杂任务启用"keep working until done"偏置:达到 step 预算前不因单次 verify 失败放弃,自动进入反思重试。
- R11. step 预算改成阶段配额think / verify / reflect取代单一 `max_steps`
- R12. pitfall 检索/注入:任务规划阶段从 PitfallDetector 库按 goal/skill 相似度检索历史 pitfall 并注入 prompt 上下文。
---
## Key Flows
- F1. 复杂任务质量闭环
- **Trigger:** PLAN_EXEC / TEAM_COLLAB 任务执行
- **Actors:** ReActEngine, VerificationLoop, ReflexionEngine, evolution 模块
- **Steps:** 任务执行 → verify → 不过 → reflexion 反思 → 重试(受阶段配额约束)→ 任务结束 → evolution 自动 trigger失败必跑 / 成功采样)
- **Covered by:** R2, R3, R4, R5, R6, R10, R11
- F2. Spec 确认闸门
- **Trigger:** PLAN_EXEC 生成 Spec
- **Actors:** SpecManager, chat handler, user
- **Steps:** Spec 生成 → 暂停 → 用户确认confirm / reject→ 确认后执行 / 拒绝后重新规划
- **Covered by:** R8
---
## Visualizations
```mermaid
flowchart TB
A[复杂任务启动] --> B[执行: think/act/observe]
B --> C{verify}
C -->|通过| D[标记完成]
C -->|不过| E{阶段配额?}
E -->|有剩余| F[reflexion 反思]
F --> B
E -->|耗尽| G[标记失败]
D --> H[evolution 自动 trigger]
G --> H
H --> I[Reflector 记录]
I --> J[PitfallDetector 检测]
J --> K[PromptOptimizer 优化]
```
---
## Acceptance Examples
- AE1. 复杂任务 verify 失败后反思重试
- **Covers R2, R4, R10.**
- **Given:** PLAN_EXEC 任务执行完成verify 运行 pytest 失败
- **When:** reflexion 触发,反思错误,生成修正方案
- **Then:** 在阶段配额内重试;若仍失败,标记任务失败并 trigger evolution
- AE2. 简单任务不走 reflexion
- **Covers R4.**
- **Given:** DIRECT_CHAT 模式执行简单任务
- **When:** 任务完成
- **Then:** 不触发 reflexion 反思循环,直接返回结果
- AE3. 任务失败后 evolution 自动记录
- **Covers R5, R6.**
- **Given:** 复杂任务最终失败verify 不过且重试用尽)
- **When:** 任务结束
- **Then:** evolution 自动 triggerReflector 记录失败原因PitfallDetector 检测模式
- AE4. Spec 确认闸门
- **Covers R8.**
- **Given:** PLAN_EXEC 生成 Spec
- **When:** Spec 首次生成
- **Then:** 暂停执行等待用户确认;用户确认后才继续执行
---
## Success Criteria
- 复杂任务"半完成就停"消失verify 不过自动反思重试,而非直接返回 partial。
- 复杂任务结果可信任verify 通过才标记完成。
- 失败有沉淀:每次失败触发 evolution 记录pitfall 不重犯。
- 简单任务不受影响DIRECT_CHAT / REACT 不走 reflexion响应不拖慢。
---
## Scope Boundaries
### Deferred for later
- 分级沙箱read-only / workspace-write / danger——P2 优先级本次最低沙箱层级workspace-write, no network作为 R3/R10 前置拉入 scope完整分级留后续。
- REWOO/REFLEXION-as-mode作为独立执行模式接入——R7 缩窄为仅 TEAM_COLLAB 后推迟理由当前无目标服务RV10且与 reflexion-as-retry-mechanism 概念混淆RV20
- R9 coding_harnessWorker-Verifier 对抗)接入 PLAN_EXEC DELIVERY 阶段——推迟理由R3+R4 已满足目标RV114 阶段 pipeline 到单阶段 PLAN_EXEC phase 映射未定义RV12且 R8/R9 无独立成功标准RV13。**信任边界:** coding_harness 执行不受信任代码需在沙箱内运行,依赖最低沙箱层级前置。
- 模型自主 compaction——现有阈值方案能用。
- 三层嵌套循环submission / handler / turn——收益不抵成本。
- Spec 输出人类可读 markdown——本次先用现有 YAML Spec + 确认闸门markdown 化留后续。
### Outside this product's identity
- 工具极简主义(砍到 Bash + apply_patch——agentkit 走技能 / 专家团队方向25 个工具是业务需要。
- 全新 Task Runtime 概念——已有 plan_exec 基础,不引入新概念。
---
## Dependencies / Assumptions
- evolution 模块Reflector / PitfallDetector / PromptOptimizer / ABTester已实现本次只做接入。
- ReflexionEngine 已实现,本次升级其在主流程的角色。
- VerificationLoop 已实现,本次改默认值 + 强制约束。
- SpecManager.confirm 已实现REST API本次新增 `spec_review_request`/`spec_review_reply` 事件接入 chat 流程。
- coding_harness.yaml 已配置,本次接入 DELIVERY 阶段。
- 假设step 配额重设计不破坏现有 DIRECT_CHAT / REACT 语义。
---
## Outstanding Questions
### Resolved见 Key Decisions → Resolved Questions
- OQ1-OQ4 已在 ce-doc-review 阶段研究解决,决策见 `Resolved Questions`RQ1-RQ4
### Newce-doc-review 研究阶段发现)
- **OQ5.** DIRECT_CHAT 模式chat.py:1245直接调 `llm_gateway.chat()`,绕过 BaseAgent —— 是否需要为 DIRECT_CHAT 补接 evolution 触发?还是接受"DIRECT_CHAT 不进化"(简单任务进化价值低)?
- **OQ6.** `execute_stream()`config_driven.py:686绕过 `on_task_complete`/`on_task_failed` 钩子 —— R5 的自动触发在流式路径下不生效。是在 `execute_stream` 末尾补接钩子,还是改为异步 fire-and-forget`asyncio.create_task`)避免阻塞流式返回?
### From 2026-07-02 review
以下来自 ce-doc-review5 personacoherence / feasibility / product-lens / scope-guardian / adversarial。17 个可操作发现 + 5 个 FYI 观察,全部留 planning 处理。
**P1 — 必须在 planning 解决(阻塞实现)**
- RV1. R11 阶段配额是 R4/R10/AE1 的隐藏前置但值推迟到 OQ1product-lens, 100。F1 闭环无法端到端验证直到 OQ1 解决。**处理:** planning 必须先定阶段配额 v0 值,或 descope R4/R10 直到 R11 决定。
- RV2. R2 全局 `verification_enabled=True` 与简单任务性能目标冲突scope-guardian, 100。REACT 非代码任务(翻译/研究)在 final-answer 会跑 pytest/ruff。**处理:** R2 限定 `PLAN_EXEC/TEAM_COLLAB 默认 TrueDIRECT_CHAT/REACT 保持 False`
- RV3. Sandbox 推迟到 P2但 R3+R10 增加不受信任代码执行product-lens, 75。安全态势相对当前"早停"是倒退。**处理:** 将最低沙箱层级workspace-write, no network拉入本 scope 作为 R3/R10 前置,或新增 reflexion 重试的 workspace-bounded 约束。
- RV4. R4 假设 ReflexionEngine 能驱动 PLAN_EXEC但缺 phase_policy 支持adversarial, 75。`reflexion.py:88-92` 实例化 vanilla ReActEngine 无 phase_policy。**处理:** R4 加依赖说明——ReflexionEngine 需重构转发 phase_policy或在 ReActEngine 内实现 verify→reflect→retry已持有 phase_policy
- RV5. R11"不破坏 DIRECT_CHAT/REACT 语义"假设 load-bearing 且未验证adversarial, 75。DIRECT_CHAT/REACT 用同一 ReActEnginemax_steps=10无 verify/reflect 阶段。**处理:** R11 明确兼容契约——`max_steps 保留为总预算;阶段配额是复杂任务 opt-in 参数;未设时行为同今天`。
- RV6. R1-R3 bug 修复与闭环架构捆绑延迟即时价值product-lens, 75。**处理:** 考虑拆 R1/R2/R3 为 ship-first 切片独立验收R4-R11 作为第二阶段。
- RV7. "Pitfall 不重犯"目标半服务——只记录不检索product-lens, 75。R5/R6 只覆盖记录,无检索/注入。**处理:** 新增 pitfall 检索注入要求,或 descope"不重犯"条款。
**P2 — 应在 planning 解决(影响正确性)**
- RV8. R3 强制 pytest/ruff 但无要求处理无测试/非 Python 项目product-lens+adversarial, 100。非代码任务会错误失败或空真。**处理:** 限定 R3 为 coding 任务;非 coding 由 Spec 声明验证命令。
- RV9. Mermaid 将 reflexion 门控在配额检查之后,与 F1/AE1/Summary 矛盾coherence, 75。**处理:** 重排 mermaid——verify 失败→reflexion→配额决策。
- RV10. R7 拉入 REWOO/REFLEXION 模式但未绑定任何目标scope-guardian, 75。REFLEXION 独立与 R4 冗余REWOO 无目标服务。**处理:** 缩窄 R7 为仅 TEAM_COLLAB。
- RV11. R9 adversarial harness 与 R3 重叠但无目标级理由scope-guardian, 75。R3+R4 已满足目标。**处理:** 从本 requirements 移除 R9 或单独论证。
- RV12. R9 将 4 阶段 pipeline 映射到单阶段 PLAN_EXEC phase无映射定义adversarial, 75。**处理:** 替换 R9 为具体集成契约或推迟 planning。
- RV13. R8/R9 是孤立需求——无成功标准product-lens+scope-guardian, 100。**处理:** 为 R8/R9 加成功标准或移 R9 到 Deferred。
- RV14. R5/R6 自动触发 evolution 无输出质量门product-lens+adversarial, 100。噪声 pitfall 喂 PromptOptimizer 会退化 prompt。**处理:** 新增 pitfall confidence 阈值 + PromptOptimizer 消费门控 + observe-only 模式。
- RV15. R4/R10 忽略 ReActEngine 已实现 verify→reinject→retryadversarial, 75。`react.py:1278-1308` 已有 reinjection仅 max_reinjections=1 门控。**处理:** R4 加决策说明为何选 reflexion 而非提升 max_reinjections。
- RV16. R8 Spec 确认闸门假设同步用户可用性无异步回退adversarial, 75。现有 5 分钟超时PLAN_EXEC 长任务用户离开即超时。**处理:** R8 加超时策略 + resume-on-returnparked 非 failed
- RV17. 成功标准可能全过但"半完成就停"仍存在adversarial, 75。预算值推迟 OQ1。**处理:** 加定量成功标准——参考任务至少一次 green run。
**FYI 观察无需决策planning 知悉即可)**
- RV18. R10 使用"step 预算"一词但 R11 明确替换它coherence, 50。术语不一致。
- RV19. R8 Spec gate 在每次 PLAN_EXEC 加摩擦定位转移未承认product-lens, 50
- RV20. R7/R4 混淆 REFLEXION-as-mode 与 reflexion-as-retry-mechanismadversarial, 50
- RV21. MVP 路径(仅 R1+R2+R3未在承诺 Approach B 前评估adversarial, 50
- RV22. R10"keep working"与 ReActEngine 循环检测器threshold=2冲突adversarial, 50。重试相同 fix 会触循环检测中断。
**Residual concerns新信号非发现重述**
- R7 TEAM_COLLAB 可能与 ExpertTeamRouter 路径重叠feasibility
- ReWOOAgent/ReflexionEngine 是否暴露 streaming 接口兼容 chat WebSocketfeasibility
- SpecManager.confirm 同步签名 vs 异步握手——是否需新 awaitable gatefeasibility
- "keep working" + 阶段配额可能烧 token 不收敛product-lens
- str_replace_editor/coding_harness 的 buy-vs-build 未考虑——OpenHands/Aider 有成熟替代product-lens
- evolution 模块运行时正确性未验证——文件存在≠端到端正确adversarial
- ReflexionEngine 默认值quality_threshold=0.7, max_reflections=3继承未论证adversarial
- PLAN_EXEC vs TEAM_COLLAB 集成面不同——后者用 TeamOrchestrator+SharedWorkspaceadversarial
- evolution 模块是否在真实失败上端到端跑过adversarial
---
## Sources / Research
- 6 维架构调研(带行号):`src/agentkit/core/react.py`、`src/agentkit/core/verification_loop.py`、`src/agentkit/core/phase.py`、`src/agentkit/core/spec_manager.py`、`src/agentkit/core/plan_exec_engine.py`、`src/agentkit/server/_fallback_chain.py`、`src/agentkit/evolution/`、`src/agentkit/server/routes/chat.py`
- AGENTS.md 与代码不一致点:`src/agentkit/server/routes/chat.py:1336` REWOO/REFLEXION/TEAM_COLLAB 实际 fall back to REACT非"抛 not yet supported"。
- `write_file` 占位符:`src/agentkit/core/react.py:150-156` 的 `_DEFAULT_CORE_TOOLS``write_file` 但无实现类。
- 业界参照Codex agent loop单线程 ReAct + 强制 verify、Qoder QuestSpec → Code → Verify 闭环 + 自动 evolution、Trae SOLO Spec mode确认闸门

View File

@ -0,0 +1,406 @@
---
title: "feat: Complex task quality loop (verify → reflect → evolve)"
type: feat
date: 2026-07-03
origin: docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md
---
# Complex Task Quality Loop (verify → reflect → evolve)
## Summary
Assemble agentkit's declared-but-disconnected verification, reflexion, and evolution mechanisms into a unified quality loop for complex tasks (PLAN_EXEC/TEAM_COLLAB). Tasks run → verify → if fail, reflexion reflect→retry → on completion, auto-trigger evolution (record pitfall + optimize prompt). Foundational fixes: structured file editing tool, verification defaults, step budget phases, minimum sandbox, Spec review gate. The loop replaces the current "early stop on failure" behavior with "keep working until done, then learn from the outcome."
---
## Problem Frame
agentkit fails on complex tasks because its quality mechanisms are declared but not connected:
- `verification_enabled` defaults to `False` (`src/agentkit/core/react.py:171`) — VERIFICATION phase doesn't enforce tests
- `write_file` listed in `_DEFAULT_CORE_TOOLS` (`src/agentkit/core/react.py:156-162`) but has no implementation class — LLM calls fail
- reflexion only runs in `_fallback_chain.py` Recovery layer, not in the main execution flow
- evolution only triggers manually via `/api/v1/evolution/trigger` — no auto-trigger after tasks
- TEAM_COLLAB falls back to REACT (`src/agentkit/server/routes/chat.py:1336`) instead of running the real orchestrator
- `max_steps=10` hard cap with no "keep working until done" bias — tasks stop at the first verify failure
- `execute_stream()` (`src/agentkit/core/config_driven.py:686`) bypasses `on_task_complete`/`on_task_failed` hooks — R5's auto-evolution would silently no-op on the WebSocket streaming path (the primary user-facing path)
The result is systemic failure: no retry mechanism, no self-evolution. Single-point fixes don't solve this — the independent parts must be assembled into a closed loop.
(See origin: `docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md`)
---
## Requirements
Requirements are grouped by concern. Each carries its origin R-ID for traceability.
### Foundations (all tasks benefit)
- **R1.** Provide a structured file editing tool (`str_replace_editor` with `create` / `str_replace` / `insert_at_line` / `view` commands), replacing the broken `write_file` placeholder. All path parameters must resolve and prefix-check against workspace root, rejecting symlink escape; align with the existing 6-layer terminal security paradigm.
- **R2.** `verification_enabled` defaults to `True` for PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stay `False` (per RV2 — global True would force pytest/ruff on non-code REACT tasks like translation/research).
- **R3.** VERIFICATION phase forces project tests (pytest/ruff) for coding tasks; non-coding tasks use Spec-declared verification commands (per RV8 — forcing pytest on non-Python projects causes false failures).
### Closed loop (complex tasks)
- **R4.** Reflexion upgraded from fallback-only to main-flow retry for PLAN_EXEC/TEAM_COLLAB: verify fails → reflect → retry. Implemented by extending ReActEngine's existing reinjection loop, not by driving PLAN_EXEC through ReflexionEngine (per RV4, RV15, RV20 — ReflexionEngine doesn't forward `phase_policy`, and reflexion-as-mode is conceptually distinct from reflexion-as-retry).
- **R5.** Auto-trigger evolution on task completion (success or failure): Reflector records + PitfallDetector detects + PromptOptimizer optimizes. Quality gate: pitfall confidence threshold before ingestion; PromptOptimizer consumption gate (sample count ≥ `min_examples` and confidence达标); observe-only mode records without feeding optimizer to avoid noise-driven prompt degradation (per RV14).
- **R6.** Evolution trigger thresholds: failure always runs; success runs at sample rate 0.1 (per RQ2). Integrity/auth: evolution artifacts (pitfalls, optimized prompts) carry actor marking (which agent/expert produced them); cross-workspace sharing defaults off, requires explicit opt-in (per RV14 trust boundary).
### Capability wiring
- **R7.** TEAM_COLLAB does not fall back to REACT — surface failure to user instead of silent degradation. (REWOO/REFLEXION-as-mode deferred per RV10, RV20.)
- **R8.** Spec review gate: first Spec generation emits `spec_review_request` event, suspends execution pending user confirmation (`spec_review_reply`). Confirmation → execute; rejection → replan; timeout → `parked` status (not `failed`) with resume-on-return (per RV16 — 5-min timeout is too short for long tasks).
### Bias and budget
- **R10.** "Keep working until done" bias for complex tasks: don't abandon on first verify failure, auto-enter reflexion retry within remaining step budget.
- **R11.** Step budget split into phase quotas (think=7 / verify=2 / reflect=1 per RQ1), replacing single `max_steps=10`. Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB; `max_steps=10` preserved as total budget for backward compatibility (per RV5 — DIRECT_CHAT/REACT must keep current semantics).
- **R12.** Pitfall retrieval/injection: at task planning, retrieve historical pitfalls by goal/skill similarity from PitfallDetector store and inject into prompt context (per RV7 — current system only records, never retrieves, so "pitfall不重犯" goal is half-served).
---
## Key Technical Decisions
- **KTD-1. Verification canonical path is engine-internal at final-answer (`src/agentkit/core/react.py:1303-1376`), not `RunTestsTool`.** `RunTestsTool` (`src/agentkit/tools/builtin.py:16`) remains for agent-initiated mid-task verification. The engine-internal path runs automatically at the final-answer gate. This avoids double-verify and keeps the agent's manual tool distinct from the engine's automatic gate.
- **KTD-2. Reflexion-as-retry is implemented by extending ReActEngine's reinjection loop, not by driving PLAN_EXEC through ReflexionEngine.** ReflexionEngine (`src/agentkit/core/reflexion.py:88-92`) constructs a vanilla ReActEngine without forwarding `phase_policy` — refactoring it to drive PLAN_EXEC would be large and conceptually conflates reflexion-as-mode with reflexion-as-retry. Instead, extend the existing reinjection loop (which already holds `phase_policy`) to call a reflect step after `max_reinjections` exhausts. ReflexionEngine stays as the standalone engine for the deferred REFLEXION-as-mode.
- **KTD-3. Evolution triggering is a system lifecycle concern, not an agent capability.** The fix is hook-wiring (connecting `on_task_complete`/`on_task_failed` to the streaming path), not exposing evolution as an agent-callable tool. Agents produce the work; the system evolves from the outcome.
- **KTD-4. `execute_stream()` must invoke `on_task_complete`/`on_task_failed` to maintain lifecycle parity with `execute()`.** This is the single most load-bearing architectural fix — without it, R5/R6 are no-ops on the WebSocket streaming path (the primary user-facing path). Use fire-and-forget `asyncio.create_task` with backpressure cap (`max_concurrent * 2`) and shutdown drain per the portal-platform-security-reliability-fixes learning. Evolution errors must not fail the stream.
- **KTD-5. Spec review uses new `spec_review_request`/`spec_review_reply` events + `parked` Spec status.** `confirmation_request` is not reused (per RQ4 — different timeout semantics, different lifecycle, portal.py has no confirmation wiring). Events must follow terminal-event symmetry (open milestone → close on every path: confirm/reject/timeout/cancel) with stable `spec_review_id = f"{plan_id}:spec_review"` per the streaming-event-contract-residuals learning. Default timeout 30 min, configurable; on timeout → `parked` not `failed`.
- **KTD-6. `str_replace_editor` symlink defense uses `Path.resolve()` + `Path.relative_to(resolved_workspace_root)`, not `str.startswith()`.** `startswith` admits path-prefix collisions (`/workspace_root_evil/...`). Pattern mirrors the SSRF hop-revalidation approach from the bitable-companion-service security learning. Filesystem ops wrapped in `asyncio.to_thread` to avoid blocking the event loop.
- **KTD-7. Phase-budget counters are checkpoint-reconstructable from restored plan phase statuses.** On resume, `think`/`verify`/`reflect` spent counts derive from persisted phase state, not reset to zero (per long-horizon-reliability-code-review-fixes learning P2 #8/#11 — resume is full state reconstruction).
- **KTD-8. Reflexion-gave-up status is `"gave_up_after_reflections"`, not `"success"`.** When `max_reflections` exhausts without verify pass, the status propagates to `TaskResult` and evolution's `outcome` field. Evolution's `RuleBasedReflector` treats this as failure for reflection purposes. Without this, evolution silently skips reflection on reflexion-gave-up tasks (per agent-native planning finding OQ-D).
- **KTD-9. `ReActEngine.reset()` called between reflexion retry attempts.** Without reset, the loop detector (`_loop_threshold=2`) misfires on retry because `_loop_window` state leaks across attempts (per long-horizon-reliability-code-review-fixes learning P2 #9, RV22).
- **KTD-10. DIRECT_CHAT does not trigger evolution (explicit non-goal).** DIRECT_CHAT bypasses BaseAgent entirely (`src/agentkit/server/routes/chat.py:1245` calls `llm_gateway.chat()` directly). Wiring evolution would require fabricating TaskMessage/TaskResult. Simple Q&A tasks have low evolution value. Documented as non-goal, not a gap to fix later.
---
## High-Level Technical Design
### Quality loop flow
```mermaid
flowchart TB
A[Complex task starts] --> B[Execute: think/act/observe]
B --> C{Verify at final-answer}
C -->|Pass| D[Mark completed]
C -->|Fail| E{Reflect quota remaining?}
E -->|Yes| F[Call reset then reflect]
F --> G[Generate improvement]
G --> B
E -->|No| H[Mark gave_up_after_reflections]
D --> I[Trigger evolution: fire-and-forget]
H --> I
I --> J{Failure?}
J -->|Yes| K[Reflector + PitfallDetector: 100%]
J -->|No| L[Sample at 0.1 rate]
K --> M[Quality gate: confidence threshold]
L --> M
M --> N{Observe-only?}
N -->|Yes| O[Record only]
N -->|No| P[PromptOptimizer: consume gated]
```
### execute_stream hook wiring
```mermaid
sequenceDiagram
participant WS as WebSocket (chat.py)
participant CDA as ConfigDrivenAgent
participant ES as execute_stream()
participant Hooks as on_task_complete/failed
participant EVO as evolve_after_task()
WS->>CDA: execute_stream(task)
CDA->>ES: yield ReActEvent
ES-->>WS: token / final_answer (streaming)
Note over ES: finally block (new)
ES->>Hooks: invoke with TaskResult
Hooks->>EVO: asyncio.create_task (fire-and-forget)
Note over EVO: backpressure cap + shutdown drain
EVO-->>EVO: Reflector → PitfallDetector → PromptOptimizer
```
### Spec review gate lifecycle
```mermaid
stateDiagram-v2
[*] --> PLANNING
PLANNING --> SPEC_GENERATED
SPEC_GENERATED --> SPEC_REVIEW_PENDING: emit spec_review_request
SPEC_REVIEW_PENDING --> EXECUTING: spec_review_reply (confirm)
SPEC_REVIEW_PENDING --> PLANNING: spec_review_reply (reject)
SPEC_REVIEW_PENDING --> PARKED: timeout (30min)
PARKED --> EXECUTING: resume on return
EXECUTING --> [*]
```
---
## Implementation Units
### U1. str_replace_editor tool + remove write_file bug
- **Goal:** Provide a working structured file editing tool with workspace-root security; remove the broken `write_file` placeholder.
- **Requirements:** R1
- **Dependencies:** None
- **Files:**
- Create: `src/agentkit/tools/str_replace_editor.py` (new tool class)
- Modify: `src/agentkit/core/react.py` (remove `write_file` from `_DEFAULT_CORE_TOOLS` at line 156-162, add `str_replace_editor`)
- Modify: `src/agentkit/tools/__init__.py` (register new tool)
- Test: `tests/unit/test_str_replace_editor.py`
- **Approach:** Implement `str_replace_editor` with four commands: `create` (write new file), `str_replace` (exact-match anchor replace), `insert_at_line` (insert at line number), `view` (read with line numbers — needed because `str_replace` requires exact anchors). Path validation: `Path.resolve()` + `Path.relative_to(resolved_workspace_root)`; reject `..`, absolute paths, symlink escape. Wrap filesystem ops in `asyncio.to_thread`. Mirror `ReadFileTool` (`src/agentkit/tools/file_read.py:26`) for Tool base class structure and error handling. Align with 6-layer terminal security paradigm (`src/agentkit/server/auth/terminal_security.py`).
- **Patterns to follow:** `src/agentkit/tools/file_read.py:26` (ReadFileTool — Tool base class, execute schema, `_error()` helper), `src/agentkit/server/auth/terminal_security.py` (layered security, `_SHELL_OPERATORS` pattern)
- **Test scenarios:**
- **Happy path:** `create` writes new file; `view` returns content with line numbers; `str_replace` replaces exact anchor; `insert_at_line` inserts at specified line
- **Edge cases:** Empty file create; `str_replace` with multiple matches (error: anchor not unique); `insert_at_line` at line 0 / beyond EOF; `view` with line range
- **Error and failure paths:** Path traversal `../../etc/passwd` rejected; symlink escape rejected; absolute path `/etc/passwd` rejected; `str_replace` anchor not found (error); file outside workspace root rejected
- **Integration:** Tool registered in `_DEFAULT_CORE_TOOLS` appears in LLM system prompt; LLM can call it and receive structured result
- **Verification:** `write_file` no longer in `_DEFAULT_CORE_TOOLS`; `str_replace_editor` appears in tool descriptions; path traversal tests pass; `ruff check` clean.
### U2. execute_stream hook wiring (OQ6 fix)
- **Goal:** Wire `on_task_complete`/`on_task_failed` hooks into the streaming path so R5/R6 evolution triggers on WebSocket-routed tasks.
- **Requirements:** R5 (precondition), R6 (precondition)
- **Dependencies:** None
- **Files:**
- Modify: `src/agentkit/core/config_driven.py` (`execute_stream()` at line 686 — add hook invocation in `finally` block)
- Modify: `src/agentkit/core/plan_exec_engine.py` (`execute_stream()` at line 175 — add hook invocation)
- Modify: `src/agentkit/core/reflexion.py` (`execute_stream()` at line 330 — add hook invocation)
- Modify: `src/agentkit/server/routes/portal.py` (verify all 3 `execute_stream` call sites at lines 580, 701, 1001 propagate hooks)
- Test: `tests/unit/test_execute_stream_hooks.py`
- **Approach:** Extract a `_trigger_evolution_hooks(task, result)` helper from the sync `handle_task()` path (lines 473, 493). Call it from `execute_stream()`'s `finally` block. Use `asyncio.create_task()` (fire-and-forget) to avoid blocking the streaming return. Apply backpressure: cap pending evolution tasks at `max_concurrent * 2`, drop + log + increment counter on exceed. Drain pending tasks on app shutdown via `asyncio.gather(*tasks, return_exceptions=True)`. Evolution errors are caught and logged — they must not fail the stream. Follow the `CancellationToken` registration pattern (register in `try`, pop in `finally`) per the streaming-event-contract-residuals learning.
- **Patterns to follow:** `src/agentkit/core/config_driven.py:473,493` (sync hook invocation), `src/agentkit/core/config_driven.py:686` (CancellationToken try/finally pattern), portal-platform-security-reliability-fixes learning (backpressure cap + shutdown drain)
- **Test scenarios:**
- **Happy path:** `execute_stream` completion fires `on_task_complete` with correct TaskResult; `execute_stream` failure fires `on_task_failed`
- **Edge cases:** Stream cancelled mid-flight — hooks still fire with cancelled status; evolution task error does not propagate to stream; backpressure cap reached — drop + log + counter increment
- **Integration:** Same task via REST `execute()` and WebSocket `execute_stream()` produces equivalent evolution log entries (parity test); all 3 portal.py call sites propagate hooks
- **Verification:** Evolution fires after `execute_stream` completes on both success and failure paths; streaming latency P95 < +50ms (evolution is fire-and-forget); shutdown drains pending evolution tasks.
### U3. Verification defaults + forced pytest/ruff + minimum sandbox
- **Goal:** Enable verification by default for complex tasks; force pytest/ruff for coding tasks; establish minimum sandbox as security prerequisite.
- **Requirements:** R2, R3, RV3 (sandbox prerequisite)
- **Dependencies:** U1 (str_replace_editor provides safe editing within sandbox)
- **Files:**
- Modify: `src/agentkit/core/react.py` (thread `verification_enabled` parameter through PLAN_EXEC/TEAM_COLLAB construction, default True for those modes)
- Modify: `src/agentkit/core/phase.py` (`default_policy()` at line 139 — VERIFICATION phase forces pytest/ruff for coding tasks)
- Modify: `src/agentkit/core/plan_exec_engine.py` (pass `verification_enabled=True` when constructing ReActEngine for PLAN_EXEC)
- Modify: `src/agentkit/experts/orchestrator.py` (pass `verification_enabled=True` for TEAM_COLLAB)
- Create: `src/agentkit/core/sandbox.py` (minimum sandbox enforcement: workspace-write, no network)
- Test: `tests/unit/test_verification_defaults.py`, `tests/unit/test_sandbox.py`
- **Approach:** R2: `verification_enabled` defaults True only for PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stay False (per RV2). Thread the parameter through `PlanExecEngine` and `TeamOrchestrator` construction, not as a global default change. R3: In `default_policy()` VERIFICATION phase, add coding-task detection (check for `pyproject.toml` or `.py` files in workspace) — force `pytest -x -q` and `ruff check` for coding tasks; non-coding tasks use Spec-declared verification commands. RV3: Create `sandbox.py` with workspace-root enforcement (reuse U1's path validation) and network blocking (disable `httpx`/`requests`/`socket` for tool calls during VERIFICATION). Sandbox is the minimum layer; full tiering (read-only/workspace-write/danger) deferred.
- **Patterns to follow:** `src/agentkit/core/phase.py:139` (`default_policy` — PhasePolicy construction), `src/agentkit/tools/advance_phase.py:20` (forced-transition pattern for VERIFICATION→DELIVERY)
- **Test scenarios:**
- **Happy path:** PLAN_EXEC task with `pyproject.toml` runs pytest+ruff in VERIFICATION; TEAM_COLLAB task verifies by default; non-coding task uses Spec-declared command
- **Edge cases:** Workspace with no `pyproject.toml` — skip pytest, use Spec command; empty workspace — verification passes (no tests to run); ruff finds issues — reinject as verify failure
- **Error and failure paths:** pytest fails — reinject error per `max_reinjections`; sandbox blocks network call — structured error returned to LLM; path traversal attempt in verification command — rejected
- **Integration:** Sandbox enforcement applies to all tool calls during VERIFICATION phase; coding-task detection correctly identifies Python vs non-Python workspaces
- **Verification:** PLAN_EXEC/TEAM_COLLAB verify by default; DIRECT_CHAT/REACT do not verify; coding tasks force pytest/ruff; non-coding tasks use Spec commands; sandbox blocks network during VERIFICATION.
### U4. Step budget phases + keep working bias
- **Goal:** Split `max_steps` into phase quotas (think/verify/reflect); add "keep working until done" bias for complex tasks.
- **Requirements:** R11, R10
- **Dependencies:** U3 (verify quota needs verification defaults)
- **Files:**
- Modify: `src/agentkit/core/react.py` (`__init__` at line 167 — add `phase_budgets` parameter; `_execute_loop()` at line 561 — enforce per-phase quotas; loop detector at line 220-221 — raise threshold or exempt reflexion retries)
- Modify: `src/agentkit/core/phase.py` (`PhasePolicy` at line 59 — add `step_budget` field)
- Modify: `src/agentkit/core/plan_exec_engine.py` (pass `phase_budgets={"think": 7, "verify": 2, "reflect": 1}` for PLAN_EXEC)
- Test: `tests/unit/test_step_budget.py`
- **Approach:** R11: Add `phase_budgets: dict[str, int] | None = None` to ReActEngine. When set, enforce per-phase quotas: think耗尽 → force verify; verify耗尽 → return best result; reflect耗尽 → no more reflection. When None, behavior is same as today (`max_steps=10` total budget). Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB. Budget counters are checkpoint-reconstructable — derive spent counts from restored plan phase statuses on resume (KTD-7). R10: "Keep working until done" is implemented via the reflect quota — verify fail doesn't abandon, it enters reflexion retry within remaining reflect quota. Loop detector threshold raised from 2 to 3 for keep-working mode (per RV22 — threshold=2 false-positives on retry). `ReActEngine.reset()` called between retry attempts (KTD-9).
- **Patterns to follow:** `src/agentkit/core/phase.py:59` (`PhasePolicy.auto_advance_after_steps` — existing per-phase step limit pattern), `src/agentkit/core/react.py:220-221` (loop detector — `_loop_window`, `_loop_threshold`)
- **Test scenarios:**
- **Happy path:** PLAN_EXEC with `phase_budgets={"think":7,"verify":2,"reflect":1}` — think stops at 7, verify runs, reflect runs at most 1; without `phase_budgets` — behavior unchanged (`max_steps=10`)
- **Edge cases:** Think quota exhausted mid-tool-call — finish current step, then force verify; reflect quota 0 — no reflection, return best result; resume after checkpoint — budget counters reconstructed from phase statuses
- **Error and failure paths:** Loop detector threshold 3 — 2 similar retries don't abort, 3 do; `reset()` between reflexion attempts — `_loop_window` cleared
- **Integration:** Phase budgets enforced in `_execute_loop()`; checkpoint save/restore preserves budget state; DIRECT_CHAT/REACT unaffected (no `phase_budgets` set)
- **Verification:** Phase quotas enforced; backward compatibility (no `phase_budgets` = current behavior); loop detector doesn't false-positive on reflexion retry; budget state survives checkpoint/resume.
### U5. Reflexion in main flow
- **Goal:** Upgrade reflexion from fallback-only to main-flow retry: verify fails → reflect → retry.
- **Requirements:** R4
- **Dependencies:** U3 (verification), U4 (reflect quota)
- **Files:**
- Modify: `src/agentkit/core/react.py` (reinjection loop at lines 1303-1376 — after `max_reinjections` exhausts, call reflect step before returning final)
- Modify: `src/agentkit/core/config_driven.py` (parameterize `max_reflections=2` at lines 835, 1047 — currently hardcoded 3; make configurable)
- Test: `tests/unit/test_reflexion_main_flow.py`
- **Approach:** Extend the existing reinjection loop (`src/agentkit/core/react.py:1303-1376`) — when verify fails and `max_reinjections` is exhausted, if reflect quota remains: call `reset()` (KTD-9), generate reflection text (mirror `ReflexionEngine._reflect()` at `src/agentkit/core/reflexion.py:639`), inject reflection into context, retry the loop. Parameterize `max_reflections` (RQ3: 2 for main path, 1 for Recovery layer — currently hardcoded 3 at `config_driven.py:835,1047`). When `max_reflections` exhausts without verify pass, return status `"gave_up_after_reflections"` (KTD-8 — not `"success"`, so evolution treats it as failure). ReflexionEngine stays as standalone for REFLEXION-as-mode (deferred); Recovery layer escalates to human, not re-reflex (avoid double-reflexion).
- **Patterns to follow:** `src/agentkit/core/react.py:1303-1376` (existing reinjection loop — extend, don't replace), `src/agentkit/core/reflexion.py:639` (reflect step — mirror the LLM call shape), `src/agentkit/server/_fallback_chain.py:118` (Recovery `max_retries=1` — keep distinct from main path)
- **Test scenarios:**
- **Happy path:** Covers AE1 — verify fails → reflect → retry within reflect quota; retry passes verify → mark completed
- **Edge cases:** `max_reflections=2` — 2 retry attempts, then `"gave_up_after_reflections"`; `reset()` between attempts clears loop window; reflect quota 0 — no retry, return best result
- **Error and failure paths:** Reflect LLM call fails — skip reflection, retry with existing context; all retries fail — status `"gave_up_after_reflections"` propagates to TaskResult and evolution
- **Integration:** DIRECT_CHAT/REACT unaffected (no reflect quota); Recovery layer (`_fallback_chain.py`) still uses `max_reflections=1` — no double-reflexion; evolution's `RuleBasedReflector` treats `"gave_up_after_reflections"` as failure
- **Verification:** Verify-fail → reflect → retry fires; `max_reflections=2` configurable; `"gave_up_after_reflections"` status propagates; no double-reflexion with Recovery layer; DIRECT_CHAT unaffected.
### U6. Auto evolution trigger + quality gate
- **Goal:** Auto-trigger evolution on task completion with quality gates and actor marking.
- **Requirements:** R5, R6
- **Dependencies:** U2 (execute_stream hooks), U5 (quality signal from reflexion)
- **Files:**
- Modify: `src/agentkit/evolution/lifecycle.py` (`evolve_after_task()` at line 131 — add success sample rate gate, quality threshold, actor marking)
- Modify: `src/agentkit/evolution/pitfall_detector.py` (add confidence threshold before ingestion)
- Create: `src/agentkit/evolution/config.py` (`EvolutionConfig` with `success_sample_rate: float = 0.1`, `min_confidence: float = 0.5`, `observe_only: bool = True`)
- Modify: `src/agentkit/evolution/prompt_optimizer.py` (consumption gate: sample count ≥ `min_examples` and confidence达标)
- Test: `tests/unit/test_evolution_auto_trigger.py`
- **Approach:** R5: `EvolutionConfig.success_sample_rate=0.1` gates success-path evolution at `evolve_after_task()` entry using `random.random() < rate` (mirror `alignment.py:115` `audit_sample_rate` pattern). Failure path always runs (100%). Quality gate: pitfall confidence threshold before ingestion (`min_confidence=0.5` — low-confidence pitfalls discarded or marked observe-only); PromptOptimizer consumption gate (sample count ≥ `min_examples=3` and confidence达标); observe-only mode (`observe_only=True` initially — records without feeding optimizer to avoid noise-driven prompt degradation per RV14). R6: Actor marking on all evolution artifacts (pitfalls, optimized prompts) — which agent/expert produced them. Cross-workspace sharing defaults off; same-workspace sharing default on; cross-workspace requires explicit opt-in. Trust boundary: evolution products are agent-produced and must be validated before entering shared store (not trusted because an agent produced them). Known limitation (per RQ2): default `RuleBasedReflector` only generates suggestions on `outcome=='failure'` — success sampling path may 100% early-exit under default reflector; success sampling activates when reflector is upgraded or success-path learning signal is available.
- **Patterns to follow:** `src/agentkit/evolution/lifecycle.py:131` (`evolve_after_task` — extend, don't replace), `src/agentkit/evolution/pitfall_detector.py:103` (`check_pitfalls` — Jaccard similarity pattern), portal-platform-security-reliability-fixes learning (per-namespace rejection, backpressure, trust-boundary validation)
- **Test scenarios:**
- **Happy path:** Covers AE3 — task fails → evolution fires (100%) → Reflector records → PitfallDetector detects; task succeeds → evolution fires at 0.1 rate
- **Edge cases:** Observe-only mode — pitfalls recorded but not fed to optimizer; backpressure cap reached — evolution task dropped + logged; low-confidence pitfall — discarded or marked observe-only
- **Error and failure paths:** Evolution task error — caught, logged, does not fail the stream; PromptOptimizer sample count < 3 skip optimization
- **Integration:** Evolution fires via U2's `execute_stream` hooks; actor marking present on all artifacts; cross-workspace sharing rejected without opt-in; `"gave_up_after_reflections"` status triggers failure-path evolution
- **Verification:** Failure tasks always trigger evolution; success tasks trigger at 0.1 rate; observe-only mode records without mutating prompts; actor marking present; cross-workspace sharing gated.
### U7. Pitfall retrieval/injection
- **Goal:** Retrieve historical pitfalls by goal/skill similarity at task planning and inject into prompt context.
- **Requirements:** R12
- **Dependencies:** U6 (evolution store with pitfalls)
- **Files:**
- Modify: `src/agentkit/evolution/pitfall_detector.py` (`check_pitfalls()` at line 103 — extend to accept goal text, use semantic similarity not just `task_type` filter)
- Modify: `src/agentkit/core/react.py` (system prompt construction — inject pitfall warnings section)
- Modify: `src/agentkit/core/plan_exec_engine.py` (at planning phase, call pitfall retrieval and inject into Spec context)
- Test: `tests/unit/test_pitfall_injection.py`
- **Approach:** Extend `PitfallDetector.check_pitfalls()` to accept goal text and use `experience_store.search` with semantic similarity (not just `task_type` Jaccard filter). Wire `experience_store` to agent runtime as app-state singleton (KTD per OQ-E — instantiated at startup, shared across tasks). At PLAN_EXEC planning phase, retrieve top-K pitfalls (K=3) by goal/skill similarity, inject as "Historical pitfalls to avoid" section in system prompt. Gate by `WarningLevel.HIGH` only (avoid noise). Pitfall injection appears in agent's first LLM call. PitfallDetector currently only used in `evolution_dashboard.py:549` (read-only) — this is the first runtime integration.
- **Patterns to follow:** `src/agentkit/evolution/pitfall_detector.py:103` (`check_pitfalls` — extend signature, don't break existing callers), `src/agentkit/memory/semantic.py` (semantic retrieval pattern if applicable)
- **Test scenarios:**
- **Happy path:** Task with similar goal to past failure → top-3 pitfalls injected into system prompt → pitfalls appear in agent's first LLM call
- **Edge cases:** No pitfalls in store → empty section, no injection; all pitfalls low severity → none injected (gate by HIGH); pitfall store has 100+ entries → only top-3 by similarity retrieved (no N+1)
- **Error and failure paths:** `experience_store` unavailable → skip injection, log warning; similarity search times out → skip injection, continue task
- **Integration:** PitfallDetector app-state singleton accessible from PLAN_EXEC planning; existing `evolution_dashboard.py` caller still works (backward compatible)
- **Verification:** Pitfalls injected at planning phase appear in system prompt; similarity retrieval works on goal text; HIGH-severity gate filters noise; existing dashboard caller unaffected.
### U8. Spec review gate
- **Goal:** Pause PLAN_EXEC after first Spec generation for user review; resume on confirmation, replan on rejection.
- **Requirements:** R8
- **Dependencies:** U5 (reflexion retry for post-review execution)
- **Files:**
- Modify: `src/agentkit/core/plan_exec_engine.py` (at line 269-277 — after Spec generation, emit `spec_review_request`, suspend on pending future)
- Modify: `src/agentkit/core/spec_manager.py` (add `parked` status, `resume()` method)
- Modify: `src/agentkit/server/routes/chat.py` (add `spec_review_request`/`spec_review_reply` to `_VALID_TEAM_EVENT_TYPES` at line 144; add handler for `spec_review_reply`)
- Modify: `src/agentkit/server/routes/portal.py` (add event forwarding for spec review events)
- Test: `tests/unit/test_spec_review_gate.py`
- **Approach:** At `plan_exec_engine.py:269-277` (currently generates Spec and immediately executes), insert: emit `spec_review_request` event (carrying `spec_id`, `goal`, `steps`, `spec_review_id = f"{plan_id}:spec_review"`), suspend on pending `asyncio.Future`. On `spec_review_reply` (confirm/reject/timeout): confirm → resume execution; reject → replan (call `GoalPlanner` again with rejection feedback); timeout (30 min default, configurable) → set Spec status `parked` (not `failed`), allow resume-on-return. Add `spec_review_request`/`spec_review_reply` to `_VALID_TEAM_EVENT_TYPES` (per streaming-event-whitelist learning — without this, events silently no-op with 200 response). Follow terminal-event symmetry (open milestone → close on every path). Mirror CancellationToken pattern (register pending future, pop in finally). RQ4 confirmed: new events, not reuse `confirmation_request` (different timeout semantics, different lifecycle, portal.py has no confirmation wiring).
- **Patterns to follow:** `src/agentkit/core/config_driven.py:686` (CancellationToken try/finally — register/pop pattern), `src/agentkit/server/routes/chat.py:144` (`_VALID_TEAM_EVENT_TYPES` — add new events), `src/agentkit/server/routes/chat.py:1365-1377` (confirmation pattern — reference, not reuse), streaming-event-contract-residuals learning (terminal-event symmetry, stable identifier)
- **Test scenarios:**
- **Happy path:** Covers AE4 — PLAN_EXEC generates Spec → `spec_review_request` emitted → execution suspends → user confirms → `spec_review_reply` → execution resumes
- **Edge cases:** User rejects → replan with feedback → new Spec generated → review again; timeout (30min) → Spec status `parked` (not `failed`) → resume on return; stream cancelled during review → future cancelled, no deadlock
- **Error and failure paths:** `spec_review_reply` with invalid `spec_review_id` → error response; future resolution error → execution fails gracefully; event not in whitelist → test asserts it IS in whitelist (silent failure prevention)
- **Integration:** Events forwarded by portal.py; frontend receives `spec_review_request` and can render review UI; `parked` Spec survives page reload
- **Verification:** Spec review round-trip works (request → suspend → reply → resume); rejection triggers replan; timeout → parked not failed; events in whitelist (no silent no-op).
### U9. TEAM_COLLAB no fall-back to REACT
- **Goal:** TEAM_COLLAB surfaces failure to user instead of silently falling back to REACT.
- **Requirements:** R7
- **Dependencies:** None (routing change only)
- **Files:**
- Modify: `src/agentkit/server/routes/chat.py` (at line 1336-1344 — change TEAM_COLLAB branch to reject fall-back, surface failure)
- Modify: `AGENTS.md` (update to reflect actual behavior — remove "抛 not yet supported" claim, document TEAM_COLLAB routing)
- Test: `tests/unit/test_team_collab_routing.py`
- **Approach:** At `chat.py:1336-1344` (currently falls back to REACT with warning for TEAM_COLLAB), change the TEAM_COLLAB branch to: route to TeamOrchestrator+SharedWorkspace (real wiring), or if orchestrator unavailable, surface failure to user (not silent fall-back). Update AGENTS.md to remove the stale "抛 not yet supported" claim for REWOO/REFLEXION/TEAM_COLLAB — document that TEAM_COLLAB routes to TeamOrchestrator, REWOO/REFLEXION-as-mode are deferred (not "unsupported"). This is a routing change, not full TEAM_COLLAB implementation — the orchestrator already exists (`src/agentkit/experts/orchestrator.py:45`).
- **Patterns to follow:** `src/agentkit/server/routes/chat.py:758-808` (PLAN_EXEC routing — mutual exclusivity with fallback chain, KTD5 pattern)
- **Test scenarios:**
- **Happy path:** `@team` prefix → routes to TeamOrchestrator (not REACT fall-back); TeamOrchestrator executes phases
- **Edge cases:** TeamOrchestrator unavailable → error surfaced to user (not silent REACT); team template not found → error with template list
- **Error and failure paths:** All phases fail → failure surfaced to user (not fall-back to single agent per existing `_fallback_to_single_agent` — that's orchestrator-internal, acceptable)
- **Integration:** AGENTS.md updated; REWOO/REFLEXION-as-mode still fall back (deferred, not in scope)
- **Verification:** TEAM_COLLAB routes to TeamOrchestrator; no silent REACT fall-back; AGENTS.md reflects actual behavior.
---
## Scope Boundaries
### Deferred for later
- **Full sandbox tiering** (read-only / workspace-write / danger) — P2 priority; only minimum sandbox (workspace-write, no network) pulled into scope as R3/R10 prerequisite (per RV3).
- **REWOO/REFLEXION-as-mode** (as independent execution modes) — deferred per RV10 (no target service for REWOO, conceptually distinct from reflexion-as-retry per RV20); R7 narrowed to TEAM_COLLAB only.
- **R9 coding_harness** (Worker-Verifier adversarial harness) — deferred per RV11 (R3+R4 already satisfy the goal), RV12 (4-stage pipeline to single-stage PLAN_EXEC phase mapping undefined), RV13 (no independent success criteria). Trust boundary: coding_harness executing untrusted code requires sandbox — depends on full sandbox tiering.
- **Model autonomous compaction** — existing threshold approach works.
- **Three-tier nested loop** (submission / handler / turn) — cost exceeds benefit.
- **Spec output as human-readable markdown** — current YAML Spec + review gate works; markdown化 deferred.
- **Full TEAM_COLLAB real wiring** (beyond routing) — U9 handles routing only; deeper orchestrator integration (debate rounds, review gates, divergence detection) is existing functionality that may need tuning but is not in scope for the quality loop.
### Outside this product's identity
- **Tool minimalism** (cut to Bash + apply_patch) — agentkit goes the skill/expert-team direction; 25 tools are business need.
- **New Task Runtime concept** — existing plan_exec foundation suffices; no new concept introduced.
### Deferred to Follow-Up Work
- **DIRECT_CHAT evolution wiring** — explicitly non-goal (KTD-10); if future simple-task learning becomes valuable, would require fabricating TaskMessage/TaskResult.
- **Success-path reflector upgrade** — current `RuleBasedReflector` only generates suggestions on failure; success sampling (RQ2) activates fully when a success-capable reflector is implemented.
- **Loop detector semantic upgrade** — current hash-based detector raised to threshold 3 for keep-working mode; semantic detection (detect truly identical retries vs similar-but-different) is a future upgrade.
---
## System-Wide Impact
- **Streaming path behavior change (U2):** All WebSocket-routed tasks now trigger evolution hooks. Fire-and-forget with backpressure ensures no latency regression. Evolution errors are isolated — they cannot fail the stream.
- **Verification default change (U3):** PLAN_EXEC/TEAM_COLLAB now verify by default. Tasks that previously "succeeded" without verification may now fail verification. This is the intended behavior change — surfaces real failures that were hidden.
- **Step budget change (U4):** PLAN_EXEC/TEAM_COLLAB get phase quotas; DIRECT_CHAT/REACT keep `max_steps=10` total. Backward compatible — no `phase_budgets` means current behavior.
- **Evolution artifacts now persist cross-task (U6):** Without actor marking and workspace-scoped sharing, a poisoned pitfall from one workspace could degrade prompts in another. Trust boundary enforcement is load-bearing.
- **Reflexion retry changes loop behavior (U5):** "Keep working until done" expands blast radius. Minimum sandbox (U3) is the security countermeasure. Loop detector threshold raised to 3 to avoid false-positive on retry.
- **Spec review adds friction to PLAN_EXEC (U8):** Every PLAN_EXEC now pauses for review. This is intentional (per R8) — catches bad plans before execution. Timeout → parked (not failed) respects long-task user availability.
- **TEAM_COLLAB no longer silently degrades (U9):** Users who relied on TEAM_COLLAB falling back to REACT will see explicit failures instead. This is the intended behavior — silent degradation was a bug.
---
## Risks & Dependencies
- **R5 streaming hook bypass (OQ6) — HIGHEST RISK.** Without U2, R5/R6 are no-ops on the primary user-facing path. U2 is the load-bearing precondition. Mitigation: U2 ships first; parity test (REST vs WebSocket evolution log) is the regression guard.
- **R4 double-reflexion with Recovery layer.** Main-flow reflexion (U5) + Recovery-layer reflexion (`_fallback_chain.py:118`) could double-reflect. Mitigation: Recovery escalates to human, not re-reflex. Documented in KTD-2.
- **RV22 loop detector conflict with R10.** "Keep working" retries similar fixes, triggering loop detection (threshold=2). Mitigation: threshold raised to 3 for keep-working mode (U4); `reset()` between attempts (KTD-9).
- **R1 str_replace exact-match fragility.** Without `view` command, agents emit `str_replace` with stale anchors and fail. Mitigation: `view` command included in U1.
- **R8 spec review deadlock.** User leaves → task hangs. Mitigation: 30-min timeout → `parked` not `failed`; resume-on-return.
- **Evolution noise degrades prompts (RV14).** Low-quality pitfalls fed to optimizer regress prompts. Mitigation: confidence threshold + observe-only mode (U6, initially `observe_only=True`).
- **Evolution module runtime correctness unverified.** No prior learnings exist for evolution/reflexion/verification/spec_manager modules (coverage gap from learnings research). Mitigation: budget for first-principles verification; characterization tests before changes.
- **Streaming event whitelist silent failure.** New events not in `_VALID_TEAM_EVENT_TYPES` silently no-op. Mitigation: U8 explicitly adds events to whitelist; test asserts presence.
- **Async generator safety.** All new `async def` with `yield` must use `return; yield` pattern before early return (project rule). Applies to U2 (hook helper), U5 (reflexion streaming), U8 (spec review suspension).
Dependencies:
- evolution module (Reflector/PitfallDetector/PromptOptimizer/ABTester) already implemented — U6/U7 do integration only
- ReflexionEngine already implemented — U5 extends ReActEngine, doesn't refactor ReflexionEngine
- VerificationLoop already implemented — U3 changes defaults and policy, not core logic
- SpecManager.confirm already implemented (REST) — U8 adds chat flow integration
- TeamOrchestrator already implemented — U9 is routing change, not orchestrator implementation
- Assume: step quota redesign doesn't break DIRECT_CHAT/REACT semantics (enforced by opt-in `phase_budgets` parameter)
---
## Acceptance Examples
- **AE1. Complex task verify-fail → reflexion retry.** Covers R2, R4, R10. Given: PLAN_EXEC task completes, verify runs pytest and fails. When: reflexion triggers, reflects on error, generates fix. Then: retries within reflect quota; if still fails, marks `"gave_up_after_reflections"` and triggers evolution.
- **AE2. Simple task doesn't reflexion.** Covers R4. Given: DIRECT_CHAT mode executes simple task. When: task completes. Then: no reflexion retry loop, direct return.
- **AE3. Task failure auto-triggers evolution.** Covers R5, R6. Given: complex task fails (verify fails, reflexion exhausted). When: task ends. Then: evolution auto-triggers, Reflector records failure, PitfallDetector detects patterns.
- **AE4. Spec review gate.** Covers R8. Given: PLAN_EXEC generates Spec. When: Spec first generated. Then: execution suspends, `spec_review_request` emitted; user confirms → execution resumes; user rejects → replan; timeout → `parked`.
---
## Sources / Research
- **Origin document:** `docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md` (R1-R12, RQ1-RQ4, OQ5-OQ6, RV1-RV22)
- **Repo research:** Confirmed all brainstorm findings with file:line references; mapped 12 requirements to integration points; identified 3 AGENTS.md contradictions; recommended 6-phase implementation order.
- **Institutional learnings (5 relevant docs in `docs/solutions/`):**
- `integration-issues/streaming-event-contract-residuals.md``execute_stream` registration pattern (resolves OQ6), terminal-event symmetry (shapes R8), stable identifier convention
- `logic-errors/long-horizon-reliability-code-review-fixes.md``reset()` between retry attempts (RV22 mitigation), checkpoint-reconstructable counters (KTD-7), cross-module format contracts
- `runtime-errors/streaming-event-whitelist-and-accumulation.md``_VALID_TEAM_EVENT_TYPES` whitelist (R8 events), ReAct Streaming Contract (R4 streaming)
- `architecture-patterns/bitable-companion-service-security-reliability-patterns.md` — SSRF hop-revalidation → symlink defense (KTD-6), IDOR 404-before-403 (R6 trust boundary), `asyncio.to_thread` (R1)
- `security-issues/portal-platform-security-reliability-fixes.md` — backpressure cap + shutdown drain (KTD-4), per-namespace rejection (R6), trust-boundary validation
- **Coverage gap:** No prior learnings exist for evolution/reflexion/verification/spec_manager modules — budget for first-principles verification.
- **Agent-native planning assessment:** Confirmed agentkit is agent-native (Required applicability); classified domain actions (Now/Later/Never); identified execute_stream hook wiring as single most load-bearing architectural issue; suggested 11 implementation units (refined to 9 in this plan); proposed 5 KTDs (expanded to 10 in this plan).
- **Industry benchmarks (from brainstorm):** Codex agent loop (single-thread ReAct + forced verify), Qoder Quest (Spec → Code → Verify loop + auto evolution), Trae SOLO Spec mode (confirmation gate).