---
title: "feat: Complex task quality loop (verify → reflect → evolve)"
type: feat
date: 2026-07-03
origin: docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md
---

# Complex Task Quality Loop (verify → reflect → evolve)

## Summary

Assemble agentkit's declared-but-disconnected verification, reflexion, and evolution mechanisms into a unified quality loop for complex tasks (PLAN_EXEC/TEAM_COLLAB). Tasks run → verify → if fail, reflexion reflect→retry → on completion, auto-trigger evolution (record pitfall + optimize prompt). Foundational fixes: structured file editing tool, verification defaults, step budget phases, minimum sandbox, Spec review gate. The loop replaces the current "early stop on failure" behavior with "keep working until done, then learn from the outcome."

---

## Problem Frame

agentkit fails on complex tasks because its quality mechanisms are declared but not connected:

- `verification_enabled` defaults to `False` (`src/agentkit/core/react.py:171`) — VERIFICATION phase doesn't enforce tests
- `write_file` listed in `_DEFAULT_CORE_TOOLS` (`src/agentkit/core/react.py:156-162`) but has no implementation class — LLM calls fail
- reflexion only runs in `_fallback_chain.py` Recovery layer, not in the main execution flow
- evolution only triggers manually via `/api/v1/evolution/trigger` — no auto-trigger after tasks
- TEAM_COLLAB falls back to REACT (`src/agentkit/server/routes/chat.py:1336`) instead of running the real orchestrator
- `max_steps=10` hard cap with no "keep working until done" bias — tasks stop at the first verify failure
- `execute_stream()` (`src/agentkit/core/config_driven.py:686`) bypasses `on_task_complete`/`on_task_failed` hooks — R5's auto-evolution would silently no-op on the WebSocket streaming path (the primary user-facing path)

The result is systemic failure: no retry mechanism, no self-evolution. Single-point fixes don't solve this — the independent parts must be assembled into a closed loop.

(See origin: `docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md`)

---

## Requirements

Requirements are grouped by concern. Each carries its origin R-ID for traceability.

### Foundations (all tasks benefit)

- **R1.** Provide a structured file editing tool (`str_replace_editor` with `create` / `str_replace` / `insert_at_line` / `view` commands), replacing the broken `write_file` placeholder. All path parameters must resolve and prefix-check against workspace root, rejecting symlink escape; align with the existing 6-layer terminal security paradigm.
- **R2.** `verification_enabled` defaults to `True` for PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stay `False` (per RV2 — global True would force pytest/ruff on non-code REACT tasks like translation/research).
- **R3.** VERIFICATION phase forces project tests (pytest/ruff) for coding tasks; non-coding tasks use Spec-declared verification commands (per RV8 — forcing pytest on non-Python projects causes false failures).

### Closed loop (complex tasks)

- **R4.** Reflexion upgraded from fallback-only to main-flow retry for PLAN_EXEC/TEAM_COLLAB: verify fails → reflect → retry. Implemented by extending ReActEngine's existing reinjection loop, not by driving PLAN_EXEC through ReflexionEngine (per RV4, RV15, RV20 — ReflexionEngine doesn't forward `phase_policy`, and reflexion-as-mode is conceptually distinct from reflexion-as-retry).
- **R5.** Auto-trigger evolution on task completion (success or failure): Reflector records + PitfallDetector detects + PromptOptimizer optimizes. Quality gate: pitfall confidence threshold before ingestion; PromptOptimizer consumption gate (sample count ≥ `min_examples` and confidence达标); observe-only mode records without feeding optimizer to avoid noise-driven prompt degradation (per RV14).
- **R6.** Evolution trigger thresholds: failure always runs; success runs at sample rate 0.1 (per RQ2). Integrity/auth: evolution artifacts (pitfalls, optimized prompts) carry actor marking (which agent/expert produced them); cross-workspace sharing defaults off, requires explicit opt-in (per RV14 trust boundary).

### Capability wiring

- **R7.** TEAM_COLLAB does not fall back to REACT — surface failure to user instead of silent degradation. (REWOO/REFLEXION-as-mode deferred per RV10, RV20.)
- **R8.** Spec review gate: first Spec generation emits `spec_review_request` event, suspends execution pending user confirmation (`spec_review_reply`). Confirmation → execute; rejection → replan; timeout → `parked` status (not `failed`) with resume-on-return (per RV16 — 5-min timeout is too short for long tasks).

### Bias and budget

- **R10.** "Keep working until done" bias for complex tasks: don't abandon on first verify failure, auto-enter reflexion retry within remaining step budget.
- **R11.** Step budget split into phase quotas (think=7 / verify=2 / reflect=1 per RQ1), replacing single `max_steps=10`. Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB; `max_steps=10` preserved as total budget for backward compatibility (per RV5 — DIRECT_CHAT/REACT must keep current semantics).
- **R12.** Pitfall retrieval/injection: at task planning, retrieve historical pitfalls by goal/skill similarity from PitfallDetector store and inject into prompt context (per RV7 — current system only records, never retrieves, so "pitfall不重犯" goal is half-served).

---

## Key Technical Decisions

- **KTD-1. Verification canonical path is engine-internal at final-answer (`src/agentkit/core/react.py:1303-1376`), not `RunTestsTool`.** `RunTestsTool` (`src/agentkit/tools/builtin.py:16`) remains for agent-initiated mid-task verification. The engine-internal path runs automatically at the final-answer gate. This avoids double-verify and keeps the agent's manual tool distinct from the engine's automatic gate.

- **KTD-2. Reflexion-as-retry is implemented by extending ReActEngine's reinjection loop, not by driving PLAN_EXEC through ReflexionEngine.** ReflexionEngine (`src/agentkit/core/reflexion.py:88-92`) constructs a vanilla ReActEngine without forwarding `phase_policy` — refactoring it to drive PLAN_EXEC would be large and conceptually conflates reflexion-as-mode with reflexion-as-retry. Instead, extend the existing reinjection loop (which already holds `phase_policy`) to call a reflect step after `max_reinjections` exhausts. ReflexionEngine stays as the standalone engine for the deferred REFLEXION-as-mode.

- **KTD-3. Evolution triggering is a system lifecycle concern, not an agent capability.** The fix is hook-wiring (connecting `on_task_complete`/`on_task_failed` to the streaming path), not exposing evolution as an agent-callable tool. Agents produce the work; the system evolves from the outcome.

- **KTD-4. `execute_stream()` must invoke `on_task_complete`/`on_task_failed` to maintain lifecycle parity with `execute()`.** This is the single most load-bearing architectural fix — without it, R5/R6 are no-ops on the WebSocket streaming path (the primary user-facing path). Use fire-and-forget `asyncio.create_task` with backpressure cap (`max_concurrent * 2`) and shutdown drain per the portal-platform-security-reliability-fixes learning. Evolution errors must not fail the stream.

- **KTD-5. Spec review uses new `spec_review_request`/`spec_review_reply` events + `parked` Spec status.** `confirmation_request` is not reused (per RQ4 — different timeout semantics, different lifecycle, portal.py has no confirmation wiring). Events must follow terminal-event symmetry (open milestone → close on every path: confirm/reject/timeout/cancel) with stable `spec_review_id = f"{plan_id}:spec_review"` per the streaming-event-contract-residuals learning. Default timeout 30 min, configurable; on timeout → `parked` not `failed`.

- **KTD-6. `str_replace_editor` symlink defense uses `Path.resolve()` + `Path.relative_to(resolved_workspace_root)`, not `str.startswith()`.** `startswith` admits path-prefix collisions (`/workspace_root_evil/...`). Pattern mirrors the SSRF hop-revalidation approach from the bitable-companion-service security learning. Filesystem ops wrapped in `asyncio.to_thread` to avoid blocking the event loop.

- **KTD-7. Phase-budget counters are checkpoint-reconstructable from restored plan phase statuses.** On resume, `think`/`verify`/`reflect` spent counts derive from persisted phase state, not reset to zero (per long-horizon-reliability-code-review-fixes learning P2 #8/#11 — resume is full state reconstruction).

- **KTD-8. Reflexion-gave-up status is `"gave_up_after_reflections"`, not `"success"`.** When `max_reflections` exhausts without verify pass, the status propagates to `TaskResult` and evolution's `outcome` field. Evolution's `RuleBasedReflector` treats this as failure for reflection purposes. Without this, evolution silently skips reflection on reflexion-gave-up tasks (per agent-native planning finding OQ-D).

- **KTD-9. `ReActEngine.reset()` called between reflexion retry attempts.** Without reset, the loop detector (`_loop_threshold=2`) misfires on retry because `_loop_window` state leaks across attempts (per long-horizon-reliability-code-review-fixes learning P2 #9, RV22).

- **KTD-10. DIRECT_CHAT does not trigger evolution (explicit non-goal).** DIRECT_CHAT bypasses BaseAgent entirely (`src/agentkit/server/routes/chat.py:1245` calls `llm_gateway.chat()` directly). Wiring evolution would require fabricating TaskMessage/TaskResult. Simple Q&A tasks have low evolution value. Documented as non-goal, not a gap to fix later.

---

## High-Level Technical Design

### Quality loop flow

```mermaid
flowchart TB
    A[Complex task starts] --> B[Execute: think/act/observe]
    B --> C{Verify at final-answer}
    C -->|Pass| D[Mark completed]
    C -->|Fail| E{Reflect quota remaining?}
    E -->|Yes| F[Call reset then reflect]
    F --> G[Generate improvement]
    G --> B
    E -->|No| H[Mark gave_up_after_reflections]
    D --> I[Trigger evolution: fire-and-forget]
    H --> I
    I --> J{Failure?}
    J -->|Yes| K[Reflector + PitfallDetector: 100%]
    J -->|No| L[Sample at 0.1 rate]
    K --> M[Quality gate: confidence threshold]
    L --> M
    M --> N{Observe-only?}
    N -->|Yes| O[Record only]
    N -->|No| P[PromptOptimizer: consume gated]
```

### execute_stream hook wiring

```mermaid
sequenceDiagram
    participant WS as WebSocket (chat.py)
    participant CDA as ConfigDrivenAgent
    participant ES as execute_stream()
    participant Hooks as on_task_complete/failed
    participant EVO as evolve_after_task()

    WS->>CDA: execute_stream(task)
    CDA->>ES: yield ReActEvent
    ES-->>WS: token / final_answer (streaming)
    Note over ES: finally block (new)
    ES->>Hooks: invoke with TaskResult
    Hooks->>EVO: asyncio.create_task (fire-and-forget)
    Note over EVO: backpressure cap + shutdown drain
    EVO-->>EVO: Reflector → PitfallDetector → PromptOptimizer
```

### Spec review gate lifecycle

```mermaid
stateDiagram-v2
    [*] --> PLANNING
    PLANNING --> SPEC_GENERATED
    SPEC_GENERATED --> SPEC_REVIEW_PENDING: emit spec_review_request
    SPEC_REVIEW_PENDING --> EXECUTING: spec_review_reply (confirm)
    SPEC_REVIEW_PENDING --> PLANNING: spec_review_reply (reject)
    SPEC_REVIEW_PENDING --> PARKED: timeout (30min)
    PARKED --> EXECUTING: resume on return
    EXECUTING --> [*]
```

---

## Implementation Units

### U1. str_replace_editor tool + remove write_file bug

- **Goal:** Provide a working structured file editing tool with workspace-root security; remove the broken `write_file` placeholder.
- **Requirements:** R1
- **Dependencies:** None
- **Files:**
  - Create: `src/agentkit/tools/str_replace_editor.py` (new tool class)
  - Modify: `src/agentkit/core/react.py` (remove `write_file` from `_DEFAULT_CORE_TOOLS` at line 156-162, add `str_replace_editor`)
  - Modify: `src/agentkit/tools/__init__.py` (register new tool)
  - Test: `tests/unit/test_str_replace_editor.py`
- **Approach:** Implement `str_replace_editor` with four commands: `create` (write new file), `str_replace` (exact-match anchor replace), `insert_at_line` (insert at line number), `view` (read with line numbers — needed because `str_replace` requires exact anchors). Path validation: `Path.resolve()` + `Path.relative_to(resolved_workspace_root)`; reject `..`, absolute paths, symlink escape. Wrap filesystem ops in `asyncio.to_thread`. Mirror `ReadFileTool` (`src/agentkit/tools/file_read.py:26`) for Tool base class structure and error handling. Align with 6-layer terminal security paradigm (`src/agentkit/server/auth/terminal_security.py`).
- **Patterns to follow:** `src/agentkit/tools/file_read.py:26` (ReadFileTool — Tool base class, execute schema, `_error()` helper), `src/agentkit/server/auth/terminal_security.py` (layered security, `_SHELL_OPERATORS` pattern)
- **Test scenarios:**
  - **Happy path:** `create` writes new file; `view` returns content with line numbers; `str_replace` replaces exact anchor; `insert_at_line` inserts at specified line
  - **Edge cases:** Empty file create; `str_replace` with multiple matches (error: anchor not unique); `insert_at_line` at line 0 / beyond EOF; `view` with line range
  - **Error and failure paths:** Path traversal `../../etc/passwd` rejected; symlink escape rejected; absolute path `/etc/passwd` rejected; `str_replace` anchor not found (error); file outside workspace root rejected
  - **Integration:** Tool registered in `_DEFAULT_CORE_TOOLS` appears in LLM system prompt; LLM can call it and receive structured result
- **Verification:** `write_file` no longer in `_DEFAULT_CORE_TOOLS`; `str_replace_editor` appears in tool descriptions; path traversal tests pass; `ruff check` clean.

### U2. execute_stream hook wiring (OQ6 fix)

- **Goal:** Wire `on_task_complete`/`on_task_failed` hooks into the streaming path so R5/R6 evolution triggers on WebSocket-routed tasks.
- **Requirements:** R5 (precondition), R6 (precondition)
- **Dependencies:** None
- **Files:**
  - Modify: `src/agentkit/core/config_driven.py` (`execute_stream()` at line 686 — add hook invocation in `finally` block)
  - Modify: `src/agentkit/core/plan_exec_engine.py` (`execute_stream()` at line 175 — add hook invocation)
  - Modify: `src/agentkit/core/reflexion.py` (`execute_stream()` at line 330 — add hook invocation)
  - Modify: `src/agentkit/server/routes/portal.py` (verify all 3 `execute_stream` call sites at lines 580, 701, 1001 propagate hooks)
  - Test: `tests/unit/test_execute_stream_hooks.py`
- **Approach:** Extract a `_trigger_evolution_hooks(task, result)` helper from the sync `handle_task()` path (lines 473, 493). Call it from `execute_stream()`'s `finally` block. Use `asyncio.create_task()` (fire-and-forget) to avoid blocking the streaming return. Apply backpressure: cap pending evolution tasks at `max_concurrent * 2`, drop + log + increment counter on exceed. Drain pending tasks on app shutdown via `asyncio.gather(*tasks, return_exceptions=True)`. Evolution errors are caught and logged — they must not fail the stream. Follow the `CancellationToken` registration pattern (register in `try`, pop in `finally`) per the streaming-event-contract-residuals learning.
- **Patterns to follow:** `src/agentkit/core/config_driven.py:473,493` (sync hook invocation), `src/agentkit/core/config_driven.py:686` (CancellationToken try/finally pattern), portal-platform-security-reliability-fixes learning (backpressure cap + shutdown drain)
- **Test scenarios:**
  - **Happy path:** `execute_stream` completion fires `on_task_complete` with correct TaskResult; `execute_stream` failure fires `on_task_failed`
  - **Edge cases:** Stream cancelled mid-flight — hooks still fire with cancelled status; evolution task error does not propagate to stream; backpressure cap reached — drop + log + counter increment
  - **Integration:** Same task via REST `execute()` and WebSocket `execute_stream()` produces equivalent evolution log entries (parity test); all 3 portal.py call sites propagate hooks
- **Verification:** Evolution fires after `execute_stream` completes on both success and failure paths; streaming latency P95 < +50ms (evolution is fire-and-forget); shutdown drains pending evolution tasks.

### U3. Verification defaults + forced pytest/ruff + minimum sandbox

- **Goal:** Enable verification by default for complex tasks; force pytest/ruff for coding tasks; establish minimum sandbox as security prerequisite.
- **Requirements:** R2, R3, RV3 (sandbox prerequisite)
- **Dependencies:** U1 (str_replace_editor provides safe editing within sandbox)
- **Files:**
  - Modify: `src/agentkit/core/react.py` (thread `verification_enabled` parameter through PLAN_EXEC/TEAM_COLLAB construction, default True for those modes)
  - Modify: `src/agentkit/core/phase.py` (`default_policy()` at line 139 — VERIFICATION phase forces pytest/ruff for coding tasks)
  - Modify: `src/agentkit/core/plan_exec_engine.py` (pass `verification_enabled=True` when constructing ReActEngine for PLAN_EXEC)
  - Modify: `src/agentkit/experts/orchestrator.py` (pass `verification_enabled=True` for TEAM_COLLAB)
  - Create: `src/agentkit/core/sandbox.py` (minimum sandbox enforcement: workspace-write, no network)
  - Test: `tests/unit/test_verification_defaults.py`, `tests/unit/test_sandbox.py`
- **Approach:** R2: `verification_enabled` defaults True only for PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stay False (per RV2). Thread the parameter through `PlanExecEngine` and `TeamOrchestrator` construction, not as a global default change. R3: In `default_policy()` VERIFICATION phase, add coding-task detection (check for `pyproject.toml` or `.py` files in workspace) — force `pytest -x -q` and `ruff check` for coding tasks; non-coding tasks use Spec-declared verification commands. RV3: Create `sandbox.py` with workspace-root enforcement (reuse U1's path validation) and network blocking (disable `httpx`/`requests`/`socket` for tool calls during VERIFICATION). Sandbox is the minimum layer; full tiering (read-only/workspace-write/danger) deferred.
- **Patterns to follow:** `src/agentkit/core/phase.py:139` (`default_policy` — PhasePolicy construction), `src/agentkit/tools/advance_phase.py:20` (forced-transition pattern for VERIFICATION→DELIVERY)
- **Test scenarios:**
  - **Happy path:** PLAN_EXEC task with `pyproject.toml` runs pytest+ruff in VERIFICATION; TEAM_COLLAB task verifies by default; non-coding task uses Spec-declared command
  - **Edge cases:** Workspace with no `pyproject.toml` — skip pytest, use Spec command; empty workspace — verification passes (no tests to run); ruff finds issues — reinject as verify failure
  - **Error and failure paths:** pytest fails — reinject error per `max_reinjections`; sandbox blocks network call — structured error returned to LLM; path traversal attempt in verification command — rejected
  - **Integration:** Sandbox enforcement applies to all tool calls during VERIFICATION phase; coding-task detection correctly identifies Python vs non-Python workspaces
- **Verification:** PLAN_EXEC/TEAM_COLLAB verify by default; DIRECT_CHAT/REACT do not verify; coding tasks force pytest/ruff; non-coding tasks use Spec commands; sandbox blocks network during VERIFICATION.

### U4. Step budget phases + keep working bias

- **Goal:** Split `max_steps` into phase quotas (think/verify/reflect); add "keep working until done" bias for complex tasks.
- **Requirements:** R11, R10
- **Dependencies:** U3 (verify quota needs verification defaults)
- **Files:**
  - Modify: `src/agentkit/core/react.py` (`__init__` at line 167 — add `phase_budgets` parameter; `_execute_loop()` at line 561 — enforce per-phase quotas; loop detector at line 220-221 — raise threshold or exempt reflexion retries)
  - Modify: `src/agentkit/core/phase.py` (`PhasePolicy` at line 59 — add `step_budget` field)
  - Modify: `src/agentkit/core/plan_exec_engine.py` (pass `phase_budgets={"think": 7, "verify": 2, "reflect": 1}` for PLAN_EXEC)
  - Test: `tests/unit/test_step_budget.py`
- **Approach:** R11: Add `phase_budgets: dict[str, int] | None = None` to ReActEngine. When set, enforce per-phase quotas: think耗尽 → force verify; verify耗尽 → return best result; reflect耗尽 → no more reflection. When None, behavior is same as today (`max_steps=10` total budget). Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB. Budget counters are checkpoint-reconstructable — derive spent counts from restored plan phase statuses on resume (KTD-7). R10: "Keep working until done" is implemented via the reflect quota — verify fail doesn't abandon, it enters reflexion retry within remaining reflect quota. Loop detector threshold raised from 2 to 3 for keep-working mode (per RV22 — threshold=2 false-positives on retry). `ReActEngine.reset()` called between retry attempts (KTD-9).
- **Patterns to follow:** `src/agentkit/core/phase.py:59` (`PhasePolicy.auto_advance_after_steps` — existing per-phase step limit pattern), `src/agentkit/core/react.py:220-221` (loop detector — `_loop_window`, `_loop_threshold`)
- **Test scenarios:**
  - **Happy path:** PLAN_EXEC with `phase_budgets={"think":7,"verify":2,"reflect":1}` — think stops at 7, verify runs, reflect runs at most 1; without `phase_budgets` — behavior unchanged (`max_steps=10`)
  - **Edge cases:** Think quota exhausted mid-tool-call — finish current step, then force verify; reflect quota 0 — no reflection, return best result; resume after checkpoint — budget counters reconstructed from phase statuses
  - **Error and failure paths:** Loop detector threshold 3 — 2 similar retries don't abort, 3 do; `reset()` between reflexion attempts — `_loop_window` cleared
  - **Integration:** Phase budgets enforced in `_execute_loop()`; checkpoint save/restore preserves budget state; DIRECT_CHAT/REACT unaffected (no `phase_budgets` set)
- **Verification:** Phase quotas enforced; backward compatibility (no `phase_budgets` = current behavior); loop detector doesn't false-positive on reflexion retry; budget state survives checkpoint/resume.

### U5. Reflexion in main flow

- **Goal:** Upgrade reflexion from fallback-only to main-flow retry: verify fails → reflect → retry.
- **Requirements:** R4
- **Dependencies:** U3 (verification), U4 (reflect quota)
- **Files:**
  - Modify: `src/agentkit/core/react.py` (reinjection loop at lines 1303-1376 — after `max_reinjections` exhausts, call reflect step before returning final)
  - Modify: `src/agentkit/core/config_driven.py` (parameterize `max_reflections=2` at lines 835, 1047 — currently hardcoded 3; make configurable)
  - Test: `tests/unit/test_reflexion_main_flow.py`
- **Approach:** Extend the existing reinjection loop (`src/agentkit/core/react.py:1303-1376`) — when verify fails and `max_reinjections` is exhausted, if reflect quota remains: call `reset()` (KTD-9), generate reflection text (mirror `ReflexionEngine._reflect()` at `src/agentkit/core/reflexion.py:639`), inject reflection into context, retry the loop. Parameterize `max_reflections` (RQ3: 2 for main path, 1 for Recovery layer — currently hardcoded 3 at `config_driven.py:835,1047`). When `max_reflections` exhausts without verify pass, return status `"gave_up_after_reflections"` (KTD-8 — not `"success"`, so evolution treats it as failure). ReflexionEngine stays as standalone for REFLEXION-as-mode (deferred); Recovery layer escalates to human, not re-reflex (avoid double-reflexion).
- **Patterns to follow:** `src/agentkit/core/react.py:1303-1376` (existing reinjection loop — extend, don't replace), `src/agentkit/core/reflexion.py:639` (reflect step — mirror the LLM call shape), `src/agentkit/server/_fallback_chain.py:118` (Recovery `max_retries=1` — keep distinct from main path)
- **Test scenarios:**
  - **Happy path:** Covers AE1 — verify fails → reflect → retry within reflect quota; retry passes verify → mark completed
  - **Edge cases:** `max_reflections=2` — 2 retry attempts, then `"gave_up_after_reflections"`; `reset()` between attempts clears loop window; reflect quota 0 — no retry, return best result
  - **Error and failure paths:** Reflect LLM call fails — skip reflection, retry with existing context; all retries fail — status `"gave_up_after_reflections"` propagates to TaskResult and evolution
  - **Integration:** DIRECT_CHAT/REACT unaffected (no reflect quota); Recovery layer (`_fallback_chain.py`) still uses `max_reflections=1` — no double-reflexion; evolution's `RuleBasedReflector` treats `"gave_up_after_reflections"` as failure
- **Verification:** Verify-fail → reflect → retry fires; `max_reflections=2` configurable; `"gave_up_after_reflections"` status propagates; no double-reflexion with Recovery layer; DIRECT_CHAT unaffected.

### U6. Auto evolution trigger + quality gate

- **Goal:** Auto-trigger evolution on task completion with quality gates and actor marking.
- **Requirements:** R5, R6
- **Dependencies:** U2 (execute_stream hooks), U5 (quality signal from reflexion)
- **Files:**
  - Modify: `src/agentkit/evolution/lifecycle.py` (`evolve_after_task()` at line 131 — add success sample rate gate, quality threshold, actor marking)
  - Modify: `src/agentkit/evolution/pitfall_detector.py` (add confidence threshold before ingestion)
  - Create: `src/agentkit/evolution/config.py` (`EvolutionConfig` with `success_sample_rate: float = 0.1`, `min_confidence: float = 0.5`, `observe_only: bool = True`)
  - Modify: `src/agentkit/evolution/prompt_optimizer.py` (consumption gate: sample count ≥ `min_examples` and confidence达标)
  - Test: `tests/unit/test_evolution_auto_trigger.py`
- **Approach:** R5: `EvolutionConfig.success_sample_rate=0.1` gates success-path evolution at `evolve_after_task()` entry using `random.random() < rate` (mirror `alignment.py:115` `audit_sample_rate` pattern). Failure path always runs (100%). Quality gate: pitfall confidence threshold before ingestion (`min_confidence=0.5` — low-confidence pitfalls discarded or marked observe-only); PromptOptimizer consumption gate (sample count ≥ `min_examples=3` and confidence达标); observe-only mode (`observe_only=True` initially — records without feeding optimizer to avoid noise-driven prompt degradation per RV14). R6: Actor marking on all evolution artifacts (pitfalls, optimized prompts) — which agent/expert produced them. Cross-workspace sharing defaults off; same-workspace sharing default on; cross-workspace requires explicit opt-in. Trust boundary: evolution products are agent-produced and must be validated before entering shared store (not trusted because an agent produced them). Known limitation (per RQ2): default `RuleBasedReflector` only generates suggestions on `outcome=='failure'` — success sampling path may 100% early-exit under default reflector; success sampling activates when reflector is upgraded or success-path learning signal is available.
- **Patterns to follow:** `src/agentkit/evolution/lifecycle.py:131` (`evolve_after_task` — extend, don't replace), `src/agentkit/evolution/pitfall_detector.py:103` (`check_pitfalls` — Jaccard similarity pattern), portal-platform-security-reliability-fixes learning (per-namespace rejection, backpressure, trust-boundary validation)
- **Test scenarios:**
  - **Happy path:** Covers AE3 — task fails → evolution fires (100%) → Reflector records → PitfallDetector detects; task succeeds → evolution fires at 0.1 rate
  - **Edge cases:** Observe-only mode — pitfalls recorded but not fed to optimizer; backpressure cap reached — evolution task dropped + logged; low-confidence pitfall — discarded or marked observe-only
  - **Error and failure paths:** Evolution task error — caught, logged, does not fail the stream; PromptOptimizer sample count < 3 — skip optimization
  - **Integration:** Evolution fires via U2's `execute_stream` hooks; actor marking present on all artifacts; cross-workspace sharing rejected without opt-in; `"gave_up_after_reflections"` status triggers failure-path evolution
- **Verification:** Failure tasks always trigger evolution; success tasks trigger at 0.1 rate; observe-only mode records without mutating prompts; actor marking present; cross-workspace sharing gated.

### U7. Pitfall retrieval/injection

- **Goal:** Retrieve historical pitfalls by goal/skill similarity at task planning and inject into prompt context.
- **Requirements:** R12
- **Dependencies:** U6 (evolution store with pitfalls)
- **Files:**
  - Modify: `src/agentkit/evolution/pitfall_detector.py` (`check_pitfalls()` at line 103 — extend to accept goal text, use semantic similarity not just `task_type` filter)
  - Modify: `src/agentkit/core/react.py` (system prompt construction — inject pitfall warnings section)
  - Modify: `src/agentkit/core/plan_exec_engine.py` (at planning phase, call pitfall retrieval and inject into Spec context)
  - Test: `tests/unit/test_pitfall_injection.py`
- **Approach:** Extend `PitfallDetector.check_pitfalls()` to accept goal text and use `experience_store.search` with semantic similarity (not just `task_type` Jaccard filter). Wire `experience_store` to agent runtime as app-state singleton (KTD per OQ-E — instantiated at startup, shared across tasks). At PLAN_EXEC planning phase, retrieve top-K pitfalls (K=3) by goal/skill similarity, inject as "Historical pitfalls to avoid" section in system prompt. Gate by `WarningLevel.HIGH` only (avoid noise). Pitfall injection appears in agent's first LLM call. PitfallDetector currently only used in `evolution_dashboard.py:549` (read-only) — this is the first runtime integration.
- **Patterns to follow:** `src/agentkit/evolution/pitfall_detector.py:103` (`check_pitfalls` — extend signature, don't break existing callers), `src/agentkit/memory/semantic.py` (semantic retrieval pattern if applicable)
- **Test scenarios:**
  - **Happy path:** Task with similar goal to past failure → top-3 pitfalls injected into system prompt → pitfalls appear in agent's first LLM call
  - **Edge cases:** No pitfalls in store → empty section, no injection; all pitfalls low severity → none injected (gate by HIGH); pitfall store has 100+ entries → only top-3 by similarity retrieved (no N+1)
  - **Error and failure paths:** `experience_store` unavailable → skip injection, log warning; similarity search times out → skip injection, continue task
  - **Integration:** PitfallDetector app-state singleton accessible from PLAN_EXEC planning; existing `evolution_dashboard.py` caller still works (backward compatible)
- **Verification:** Pitfalls injected at planning phase appear in system prompt; similarity retrieval works on goal text; HIGH-severity gate filters noise; existing dashboard caller unaffected.

### U8. Spec review gate

- **Goal:** Pause PLAN_EXEC after first Spec generation for user review; resume on confirmation, replan on rejection.
- **Requirements:** R8
- **Dependencies:** U5 (reflexion retry for post-review execution)
- **Files:**
  - Modify: `src/agentkit/core/plan_exec_engine.py` (at line 269-277 — after Spec generation, emit `spec_review_request`, suspend on pending future)
  - Modify: `src/agentkit/core/spec_manager.py` (add `parked` status, `resume()` method)
  - Modify: `src/agentkit/server/routes/chat.py` (add `spec_review_request`/`spec_review_reply` to `_VALID_TEAM_EVENT_TYPES` at line 144; add handler for `spec_review_reply`)
  - Modify: `src/agentkit/server/routes/portal.py` (add event forwarding for spec review events)
  - Test: `tests/unit/test_spec_review_gate.py`
- **Approach:** At `plan_exec_engine.py:269-277` (currently generates Spec and immediately executes), insert: emit `spec_review_request` event (carrying `spec_id`, `goal`, `steps`, `spec_review_id = f"{plan_id}:spec_review"`), suspend on pending `asyncio.Future`. On `spec_review_reply` (confirm/reject/timeout): confirm → resume execution; reject → replan (call `GoalPlanner` again with rejection feedback); timeout (30 min default, configurable) → set Spec status `parked` (not `failed`), allow resume-on-return. Add `spec_review_request`/`spec_review_reply` to `_VALID_TEAM_EVENT_TYPES` (per streaming-event-whitelist learning — without this, events silently no-op with 200 response). Follow terminal-event symmetry (open milestone → close on every path). Mirror CancellationToken pattern (register pending future, pop in finally). RQ4 confirmed: new events, not reuse `confirmation_request` (different timeout semantics, different lifecycle, portal.py has no confirmation wiring).
- **Patterns to follow:** `src/agentkit/core/config_driven.py:686` (CancellationToken try/finally — register/pop pattern), `src/agentkit/server/routes/chat.py:144` (`_VALID_TEAM_EVENT_TYPES` — add new events), `src/agentkit/server/routes/chat.py:1365-1377` (confirmation pattern — reference, not reuse), streaming-event-contract-residuals learning (terminal-event symmetry, stable identifier)
- **Test scenarios:**
  - **Happy path:** Covers AE4 — PLAN_EXEC generates Spec → `spec_review_request` emitted → execution suspends → user confirms → `spec_review_reply` → execution resumes
  - **Edge cases:** User rejects → replan with feedback → new Spec generated → review again; timeout (30min) → Spec status `parked` (not `failed`) → resume on return; stream cancelled during review → future cancelled, no deadlock
  - **Error and failure paths:** `spec_review_reply` with invalid `spec_review_id` → error response; future resolution error → execution fails gracefully; event not in whitelist → test asserts it IS in whitelist (silent failure prevention)
  - **Integration:** Events forwarded by portal.py; frontend receives `spec_review_request` and can render review UI; `parked` Spec survives page reload
- **Verification:** Spec review round-trip works (request → suspend → reply → resume); rejection triggers replan; timeout → parked not failed; events in whitelist (no silent no-op).

### U9. TEAM_COLLAB no fall-back to REACT

- **Goal:** TEAM_COLLAB surfaces failure to user instead of silently falling back to REACT.
- **Requirements:** R7
- **Dependencies:** None (routing change only)
- **Files:**
  - Modify: `src/agentkit/server/routes/chat.py` (at line 1336-1344 — change TEAM_COLLAB branch to reject fall-back, surface failure)
  - Modify: `AGENTS.md` (update to reflect actual behavior — remove "抛 not yet supported" claim, document TEAM_COLLAB routing)
  - Test: `tests/unit/test_team_collab_routing.py`
- **Approach:** At `chat.py:1336-1344` (currently falls back to REACT with warning for TEAM_COLLAB), change the TEAM_COLLAB branch to: route to TeamOrchestrator+SharedWorkspace (real wiring), or if orchestrator unavailable, surface failure to user (not silent fall-back). Update AGENTS.md to remove the stale "抛 not yet supported" claim for REWOO/REFLEXION/TEAM_COLLAB — document that TEAM_COLLAB routes to TeamOrchestrator, REWOO/REFLEXION-as-mode are deferred (not "unsupported"). This is a routing change, not full TEAM_COLLAB implementation — the orchestrator already exists (`src/agentkit/experts/orchestrator.py:45`).
- **Patterns to follow:** `src/agentkit/server/routes/chat.py:758-808` (PLAN_EXEC routing — mutual exclusivity with fallback chain, KTD5 pattern)
- **Test scenarios:**
  - **Happy path:** `@team` prefix → routes to TeamOrchestrator (not REACT fall-back); TeamOrchestrator executes phases
  - **Edge cases:** TeamOrchestrator unavailable → error surfaced to user (not silent REACT); team template not found → error with template list
  - **Error and failure paths:** All phases fail → failure surfaced to user (not fall-back to single agent per existing `_fallback_to_single_agent` — that's orchestrator-internal, acceptable)
  - **Integration:** AGENTS.md updated; REWOO/REFLEXION-as-mode still fall back (deferred, not in scope)
- **Verification:** TEAM_COLLAB routes to TeamOrchestrator; no silent REACT fall-back; AGENTS.md reflects actual behavior.

---

## Scope Boundaries

### Deferred for later

- **Full sandbox tiering** (read-only / workspace-write / danger) — P2 priority; only minimum sandbox (workspace-write, no network) pulled into scope as R3/R10 prerequisite (per RV3).
- **REWOO/REFLEXION-as-mode** (as independent execution modes) — deferred per RV10 (no target service for REWOO, conceptually distinct from reflexion-as-retry per RV20); R7 narrowed to TEAM_COLLAB only.
- **R9 coding_harness** (Worker-Verifier adversarial harness) — deferred per RV11 (R3+R4 already satisfy the goal), RV12 (4-stage pipeline to single-stage PLAN_EXEC phase mapping undefined), RV13 (no independent success criteria). Trust boundary: coding_harness executing untrusted code requires sandbox — depends on full sandbox tiering.
- **Model autonomous compaction** — existing threshold approach works.
- **Three-tier nested loop** (submission / handler / turn) — cost exceeds benefit.
- **Spec output as human-readable markdown** — current YAML Spec + review gate works; markdown化 deferred.
- **Full TEAM_COLLAB real wiring** (beyond routing) — U9 handles routing only; deeper orchestrator integration (debate rounds, review gates, divergence detection) is existing functionality that may need tuning but is not in scope for the quality loop.

### Outside this product's identity

- **Tool minimalism** (cut to Bash + apply_patch) — agentkit goes the skill/expert-team direction; 25 tools are business need.
- **New Task Runtime concept** — existing plan_exec foundation suffices; no new concept introduced.

### Deferred to Follow-Up Work

- **DIRECT_CHAT evolution wiring** — explicitly non-goal (KTD-10); if future simple-task learning becomes valuable, would require fabricating TaskMessage/TaskResult.
- **Success-path reflector upgrade** — current `RuleBasedReflector` only generates suggestions on failure; success sampling (RQ2) activates fully when a success-capable reflector is implemented.
- **Loop detector semantic upgrade** — current hash-based detector raised to threshold 3 for keep-working mode; semantic detection (detect truly identical retries vs similar-but-different) is a future upgrade.

---

## System-Wide Impact

- **Streaming path behavior change (U2):** All WebSocket-routed tasks now trigger evolution hooks. Fire-and-forget with backpressure ensures no latency regression. Evolution errors are isolated — they cannot fail the stream.
- **Verification default change (U3):** PLAN_EXEC/TEAM_COLLAB now verify by default. Tasks that previously "succeeded" without verification may now fail verification. This is the intended behavior change — surfaces real failures that were hidden.
- **Step budget change (U4):** PLAN_EXEC/TEAM_COLLAB get phase quotas; DIRECT_CHAT/REACT keep `max_steps=10` total. Backward compatible — no `phase_budgets` means current behavior.
- **Evolution artifacts now persist cross-task (U6):** Without actor marking and workspace-scoped sharing, a poisoned pitfall from one workspace could degrade prompts in another. Trust boundary enforcement is load-bearing.
- **Reflexion retry changes loop behavior (U5):** "Keep working until done" expands blast radius. Minimum sandbox (U3) is the security countermeasure. Loop detector threshold raised to 3 to avoid false-positive on retry.
- **Spec review adds friction to PLAN_EXEC (U8):** Every PLAN_EXEC now pauses for review. This is intentional (per R8) — catches bad plans before execution. Timeout → parked (not failed) respects long-task user availability.
- **TEAM_COLLAB no longer silently degrades (U9):** Users who relied on TEAM_COLLAB falling back to REACT will see explicit failures instead. This is the intended behavior — silent degradation was a bug.

---

## Risks & Dependencies

- **R5 streaming hook bypass (OQ6) — HIGHEST RISK.** Without U2, R5/R6 are no-ops on the primary user-facing path. U2 is the load-bearing precondition. Mitigation: U2 ships first; parity test (REST vs WebSocket evolution log) is the regression guard.
- **R4 double-reflexion with Recovery layer.** Main-flow reflexion (U5) + Recovery-layer reflexion (`_fallback_chain.py:118`) could double-reflect. Mitigation: Recovery escalates to human, not re-reflex. Documented in KTD-2.
- **RV22 loop detector conflict with R10.** "Keep working" retries similar fixes, triggering loop detection (threshold=2). Mitigation: threshold raised to 3 for keep-working mode (U4); `reset()` between attempts (KTD-9).
- **R1 str_replace exact-match fragility.** Without `view` command, agents emit `str_replace` with stale anchors and fail. Mitigation: `view` command included in U1.
- **R8 spec review deadlock.** User leaves → task hangs. Mitigation: 30-min timeout → `parked` not `failed`; resume-on-return.
- **Evolution noise degrades prompts (RV14).** Low-quality pitfalls fed to optimizer regress prompts. Mitigation: confidence threshold + observe-only mode (U6, initially `observe_only=True`).
- **Evolution module runtime correctness unverified.** No prior learnings exist for evolution/reflexion/verification/spec_manager modules (coverage gap from learnings research). Mitigation: budget for first-principles verification; characterization tests before changes.
- **Streaming event whitelist silent failure.** New events not in `_VALID_TEAM_EVENT_TYPES` silently no-op. Mitigation: U8 explicitly adds events to whitelist; test asserts presence.
- **Async generator safety.** All new `async def` with `yield` must use `return; yield` pattern before early return (project rule). Applies to U2 (hook helper), U5 (reflexion streaming), U8 (spec review suspension).

Dependencies:
- evolution module (Reflector/PitfallDetector/PromptOptimizer/ABTester) already implemented — U6/U7 do integration only
- ReflexionEngine already implemented — U5 extends ReActEngine, doesn't refactor ReflexionEngine
- VerificationLoop already implemented — U3 changes defaults and policy, not core logic
- SpecManager.confirm already implemented (REST) — U8 adds chat flow integration
- TeamOrchestrator already implemented — U9 is routing change, not orchestrator implementation
- Assume: step quota redesign doesn't break DIRECT_CHAT/REACT semantics (enforced by opt-in `phase_budgets` parameter)

---

## Acceptance Examples

- **AE1. Complex task verify-fail → reflexion retry.** Covers R2, R4, R10. Given: PLAN_EXEC task completes, verify runs pytest and fails. When: reflexion triggers, reflects on error, generates fix. Then: retries within reflect quota; if still fails, marks `"gave_up_after_reflections"` and triggers evolution.
- **AE2. Simple task doesn't reflexion.** Covers R4. Given: DIRECT_CHAT mode executes simple task. When: task completes. Then: no reflexion retry loop, direct return.
- **AE3. Task failure auto-triggers evolution.** Covers R5, R6. Given: complex task fails (verify fails, reflexion exhausted). When: task ends. Then: evolution auto-triggers, Reflector records failure, PitfallDetector detects patterns.
- **AE4. Spec review gate.** Covers R8. Given: PLAN_EXEC generates Spec. When: Spec first generated. Then: execution suspends, `spec_review_request` emitted; user confirms → execution resumes; user rejects → replan; timeout → `parked`.

---

## Sources / Research

- **Origin document:** `docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md` (R1-R12, RQ1-RQ4, OQ5-OQ6, RV1-RV22)
- **Repo research:** Confirmed all brainstorm findings with file:line references; mapped 12 requirements to integration points; identified 3 AGENTS.md contradictions; recommended 6-phase implementation order.
- **Institutional learnings (5 relevant docs in `docs/solutions/`):**
  - `integration-issues/streaming-event-contract-residuals.md` — `execute_stream` registration pattern (resolves OQ6), terminal-event symmetry (shapes R8), stable identifier convention
  - `logic-errors/long-horizon-reliability-code-review-fixes.md` — `reset()` between retry attempts (RV22 mitigation), checkpoint-reconstructable counters (KTD-7), cross-module format contracts
  - `runtime-errors/streaming-event-whitelist-and-accumulation.md` — `_VALID_TEAM_EVENT_TYPES` whitelist (R8 events), ReAct Streaming Contract (R4 streaming)
  - `architecture-patterns/bitable-companion-service-security-reliability-patterns.md` — SSRF hop-revalidation → symlink defense (KTD-6), IDOR 404-before-403 (R6 trust boundary), `asyncio.to_thread` (R1)
  - `security-issues/portal-platform-security-reliability-fixes.md` — backpressure cap + shutdown drain (KTD-4), per-namespace rejection (R6), trust-boundary validation
- **Coverage gap:** No prior learnings exist for evolution/reflexion/verification/spec_manager modules — budget for first-principles verification.
- **Agent-native planning assessment:** Confirmed agentkit is agent-native (Required applicability); classified domain actions (Now/Later/Never); identified execute_stream hook wiring as single most load-bearing architectural issue; suggested 11 implementation units (refined to 9 in this plan); proposed 5 KTDs (expanded to 10 in this plan).
- **Industry benchmarks (from brainstorm):** Codex agent loop (single-thread ReAct + forced verify), Qoder Quest (Spec → Code → Verify loop + auto evolution), Trae SOLO Spec mode (confirmation gate).