fischer-agentkit/docs/plans/2026-07-03-001-feat-complex...

44 KiB

title type date origin
feat: Complex task quality loop (verify → reflect → evolve) feat 2026-07-03 docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md

Complex Task Quality Loop (verify → reflect → evolve)

Summary

Assemble agentkit's declared-but-disconnected verification, reflexion, and evolution mechanisms into a unified quality loop for complex tasks (PLAN_EXEC/TEAM_COLLAB). Tasks run → verify → if fail, reflexion reflect→retry → on completion, auto-trigger evolution (record pitfall + optimize prompt). Foundational fixes: structured file editing tool, verification defaults, step budget phases, minimum sandbox, Spec review gate. The loop replaces the current "early stop on failure" behavior with "keep working until done, then learn from the outcome."


Problem Frame

agentkit fails on complex tasks because its quality mechanisms are declared but not connected:

  • verification_enabled defaults to False (src/agentkit/core/react.py:171) — VERIFICATION phase doesn't enforce tests
  • write_file listed in _DEFAULT_CORE_TOOLS (src/agentkit/core/react.py:156-162) but has no implementation class — LLM calls fail
  • reflexion only runs in _fallback_chain.py Recovery layer, not in the main execution flow
  • evolution only triggers manually via /api/v1/evolution/trigger — no auto-trigger after tasks
  • TEAM_COLLAB falls back to REACT (src/agentkit/server/routes/chat.py:1336) instead of running the real orchestrator
  • max_steps=10 hard cap with no "keep working until done" bias — tasks stop at the first verify failure
  • execute_stream() (src/agentkit/core/config_driven.py:686) bypasses on_task_complete/on_task_failed hooks — R5's auto-evolution would silently no-op on the WebSocket streaming path (the primary user-facing path)

The result is systemic failure: no retry mechanism, no self-evolution. Single-point fixes don't solve this — the independent parts must be assembled into a closed loop.

(See origin: docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md)


Requirements

Requirements are grouped by concern. Each carries its origin R-ID for traceability.

Foundations (all tasks benefit)

  • R1. Provide a structured file editing tool (str_replace_editor with create / str_replace / insert_at_line / view commands), replacing the broken write_file placeholder. All path parameters must resolve and prefix-check against workspace root, rejecting symlink escape; align with the existing 6-layer terminal security paradigm.
  • R2. verification_enabled defaults to True for PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stay False (per RV2 — global True would force pytest/ruff on non-code REACT tasks like translation/research).
  • R3. VERIFICATION phase forces project tests (pytest/ruff) for coding tasks; non-coding tasks use Spec-declared verification commands (per RV8 — forcing pytest on non-Python projects causes false failures).

Closed loop (complex tasks)

  • R4. Reflexion upgraded from fallback-only to main-flow retry for PLAN_EXEC/TEAM_COLLAB: verify fails → reflect → retry. Implemented by extending ReActEngine's existing reinjection loop, not by driving PLAN_EXEC through ReflexionEngine (per RV4, RV15, RV20 — ReflexionEngine doesn't forward phase_policy, and reflexion-as-mode is conceptually distinct from reflexion-as-retry).
  • R5. Auto-trigger evolution on task completion (success or failure): Reflector records + PitfallDetector detects + PromptOptimizer optimizes. Quality gate: pitfall confidence threshold before ingestion; PromptOptimizer consumption gate (sample count ≥ min_examples and confidence达标); observe-only mode records without feeding optimizer to avoid noise-driven prompt degradation (per RV14).
  • R6. Evolution trigger thresholds: failure always runs; success runs at sample rate 0.1 (per RQ2). Integrity/auth: evolution artifacts (pitfalls, optimized prompts) carry actor marking (which agent/expert produced them); cross-workspace sharing defaults off, requires explicit opt-in (per RV14 trust boundary).

Capability wiring

  • R7. TEAM_COLLAB does not fall back to REACT — surface failure to user instead of silent degradation. (REWOO/REFLEXION-as-mode deferred per RV10, RV20.)
  • R8. Spec review gate: first Spec generation emits spec_review_request event, suspends execution pending user confirmation (spec_review_reply). Confirmation → execute; rejection → replan; timeout → parked status (not failed) with resume-on-return (per RV16 — 5-min timeout is too short for long tasks).

Bias and budget

  • R10. "Keep working until done" bias for complex tasks: don't abandon on first verify failure, auto-enter reflexion retry within remaining step budget.
  • R11. Step budget split into phase quotas (think=7 / verify=2 / reflect=1 per RQ1), replacing single max_steps=10. Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB; max_steps=10 preserved as total budget for backward compatibility (per RV5 — DIRECT_CHAT/REACT must keep current semantics).
  • R12. Pitfall retrieval/injection: at task planning, retrieve historical pitfalls by goal/skill similarity from PitfallDetector store and inject into prompt context (per RV7 — current system only records, never retrieves, so "pitfall不重犯" goal is half-served).

Key Technical Decisions

  • KTD-1. Verification canonical path is engine-internal at final-answer (src/agentkit/core/react.py:1303-1376), not RunTestsTool. RunTestsTool (src/agentkit/tools/builtin.py:16) remains for agent-initiated mid-task verification. The engine-internal path runs automatically at the final-answer gate. This avoids double-verify and keeps the agent's manual tool distinct from the engine's automatic gate.

  • KTD-2. Reflexion-as-retry is implemented by extending ReActEngine's reinjection loop, not by driving PLAN_EXEC through ReflexionEngine. ReflexionEngine (src/agentkit/core/reflexion.py:88-92) constructs a vanilla ReActEngine without forwarding phase_policy — refactoring it to drive PLAN_EXEC would be large and conceptually conflates reflexion-as-mode with reflexion-as-retry. Instead, extend the existing reinjection loop (which already holds phase_policy) to call a reflect step after max_reinjections exhausts. ReflexionEngine stays as the standalone engine for the deferred REFLEXION-as-mode.

  • KTD-3. Evolution triggering is a system lifecycle concern, not an agent capability. The fix is hook-wiring (connecting on_task_complete/on_task_failed to the streaming path), not exposing evolution as an agent-callable tool. Agents produce the work; the system evolves from the outcome.

  • KTD-4. execute_stream() must invoke on_task_complete/on_task_failed to maintain lifecycle parity with execute(). This is the single most load-bearing architectural fix — without it, R5/R6 are no-ops on the WebSocket streaming path (the primary user-facing path). Use fire-and-forget asyncio.create_task with backpressure cap (max_concurrent * 2) and shutdown drain per the portal-platform-security-reliability-fixes learning. Evolution errors must not fail the stream.

  • KTD-5. Spec review uses new spec_review_request/spec_review_reply events + parked Spec status. confirmation_request is not reused (per RQ4 — different timeout semantics, different lifecycle, portal.py has no confirmation wiring). Events must follow terminal-event symmetry (open milestone → close on every path: confirm/reject/timeout/cancel) with stable spec_review_id = f"{plan_id}:spec_review" per the streaming-event-contract-residuals learning. Default timeout 30 min, configurable; on timeout → parked not failed.

  • KTD-6. str_replace_editor symlink defense uses Path.resolve() + Path.relative_to(resolved_workspace_root), not str.startswith(). startswith admits path-prefix collisions (/workspace_root_evil/...). Pattern mirrors the SSRF hop-revalidation approach from the bitable-companion-service security learning. Filesystem ops wrapped in asyncio.to_thread to avoid blocking the event loop.

  • KTD-7. Phase-budget counters are checkpoint-reconstructable from restored plan phase statuses. On resume, think/verify/reflect spent counts derive from persisted phase state, not reset to zero (per long-horizon-reliability-code-review-fixes learning P2 #8/#11 — resume is full state reconstruction).

  • KTD-8. Reflexion-gave-up status is "gave_up_after_reflections", not "success". When max_reflections exhausts without verify pass, the status propagates to TaskResult and evolution's outcome field. Evolution's RuleBasedReflector treats this as failure for reflection purposes. Without this, evolution silently skips reflection on reflexion-gave-up tasks (per agent-native planning finding OQ-D).

  • KTD-9. ReActEngine.reset() called between reflexion retry attempts. Without reset, the loop detector (_loop_threshold=2) misfires on retry because _loop_window state leaks across attempts (per long-horizon-reliability-code-review-fixes learning P2 #9, RV22).

  • KTD-10. DIRECT_CHAT does not trigger evolution (explicit non-goal). DIRECT_CHAT bypasses BaseAgent entirely (src/agentkit/server/routes/chat.py:1245 calls llm_gateway.chat() directly). Wiring evolution would require fabricating TaskMessage/TaskResult. Simple Q&A tasks have low evolution value. Documented as non-goal, not a gap to fix later.


High-Level Technical Design

Quality loop flow

flowchart TB
    A[Complex task starts] --> B[Execute: think/act/observe]
    B --> C{Verify at final-answer}
    C -->|Pass| D[Mark completed]
    C -->|Fail| E{Reflect quota remaining?}
    E -->|Yes| F[Call reset then reflect]
    F --> G[Generate improvement]
    G --> B
    E -->|No| H[Mark gave_up_after_reflections]
    D --> I[Trigger evolution: fire-and-forget]
    H --> I
    I --> J{Failure?}
    J -->|Yes| K[Reflector + PitfallDetector: 100%]
    J -->|No| L[Sample at 0.1 rate]
    K --> M[Quality gate: confidence threshold]
    L --> M
    M --> N{Observe-only?}
    N -->|Yes| O[Record only]
    N -->|No| P[PromptOptimizer: consume gated]

execute_stream hook wiring

sequenceDiagram
    participant WS as WebSocket (chat.py)
    participant CDA as ConfigDrivenAgent
    participant ES as execute_stream()
    participant Hooks as on_task_complete/failed
    participant EVO as evolve_after_task()

    WS->>CDA: execute_stream(task)
    CDA->>ES: yield ReActEvent
    ES-->>WS: token / final_answer (streaming)
    Note over ES: finally block (new)
    ES->>Hooks: invoke with TaskResult
    Hooks->>EVO: asyncio.create_task (fire-and-forget)
    Note over EVO: backpressure cap + shutdown drain
    EVO-->>EVO: Reflector → PitfallDetector → PromptOptimizer

Spec review gate lifecycle

stateDiagram-v2
    [*] --> PLANNING
    PLANNING --> SPEC_GENERATED
    SPEC_GENERATED --> SPEC_REVIEW_PENDING: emit spec_review_request
    SPEC_REVIEW_PENDING --> EXECUTING: spec_review_reply (confirm)
    SPEC_REVIEW_PENDING --> PLANNING: spec_review_reply (reject)
    SPEC_REVIEW_PENDING --> PARKED: timeout (30min)
    PARKED --> EXECUTING: resume on return
    EXECUTING --> [*]

Implementation Units

U1. str_replace_editor tool + remove write_file bug

  • Goal: Provide a working structured file editing tool with workspace-root security; remove the broken write_file placeholder.
  • Requirements: R1
  • Dependencies: None
  • Files:
    • Create: src/agentkit/tools/str_replace_editor.py (new tool class)
    • Modify: src/agentkit/core/react.py (remove write_file from _DEFAULT_CORE_TOOLS at line 156-162, add str_replace_editor)
    • Modify: src/agentkit/tools/__init__.py (register new tool)
    • Test: tests/unit/test_str_replace_editor.py
  • Approach: Implement str_replace_editor with four commands: create (write new file), str_replace (exact-match anchor replace), insert_at_line (insert at line number), view (read with line numbers — needed because str_replace requires exact anchors). Path validation: Path.resolve() + Path.relative_to(resolved_workspace_root); reject .., absolute paths, symlink escape. Wrap filesystem ops in asyncio.to_thread. Mirror ReadFileTool (src/agentkit/tools/file_read.py:26) for Tool base class structure and error handling. Align with 6-layer terminal security paradigm (src/agentkit/server/auth/terminal_security.py).
  • Patterns to follow: src/agentkit/tools/file_read.py:26 (ReadFileTool — Tool base class, execute schema, _error() helper), src/agentkit/server/auth/terminal_security.py (layered security, _SHELL_OPERATORS pattern)
  • Test scenarios:
    • Happy path: create writes new file; view returns content with line numbers; str_replace replaces exact anchor; insert_at_line inserts at specified line
    • Edge cases: Empty file create; str_replace with multiple matches (error: anchor not unique); insert_at_line at line 0 / beyond EOF; view with line range
    • Error and failure paths: Path traversal ../../etc/passwd rejected; symlink escape rejected; absolute path /etc/passwd rejected; str_replace anchor not found (error); file outside workspace root rejected
    • Integration: Tool registered in _DEFAULT_CORE_TOOLS appears in LLM system prompt; LLM can call it and receive structured result
  • Verification: write_file no longer in _DEFAULT_CORE_TOOLS; str_replace_editor appears in tool descriptions; path traversal tests pass; ruff check clean.

U2. execute_stream hook wiring (OQ6 fix)

  • Goal: Wire on_task_complete/on_task_failed hooks into the streaming path so R5/R6 evolution triggers on WebSocket-routed tasks.
  • Requirements: R5 (precondition), R6 (precondition)
  • Dependencies: None
  • Files:
    • Modify: src/agentkit/core/config_driven.py (execute_stream() at line 686 — add hook invocation in finally block)
    • Modify: src/agentkit/core/plan_exec_engine.py (execute_stream() at line 175 — add hook invocation)
    • Modify: src/agentkit/core/reflexion.py (execute_stream() at line 330 — add hook invocation)
    • Modify: src/agentkit/server/routes/portal.py (verify all 3 execute_stream call sites at lines 580, 701, 1001 propagate hooks)
    • Test: tests/unit/test_execute_stream_hooks.py
  • Approach: Extract a _trigger_evolution_hooks(task, result) helper from the sync handle_task() path (lines 473, 493). Call it from execute_stream()'s finally block. Use asyncio.create_task() (fire-and-forget) to avoid blocking the streaming return. Apply backpressure: cap pending evolution tasks at max_concurrent * 2, drop + log + increment counter on exceed. Drain pending tasks on app shutdown via asyncio.gather(*tasks, return_exceptions=True). Evolution errors are caught and logged — they must not fail the stream. Follow the CancellationToken registration pattern (register in try, pop in finally) per the streaming-event-contract-residuals learning.
  • Patterns to follow: src/agentkit/core/config_driven.py:473,493 (sync hook invocation), src/agentkit/core/config_driven.py:686 (CancellationToken try/finally pattern), portal-platform-security-reliability-fixes learning (backpressure cap + shutdown drain)
  • Test scenarios:
    • Happy path: execute_stream completion fires on_task_complete with correct TaskResult; execute_stream failure fires on_task_failed
    • Edge cases: Stream cancelled mid-flight — hooks still fire with cancelled status; evolution task error does not propagate to stream; backpressure cap reached — drop + log + counter increment
    • Integration: Same task via REST execute() and WebSocket execute_stream() produces equivalent evolution log entries (parity test); all 3 portal.py call sites propagate hooks
  • Verification: Evolution fires after execute_stream completes on both success and failure paths; streaming latency P95 < +50ms (evolution is fire-and-forget); shutdown drains pending evolution tasks.

U3. Verification defaults + forced pytest/ruff + minimum sandbox

  • Goal: Enable verification by default for complex tasks; force pytest/ruff for coding tasks; establish minimum sandbox as security prerequisite.
  • Requirements: R2, R3, RV3 (sandbox prerequisite)
  • Dependencies: U1 (str_replace_editor provides safe editing within sandbox)
  • Files:
    • Modify: src/agentkit/core/react.py (thread verification_enabled parameter through PLAN_EXEC/TEAM_COLLAB construction, default True for those modes)
    • Modify: src/agentkit/core/phase.py (default_policy() at line 139 — VERIFICATION phase forces pytest/ruff for coding tasks)
    • Modify: src/agentkit/core/plan_exec_engine.py (pass verification_enabled=True when constructing ReActEngine for PLAN_EXEC)
    • Modify: src/agentkit/experts/orchestrator.py (pass verification_enabled=True for TEAM_COLLAB)
    • Create: src/agentkit/core/sandbox.py (minimum sandbox enforcement: workspace-write, no network)
    • Test: tests/unit/test_verification_defaults.py, tests/unit/test_sandbox.py
  • Approach: R2: verification_enabled defaults True only for PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stay False (per RV2). Thread the parameter through PlanExecEngine and TeamOrchestrator construction, not as a global default change. R3: In default_policy() VERIFICATION phase, add coding-task detection (check for pyproject.toml or .py files in workspace) — force pytest -x -q and ruff check for coding tasks; non-coding tasks use Spec-declared verification commands. RV3: Create sandbox.py with workspace-root enforcement (reuse U1's path validation) and network blocking (disable httpx/requests/socket for tool calls during VERIFICATION). Sandbox is the minimum layer; full tiering (read-only/workspace-write/danger) deferred.
  • Patterns to follow: src/agentkit/core/phase.py:139 (default_policy — PhasePolicy construction), src/agentkit/tools/advance_phase.py:20 (forced-transition pattern for VERIFICATION→DELIVERY)
  • Test scenarios:
    • Happy path: PLAN_EXEC task with pyproject.toml runs pytest+ruff in VERIFICATION; TEAM_COLLAB task verifies by default; non-coding task uses Spec-declared command
    • Edge cases: Workspace with no pyproject.toml — skip pytest, use Spec command; empty workspace — verification passes (no tests to run); ruff finds issues — reinject as verify failure
    • Error and failure paths: pytest fails — reinject error per max_reinjections; sandbox blocks network call — structured error returned to LLM; path traversal attempt in verification command — rejected
    • Integration: Sandbox enforcement applies to all tool calls during VERIFICATION phase; coding-task detection correctly identifies Python vs non-Python workspaces
  • Verification: PLAN_EXEC/TEAM_COLLAB verify by default; DIRECT_CHAT/REACT do not verify; coding tasks force pytest/ruff; non-coding tasks use Spec commands; sandbox blocks network during VERIFICATION.

U4. Step budget phases + keep working bias

  • Goal: Split max_steps into phase quotas (think/verify/reflect); add "keep working until done" bias for complex tasks.
  • Requirements: R11, R10
  • Dependencies: U3 (verify quota needs verification defaults)
  • Files:
    • Modify: src/agentkit/core/react.py (__init__ at line 167 — add phase_budgets parameter; _execute_loop() at line 561 — enforce per-phase quotas; loop detector at line 220-221 — raise threshold or exempt reflexion retries)
    • Modify: src/agentkit/core/phase.py (PhasePolicy at line 59 — add step_budget field)
    • Modify: src/agentkit/core/plan_exec_engine.py (pass phase_budgets={"think": 7, "verify": 2, "reflect": 1} for PLAN_EXEC)
    • Test: tests/unit/test_step_budget.py
  • Approach: R11: Add phase_budgets: dict[str, int] | None = None to ReActEngine. When set, enforce per-phase quotas: think耗尽 → force verify; verify耗尽 → return best result; reflect耗尽 → no more reflection. When None, behavior is same as today (max_steps=10 total budget). Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB. Budget counters are checkpoint-reconstructable — derive spent counts from restored plan phase statuses on resume (KTD-7). R10: "Keep working until done" is implemented via the reflect quota — verify fail doesn't abandon, it enters reflexion retry within remaining reflect quota. Loop detector threshold raised from 2 to 3 for keep-working mode (per RV22 — threshold=2 false-positives on retry). ReActEngine.reset() called between retry attempts (KTD-9).
  • Patterns to follow: src/agentkit/core/phase.py:59 (PhasePolicy.auto_advance_after_steps — existing per-phase step limit pattern), src/agentkit/core/react.py:220-221 (loop detector — _loop_window, _loop_threshold)
  • Test scenarios:
    • Happy path: PLAN_EXEC with phase_budgets={"think":7,"verify":2,"reflect":1} — think stops at 7, verify runs, reflect runs at most 1; without phase_budgets — behavior unchanged (max_steps=10)
    • Edge cases: Think quota exhausted mid-tool-call — finish current step, then force verify; reflect quota 0 — no reflection, return best result; resume after checkpoint — budget counters reconstructed from phase statuses
    • Error and failure paths: Loop detector threshold 3 — 2 similar retries don't abort, 3 do; reset() between reflexion attempts — _loop_window cleared
    • Integration: Phase budgets enforced in _execute_loop(); checkpoint save/restore preserves budget state; DIRECT_CHAT/REACT unaffected (no phase_budgets set)
  • Verification: Phase quotas enforced; backward compatibility (no phase_budgets = current behavior); loop detector doesn't false-positive on reflexion retry; budget state survives checkpoint/resume.

U5. Reflexion in main flow

  • Goal: Upgrade reflexion from fallback-only to main-flow retry: verify fails → reflect → retry.
  • Requirements: R4
  • Dependencies: U3 (verification), U4 (reflect quota)
  • Files:
    • Modify: src/agentkit/core/react.py (reinjection loop at lines 1303-1376 — after max_reinjections exhausts, call reflect step before returning final)
    • Modify: src/agentkit/core/config_driven.py (parameterize max_reflections=2 at lines 835, 1047 — currently hardcoded 3; make configurable)
    • Test: tests/unit/test_reflexion_main_flow.py
  • Approach: Extend the existing reinjection loop (src/agentkit/core/react.py:1303-1376) — when verify fails and max_reinjections is exhausted, if reflect quota remains: call reset() (KTD-9), generate reflection text (mirror ReflexionEngine._reflect() at src/agentkit/core/reflexion.py:639), inject reflection into context, retry the loop. Parameterize max_reflections (RQ3: 2 for main path, 1 for Recovery layer — currently hardcoded 3 at config_driven.py:835,1047). When max_reflections exhausts without verify pass, return status "gave_up_after_reflections" (KTD-8 — not "success", so evolution treats it as failure). ReflexionEngine stays as standalone for REFLEXION-as-mode (deferred); Recovery layer escalates to human, not re-reflex (avoid double-reflexion).
  • Patterns to follow: src/agentkit/core/react.py:1303-1376 (existing reinjection loop — extend, don't replace), src/agentkit/core/reflexion.py:639 (reflect step — mirror the LLM call shape), src/agentkit/server/_fallback_chain.py:118 (Recovery max_retries=1 — keep distinct from main path)
  • Test scenarios:
    • Happy path: Covers AE1 — verify fails → reflect → retry within reflect quota; retry passes verify → mark completed
    • Edge cases: max_reflections=2 — 2 retry attempts, then "gave_up_after_reflections"; reset() between attempts clears loop window; reflect quota 0 — no retry, return best result
    • Error and failure paths: Reflect LLM call fails — skip reflection, retry with existing context; all retries fail — status "gave_up_after_reflections" propagates to TaskResult and evolution
    • Integration: DIRECT_CHAT/REACT unaffected (no reflect quota); Recovery layer (_fallback_chain.py) still uses max_reflections=1 — no double-reflexion; evolution's RuleBasedReflector treats "gave_up_after_reflections" as failure
  • Verification: Verify-fail → reflect → retry fires; max_reflections=2 configurable; "gave_up_after_reflections" status propagates; no double-reflexion with Recovery layer; DIRECT_CHAT unaffected.

U6. Auto evolution trigger + quality gate

  • Goal: Auto-trigger evolution on task completion with quality gates and actor marking.
  • Requirements: R5, R6
  • Dependencies: U2 (execute_stream hooks), U5 (quality signal from reflexion)
  • Files:
    • Modify: src/agentkit/evolution/lifecycle.py (evolve_after_task() at line 131 — add success sample rate gate, quality threshold, actor marking)
    • Modify: src/agentkit/evolution/pitfall_detector.py (add confidence threshold before ingestion)
    • Create: src/agentkit/evolution/config.py (EvolutionConfig with success_sample_rate: float = 0.1, min_confidence: float = 0.5, observe_only: bool = True)
    • Modify: src/agentkit/evolution/prompt_optimizer.py (consumption gate: sample count ≥ min_examples and confidence达标)
    • Test: tests/unit/test_evolution_auto_trigger.py
  • Approach: R5: EvolutionConfig.success_sample_rate=0.1 gates success-path evolution at evolve_after_task() entry using random.random() < rate (mirror alignment.py:115 audit_sample_rate pattern). Failure path always runs (100%). Quality gate: pitfall confidence threshold before ingestion (min_confidence=0.5 — low-confidence pitfalls discarded or marked observe-only); PromptOptimizer consumption gate (sample count ≥ min_examples=3 and confidence达标); observe-only mode (observe_only=True initially — records without feeding optimizer to avoid noise-driven prompt degradation per RV14). R6: Actor marking on all evolution artifacts (pitfalls, optimized prompts) — which agent/expert produced them. Cross-workspace sharing defaults off; same-workspace sharing default on; cross-workspace requires explicit opt-in. Trust boundary: evolution products are agent-produced and must be validated before entering shared store (not trusted because an agent produced them). Known limitation (per RQ2): default RuleBasedReflector only generates suggestions on outcome=='failure' — success sampling path may 100% early-exit under default reflector; success sampling activates when reflector is upgraded or success-path learning signal is available.
  • Patterns to follow: src/agentkit/evolution/lifecycle.py:131 (evolve_after_task — extend, don't replace), src/agentkit/evolution/pitfall_detector.py:103 (check_pitfalls — Jaccard similarity pattern), portal-platform-security-reliability-fixes learning (per-namespace rejection, backpressure, trust-boundary validation)
  • Test scenarios:
    • Happy path: Covers AE3 — task fails → evolution fires (100%) → Reflector records → PitfallDetector detects; task succeeds → evolution fires at 0.1 rate
    • Edge cases: Observe-only mode — pitfalls recorded but not fed to optimizer; backpressure cap reached — evolution task dropped + logged; low-confidence pitfall — discarded or marked observe-only
    • Error and failure paths: Evolution task error — caught, logged, does not fail the stream; PromptOptimizer sample count < 3 — skip optimization
    • Integration: Evolution fires via U2's execute_stream hooks; actor marking present on all artifacts; cross-workspace sharing rejected without opt-in; "gave_up_after_reflections" status triggers failure-path evolution
  • Verification: Failure tasks always trigger evolution; success tasks trigger at 0.1 rate; observe-only mode records without mutating prompts; actor marking present; cross-workspace sharing gated.

U7. Pitfall retrieval/injection

  • Goal: Retrieve historical pitfalls by goal/skill similarity at task planning and inject into prompt context.
  • Requirements: R12
  • Dependencies: U6 (evolution store with pitfalls)
  • Files:
    • Modify: src/agentkit/evolution/pitfall_detector.py (check_pitfalls() at line 103 — extend to accept goal text, use semantic similarity not just task_type filter)
    • Modify: src/agentkit/core/react.py (system prompt construction — inject pitfall warnings section)
    • Modify: src/agentkit/core/plan_exec_engine.py (at planning phase, call pitfall retrieval and inject into Spec context)
    • Test: tests/unit/test_pitfall_injection.py
  • Approach: Extend PitfallDetector.check_pitfalls() to accept goal text and use experience_store.search with semantic similarity (not just task_type Jaccard filter). Wire experience_store to agent runtime as app-state singleton (KTD per OQ-E — instantiated at startup, shared across tasks). At PLAN_EXEC planning phase, retrieve top-K pitfalls (K=3) by goal/skill similarity, inject as "Historical pitfalls to avoid" section in system prompt. Gate by WarningLevel.HIGH only (avoid noise). Pitfall injection appears in agent's first LLM call. PitfallDetector currently only used in evolution_dashboard.py:549 (read-only) — this is the first runtime integration.
  • Patterns to follow: src/agentkit/evolution/pitfall_detector.py:103 (check_pitfalls — extend signature, don't break existing callers), src/agentkit/memory/semantic.py (semantic retrieval pattern if applicable)
  • Test scenarios:
    • Happy path: Task with similar goal to past failure → top-3 pitfalls injected into system prompt → pitfalls appear in agent's first LLM call
    • Edge cases: No pitfalls in store → empty section, no injection; all pitfalls low severity → none injected (gate by HIGH); pitfall store has 100+ entries → only top-3 by similarity retrieved (no N+1)
    • Error and failure paths: experience_store unavailable → skip injection, log warning; similarity search times out → skip injection, continue task
    • Integration: PitfallDetector app-state singleton accessible from PLAN_EXEC planning; existing evolution_dashboard.py caller still works (backward compatible)
  • Verification: Pitfalls injected at planning phase appear in system prompt; similarity retrieval works on goal text; HIGH-severity gate filters noise; existing dashboard caller unaffected.

U8. Spec review gate

  • Goal: Pause PLAN_EXEC after first Spec generation for user review; resume on confirmation, replan on rejection.
  • Requirements: R8
  • Dependencies: U5 (reflexion retry for post-review execution)
  • Files:
    • Modify: src/agentkit/core/plan_exec_engine.py (at line 269-277 — after Spec generation, emit spec_review_request, suspend on pending future)
    • Modify: src/agentkit/core/spec_manager.py (add parked status, resume() method)
    • Modify: src/agentkit/server/routes/chat.py (add spec_review_request/spec_review_reply to _VALID_TEAM_EVENT_TYPES at line 144; add handler for spec_review_reply)
    • Modify: src/agentkit/server/routes/portal.py (add event forwarding for spec review events)
    • Test: tests/unit/test_spec_review_gate.py
  • Approach: At plan_exec_engine.py:269-277 (currently generates Spec and immediately executes), insert: emit spec_review_request event (carrying spec_id, goal, steps, spec_review_id = f"{plan_id}:spec_review"), suspend on pending asyncio.Future. On spec_review_reply (confirm/reject/timeout): confirm → resume execution; reject → replan (call GoalPlanner again with rejection feedback); timeout (30 min default, configurable) → set Spec status parked (not failed), allow resume-on-return. Add spec_review_request/spec_review_reply to _VALID_TEAM_EVENT_TYPES (per streaming-event-whitelist learning — without this, events silently no-op with 200 response). Follow terminal-event symmetry (open milestone → close on every path). Mirror CancellationToken pattern (register pending future, pop in finally). RQ4 confirmed: new events, not reuse confirmation_request (different timeout semantics, different lifecycle, portal.py has no confirmation wiring).
  • Patterns to follow: src/agentkit/core/config_driven.py:686 (CancellationToken try/finally — register/pop pattern), src/agentkit/server/routes/chat.py:144 (_VALID_TEAM_EVENT_TYPES — add new events), src/agentkit/server/routes/chat.py:1365-1377 (confirmation pattern — reference, not reuse), streaming-event-contract-residuals learning (terminal-event symmetry, stable identifier)
  • Test scenarios:
    • Happy path: Covers AE4 — PLAN_EXEC generates Spec → spec_review_request emitted → execution suspends → user confirms → spec_review_reply → execution resumes
    • Edge cases: User rejects → replan with feedback → new Spec generated → review again; timeout (30min) → Spec status parked (not failed) → resume on return; stream cancelled during review → future cancelled, no deadlock
    • Error and failure paths: spec_review_reply with invalid spec_review_id → error response; future resolution error → execution fails gracefully; event not in whitelist → test asserts it IS in whitelist (silent failure prevention)
    • Integration: Events forwarded by portal.py; frontend receives spec_review_request and can render review UI; parked Spec survives page reload
  • Verification: Spec review round-trip works (request → suspend → reply → resume); rejection triggers replan; timeout → parked not failed; events in whitelist (no silent no-op).

U9. TEAM_COLLAB no fall-back to REACT

  • Goal: TEAM_COLLAB surfaces failure to user instead of silently falling back to REACT.
  • Requirements: R7
  • Dependencies: None (routing change only)
  • Files:
    • Modify: src/agentkit/server/routes/chat.py (at line 1336-1344 — change TEAM_COLLAB branch to reject fall-back, surface failure)
    • Modify: AGENTS.md (update to reflect actual behavior — remove "抛 not yet supported" claim, document TEAM_COLLAB routing)
    • Test: tests/unit/test_team_collab_routing.py
  • Approach: At chat.py:1336-1344 (currently falls back to REACT with warning for TEAM_COLLAB), change the TEAM_COLLAB branch to: route to TeamOrchestrator+SharedWorkspace (real wiring), or if orchestrator unavailable, surface failure to user (not silent fall-back). Update AGENTS.md to remove the stale "抛 not yet supported" claim for REWOO/REFLEXION/TEAM_COLLAB — document that TEAM_COLLAB routes to TeamOrchestrator, REWOO/REFLEXION-as-mode are deferred (not "unsupported"). This is a routing change, not full TEAM_COLLAB implementation — the orchestrator already exists (src/agentkit/experts/orchestrator.py:45).
  • Patterns to follow: src/agentkit/server/routes/chat.py:758-808 (PLAN_EXEC routing — mutual exclusivity with fallback chain, KTD5 pattern)
  • Test scenarios:
    • Happy path: @team prefix → routes to TeamOrchestrator (not REACT fall-back); TeamOrchestrator executes phases
    • Edge cases: TeamOrchestrator unavailable → error surfaced to user (not silent REACT); team template not found → error with template list
    • Error and failure paths: All phases fail → failure surfaced to user (not fall-back to single agent per existing _fallback_to_single_agent — that's orchestrator-internal, acceptable)
    • Integration: AGENTS.md updated; REWOO/REFLEXION-as-mode still fall back (deferred, not in scope)
  • Verification: TEAM_COLLAB routes to TeamOrchestrator; no silent REACT fall-back; AGENTS.md reflects actual behavior.

Scope Boundaries

Deferred for later

  • Full sandbox tiering (read-only / workspace-write / danger) — P2 priority; only minimum sandbox (workspace-write, no network) pulled into scope as R3/R10 prerequisite (per RV3).
  • REWOO/REFLEXION-as-mode (as independent execution modes) — deferred per RV10 (no target service for REWOO, conceptually distinct from reflexion-as-retry per RV20); R7 narrowed to TEAM_COLLAB only.
  • R9 coding_harness (Worker-Verifier adversarial harness) — deferred per RV11 (R3+R4 already satisfy the goal), RV12 (4-stage pipeline to single-stage PLAN_EXEC phase mapping undefined), RV13 (no independent success criteria). Trust boundary: coding_harness executing untrusted code requires sandbox — depends on full sandbox tiering.
  • Model autonomous compaction — existing threshold approach works.
  • Three-tier nested loop (submission / handler / turn) — cost exceeds benefit.
  • Spec output as human-readable markdown — current YAML Spec + review gate works; markdown化 deferred.
  • Full TEAM_COLLAB real wiring (beyond routing) — U9 handles routing only; deeper orchestrator integration (debate rounds, review gates, divergence detection) is existing functionality that may need tuning but is not in scope for the quality loop.

Outside this product's identity

  • Tool minimalism (cut to Bash + apply_patch) — agentkit goes the skill/expert-team direction; 25 tools are business need.
  • New Task Runtime concept — existing plan_exec foundation suffices; no new concept introduced.

Deferred to Follow-Up Work

  • DIRECT_CHAT evolution wiring — explicitly non-goal (KTD-10); if future simple-task learning becomes valuable, would require fabricating TaskMessage/TaskResult.
  • Success-path reflector upgrade — current RuleBasedReflector only generates suggestions on failure; success sampling (RQ2) activates fully when a success-capable reflector is implemented.
  • Loop detector semantic upgrade — current hash-based detector raised to threshold 3 for keep-working mode; semantic detection (detect truly identical retries vs similar-but-different) is a future upgrade.

System-Wide Impact

  • Streaming path behavior change (U2): All WebSocket-routed tasks now trigger evolution hooks. Fire-and-forget with backpressure ensures no latency regression. Evolution errors are isolated — they cannot fail the stream.
  • Verification default change (U3): PLAN_EXEC/TEAM_COLLAB now verify by default. Tasks that previously "succeeded" without verification may now fail verification. This is the intended behavior change — surfaces real failures that were hidden.
  • Step budget change (U4): PLAN_EXEC/TEAM_COLLAB get phase quotas; DIRECT_CHAT/REACT keep max_steps=10 total. Backward compatible — no phase_budgets means current behavior.
  • Evolution artifacts now persist cross-task (U6): Without actor marking and workspace-scoped sharing, a poisoned pitfall from one workspace could degrade prompts in another. Trust boundary enforcement is load-bearing.
  • Reflexion retry changes loop behavior (U5): "Keep working until done" expands blast radius. Minimum sandbox (U3) is the security countermeasure. Loop detector threshold raised to 3 to avoid false-positive on retry.
  • Spec review adds friction to PLAN_EXEC (U8): Every PLAN_EXEC now pauses for review. This is intentional (per R8) — catches bad plans before execution. Timeout → parked (not failed) respects long-task user availability.
  • TEAM_COLLAB no longer silently degrades (U9): Users who relied on TEAM_COLLAB falling back to REACT will see explicit failures instead. This is the intended behavior — silent degradation was a bug.

Risks & Dependencies

  • R5 streaming hook bypass (OQ6) — HIGHEST RISK. Without U2, R5/R6 are no-ops on the primary user-facing path. U2 is the load-bearing precondition. Mitigation: U2 ships first; parity test (REST vs WebSocket evolution log) is the regression guard.
  • R4 double-reflexion with Recovery layer. Main-flow reflexion (U5) + Recovery-layer reflexion (_fallback_chain.py:118) could double-reflect. Mitigation: Recovery escalates to human, not re-reflex. Documented in KTD-2.
  • RV22 loop detector conflict with R10. "Keep working" retries similar fixes, triggering loop detection (threshold=2). Mitigation: threshold raised to 3 for keep-working mode (U4); reset() between attempts (KTD-9).
  • R1 str_replace exact-match fragility. Without view command, agents emit str_replace with stale anchors and fail. Mitigation: view command included in U1.
  • R8 spec review deadlock. User leaves → task hangs. Mitigation: 30-min timeout → parked not failed; resume-on-return.
  • Evolution noise degrades prompts (RV14). Low-quality pitfalls fed to optimizer regress prompts. Mitigation: confidence threshold + observe-only mode (U6, initially observe_only=True).
  • Evolution module runtime correctness unverified. No prior learnings exist for evolution/reflexion/verification/spec_manager modules (coverage gap from learnings research). Mitigation: budget for first-principles verification; characterization tests before changes.
  • Streaming event whitelist silent failure. New events not in _VALID_TEAM_EVENT_TYPES silently no-op. Mitigation: U8 explicitly adds events to whitelist; test asserts presence.
  • Async generator safety. All new async def with yield must use return; yield pattern before early return (project rule). Applies to U2 (hook helper), U5 (reflexion streaming), U8 (spec review suspension).

Dependencies:

  • evolution module (Reflector/PitfallDetector/PromptOptimizer/ABTester) already implemented — U6/U7 do integration only
  • ReflexionEngine already implemented — U5 extends ReActEngine, doesn't refactor ReflexionEngine
  • VerificationLoop already implemented — U3 changes defaults and policy, not core logic
  • SpecManager.confirm already implemented (REST) — U8 adds chat flow integration
  • TeamOrchestrator already implemented — U9 is routing change, not orchestrator implementation
  • Assume: step quota redesign doesn't break DIRECT_CHAT/REACT semantics (enforced by opt-in phase_budgets parameter)

Acceptance Examples

  • AE1. Complex task verify-fail → reflexion retry. Covers R2, R4, R10. Given: PLAN_EXEC task completes, verify runs pytest and fails. When: reflexion triggers, reflects on error, generates fix. Then: retries within reflect quota; if still fails, marks "gave_up_after_reflections" and triggers evolution.
  • AE2. Simple task doesn't reflexion. Covers R4. Given: DIRECT_CHAT mode executes simple task. When: task completes. Then: no reflexion retry loop, direct return.
  • AE3. Task failure auto-triggers evolution. Covers R5, R6. Given: complex task fails (verify fails, reflexion exhausted). When: task ends. Then: evolution auto-triggers, Reflector records failure, PitfallDetector detects patterns.
  • AE4. Spec review gate. Covers R8. Given: PLAN_EXEC generates Spec. When: Spec first generated. Then: execution suspends, spec_review_request emitted; user confirms → execution resumes; user rejects → replan; timeout → parked.

Sources / Research

  • Origin document: docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md (R1-R12, RQ1-RQ4, OQ5-OQ6, RV1-RV22)
  • Repo research: Confirmed all brainstorm findings with file:line references; mapped 12 requirements to integration points; identified 3 AGENTS.md contradictions; recommended 6-phase implementation order.
  • Institutional learnings (5 relevant docs in docs/solutions/):
    • integration-issues/streaming-event-contract-residuals.mdexecute_stream registration pattern (resolves OQ6), terminal-event symmetry (shapes R8), stable identifier convention
    • logic-errors/long-horizon-reliability-code-review-fixes.mdreset() between retry attempts (RV22 mitigation), checkpoint-reconstructable counters (KTD-7), cross-module format contracts
    • runtime-errors/streaming-event-whitelist-and-accumulation.md_VALID_TEAM_EVENT_TYPES whitelist (R8 events), ReAct Streaming Contract (R4 streaming)
    • architecture-patterns/bitable-companion-service-security-reliability-patterns.md — SSRF hop-revalidation → symlink defense (KTD-6), IDOR 404-before-403 (R6 trust boundary), asyncio.to_thread (R1)
    • security-issues/portal-platform-security-reliability-fixes.md — backpressure cap + shutdown drain (KTD-4), per-namespace rejection (R6), trust-boundary validation
  • Coverage gap: No prior learnings exist for evolution/reflexion/verification/spec_manager modules — budget for first-principles verification.
  • Agent-native planning assessment: Confirmed agentkit is agent-native (Required applicability); classified domain actions (Now/Later/Never); identified execute_stream hook wiring as single most load-bearing architectural issue; suggested 11 implementation units (refined to 9 in this plan); proposed 5 KTDs (expanded to 10 in this plan).
  • Industry benchmarks (from brainstorm): Codex agent loop (single-thread ReAct + forced verify), Qoder Quest (Spec → Code → Verify loop + auto evolution), Trae SOLO Spec mode (confirmation gate).