44 KiB
| title | type | date | origin |
|---|---|---|---|
| feat: Complex task quality loop (verify → reflect → evolve) | feat | 2026-07-03 | docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md |
Complex Task Quality Loop (verify → reflect → evolve)
Summary
Assemble agentkit's declared-but-disconnected verification, reflexion, and evolution mechanisms into a unified quality loop for complex tasks (PLAN_EXEC/TEAM_COLLAB). Tasks run → verify → if fail, reflexion reflect→retry → on completion, auto-trigger evolution (record pitfall + optimize prompt). Foundational fixes: structured file editing tool, verification defaults, step budget phases, minimum sandbox, Spec review gate. The loop replaces the current "early stop on failure" behavior with "keep working until done, then learn from the outcome."
Problem Frame
agentkit fails on complex tasks because its quality mechanisms are declared but not connected:
verification_enableddefaults toFalse(src/agentkit/core/react.py:171) — VERIFICATION phase doesn't enforce testswrite_filelisted in_DEFAULT_CORE_TOOLS(src/agentkit/core/react.py:156-162) but has no implementation class — LLM calls fail- reflexion only runs in
_fallback_chain.pyRecovery layer, not in the main execution flow - evolution only triggers manually via
/api/v1/evolution/trigger— no auto-trigger after tasks - TEAM_COLLAB falls back to REACT (
src/agentkit/server/routes/chat.py:1336) instead of running the real orchestrator max_steps=10hard cap with no "keep working until done" bias — tasks stop at the first verify failureexecute_stream()(src/agentkit/core/config_driven.py:686) bypasseson_task_complete/on_task_failedhooks — R5's auto-evolution would silently no-op on the WebSocket streaming path (the primary user-facing path)
The result is systemic failure: no retry mechanism, no self-evolution. Single-point fixes don't solve this — the independent parts must be assembled into a closed loop.
(See origin: docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md)
Requirements
Requirements are grouped by concern. Each carries its origin R-ID for traceability.
Foundations (all tasks benefit)
- R1. Provide a structured file editing tool (
str_replace_editorwithcreate/str_replace/insert_at_line/viewcommands), replacing the brokenwrite_fileplaceholder. All path parameters must resolve and prefix-check against workspace root, rejecting symlink escape; align with the existing 6-layer terminal security paradigm. - R2.
verification_enableddefaults toTruefor PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stayFalse(per RV2 — global True would force pytest/ruff on non-code REACT tasks like translation/research). - R3. VERIFICATION phase forces project tests (pytest/ruff) for coding tasks; non-coding tasks use Spec-declared verification commands (per RV8 — forcing pytest on non-Python projects causes false failures).
Closed loop (complex tasks)
- R4. Reflexion upgraded from fallback-only to main-flow retry for PLAN_EXEC/TEAM_COLLAB: verify fails → reflect → retry. Implemented by extending ReActEngine's existing reinjection loop, not by driving PLAN_EXEC through ReflexionEngine (per RV4, RV15, RV20 — ReflexionEngine doesn't forward
phase_policy, and reflexion-as-mode is conceptually distinct from reflexion-as-retry). - R5. Auto-trigger evolution on task completion (success or failure): Reflector records + PitfallDetector detects + PromptOptimizer optimizes. Quality gate: pitfall confidence threshold before ingestion; PromptOptimizer consumption gate (sample count ≥
min_examplesand confidence达标); observe-only mode records without feeding optimizer to avoid noise-driven prompt degradation (per RV14). - R6. Evolution trigger thresholds: failure always runs; success runs at sample rate 0.1 (per RQ2). Integrity/auth: evolution artifacts (pitfalls, optimized prompts) carry actor marking (which agent/expert produced them); cross-workspace sharing defaults off, requires explicit opt-in (per RV14 trust boundary).
Capability wiring
- R7. TEAM_COLLAB does not fall back to REACT — surface failure to user instead of silent degradation. (REWOO/REFLEXION-as-mode deferred per RV10, RV20.)
- R8. Spec review gate: first Spec generation emits
spec_review_requestevent, suspends execution pending user confirmation (spec_review_reply). Confirmation → execute; rejection → replan; timeout →parkedstatus (notfailed) with resume-on-return (per RV16 — 5-min timeout is too short for long tasks).
Bias and budget
- R10. "Keep working until done" bias for complex tasks: don't abandon on first verify failure, auto-enter reflexion retry within remaining step budget.
- R11. Step budget split into phase quotas (think=7 / verify=2 / reflect=1 per RQ1), replacing single
max_steps=10. Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB;max_steps=10preserved as total budget for backward compatibility (per RV5 — DIRECT_CHAT/REACT must keep current semantics). - R12. Pitfall retrieval/injection: at task planning, retrieve historical pitfalls by goal/skill similarity from PitfallDetector store and inject into prompt context (per RV7 — current system only records, never retrieves, so "pitfall不重犯" goal is half-served).
Key Technical Decisions
-
KTD-1. Verification canonical path is engine-internal at final-answer (
src/agentkit/core/react.py:1303-1376), notRunTestsTool.RunTestsTool(src/agentkit/tools/builtin.py:16) remains for agent-initiated mid-task verification. The engine-internal path runs automatically at the final-answer gate. This avoids double-verify and keeps the agent's manual tool distinct from the engine's automatic gate. -
KTD-2. Reflexion-as-retry is implemented by extending ReActEngine's reinjection loop, not by driving PLAN_EXEC through ReflexionEngine. ReflexionEngine (
src/agentkit/core/reflexion.py:88-92) constructs a vanilla ReActEngine without forwardingphase_policy— refactoring it to drive PLAN_EXEC would be large and conceptually conflates reflexion-as-mode with reflexion-as-retry. Instead, extend the existing reinjection loop (which already holdsphase_policy) to call a reflect step aftermax_reinjectionsexhausts. ReflexionEngine stays as the standalone engine for the deferred REFLEXION-as-mode. -
KTD-3. Evolution triggering is a system lifecycle concern, not an agent capability. The fix is hook-wiring (connecting
on_task_complete/on_task_failedto the streaming path), not exposing evolution as an agent-callable tool. Agents produce the work; the system evolves from the outcome. -
KTD-4.
execute_stream()must invokeon_task_complete/on_task_failedto maintain lifecycle parity withexecute(). This is the single most load-bearing architectural fix — without it, R5/R6 are no-ops on the WebSocket streaming path (the primary user-facing path). Use fire-and-forgetasyncio.create_taskwith backpressure cap (max_concurrent * 2) and shutdown drain per the portal-platform-security-reliability-fixes learning. Evolution errors must not fail the stream. -
KTD-5. Spec review uses new
spec_review_request/spec_review_replyevents +parkedSpec status.confirmation_requestis not reused (per RQ4 — different timeout semantics, different lifecycle, portal.py has no confirmation wiring). Events must follow terminal-event symmetry (open milestone → close on every path: confirm/reject/timeout/cancel) with stablespec_review_id = f"{plan_id}:spec_review"per the streaming-event-contract-residuals learning. Default timeout 30 min, configurable; on timeout →parkednotfailed. -
KTD-6.
str_replace_editorsymlink defense usesPath.resolve()+Path.relative_to(resolved_workspace_root), notstr.startswith().startswithadmits path-prefix collisions (/workspace_root_evil/...). Pattern mirrors the SSRF hop-revalidation approach from the bitable-companion-service security learning. Filesystem ops wrapped inasyncio.to_threadto avoid blocking the event loop. -
KTD-7. Phase-budget counters are checkpoint-reconstructable from restored plan phase statuses. On resume,
think/verify/reflectspent counts derive from persisted phase state, not reset to zero (per long-horizon-reliability-code-review-fixes learning P2 #8/#11 — resume is full state reconstruction). -
KTD-8. Reflexion-gave-up status is
"gave_up_after_reflections", not"success". Whenmax_reflectionsexhausts without verify pass, the status propagates toTaskResultand evolution'soutcomefield. Evolution'sRuleBasedReflectortreats this as failure for reflection purposes. Without this, evolution silently skips reflection on reflexion-gave-up tasks (per agent-native planning finding OQ-D). -
KTD-9.
ReActEngine.reset()called between reflexion retry attempts. Without reset, the loop detector (_loop_threshold=2) misfires on retry because_loop_windowstate leaks across attempts (per long-horizon-reliability-code-review-fixes learning P2 #9, RV22). -
KTD-10. DIRECT_CHAT does not trigger evolution (explicit non-goal). DIRECT_CHAT bypasses BaseAgent entirely (
src/agentkit/server/routes/chat.py:1245callsllm_gateway.chat()directly). Wiring evolution would require fabricating TaskMessage/TaskResult. Simple Q&A tasks have low evolution value. Documented as non-goal, not a gap to fix later.
High-Level Technical Design
Quality loop flow
flowchart TB
A[Complex task starts] --> B[Execute: think/act/observe]
B --> C{Verify at final-answer}
C -->|Pass| D[Mark completed]
C -->|Fail| E{Reflect quota remaining?}
E -->|Yes| F[Call reset then reflect]
F --> G[Generate improvement]
G --> B
E -->|No| H[Mark gave_up_after_reflections]
D --> I[Trigger evolution: fire-and-forget]
H --> I
I --> J{Failure?}
J -->|Yes| K[Reflector + PitfallDetector: 100%]
J -->|No| L[Sample at 0.1 rate]
K --> M[Quality gate: confidence threshold]
L --> M
M --> N{Observe-only?}
N -->|Yes| O[Record only]
N -->|No| P[PromptOptimizer: consume gated]
execute_stream hook wiring
sequenceDiagram
participant WS as WebSocket (chat.py)
participant CDA as ConfigDrivenAgent
participant ES as execute_stream()
participant Hooks as on_task_complete/failed
participant EVO as evolve_after_task()
WS->>CDA: execute_stream(task)
CDA->>ES: yield ReActEvent
ES-->>WS: token / final_answer (streaming)
Note over ES: finally block (new)
ES->>Hooks: invoke with TaskResult
Hooks->>EVO: asyncio.create_task (fire-and-forget)
Note over EVO: backpressure cap + shutdown drain
EVO-->>EVO: Reflector → PitfallDetector → PromptOptimizer
Spec review gate lifecycle
stateDiagram-v2
[*] --> PLANNING
PLANNING --> SPEC_GENERATED
SPEC_GENERATED --> SPEC_REVIEW_PENDING: emit spec_review_request
SPEC_REVIEW_PENDING --> EXECUTING: spec_review_reply (confirm)
SPEC_REVIEW_PENDING --> PLANNING: spec_review_reply (reject)
SPEC_REVIEW_PENDING --> PARKED: timeout (30min)
PARKED --> EXECUTING: resume on return
EXECUTING --> [*]
Implementation Units
U1. str_replace_editor tool + remove write_file bug
- Goal: Provide a working structured file editing tool with workspace-root security; remove the broken
write_fileplaceholder. - Requirements: R1
- Dependencies: None
- Files:
- Create:
src/agentkit/tools/str_replace_editor.py(new tool class) - Modify:
src/agentkit/core/react.py(removewrite_filefrom_DEFAULT_CORE_TOOLSat line 156-162, addstr_replace_editor) - Modify:
src/agentkit/tools/__init__.py(register new tool) - Test:
tests/unit/test_str_replace_editor.py
- Create:
- Approach: Implement
str_replace_editorwith four commands:create(write new file),str_replace(exact-match anchor replace),insert_at_line(insert at line number),view(read with line numbers — needed becausestr_replacerequires exact anchors). Path validation:Path.resolve()+Path.relative_to(resolved_workspace_root); reject.., absolute paths, symlink escape. Wrap filesystem ops inasyncio.to_thread. MirrorReadFileTool(src/agentkit/tools/file_read.py:26) for Tool base class structure and error handling. Align with 6-layer terminal security paradigm (src/agentkit/server/auth/terminal_security.py). - Patterns to follow:
src/agentkit/tools/file_read.py:26(ReadFileTool — Tool base class, execute schema,_error()helper),src/agentkit/server/auth/terminal_security.py(layered security,_SHELL_OPERATORSpattern) - Test scenarios:
- Happy path:
createwrites new file;viewreturns content with line numbers;str_replacereplaces exact anchor;insert_at_lineinserts at specified line - Edge cases: Empty file create;
str_replacewith multiple matches (error: anchor not unique);insert_at_lineat line 0 / beyond EOF;viewwith line range - Error and failure paths: Path traversal
../../etc/passwdrejected; symlink escape rejected; absolute path/etc/passwdrejected;str_replaceanchor not found (error); file outside workspace root rejected - Integration: Tool registered in
_DEFAULT_CORE_TOOLSappears in LLM system prompt; LLM can call it and receive structured result
- Happy path:
- Verification:
write_fileno longer in_DEFAULT_CORE_TOOLS;str_replace_editorappears in tool descriptions; path traversal tests pass;ruff checkclean.
U2. execute_stream hook wiring (OQ6 fix)
- Goal: Wire
on_task_complete/on_task_failedhooks into the streaming path so R5/R6 evolution triggers on WebSocket-routed tasks. - Requirements: R5 (precondition), R6 (precondition)
- Dependencies: None
- Files:
- Modify:
src/agentkit/core/config_driven.py(execute_stream()at line 686 — add hook invocation infinallyblock) - Modify:
src/agentkit/core/plan_exec_engine.py(execute_stream()at line 175 — add hook invocation) - Modify:
src/agentkit/core/reflexion.py(execute_stream()at line 330 — add hook invocation) - Modify:
src/agentkit/server/routes/portal.py(verify all 3execute_streamcall sites at lines 580, 701, 1001 propagate hooks) - Test:
tests/unit/test_execute_stream_hooks.py
- Modify:
- Approach: Extract a
_trigger_evolution_hooks(task, result)helper from the synchandle_task()path (lines 473, 493). Call it fromexecute_stream()'sfinallyblock. Useasyncio.create_task()(fire-and-forget) to avoid blocking the streaming return. Apply backpressure: cap pending evolution tasks atmax_concurrent * 2, drop + log + increment counter on exceed. Drain pending tasks on app shutdown viaasyncio.gather(*tasks, return_exceptions=True). Evolution errors are caught and logged — they must not fail the stream. Follow theCancellationTokenregistration pattern (register intry, pop infinally) per the streaming-event-contract-residuals learning. - Patterns to follow:
src/agentkit/core/config_driven.py:473,493(sync hook invocation),src/agentkit/core/config_driven.py:686(CancellationToken try/finally pattern), portal-platform-security-reliability-fixes learning (backpressure cap + shutdown drain) - Test scenarios:
- Happy path:
execute_streamcompletion fireson_task_completewith correct TaskResult;execute_streamfailure fireson_task_failed - Edge cases: Stream cancelled mid-flight — hooks still fire with cancelled status; evolution task error does not propagate to stream; backpressure cap reached — drop + log + counter increment
- Integration: Same task via REST
execute()and WebSocketexecute_stream()produces equivalent evolution log entries (parity test); all 3 portal.py call sites propagate hooks
- Happy path:
- Verification: Evolution fires after
execute_streamcompletes on both success and failure paths; streaming latency P95 < +50ms (evolution is fire-and-forget); shutdown drains pending evolution tasks.
U3. Verification defaults + forced pytest/ruff + minimum sandbox
- Goal: Enable verification by default for complex tasks; force pytest/ruff for coding tasks; establish minimum sandbox as security prerequisite.
- Requirements: R2, R3, RV3 (sandbox prerequisite)
- Dependencies: U1 (str_replace_editor provides safe editing within sandbox)
- Files:
- Modify:
src/agentkit/core/react.py(threadverification_enabledparameter through PLAN_EXEC/TEAM_COLLAB construction, default True for those modes) - Modify:
src/agentkit/core/phase.py(default_policy()at line 139 — VERIFICATION phase forces pytest/ruff for coding tasks) - Modify:
src/agentkit/core/plan_exec_engine.py(passverification_enabled=Truewhen constructing ReActEngine for PLAN_EXEC) - Modify:
src/agentkit/experts/orchestrator.py(passverification_enabled=Truefor TEAM_COLLAB) - Create:
src/agentkit/core/sandbox.py(minimum sandbox enforcement: workspace-write, no network) - Test:
tests/unit/test_verification_defaults.py,tests/unit/test_sandbox.py
- Modify:
- Approach: R2:
verification_enableddefaults True only for PLAN_EXEC/TEAM_COLLAB; DIRECT_CHAT/REACT stay False (per RV2). Thread the parameter throughPlanExecEngineandTeamOrchestratorconstruction, not as a global default change. R3: Indefault_policy()VERIFICATION phase, add coding-task detection (check forpyproject.tomlor.pyfiles in workspace) — forcepytest -x -qandruff checkfor coding tasks; non-coding tasks use Spec-declared verification commands. RV3: Createsandbox.pywith workspace-root enforcement (reuse U1's path validation) and network blocking (disablehttpx/requests/socketfor tool calls during VERIFICATION). Sandbox is the minimum layer; full tiering (read-only/workspace-write/danger) deferred. - Patterns to follow:
src/agentkit/core/phase.py:139(default_policy— PhasePolicy construction),src/agentkit/tools/advance_phase.py:20(forced-transition pattern for VERIFICATION→DELIVERY) - Test scenarios:
- Happy path: PLAN_EXEC task with
pyproject.tomlruns pytest+ruff in VERIFICATION; TEAM_COLLAB task verifies by default; non-coding task uses Spec-declared command - Edge cases: Workspace with no
pyproject.toml— skip pytest, use Spec command; empty workspace — verification passes (no tests to run); ruff finds issues — reinject as verify failure - Error and failure paths: pytest fails — reinject error per
max_reinjections; sandbox blocks network call — structured error returned to LLM; path traversal attempt in verification command — rejected - Integration: Sandbox enforcement applies to all tool calls during VERIFICATION phase; coding-task detection correctly identifies Python vs non-Python workspaces
- Happy path: PLAN_EXEC task with
- Verification: PLAN_EXEC/TEAM_COLLAB verify by default; DIRECT_CHAT/REACT do not verify; coding tasks force pytest/ruff; non-coding tasks use Spec commands; sandbox blocks network during VERIFICATION.
U4. Step budget phases + keep working bias
- Goal: Split
max_stepsinto phase quotas (think/verify/reflect); add "keep working until done" bias for complex tasks. - Requirements: R11, R10
- Dependencies: U3 (verify quota needs verification defaults)
- Files:
- Modify:
src/agentkit/core/react.py(__init__at line 167 — addphase_budgetsparameter;_execute_loop()at line 561 — enforce per-phase quotas; loop detector at line 220-221 — raise threshold or exempt reflexion retries) - Modify:
src/agentkit/core/phase.py(PhasePolicyat line 59 — addstep_budgetfield) - Modify:
src/agentkit/core/plan_exec_engine.py(passphase_budgets={"think": 7, "verify": 2, "reflect": 1}for PLAN_EXEC) - Test:
tests/unit/test_step_budget.py
- Modify:
- Approach: R11: Add
phase_budgets: dict[str, int] | None = Noneto ReActEngine. When set, enforce per-phase quotas: think耗尽 → force verify; verify耗尽 → return best result; reflect耗尽 → no more reflection. When None, behavior is same as today (max_steps=10total budget). Quotas are opt-in for PLAN_EXEC/TEAM_COLLAB. Budget counters are checkpoint-reconstructable — derive spent counts from restored plan phase statuses on resume (KTD-7). R10: "Keep working until done" is implemented via the reflect quota — verify fail doesn't abandon, it enters reflexion retry within remaining reflect quota. Loop detector threshold raised from 2 to 3 for keep-working mode (per RV22 — threshold=2 false-positives on retry).ReActEngine.reset()called between retry attempts (KTD-9). - Patterns to follow:
src/agentkit/core/phase.py:59(PhasePolicy.auto_advance_after_steps— existing per-phase step limit pattern),src/agentkit/core/react.py:220-221(loop detector —_loop_window,_loop_threshold) - Test scenarios:
- Happy path: PLAN_EXEC with
phase_budgets={"think":7,"verify":2,"reflect":1}— think stops at 7, verify runs, reflect runs at most 1; withoutphase_budgets— behavior unchanged (max_steps=10) - Edge cases: Think quota exhausted mid-tool-call — finish current step, then force verify; reflect quota 0 — no reflection, return best result; resume after checkpoint — budget counters reconstructed from phase statuses
- Error and failure paths: Loop detector threshold 3 — 2 similar retries don't abort, 3 do;
reset()between reflexion attempts —_loop_windowcleared - Integration: Phase budgets enforced in
_execute_loop(); checkpoint save/restore preserves budget state; DIRECT_CHAT/REACT unaffected (nophase_budgetsset)
- Happy path: PLAN_EXEC with
- Verification: Phase quotas enforced; backward compatibility (no
phase_budgets= current behavior); loop detector doesn't false-positive on reflexion retry; budget state survives checkpoint/resume.
U5. Reflexion in main flow
- Goal: Upgrade reflexion from fallback-only to main-flow retry: verify fails → reflect → retry.
- Requirements: R4
- Dependencies: U3 (verification), U4 (reflect quota)
- Files:
- Modify:
src/agentkit/core/react.py(reinjection loop at lines 1303-1376 — aftermax_reinjectionsexhausts, call reflect step before returning final) - Modify:
src/agentkit/core/config_driven.py(parameterizemax_reflections=2at lines 835, 1047 — currently hardcoded 3; make configurable) - Test:
tests/unit/test_reflexion_main_flow.py
- Modify:
- Approach: Extend the existing reinjection loop (
src/agentkit/core/react.py:1303-1376) — when verify fails andmax_reinjectionsis exhausted, if reflect quota remains: callreset()(KTD-9), generate reflection text (mirrorReflexionEngine._reflect()atsrc/agentkit/core/reflexion.py:639), inject reflection into context, retry the loop. Parameterizemax_reflections(RQ3: 2 for main path, 1 for Recovery layer — currently hardcoded 3 atconfig_driven.py:835,1047). Whenmax_reflectionsexhausts without verify pass, return status"gave_up_after_reflections"(KTD-8 — not"success", so evolution treats it as failure). ReflexionEngine stays as standalone for REFLEXION-as-mode (deferred); Recovery layer escalates to human, not re-reflex (avoid double-reflexion). - Patterns to follow:
src/agentkit/core/react.py:1303-1376(existing reinjection loop — extend, don't replace),src/agentkit/core/reflexion.py:639(reflect step — mirror the LLM call shape),src/agentkit/server/_fallback_chain.py:118(Recoverymax_retries=1— keep distinct from main path) - Test scenarios:
- Happy path: Covers AE1 — verify fails → reflect → retry within reflect quota; retry passes verify → mark completed
- Edge cases:
max_reflections=2— 2 retry attempts, then"gave_up_after_reflections";reset()between attempts clears loop window; reflect quota 0 — no retry, return best result - Error and failure paths: Reflect LLM call fails — skip reflection, retry with existing context; all retries fail — status
"gave_up_after_reflections"propagates to TaskResult and evolution - Integration: DIRECT_CHAT/REACT unaffected (no reflect quota); Recovery layer (
_fallback_chain.py) still usesmax_reflections=1— no double-reflexion; evolution'sRuleBasedReflectortreats"gave_up_after_reflections"as failure
- Verification: Verify-fail → reflect → retry fires;
max_reflections=2configurable;"gave_up_after_reflections"status propagates; no double-reflexion with Recovery layer; DIRECT_CHAT unaffected.
U6. Auto evolution trigger + quality gate
- Goal: Auto-trigger evolution on task completion with quality gates and actor marking.
- Requirements: R5, R6
- Dependencies: U2 (execute_stream hooks), U5 (quality signal from reflexion)
- Files:
- Modify:
src/agentkit/evolution/lifecycle.py(evolve_after_task()at line 131 — add success sample rate gate, quality threshold, actor marking) - Modify:
src/agentkit/evolution/pitfall_detector.py(add confidence threshold before ingestion) - Create:
src/agentkit/evolution/config.py(EvolutionConfigwithsuccess_sample_rate: float = 0.1,min_confidence: float = 0.5,observe_only: bool = True) - Modify:
src/agentkit/evolution/prompt_optimizer.py(consumption gate: sample count ≥min_examplesand confidence达标) - Test:
tests/unit/test_evolution_auto_trigger.py
- Modify:
- Approach: R5:
EvolutionConfig.success_sample_rate=0.1gates success-path evolution atevolve_after_task()entry usingrandom.random() < rate(mirroralignment.py:115audit_sample_ratepattern). Failure path always runs (100%). Quality gate: pitfall confidence threshold before ingestion (min_confidence=0.5— low-confidence pitfalls discarded or marked observe-only); PromptOptimizer consumption gate (sample count ≥min_examples=3and confidence达标); observe-only mode (observe_only=Trueinitially — records without feeding optimizer to avoid noise-driven prompt degradation per RV14). R6: Actor marking on all evolution artifacts (pitfalls, optimized prompts) — which agent/expert produced them. Cross-workspace sharing defaults off; same-workspace sharing default on; cross-workspace requires explicit opt-in. Trust boundary: evolution products are agent-produced and must be validated before entering shared store (not trusted because an agent produced them). Known limitation (per RQ2): defaultRuleBasedReflectoronly generates suggestions onoutcome=='failure'— success sampling path may 100% early-exit under default reflector; success sampling activates when reflector is upgraded or success-path learning signal is available. - Patterns to follow:
src/agentkit/evolution/lifecycle.py:131(evolve_after_task— extend, don't replace),src/agentkit/evolution/pitfall_detector.py:103(check_pitfalls— Jaccard similarity pattern), portal-platform-security-reliability-fixes learning (per-namespace rejection, backpressure, trust-boundary validation) - Test scenarios:
- Happy path: Covers AE3 — task fails → evolution fires (100%) → Reflector records → PitfallDetector detects; task succeeds → evolution fires at 0.1 rate
- Edge cases: Observe-only mode — pitfalls recorded but not fed to optimizer; backpressure cap reached — evolution task dropped + logged; low-confidence pitfall — discarded or marked observe-only
- Error and failure paths: Evolution task error — caught, logged, does not fail the stream; PromptOptimizer sample count < 3 — skip optimization
- Integration: Evolution fires via U2's
execute_streamhooks; actor marking present on all artifacts; cross-workspace sharing rejected without opt-in;"gave_up_after_reflections"status triggers failure-path evolution
- Verification: Failure tasks always trigger evolution; success tasks trigger at 0.1 rate; observe-only mode records without mutating prompts; actor marking present; cross-workspace sharing gated.
U7. Pitfall retrieval/injection
- Goal: Retrieve historical pitfalls by goal/skill similarity at task planning and inject into prompt context.
- Requirements: R12
- Dependencies: U6 (evolution store with pitfalls)
- Files:
- Modify:
src/agentkit/evolution/pitfall_detector.py(check_pitfalls()at line 103 — extend to accept goal text, use semantic similarity not justtask_typefilter) - Modify:
src/agentkit/core/react.py(system prompt construction — inject pitfall warnings section) - Modify:
src/agentkit/core/plan_exec_engine.py(at planning phase, call pitfall retrieval and inject into Spec context) - Test:
tests/unit/test_pitfall_injection.py
- Modify:
- Approach: Extend
PitfallDetector.check_pitfalls()to accept goal text and useexperience_store.searchwith semantic similarity (not justtask_typeJaccard filter). Wireexperience_storeto agent runtime as app-state singleton (KTD per OQ-E — instantiated at startup, shared across tasks). At PLAN_EXEC planning phase, retrieve top-K pitfalls (K=3) by goal/skill similarity, inject as "Historical pitfalls to avoid" section in system prompt. Gate byWarningLevel.HIGHonly (avoid noise). Pitfall injection appears in agent's first LLM call. PitfallDetector currently only used inevolution_dashboard.py:549(read-only) — this is the first runtime integration. - Patterns to follow:
src/agentkit/evolution/pitfall_detector.py:103(check_pitfalls— extend signature, don't break existing callers),src/agentkit/memory/semantic.py(semantic retrieval pattern if applicable) - Test scenarios:
- Happy path: Task with similar goal to past failure → top-3 pitfalls injected into system prompt → pitfalls appear in agent's first LLM call
- Edge cases: No pitfalls in store → empty section, no injection; all pitfalls low severity → none injected (gate by HIGH); pitfall store has 100+ entries → only top-3 by similarity retrieved (no N+1)
- Error and failure paths:
experience_storeunavailable → skip injection, log warning; similarity search times out → skip injection, continue task - Integration: PitfallDetector app-state singleton accessible from PLAN_EXEC planning; existing
evolution_dashboard.pycaller still works (backward compatible)
- Verification: Pitfalls injected at planning phase appear in system prompt; similarity retrieval works on goal text; HIGH-severity gate filters noise; existing dashboard caller unaffected.
U8. Spec review gate
- Goal: Pause PLAN_EXEC after first Spec generation for user review; resume on confirmation, replan on rejection.
- Requirements: R8
- Dependencies: U5 (reflexion retry for post-review execution)
- Files:
- Modify:
src/agentkit/core/plan_exec_engine.py(at line 269-277 — after Spec generation, emitspec_review_request, suspend on pending future) - Modify:
src/agentkit/core/spec_manager.py(addparkedstatus,resume()method) - Modify:
src/agentkit/server/routes/chat.py(addspec_review_request/spec_review_replyto_VALID_TEAM_EVENT_TYPESat line 144; add handler forspec_review_reply) - Modify:
src/agentkit/server/routes/portal.py(add event forwarding for spec review events) - Test:
tests/unit/test_spec_review_gate.py
- Modify:
- Approach: At
plan_exec_engine.py:269-277(currently generates Spec and immediately executes), insert: emitspec_review_requestevent (carryingspec_id,goal,steps,spec_review_id = f"{plan_id}:spec_review"), suspend on pendingasyncio.Future. Onspec_review_reply(confirm/reject/timeout): confirm → resume execution; reject → replan (callGoalPlanneragain with rejection feedback); timeout (30 min default, configurable) → set Spec statusparked(notfailed), allow resume-on-return. Addspec_review_request/spec_review_replyto_VALID_TEAM_EVENT_TYPES(per streaming-event-whitelist learning — without this, events silently no-op with 200 response). Follow terminal-event symmetry (open milestone → close on every path). Mirror CancellationToken pattern (register pending future, pop in finally). RQ4 confirmed: new events, not reuseconfirmation_request(different timeout semantics, different lifecycle, portal.py has no confirmation wiring). - Patterns to follow:
src/agentkit/core/config_driven.py:686(CancellationToken try/finally — register/pop pattern),src/agentkit/server/routes/chat.py:144(_VALID_TEAM_EVENT_TYPES— add new events),src/agentkit/server/routes/chat.py:1365-1377(confirmation pattern — reference, not reuse), streaming-event-contract-residuals learning (terminal-event symmetry, stable identifier) - Test scenarios:
- Happy path: Covers AE4 — PLAN_EXEC generates Spec →
spec_review_requestemitted → execution suspends → user confirms →spec_review_reply→ execution resumes - Edge cases: User rejects → replan with feedback → new Spec generated → review again; timeout (30min) → Spec status
parked(notfailed) → resume on return; stream cancelled during review → future cancelled, no deadlock - Error and failure paths:
spec_review_replywith invalidspec_review_id→ error response; future resolution error → execution fails gracefully; event not in whitelist → test asserts it IS in whitelist (silent failure prevention) - Integration: Events forwarded by portal.py; frontend receives
spec_review_requestand can render review UI;parkedSpec survives page reload
- Happy path: Covers AE4 — PLAN_EXEC generates Spec →
- Verification: Spec review round-trip works (request → suspend → reply → resume); rejection triggers replan; timeout → parked not failed; events in whitelist (no silent no-op).
U9. TEAM_COLLAB no fall-back to REACT
- Goal: TEAM_COLLAB surfaces failure to user instead of silently falling back to REACT.
- Requirements: R7
- Dependencies: None (routing change only)
- Files:
- Modify:
src/agentkit/server/routes/chat.py(at line 1336-1344 — change TEAM_COLLAB branch to reject fall-back, surface failure) - Modify:
AGENTS.md(update to reflect actual behavior — remove "抛 not yet supported" claim, document TEAM_COLLAB routing) - Test:
tests/unit/test_team_collab_routing.py
- Modify:
- Approach: At
chat.py:1336-1344(currently falls back to REACT with warning for TEAM_COLLAB), change the TEAM_COLLAB branch to: route to TeamOrchestrator+SharedWorkspace (real wiring), or if orchestrator unavailable, surface failure to user (not silent fall-back). Update AGENTS.md to remove the stale "抛 not yet supported" claim for REWOO/REFLEXION/TEAM_COLLAB — document that TEAM_COLLAB routes to TeamOrchestrator, REWOO/REFLEXION-as-mode are deferred (not "unsupported"). This is a routing change, not full TEAM_COLLAB implementation — the orchestrator already exists (src/agentkit/experts/orchestrator.py:45). - Patterns to follow:
src/agentkit/server/routes/chat.py:758-808(PLAN_EXEC routing — mutual exclusivity with fallback chain, KTD5 pattern) - Test scenarios:
- Happy path:
@teamprefix → routes to TeamOrchestrator (not REACT fall-back); TeamOrchestrator executes phases - Edge cases: TeamOrchestrator unavailable → error surfaced to user (not silent REACT); team template not found → error with template list
- Error and failure paths: All phases fail → failure surfaced to user (not fall-back to single agent per existing
_fallback_to_single_agent— that's orchestrator-internal, acceptable) - Integration: AGENTS.md updated; REWOO/REFLEXION-as-mode still fall back (deferred, not in scope)
- Happy path:
- Verification: TEAM_COLLAB routes to TeamOrchestrator; no silent REACT fall-back; AGENTS.md reflects actual behavior.
Scope Boundaries
Deferred for later
- Full sandbox tiering (read-only / workspace-write / danger) — P2 priority; only minimum sandbox (workspace-write, no network) pulled into scope as R3/R10 prerequisite (per RV3).
- REWOO/REFLEXION-as-mode (as independent execution modes) — deferred per RV10 (no target service for REWOO, conceptually distinct from reflexion-as-retry per RV20); R7 narrowed to TEAM_COLLAB only.
- R9 coding_harness (Worker-Verifier adversarial harness) — deferred per RV11 (R3+R4 already satisfy the goal), RV12 (4-stage pipeline to single-stage PLAN_EXEC phase mapping undefined), RV13 (no independent success criteria). Trust boundary: coding_harness executing untrusted code requires sandbox — depends on full sandbox tiering.
- Model autonomous compaction — existing threshold approach works.
- Three-tier nested loop (submission / handler / turn) — cost exceeds benefit.
- Spec output as human-readable markdown — current YAML Spec + review gate works; markdown化 deferred.
- Full TEAM_COLLAB real wiring (beyond routing) — U9 handles routing only; deeper orchestrator integration (debate rounds, review gates, divergence detection) is existing functionality that may need tuning but is not in scope for the quality loop.
Outside this product's identity
- Tool minimalism (cut to Bash + apply_patch) — agentkit goes the skill/expert-team direction; 25 tools are business need.
- New Task Runtime concept — existing plan_exec foundation suffices; no new concept introduced.
Deferred to Follow-Up Work
- DIRECT_CHAT evolution wiring — explicitly non-goal (KTD-10); if future simple-task learning becomes valuable, would require fabricating TaskMessage/TaskResult.
- Success-path reflector upgrade — current
RuleBasedReflectoronly generates suggestions on failure; success sampling (RQ2) activates fully when a success-capable reflector is implemented. - Loop detector semantic upgrade — current hash-based detector raised to threshold 3 for keep-working mode; semantic detection (detect truly identical retries vs similar-but-different) is a future upgrade.
System-Wide Impact
- Streaming path behavior change (U2): All WebSocket-routed tasks now trigger evolution hooks. Fire-and-forget with backpressure ensures no latency regression. Evolution errors are isolated — they cannot fail the stream.
- Verification default change (U3): PLAN_EXEC/TEAM_COLLAB now verify by default. Tasks that previously "succeeded" without verification may now fail verification. This is the intended behavior change — surfaces real failures that were hidden.
- Step budget change (U4): PLAN_EXEC/TEAM_COLLAB get phase quotas; DIRECT_CHAT/REACT keep
max_steps=10total. Backward compatible — nophase_budgetsmeans current behavior. - Evolution artifacts now persist cross-task (U6): Without actor marking and workspace-scoped sharing, a poisoned pitfall from one workspace could degrade prompts in another. Trust boundary enforcement is load-bearing.
- Reflexion retry changes loop behavior (U5): "Keep working until done" expands blast radius. Minimum sandbox (U3) is the security countermeasure. Loop detector threshold raised to 3 to avoid false-positive on retry.
- Spec review adds friction to PLAN_EXEC (U8): Every PLAN_EXEC now pauses for review. This is intentional (per R8) — catches bad plans before execution. Timeout → parked (not failed) respects long-task user availability.
- TEAM_COLLAB no longer silently degrades (U9): Users who relied on TEAM_COLLAB falling back to REACT will see explicit failures instead. This is the intended behavior — silent degradation was a bug.
Risks & Dependencies
- R5 streaming hook bypass (OQ6) — HIGHEST RISK. Without U2, R5/R6 are no-ops on the primary user-facing path. U2 is the load-bearing precondition. Mitigation: U2 ships first; parity test (REST vs WebSocket evolution log) is the regression guard.
- R4 double-reflexion with Recovery layer. Main-flow reflexion (U5) + Recovery-layer reflexion (
_fallback_chain.py:118) could double-reflect. Mitigation: Recovery escalates to human, not re-reflex. Documented in KTD-2. - RV22 loop detector conflict with R10. "Keep working" retries similar fixes, triggering loop detection (threshold=2). Mitigation: threshold raised to 3 for keep-working mode (U4);
reset()between attempts (KTD-9). - R1 str_replace exact-match fragility. Without
viewcommand, agents emitstr_replacewith stale anchors and fail. Mitigation:viewcommand included in U1. - R8 spec review deadlock. User leaves → task hangs. Mitigation: 30-min timeout →
parkednotfailed; resume-on-return. - Evolution noise degrades prompts (RV14). Low-quality pitfalls fed to optimizer regress prompts. Mitigation: confidence threshold + observe-only mode (U6, initially
observe_only=True). - Evolution module runtime correctness unverified. No prior learnings exist for evolution/reflexion/verification/spec_manager modules (coverage gap from learnings research). Mitigation: budget for first-principles verification; characterization tests before changes.
- Streaming event whitelist silent failure. New events not in
_VALID_TEAM_EVENT_TYPESsilently no-op. Mitigation: U8 explicitly adds events to whitelist; test asserts presence. - Async generator safety. All new
async defwithyieldmust usereturn; yieldpattern before early return (project rule). Applies to U2 (hook helper), U5 (reflexion streaming), U8 (spec review suspension).
Dependencies:
- evolution module (Reflector/PitfallDetector/PromptOptimizer/ABTester) already implemented — U6/U7 do integration only
- ReflexionEngine already implemented — U5 extends ReActEngine, doesn't refactor ReflexionEngine
- VerificationLoop already implemented — U3 changes defaults and policy, not core logic
- SpecManager.confirm already implemented (REST) — U8 adds chat flow integration
- TeamOrchestrator already implemented — U9 is routing change, not orchestrator implementation
- Assume: step quota redesign doesn't break DIRECT_CHAT/REACT semantics (enforced by opt-in
phase_budgetsparameter)
Acceptance Examples
- AE1. Complex task verify-fail → reflexion retry. Covers R2, R4, R10. Given: PLAN_EXEC task completes, verify runs pytest and fails. When: reflexion triggers, reflects on error, generates fix. Then: retries within reflect quota; if still fails, marks
"gave_up_after_reflections"and triggers evolution. - AE2. Simple task doesn't reflexion. Covers R4. Given: DIRECT_CHAT mode executes simple task. When: task completes. Then: no reflexion retry loop, direct return.
- AE3. Task failure auto-triggers evolution. Covers R5, R6. Given: complex task fails (verify fails, reflexion exhausted). When: task ends. Then: evolution auto-triggers, Reflector records failure, PitfallDetector detects patterns.
- AE4. Spec review gate. Covers R8. Given: PLAN_EXEC generates Spec. When: Spec first generated. Then: execution suspends,
spec_review_requestemitted; user confirms → execution resumes; user rejects → replan; timeout →parked.
Sources / Research
- Origin document:
docs/brainstorms/2026-07-02-complex-task-quality-loop-requirements.md(R1-R12, RQ1-RQ4, OQ5-OQ6, RV1-RV22) - Repo research: Confirmed all brainstorm findings with file:line references; mapped 12 requirements to integration points; identified 3 AGENTS.md contradictions; recommended 6-phase implementation order.
- Institutional learnings (5 relevant docs in
docs/solutions/):integration-issues/streaming-event-contract-residuals.md—execute_streamregistration pattern (resolves OQ6), terminal-event symmetry (shapes R8), stable identifier conventionlogic-errors/long-horizon-reliability-code-review-fixes.md—reset()between retry attempts (RV22 mitigation), checkpoint-reconstructable counters (KTD-7), cross-module format contractsruntime-errors/streaming-event-whitelist-and-accumulation.md—_VALID_TEAM_EVENT_TYPESwhitelist (R8 events), ReAct Streaming Contract (R4 streaming)architecture-patterns/bitable-companion-service-security-reliability-patterns.md— SSRF hop-revalidation → symlink defense (KTD-6), IDOR 404-before-403 (R6 trust boundary),asyncio.to_thread(R1)security-issues/portal-platform-security-reliability-fixes.md— backpressure cap + shutdown drain (KTD-4), per-namespace rejection (R6), trust-boundary validation
- Coverage gap: No prior learnings exist for evolution/reflexion/verification/spec_manager modules — budget for first-principles verification.
- Agent-native planning assessment: Confirmed agentkit is agent-native (Required applicability); classified domain actions (Now/Later/Never); identified execute_stream hook wiring as single most load-bearing architectural issue; suggested 11 implementation units (refined to 9 in this plan); proposed 5 KTDs (expanded to 10 in this plan).
- Industry benchmarks (from brainstorm): Codex agent loop (single-thread ReAct + forced verify), Qoder Quest (Spec → Code → Verify loop + auto evolution), Trae SOLO Spec mode (confirmation gate).