feat: complex-task-quality-loop (R1-R12) #22

fischer · 2026-07-03T22:56:03+08:00

fischer commented

2026-07-03 22:56:03 +08:00

Summary

Implements the verify-reflect-evolve quality loop for PLAN_EXEC / TEAM_COLLAB tasks (requirements R1-R12). Nine implementation units deliver: a workspace-write file editor with minimum sandbox, evolution hooks wired into the streaming path, verification defaults, step-budget phases, reflexion-in-main-flow, auto-trigger + quality gate + actor marking, pitfall retrieval/injection at planning phase, a spec-review gate that pauses PLAN_EXEC for user approval, and explicit TEAM_COLLAB failure instead of silent REACT fall-back.

Implementation Units

Unit	Requirement	Commit	Description
U1	R1	`2932ee5`	str_replace_editor tool with workspace-root security
U2	OQ6	`dd25915`	wire evolution hooks into execute_stream path
U3	R2/R3/RV3	`b841896`	verification defaults for PLAN_EXEC/TEAM_COLLAB + minimum sandbox
U4	R11/R10	`4255cb3`	step budget phases + keep-working bias
U5	R4	`1d09faf`	reflexion in main flow: verify fail -> reflect -> retry
U6	R5/R6	`91a61f9`	auto-trigger + quality gate + actor marking
U7	R12	`a763396`	pitfall retrieval/injection at planning phase
U8	R8	`786f921`	spec review gate: pause PLAN_EXEC for user review
U9	R7	`120892e`	TEAM_COLLAB surfaces failure instead of silent REACT fall-back
review-fix	F1/F2/F4	`ffb7a51`	wire pitfall_detector/spec_review to PlanExecEngine + fix restore_budget_state reset order
compat	-	`f1f2e72`	add pitfall_warnings param for ReActEngine interface compat
docs	-	`7c900ce`	plan + requirements documents

Key Technical Decisions

KTD-5: PitfallDetector is an app-state singleton instantiated at startup (constructor injection, not module-global)
KTD-7: Budget counters survive reset() when restored from checkpoint (_state_restored flag guard)
Interface compatibility: PlanExecEngine.execute/execute_stream signatures match ReActEngine exactly (pitfall_warnings param added for compat)
Async generator safety: All async generators follow the return;yield pattern per project rules
R4: Reflexion extends ReActEngine's existing reinjection loop rather than refactoring ReflexionEngine

Testing

222 tests for all modified modules pass (including 63 new tests across 4 new test files)
Ruff clean on all modified source files
New test files: test_budget_restore.py (4), test_pitfall_injection.py (28), test_spec_review_gate.py (20), test_team_collab_routing.py (11)
Pre-existing environment issues (not caused by this PR): tests/unit/llm/test_cache.py fails due to litellm not installed in Python 3.14 env; tests/unit/server/test_evolution_dashboard.py requires PostgreSQL

Known Residuals (non-blocking)

F3 (P2): Portal WebSocket path does not wire evolution hooks yet (documented gap; portal uses a separate execution path)
F5 (P3): SpecManager spec_id is not sanitized (defensive hardening deferred)
F6 (P3): Wilson interval for quality gate confidence score (advisory; current heuristic works)

Post-Deploy Monitoring

After merge, monitor:

Evolution trigger rate for PLAN_EXEC/TEAM_COLLAB tasks (should be non-zero; verify via /api/v1/evolution/experiences)
Spec review gate timeouts (watch for parked specs accumulating at /api/v1/specs?status=parked)
Pitfall injection hit rate (check logs for 'pitfall_warnings_injected' entries)
TEAM_COLLAB error messages reaching users without @team prefix (should see error events, not silent REACT fall-back)
Streaming path task-completion hooks firing (verify evolution entries created for streamed tasks)

Rollback: revert the merge commit; no schema migrations required.

## Summary Implements the verify-reflect-evolve quality loop for PLAN_EXEC / TEAM_COLLAB tasks (requirements R1-R12). Nine implementation units deliver: a workspace-write file editor with minimum sandbox, evolution hooks wired into the streaming path, verification defaults, step-budget phases, reflexion-in-main-flow, auto-trigger + quality gate + actor marking, pitfall retrieval/injection at planning phase, a spec-review gate that pauses PLAN_EXEC for user approval, and explicit TEAM_COLLAB failure instead of silent REACT fall-back. ## Implementation Units | Unit | Requirement | Commit | Description | |------|------------|--------|-------------| | U1 | R1 | 2932ee5 | str_replace_editor tool with workspace-root security | | U2 | OQ6 | dd25915 | wire evolution hooks into execute_stream path | | U3 | R2/R3/RV3 | b841896 | verification defaults for PLAN_EXEC/TEAM_COLLAB + minimum sandbox | | U4 | R11/R10 | 4255cb3 | step budget phases + keep-working bias | | U5 | R4 | 1d09faf | reflexion in main flow: verify fail -> reflect -> retry | | U6 | R5/R6 | 91a61f9 | auto-trigger + quality gate + actor marking | | U7 | R12 | a763396 | pitfall retrieval/injection at planning phase | | U8 | R8 | 786f921 | spec review gate: pause PLAN_EXEC for user review | | U9 | R7 | 120892e | TEAM_COLLAB surfaces failure instead of silent REACT fall-back | | review-fix | F1/F2/F4 | ffb7a51 | wire pitfall_detector/spec_review to PlanExecEngine + fix restore_budget_state reset order | | compat | - | f1f2e72 | add pitfall_warnings param for ReActEngine interface compat | | docs | - | 7c900ce | plan + requirements documents | ## Key Technical Decisions - **KTD-5**: PitfallDetector is an app-state singleton instantiated at startup (constructor injection, not module-global) - **KTD-7**: Budget counters survive reset() when restored from checkpoint (_state_restored flag guard) - **Interface compatibility**: PlanExecEngine.execute/execute_stream signatures match ReActEngine exactly (pitfall_warnings param added for compat) - **Async generator safety**: All async generators follow the return;yield pattern per project rules - **R4**: Reflexion extends ReActEngine's existing reinjection loop rather than refactoring ReflexionEngine ## Testing - 222 tests for all modified modules pass (including 63 new tests across 4 new test files) - Ruff clean on all modified source files - New test files: test_budget_restore.py (4), test_pitfall_injection.py (28), test_spec_review_gate.py (20), test_team_collab_routing.py (11) - Pre-existing environment issues (not caused by this PR): tests/unit/llm/test_cache.py fails due to litellm not installed in Python 3.14 env; tests/unit/server/test_evolution_dashboard.py requires PostgreSQL ## Known Residuals (non-blocking) - **F3 (P2)**: Portal WebSocket path does not wire evolution hooks yet (documented gap; portal uses a separate execution path) - **F5 (P3)**: SpecManager spec_id is not sanitized (defensive hardening deferred) - **F6 (P3)**: Wilson interval for quality gate confidence score (advisory; current heuristic works) ## Post-Deploy Monitoring After merge, monitor: 1. Evolution trigger rate for PLAN_EXEC/TEAM_COLLAB tasks (should be non-zero; verify via /api/v1/evolution/experiences) 2. Spec review gate timeouts (watch for parked specs accumulating at /api/v1/specs?status=parked) 3. Pitfall injection hit rate (check logs for 'pitfall_warnings_injected' entries) 4. TEAM_COLLAB error messages reaching users without @team prefix (should see error events, not silent REACT fall-back) 5. Streaming path task-completion hooks firing (verify evolution entries created for streamed tasks) Rollback: revert the merge commit; no schema migrations required.

fischer added 12 commits 2026-07-03 22:56:04 +08:00

2932ee51ed feat(tools): add str_replace_editor tool with workspace-root security (U1, R1)

Replaces the broken write_file placeholder (no real implementation, only
_FakeTool stubs in cli/benchmark.py) with a structured editor offering four
commands: create, str_replace, insert_at_line, view.

Security model (file-system analog of the 6-layer terminal security paradigm,
reject-by-default + prefix match):
  1. Reject absolute paths (force relative interpretation vs workspace root).
  2. Reject any .. path component (path traversal).
  3. Path.resolve() follows symlinks, then relative_to(workspace_root)
     rejects symlink escape and residual traversal.

Data-loss guard: create refuses to overwrite existing files. str_replace
requires a unique anchor (0 or >1 matches error). insert_at_line is 1-based
(0 = prepend, > EOF = append). All FS I/O wrapped in asyncio.to_thread.

Registers str_replace_editor in _DEFAULT_CORE_TOOLS (replacing write_file)
so its full description is always injected into the LLM prompt. Updates
test_tool_search.py which used write_file as a sample core tool.

Tests: 34 cases in test_str_replace_editor.py cover happy path, edge cases
(empty file, multi-match, insert at 0/beyond EOF, view range), error paths
(overwrite refusal, anchor not found, path traversal, absolute path, symlink
escape, unknown command, missing args), and integration contract (in
_DEFAULT_CORE_TOOLS, exported from agentkit.tools, schema enum, prompt
injection via _build_tool_use_prompt).

Verification: ruff check clean; targeted regression suite 412 passed
(the single failure in test_calendar_tool.py is a pre-existing date-sensitive
bug in an untouched file, today 2026-07-03 Friday makes the next-Wednesday
assertion fail).

dd259153fa feat(core): wire evolution hooks into execute_stream path (U2, OQ6 fix)

ConfigDrivenAgent.execute_stream() now fires on_task_complete/on_task_failed
evolution hooks in its finally block, achieving lifecycle parity with the
sync execute() path. This fixes the OQ6 gap where WebSocket-routed streaming
tasks bypassed evolution entirely.

Implementation:
- Module-level backpressure manager (_schedule_evolution / drain_pending_evolution_tasks)
  with cap = max(2, max_concurrency * 2), drop + log + counter on exceed, and
  shutdown drain via asyncio.gather(return_exceptions=True).
- _trigger_evolution_hooks / _evolve_safe methods on ConfigDrivenAgent: fire-and-forget
  via asyncio.create_task, evolution errors swallowed (never fail the stream).
- execute_stream finally block distinguishes cancelled (CancelledError /
  TaskCancelledError -> CANCELLED), failed (Exception -> FAILED), completed
  (final_answer received -> COMPLETED), and early-close (no completion, no
  error -> CANCELLED "stream closed before completion").
- app.py shutdown drains pending evolution tasks.
- plan_exec_engine.py / reflexion.py: doc comments noting hooks fire at the
  ConfigDrivenAgent layer (single chokepoint, no double-fire).
- portal.py: verification comments at 3 execute_stream call sites (these call
  react_engine.execute_stream directly, bypassing ConfigDrivenAgent - known gap
  tracked separately).

Tests (8 new in test_execute_stream_hooks.py):
- Happy path: success fires COMPLETED, failure fires FAILED.
- Edge cases: cancellation fires CANCELLED, early aclose fires CANCELLED,
  evolution error suppressed, backpressure cap drops + counts.
- Parity: REST on_task_complete vs execute_stream both fire COMPLETED.
- Disabled: _evolution_enabled=False fires no hooks.

b8418968c2 feat(core): verification defaults for PLAN_EXEC/TEAM_COLLAB + minimum sandbox (U3, R2/R3/RV3)

4255cb33ba feat(core): step budget phases + keep working bias (U4, R11/R10)

1d09fafec9 feat(core): reflexion in main flow - verify fail → reflect → retry (U5, R4)

91a61f9b49 feat(evolution): auto-trigger + quality gate + actor marking (U6, R5/R6)

U6 of the complex task quality loop plan.

R5 (auto evolution trigger + quality gate):
- EvolutionConfig (Pydantic v2): success_sample_rate=0.1, min_confidence=0.5,
  min_examples=3, observe_only=True, cross_workspace_sharing=False
- Success path gated by success_sample_rate; failure path always runs (100%)
- Observe-only mode records reflections without feeding optimizer (RV14:
  avoids noise-driven prompt degradation during initial rollout)
- PromptOptimizer.can_optimize() consumption gate: sample count >= min_examples
  AND mean quality >= min_confidence
- PitfallDetector confidence threshold: low-confidence warnings marked
  observe-only; confidence = failure_rate * min(1.0, total/3) linear ramp
  (ponytail: upgrade to Wilson interval)

R6 (actor marking + cross-workspace sharing):
- All evolution artifacts (EvolutionLogEntry, Module, PitfallWarning) carry
  actor field; defaults to result.agent_name
- can_share_artifact(): same-workspace always allowed; cross-workspace requires
  explicit opt-in via EvolutionConfig.cross_workspace_sharing=True

KTD-8: gave_up_after_reflections treated as failure path (triggers 100%
evolution) even when stream wrapper marks status as COMPLETED. Detection via
output_data.trace_outcome or error_message substring (ponytail: heuristic;
upgrade path is a dedicated TaskResult.trace_outcome field).

Backward compat: all gates conditional on auto_evolution_config is not None;
existing EvolutionMixin usage without config preserves prior behavior.

Tests: tests/unit/test_evolution_auto_trigger.py (37 tests) covers R5/R6
scenarios - sample rate gate, observe-only, consumption gate, pitfall
confidence, actor marking, cross-workspace sharing, gave_up_after_reflections,
error handling, fire-and-forget, backpressure cap, AE3 happy path.

a763396011 feat(evolution): pitfall retrieval/injection at planning phase (U7, R12)

786f921c5e feat(core): spec review gate - pause PLAN_EXEC for user review (U8, R8)

Add a spec review gate to PlanExecEngine that pauses execution after the
first Spec is generated, awaiting the user's confirm/reject decision.
On approval execution continues; on rejection the engine replans (capped
at 2 replans); on 30-min timeout the Spec is parked (not failed) so the
user can resume later.

- spec_manager: add parked status + park()/resume() methods
- plan_exec_engine: add spec_review_handler param, wire gate into both
  execute_stream and _execute_loop with replan cap, emit
  spec_review_request/spec_review_reply events, handle timeout to park
- chat.py: whitelist new events, add spec_review_reply WS handler,
  wire _spec_review_handler closure (30-min timeout), cleanup on disconnect
- portal.py: persist spec_review_id/decision/feedback for page reload
- tests: 20 unit tests covering happy path, rejection/replan, timeout,
  cancellation, backward compat, handler errors, park/resume round-trips

120892e305 feat(chat): TEAM_COLLAB surfaces failure instead of silent REACT fall-back (U9, R7)

- chat.py: TEAM_COLLAB execution_mode sends error + returns (no REACT fall-back)
- REWOO/REFLEXION-as-mode keep deferred fall-back (RV10)
- AGENTS.md: update stale "not yet supported" claim
- Known gap: portal.py REST path still falls back (out of U9 scope)

f1f2e72cad fix(plan_exec): add pitfall_warnings param for ReActEngine interface compat

ffb7a51d77 fix(review): wire pitfall_detector/spec_review to PlanExecEngine + fix restore_budget_state reset order

Test / backend-test (pull_request) Has been cancelled Details

Test / frontend-unit (pull_request) Has been cancelled Details

Test / api-e2e (pull_request) Has been cancelled Details

Test / frontend-e2e (pull_request) Has been cancelled Details

7c900ce280 docs: add complex-task-quality-loop plan and requirements documents

Adds the brainstorm requirements and implementation plan that drove the
9-unit quality-loop feature (R1-R12). Also gitignores local worktree
directories.

fischer added 1 commit 2026-07-05 22:31:00 +08:00

Test / backend-test (pull_request) Waiting to run Details

Test / frontend-unit (pull_request) Waiting to run Details

Test / api-e2e (pull_request) Waiting to run Details

Test / frontend-e2e (pull_request) Waiting to run Details

e5e76697a9 fix(review): resolve 11 P1 blockers from ce-code-review

P1#1  config_driven: propagate trace_outcome into output_data so
      lifecycle._is_failure_path() detects non-success outcomes
P1#2  portal: route through ConfigDrivenAgent.execute_stream (not
      react_engine.execute_stream directly) so evolution hooks fire
      and trace_outcome propagates; add pre-built messages support in
      _build_llm_messages
P1#3  sandbox: make network_block reentrant via module-level reference
      counter + threading.Lock - concurrent VERIFICATION phases no
      longer permanently block all new connections
P1#4  chat: replace dead isinstance(_PlanExecEngine) check with
      hasattr(_spec_review_handler) to wire the spec review gate
P1#5  plan_exec_engine: complete max_reflections threading chain
      (PlanExecEngine + ReActStepExecutor constructors)
P1#6  plan_exec_engine: enforce phase budgets (max_steps from
      phase_budgets, not hardcoded 5)
P1#7  plan_exec_engine: use current plan (not stale plan var) in
      aggregation after replan
P1#8  plan_exec_engine: map failure to failed status (not success)
P1#9  app: add drain timeout for pending evolution tasks on shutdown
P1#10 portal: handle spec_review_reply in WS handler
P1#11 chat: persist spec_review_request/reply/timeout to conversation
      store so reload can reconstruct gate state

Tests: 116 related tests pass; 26 pre-existing failures unchanged
(stash-verified). ruff lint clean.

fischer merged commit 8633f60831 into main

2026-07-05 22:31:22 +08:00

fischer referenced this issue from a commit

2026-07-05 22:31:23 +08:00

feat: complex-task-quality-loop (R1-R12) — 11 P1 blockers fixed (#22)

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: fischer/fischer-agentkit#22