feat: complex-task-quality-loop (R1-R12) #22

Merged

fischer merged 13 commits from feat/complex-task-quality-loop into main

2026-07-05 22:31:22 +08:00

Author	SHA1	Message	Date
chiguyong	e5e76697a9	fix(review): resolve 11 P1 blockers from ce-code-review Test / backend-test (pull_request) Waiting to run Details Test / frontend-unit (pull_request) Waiting to run Details Test / api-e2e (pull_request) Waiting to run Details Test / frontend-e2e (pull_request) Waiting to run Details P1#1 config_driven: propagate trace_outcome into output_data so lifecycle._is_failure_path() detects non-success outcomes P1#2 portal: route through ConfigDrivenAgent.execute_stream (not react_engine.execute_stream directly) so evolution hooks fire and trace_outcome propagates; add pre-built messages support in _build_llm_messages P1#3 sandbox: make network_block reentrant via module-level reference counter + threading.Lock - concurrent VERIFICATION phases no longer permanently block all new connections P1#4 chat: replace dead isinstance(_PlanExecEngine) check with hasattr(_spec_review_handler) to wire the spec review gate P1#5 plan_exec_engine: complete max_reflections threading chain (PlanExecEngine + ReActStepExecutor constructors) P1#6 plan_exec_engine: enforce phase budgets (max_steps from phase_budgets, not hardcoded 5) P1#7 plan_exec_engine: use current plan (not stale plan var) in aggregation after replan P1#8 plan_exec_engine: map failure to failed status (not success) P1#9 app: add drain timeout for pending evolution tasks on shutdown P1#10 portal: handle spec_review_reply in WS handler P1#11 chat: persist spec_review_request/reply/timeout to conversation store so reload can reconstruct gate state Tests: 116 related tests pass; 26 pre-existing failures unchanged (stash-verified). ruff lint clean.	2026-07-04 01:10:01 +08:00
chiguyong	7c900ce280	docs: add complex-task-quality-loop plan and requirements documents Test / backend-test (pull_request) Has been cancelled Details Test / frontend-unit (pull_request) Has been cancelled Details Test / api-e2e (pull_request) Has been cancelled Details Test / frontend-e2e (pull_request) Has been cancelled Details Adds the brainstorm requirements and implementation plan that drove the 9-unit quality-loop feature (R1-R12). Also gitignores local worktree directories.	2026-07-03 22:54:11 +08:00
chiguyong	ffb7a51d77	fix(review): wire pitfall_detector/spec_review to PlanExecEngine + fix restore_budget_state reset order	2026-07-03 22:05:51 +08:00
chiguyong	f1f2e72cad	fix(plan_exec): add pitfall_warnings param for ReActEngine interface compat	2026-07-03 21:39:27 +08:00
chiguyong	120892e305	feat(chat): TEAM_COLLAB surfaces failure instead of silent REACT fall-back (U9, R7) - chat.py: TEAM_COLLAB execution_mode sends error + returns (no REACT fall-back) - REWOO/REFLEXION-as-mode keep deferred fall-back (RV10) - AGENTS.md: update stale "not yet supported" claim - Known gap: portal.py REST path still falls back (out of U9 scope)	2026-07-03 15:47:45 +08:00
chiguyong	786f921c5e	feat(core): spec review gate - pause PLAN_EXEC for user review (U8, R8) Add a spec review gate to PlanExecEngine that pauses execution after the first Spec is generated, awaiting the user's confirm/reject decision. On approval execution continues; on rejection the engine replans (capped at 2 replans); on 30-min timeout the Spec is parked (not failed) so the user can resume later. - spec_manager: add parked status + park()/resume() methods - plan_exec_engine: add spec_review_handler param, wire gate into both execute_stream and _execute_loop with replan cap, emit spec_review_request/spec_review_reply events, handle timeout to park - chat.py: whitelist new events, add spec_review_reply WS handler, wire _spec_review_handler closure (30-min timeout), cleanup on disconnect - portal.py: persist spec_review_id/decision/feedback for page reload - tests: 20 unit tests covering happy path, rejection/replan, timeout, cancellation, backward compat, handler errors, park/resume round-trips	2026-07-03 15:20:38 +08:00
chiguyong	a763396011	feat(evolution): pitfall retrieval/injection at planning phase (U7, R12)	2026-07-03 14:27:48 +08:00
chiguyong	91a61f9b49	feat(evolution): auto-trigger + quality gate + actor marking (U6, R5/R6) U6 of the complex task quality loop plan. R5 (auto evolution trigger + quality gate): - EvolutionConfig (Pydantic v2): success_sample_rate=0.1, min_confidence=0.5, min_examples=3, observe_only=True, cross_workspace_sharing=False - Success path gated by success_sample_rate; failure path always runs (100%) - Observe-only mode records reflections without feeding optimizer (RV14: avoids noise-driven prompt degradation during initial rollout) - PromptOptimizer.can_optimize() consumption gate: sample count >= min_examples AND mean quality >= min_confidence - PitfallDetector confidence threshold: low-confidence warnings marked observe-only; confidence = failure_rate * min(1.0, total/3) linear ramp (ponytail: upgrade to Wilson interval) R6 (actor marking + cross-workspace sharing): - All evolution artifacts (EvolutionLogEntry, Module, PitfallWarning) carry actor field; defaults to result.agent_name - can_share_artifact(): same-workspace always allowed; cross-workspace requires explicit opt-in via EvolutionConfig.cross_workspace_sharing=True KTD-8: gave_up_after_reflections treated as failure path (triggers 100% evolution) even when stream wrapper marks status as COMPLETED. Detection via output_data.trace_outcome or error_message substring (ponytail: heuristic; upgrade path is a dedicated TaskResult.trace_outcome field). Backward compat: all gates conditional on auto_evolution_config is not None; existing EvolutionMixin usage without config preserves prior behavior. Tests: tests/unit/test_evolution_auto_trigger.py (37 tests) covers R5/R6 scenarios - sample rate gate, observe-only, consumption gate, pitfall confidence, actor marking, cross-workspace sharing, gave_up_after_reflections, error handling, fire-and-forget, backpressure cap, AE3 happy path.	2026-07-03 13:54:37 +08:00
chiguyong	1d09fafec9	feat(core): reflexion in main flow - verify fail → reflect → retry (U5, R4)	2026-07-03 13:29:54 +08:00
chiguyong	4255cb33ba	feat(core): step budget phases + keep working bias (U4, R11/R10)	2026-07-03 13:10:28 +08:00
chiguyong	b8418968c2	feat(core): verification defaults for PLAN_EXEC/TEAM_COLLAB + minimum sandbox (U3, R2/R3/RV3)	2026-07-03 12:32:22 +08:00
chiguyong	dd259153fa	feat(core): wire evolution hooks into execute_stream path (U2, OQ6 fix) ConfigDrivenAgent.execute_stream() now fires on_task_complete/on_task_failed evolution hooks in its finally block, achieving lifecycle parity with the sync execute() path. This fixes the OQ6 gap where WebSocket-routed streaming tasks bypassed evolution entirely. Implementation: - Module-level backpressure manager (_schedule_evolution / drain_pending_evolution_tasks) with cap = max(2, max_concurrency * 2), drop + log + counter on exceed, and shutdown drain via asyncio.gather(return_exceptions=True). - _trigger_evolution_hooks / _evolve_safe methods on ConfigDrivenAgent: fire-and-forget via asyncio.create_task, evolution errors swallowed (never fail the stream). - execute_stream finally block distinguishes cancelled (CancelledError / TaskCancelledError -> CANCELLED), failed (Exception -> FAILED), completed (final_answer received -> COMPLETED), and early-close (no completion, no error -> CANCELLED "stream closed before completion"). - app.py shutdown drains pending evolution tasks. - plan_exec_engine.py / reflexion.py: doc comments noting hooks fire at the ConfigDrivenAgent layer (single chokepoint, no double-fire). - portal.py: verification comments at 3 execute_stream call sites (these call react_engine.execute_stream directly, bypassing ConfigDrivenAgent - known gap tracked separately). Tests (8 new in test_execute_stream_hooks.py): - Happy path: success fires COMPLETED, failure fires FAILED. - Edge cases: cancellation fires CANCELLED, early aclose fires CANCELLED, evolution error suppressed, backpressure cap drops + counts. - Parity: REST on_task_complete vs execute_stream both fire COMPLETED. - Disabled: _evolution_enabled=False fires no hooks.	2026-07-03 12:16:02 +08:00
chiguyong	2932ee51ed	feat(tools): add str_replace_editor tool with workspace-root security (U1, R1) Replaces the broken write_file placeholder (no real implementation, only _FakeTool stubs in cli/benchmark.py) with a structured editor offering four commands: create, str_replace, insert_at_line, view. Security model (file-system analog of the 6-layer terminal security paradigm, reject-by-default + prefix match): 1. Reject absolute paths (force relative interpretation vs workspace root). 2. Reject any .. path component (path traversal). 3. Path.resolve() follows symlinks, then relative_to(workspace_root) rejects symlink escape and residual traversal. Data-loss guard: create refuses to overwrite existing files. str_replace requires a unique anchor (0 or >1 matches error). insert_at_line is 1-based (0 = prepend, > EOF = append). All FS I/O wrapped in asyncio.to_thread. Registers str_replace_editor in _DEFAULT_CORE_TOOLS (replacing write_file) so its full description is always injected into the LLM prompt. Updates test_tool_search.py which used write_file as a sample core tool. Tests: 34 cases in test_str_replace_editor.py cover happy path, edge cases (empty file, multi-match, insert at 0/beyond EOF, view range), error paths (overwrite refusal, anchor not found, path traversal, absolute path, symlink escape, unknown command, missing args), and integration contract (in _DEFAULT_CORE_TOOLS, exported from agentkit.tools, schema enum, prompt injection via _build_tool_use_prompt). Verification: ruff check clean; targeted regression suite 412 passed (the single failure in test_calendar_tool.py is a pre-existing date-sensitive bug in an untouched file, today 2026-07-03 Friday makes the next-Wednesday assertion fail).	2026-07-03 11:42:59 +08:00