feat: complex-task-quality-loop (R1-R12) #22
Loading…
Reference in New Issue
No description provided.
Delete Branch "feat/complex-task-quality-loop"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Implements the verify-reflect-evolve quality loop for PLAN_EXEC / TEAM_COLLAB tasks (requirements R1-R12). Nine implementation units deliver: a workspace-write file editor with minimum sandbox, evolution hooks wired into the streaming path, verification defaults, step-budget phases, reflexion-in-main-flow, auto-trigger + quality gate + actor marking, pitfall retrieval/injection at planning phase, a spec-review gate that pauses PLAN_EXEC for user approval, and explicit TEAM_COLLAB failure instead of silent REACT fall-back.
Implementation Units
2932ee5dd25915b8418964255cb31d09faf91a61f9a763396786f921120892effb7a51f1f2e727c900ceKey Technical Decisions
Testing
Known Residuals (non-blocking)
Post-Deploy Monitoring
After merge, monitor:
Rollback: revert the merge commit; no schema migrations required.
Replaces the broken write_file placeholder (no real implementation, only _FakeTool stubs in cli/benchmark.py) with a structured editor offering four commands: create, str_replace, insert_at_line, view. Security model (file-system analog of the 6-layer terminal security paradigm, reject-by-default + prefix match): 1. Reject absolute paths (force relative interpretation vs workspace root). 2. Reject any .. path component (path traversal). 3. Path.resolve() follows symlinks, then relative_to(workspace_root) rejects symlink escape and residual traversal. Data-loss guard: create refuses to overwrite existing files. str_replace requires a unique anchor (0 or >1 matches error). insert_at_line is 1-based (0 = prepend, > EOF = append). All FS I/O wrapped in asyncio.to_thread. Registers str_replace_editor in _DEFAULT_CORE_TOOLS (replacing write_file) so its full description is always injected into the LLM prompt. Updates test_tool_search.py which used write_file as a sample core tool. Tests: 34 cases in test_str_replace_editor.py cover happy path, edge cases (empty file, multi-match, insert at 0/beyond EOF, view range), error paths (overwrite refusal, anchor not found, path traversal, absolute path, symlink escape, unknown command, missing args), and integration contract (in _DEFAULT_CORE_TOOLS, exported from agentkit.tools, schema enum, prompt injection via _build_tool_use_prompt). Verification: ruff check clean; targeted regression suite 412 passed (the single failure in test_calendar_tool.py is a pre-existing date-sensitive bug in an untouched file, today 2026-07-03 Friday makes the next-Wednesday assertion fail).P1#1 config_driven: propagate trace_outcome into output_data so lifecycle._is_failure_path() detects non-success outcomes P1#2 portal: route through ConfigDrivenAgent.execute_stream (not react_engine.execute_stream directly) so evolution hooks fire and trace_outcome propagates; add pre-built messages support in _build_llm_messages P1#3 sandbox: make network_block reentrant via module-level reference counter + threading.Lock - concurrent VERIFICATION phases no longer permanently block all new connections P1#4 chat: replace dead isinstance(_PlanExecEngine) check with hasattr(_spec_review_handler) to wire the spec review gate P1#5 plan_exec_engine: complete max_reflections threading chain (PlanExecEngine + ReActStepExecutor constructors) P1#6 plan_exec_engine: enforce phase budgets (max_steps from phase_budgets, not hardcoded 5) P1#7 plan_exec_engine: use current plan (not stale plan var) in aggregation after replan P1#8 plan_exec_engine: map failure to failed status (not success) P1#9 app: add drain timeout for pending evolution tasks on shutdown P1#10 portal: handle spec_review_reply in WS handler P1#11 chat: persist spec_review_request/reply/timeout to conversation store so reload can reconstruct gate state Tests: 116 related tests pass; 26 pre-existing failures unchanged (stash-verified). ruff lint clean.