Test / backend-test (pull_request) Waiting to runDetails
Test / frontend-unit (pull_request) Waiting to runDetails
Test / api-e2e (pull_request) Waiting to runDetails
Test / frontend-e2e (pull_request) Waiting to runDetails
P1#1 config_driven: propagate trace_outcome into output_data so
lifecycle._is_failure_path() detects non-success outcomes
P1#2 portal: route through ConfigDrivenAgent.execute_stream (not
react_engine.execute_stream directly) so evolution hooks fire
and trace_outcome propagates; add pre-built messages support in
_build_llm_messages
P1#3 sandbox: make network_block reentrant via module-level reference
counter + threading.Lock - concurrent VERIFICATION phases no
longer permanently block all new connections
P1#4 chat: replace dead isinstance(_PlanExecEngine) check with
hasattr(_spec_review_handler) to wire the spec review gate
P1#5 plan_exec_engine: complete max_reflections threading chain
(PlanExecEngine + ReActStepExecutor constructors)
P1#6 plan_exec_engine: enforce phase budgets (max_steps from
phase_budgets, not hardcoded 5)
P1#7 plan_exec_engine: use current plan (not stale plan var) in
aggregation after replan
P1#8 plan_exec_engine: map failure to failed status (not success)
P1#9 app: add drain timeout for pending evolution tasks on shutdown
P1#10 portal: handle spec_review_reply in WS handler
P1#11 chat: persist spec_review_request/reply/timeout to conversation
store so reload can reconstruct gate state
Tests: 116 related tests pass; 26 pre-existing failures unchanged
(stash-verified). ruff lint clean.
Test / backend-test (pull_request) Has been cancelledDetails
Test / frontend-unit (pull_request) Has been cancelledDetails
Test / api-e2e (pull_request) Has been cancelledDetails
Test / frontend-e2e (pull_request) Has been cancelledDetails
Adds the brainstorm requirements and implementation plan that drove the
9-unit quality-loop feature (R1-R12). Also gitignores local worktree
directories.
Add a spec review gate to PlanExecEngine that pauses execution after the
first Spec is generated, awaiting the user's confirm/reject decision.
On approval execution continues; on rejection the engine replans (capped
at 2 replans); on 30-min timeout the Spec is parked (not failed) so the
user can resume later.
- spec_manager: add parked status + park()/resume() methods
- plan_exec_engine: add spec_review_handler param, wire gate into both
execute_stream and _execute_loop with replan cap, emit
spec_review_request/spec_review_reply events, handle timeout to park
- chat.py: whitelist new events, add spec_review_reply WS handler,
wire _spec_review_handler closure (30-min timeout), cleanup on disconnect
- portal.py: persist spec_review_id/decision/feedback for page reload
- tests: 20 unit tests covering happy path, rejection/replan, timeout,
cancellation, backward compat, handler errors, park/resume round-trips
Replaces the broken write_file placeholder (no real implementation, only
_FakeTool stubs in cli/benchmark.py) with a structured editor offering four
commands: create, str_replace, insert_at_line, view.
Security model (file-system analog of the 6-layer terminal security paradigm,
reject-by-default + prefix match):
1. Reject absolute paths (force relative interpretation vs workspace root).
2. Reject any .. path component (path traversal).
3. Path.resolve() follows symlinks, then relative_to(workspace_root)
rejects symlink escape and residual traversal.
Data-loss guard: create refuses to overwrite existing files. str_replace
requires a unique anchor (0 or >1 matches error). insert_at_line is 1-based
(0 = prepend, > EOF = append). All FS I/O wrapped in asyncio.to_thread.
Registers str_replace_editor in _DEFAULT_CORE_TOOLS (replacing write_file)
so its full description is always injected into the LLM prompt. Updates
test_tool_search.py which used write_file as a sample core tool.
Tests: 34 cases in test_str_replace_editor.py cover happy path, edge cases
(empty file, multi-match, insert at 0/beyond EOF, view range), error paths
(overwrite refusal, anchor not found, path traversal, absolute path, symlink
escape, unknown command, missing args), and integration contract (in
_DEFAULT_CORE_TOOLS, exported from agentkit.tools, schema enum, prompt
injection via _build_tool_use_prompt).
Verification: ruff check clean; targeted regression suite 412 passed
(the single failure in test_calendar_tool.py is a pre-existing date-sensitive
bug in an untouched file, today 2026-07-03 Friday makes the next-Wednesday
assertion fail).