32 KiB

Raw Blame History

title	date	type	status	origin	execution
feat: Agent Wave 4 PLAN_EXEC hardening (REST wiring + frontend events + bash filter upgrade + e2e tests)	2026-06-30	feat	draft	docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md (Wave 3 deferred items + Wave 3 code review residual risks)	code

Wave 4 PLAN_EXEC Hardening — REST Symmetry + Frontend Visibility + Filter Upgrade + E2E Coverage

Summary

Wave 3 closed G5 (function-level sharding) and G6 (SOLO four-stage state machine) at the WebSocket path only. Three concrete gaps remain before PLAN_EXEC is production-ready:

REST asymmetry — POST /api/v1/chat/{session_id}/send_message with execution_mode="plan_exec" returns HTTP 501 (chat.py:590-595). WebSocket has a real handler; REST does not.
No frontend visibility — phase_changed events are emitted on the WS socket (chat.py:1295-1309) but frontend/src/api/types.ts:135 WsServerMessage has no phase_changed/phase_violation branch. phase_violation is not emitted to the client at all (only injected back to the LLM as a tool error at react.py:2196-2203). The Vue UI cannot show what phase the agent is in or surface policy violations.
Bash filter ceiling — core/phase.py:56-60 ponytail comment names the ceiling explicitly: regex misses :>file, dd of=file. Upgrade path = reuse ShellTool._is_dangerous() (shell.py:466) at enforcement time.

Wave 4 closes all three and ships an E2E integration test that exercises the full PLAN_EXEC path (LLM → phase transition → tool dispatch → WS event). It does not migrate to tree-sitter (KTD1 upgrade path remains for Wave 5+) or add phase persistence across session resume (U7 checkpoint scope, separate concern).

Problem Frame

Wave 3 ships PLAN_EXEC behind a feature flag (agentkit.yaml plan_exec.enabled, default commented = opt-in). To turn it on in production, three conditions must hold:

Symmetric entry points: callers using REST (CLI agentkit task submit, external integrations) and callers using WebSocket (Vue chat UI) must both be able to invoke PLAN_EXEC. Today only WS works.
Observable behavior: when the agent transitions phases or rejects a tool call, the user must see it. Today the WS event is emitted but the frontend silently drops it; violations are invisible to the user.
Hardened safety boundary: the bash filter is the only thing preventing a Planning-phase LLM from running rm -rf. The regex is conservative by the author's own admission (:>file slips through). Production use requires the same safety guarantee ShellTool already provides.

Wave 3's "Out of Scope (Deferred to Follow-Up Work)" section explicitly lists "REST send_message PLAN_EXEC wiring" and "Tool-filter UI in the frontend" — Wave 4 executes those deferrals. The bash filter upgrade was surfaced by Wave 3 code review (ponytail ceiling) rather than the brainstorm.

Requirements

Carried forward from Wave 3 plan's deferred items + Wave 3 code review residuals:

R28: REST POST /api/v1/chat/{session_id}/send_message with execution_mode="plan_exec" invokes the same PLAN_EXEC handler logic as the WebSocket path, returning a ChatResult with phase events recorded (mirrors WS event sequence).
R29: REST PLAN_EXEC bypasses the Wave 2 fallback chain (mutually exclusive with phase policy — chat.py:1093-1095 documents the constraint).
R30: WebSocket emits a phase_violation event to the client when ReActEngine._check_phase_permission blocks a tool call (currently only returned to the LLM).
R31: Frontend WsServerMessage union (types.ts:135) includes phase_changed and phase_violation cases; handleWsMessage (chat.ts:800) dispatches both to a phase state slice.
R32: A compact phase indicator renders the current phase + violation toasts in the Vue chat view.
R33: PhasePolicy.bash_command_filter accepts a Callable[[str], bool] callback in addition to re.Pattern; default policy wires ShellTool._is_dangerous so :>file and dd of=file are blocked.
R34: An E2E integration test exercises the full PLAN_EXEC path through a scripted LLM mock: planning → advance_phase → building → write_file → verification → delivery, asserting WS events and tool dispatches.

Cross-cutting:

R26 (inherited): all configuration via agentkit.yaml plan_exec section, parsed by ServerConfig.from_dict.
R27 (inherited): each unit ships a minimal self-check (ponytail rule).

Key Technical Decisions

KTD1: Extract `_build_phase_engine` helper shared by WS and REST

Decision: Refactor chat.py:1093-1153 (the WS PLAN_EXEC engine construction block) into a private _build_phase_engine(server_config, agent, tools, ...) -> tuple[ReActEngine, list[Tool]] helper. Both _execute_plan_exec_ws and the new _execute_plan_exec_rest call it.

Rationale:

The WS block currently inlines policy construction + engine instantiation + AdvancePhaseTool registration. REST needs the same assembly.
Single source of truth for "how to build a phase-enforced engine" prevents drift between WS and REST paths.
Helper is private (underscore prefix) — not a public API; test access goes through the routes.

KTD2: REST PLAN_EXEC returns non-streaming `ChatResult`; SSE streaming is deferred

Decision: _execute_plan_exec_rest() returns a regular SendMessageResponse (matching the existing REST send_message shape). The phase_changed/phase_violation events are captured into a phase_events: list field on the response payload.

Rationale:

Existing REST send_message is non-streaming (chat.py:580). Streaming REST (SSE) is a separate concern owned by the /api/v1/llm/chat gateway route (llm_gateway.py), not chat.py.
First version ships parity with existing REST shape; SSE streaming for PLAN_EXEC is a follow-up if users request it.
The phase events list lets REST clients render phase progression after-the-fact (CLI agentkit task submit shows them in terminal output).

KTD3: `phase_violation` event emitted to WS alongside LLM injection

Decision: In react.py:_execute_loop, when _check_phase_permission blocks a tool call, the existing structured error is injected to the LLM conversation (unchanged), AND a phase_violation event is emitted through the engine's event stream. chat.py WS handler forwards it to the client.

Rationale:

Wave 3 returns the violation only to the LLM (gives the model a chance to self-correct by calling advance_phase). That stays.
Adding the WS event gives the user visibility into "the LLM tried to call write_file in Planning, was rejected, and will retry" — without this, the UI shows the LLM thinking silently which looks like a hang.
Event payload: {"type": "phase_violation", "data": {"tool": "write_file", "phase": "planning", "hint": "call advance_phase"}}.

KTD4: `bash_command_filter` accepts `Callable[[str], bool] | re.Pattern | None`

Decision: Change PhasePolicy.bash_command_filter field type from dict[PhaseState, re.Pattern | None] to dict[PhaseState, Callable[[str], bool] | re.Pattern | None]. is_bash_command_allowed detects callable vs pattern at call time. default_policy() injects ShellTool._is_dangerous as the callable for PLANNING/VERIFICATION.

Rationale:

ShellTool._is_dangerous (shell.py:466) is already battle-tested against _DANGEROUS_BINARIES, _DANGEROUS_BINARY_FLAGS, _DANGEROUS_ARG_PATTERNS, shell-chain operators, and pipe operators. Reusing it eliminates the regex ceiling the ponytail comment named.
The re.Pattern form stays for backward compat (config-supplied regex patterns still work).
PhaseState enum and PhasePolicy API stay stable; only the field type widens.

Alternative considered: Move the filter to ShellTool itself (gateway-level). Rejected because phase enforcement is per-step in ReActEngine, not per-shell-call — different lifecycle.

KTD5: Phase indicator UI is compact, optional, and degrades gracefully

Decision: Add a PhaseIndicator.vue component (badge + progress dots for 4 phases + transient toast for phase_violation). Mount it in the chat view header only when the current session has execution_mode="plan_exec"; otherwise render nothing.

Rationale:

Most chat sessions are REACT/SKILL_REACT — phase indicator is noise for them. Conditionally render only for PLAN_EXEC sessions.
Compact form (badge + dots) avoids competing with the existing PlanVisualization.vue (team mode, different concept — don't unify them).
Toast pattern matches existing useMessage from ant-design-vue used elsewhere in the frontend.

Scope Boundaries

In Scope

Refactor WS PLAN_EXEC engine construction into _build_phase_engine shared helper.
New _execute_plan_exec_rest for REST send_message; remove the 501 at chat.py:590-595.
Emit phase_violation event from ReActEngine._execute_loop through the WS handler.
Frontend WsServerMessage union extension + handleWsMessage cases + new PhaseIndicator.vue.
PhasePolicy.bash_command_filter type widening + default_policy() wiring to ShellTool._is_dangerous.
E2E integration test with scripted LLM mock covering full PLAN_EXEC lifecycle.

Out of Scope (Deferred to Follow-Up Work)

SSE streaming for REST PLAN_EXEC (KTD2 — non-streaming first; SSE follow-up if requested).
tree-sitter integration for symbol extraction (Wave 3 KTD1 upgrade path; Wave 5+ candidate).
Phase persistence across session resume (depends on U7 checkpoint deeper changes).
Phase-aware prompt engineering (per-phase system prompt templates — prompt-engineering concern, not code).
Phase rollback on Building → Planning regression (UX/prompt concern; Wave 2 G9 file-level rollback already handles file state).
config_sync.py exposure of plan_exec to frontend (frontend reads phase events from WS, not config — config exposure only needed if the UI wants to render phase whitelists, which is out of scope).
Recovery/Emergency layer integration with PLAN_EXEC (mutually exclusive by design — chat.py:1093-1095 documents this; integrating would require ReflexionEngine to understand phase state, separate Wave).

Outside This Product's Identity

Replacing the existing ReAct loop with LangGraph (inherited from brainstorm).
Disc-based file system à la DeerFlow (inherited).
Docker sandbox (inherited; only command-level safety via bash_command_filter).

Implementation Units

U1. Bash filter upgrade — reuse `ShellTool._is_dangerous()` (G6 hardening)

Goal: Widen PhasePolicy.bash_command_filter to accept Callable[[str], bool] callbacks and wire ShellTool._is_dangerous as the default filter for PLANNING/VERIFICATION phases. Eliminate the ponytail ceiling at core/phase.py:56-60.

Requirements: R33, R27.

Dependencies: none.

Files:

src/agentkit/core/phase.py (modify — widen field type; inject ShellTool._is_dangerous in default_policy(); update is_bash_command_allowed to handle callable).
tests/unit/test_phase_policy.py (modify — add cases for :>file, dd of=file, callable vs pattern).

Approach:

Field type: bash_command_filter: dict[PhaseState, Callable[[str], bool] | re.Pattern | None].
is_bash_command_allowed(command, phase):
- filter = self.bash_command_filter.get(phase)
- if filter is None: return True
- if callable(filter): return not filter(command)
- if isinstance(filter, re.Pattern): return not filter.search(command)
default_policy() replaces _DEFAULT_BASH_FILTER regex with ShellTool._is_dangerous method reference (bound method, callable).
Keep _DEFAULT_BASH_FILTER regex as a module constant for tests and config-supplied patterns; default_policy() no longer uses it.
Remove the ponytail comment at core/phase.py:56-60 (ceiling is closed).

Execution note: characterization-first — test that default_policy().is_bash_command_allowed("rm -rf /", PLANNING) still returns False (preserves Wave 3 behavior) before adding new edge-case coverage.

Patterns to follow:

src/agentkit/core/phase.py:default_policy() (Wave 3 — same factory pattern).
src/agentkit/tools/shell.py:_is_dangerous (Wave 3 — already the canonical safety check).

Test scenarios (covers R33):

Characterization (Wave 3 preserved):
- default_policy().is_bash_command_allowed("rm -rf /", PLANNING) → False.
- default_policy().is_bash_command_allowed("ls -la", PLANNING) → True.
- default_policy().is_bash_command_allowed("git status", PLANNING) → True.
Happy paths (new ceiling closed):
- :>file in PLANNING → False (was True before — ShellTool._is_dangerous catches redirect-to-empty).
- dd of=/dev/sda in PLANNING → False (was True before — caught by _DANGEROUS_BINARIES).
- echo hello > /tmp/x in PLANNING → False (was True before — ShellTool catches > redirect).
Edge cases:
- re.Pattern form still works when supplied via config (whitelist_override-adjacent — config-supplied regex pattern is honored).
- callable form takes precedence over re.Pattern when both somehow present (defensive — shouldn't happen).
Error paths:
- Empty command in PLANNING → True (ShellTool separately rejects empty commands at execution time; filter only gates dangerous patterns).
- None filter for BUILDING → True (no restriction).

Verification:

python3 -m pytest tests/unit/test_phase_policy.py -q passes.
ruff check src/agentkit/core/phase.py clean.
Ponytail comment at core/phase.py:56-60 is removed (ceiling closed, not just documented).

U2. Emit `phase_violation` WS event from `ReActEngine`

Goal: When _check_phase_permission blocks a tool call, emit a phase_violation event through the engine's event stream so chat.py WS handler can forward it to the client. Today the violation is only injected back to the LLM (react.py:2196-2203), invisible to the user.

Requirements: R30.

Dependencies: none (independent of U1 — violation emission doesn't depend on filter implementation).

Files:

src/agentkit/core/react.py (modify — emit event alongside the existing LLM injection).
src/agentkit/server/routes/chat.py (modify — forward phase_violation events from execute_stream to the WS client).
tests/unit/test_react_phase_enforcement.py (modify — assert event emission).
tests/unit/test_chat_plan_exec_ws.py (modify — assert WS client receives phase_violation event).

Approach:

ReActEngine.execute_stream already yields events for tool_call/tool_result/thinking/token. Add a new event type phase_violation yielded before the structured error is injected to the LLM conversation.
Event payload: {"type": "phase_violation", "data": {"tool": "<tool_name>", "phase": "<current_phase>", "hint": "call advance_phase"}}.
chat.py WS handler (around chat.py:1218 async for event in react_engine.execute_stream(...)) adds an elif event["type"] == "phase_violation": branch that websocket.send_json the event to the client.
Existing LLM-injection path is unchanged — the LLM still gets the structured error to react to.

Execution note: characterization-first — assert that phase_policy=None (no enforcement) yields zero phase_violation events (preserves Wave 3 behavior) before adding the positive-path test.

Patterns to follow:

src/agentkit/core/react.py existing event emission (e.g., tool_call event emission pattern).
src/agentkit/server/routes/chat.py:1295-1309 phase_changed event forwarding (same shape).

Test scenarios (covers R30):

Characterization (no policy):
- ReActEngine(phase_policy=None) executing a full loop yields zero phase_violation events.
Happy paths:
- PLANNING phase, LLM calls write_file → engine yields phase_violation event with tool="write_file", phase="planning", hint="call advance_phase".
- WS handler forwards phase_violation to client connection (assert websocket.send_json called with {"type": "phase_violation", ...}).
- LLM still receives the structured error in conversation (regression — Wave 3 behavior preserved).
Edge cases:
- Multiple violations in a row (LLM retries same tool) → multiple phase_violation events emitted (one per attempt).
- Violation followed by advance_phase followed by same tool now allowed → exactly one phase_violation event, then a tool_call event.
Error paths:
- Phase policy construction failure → existing 500 error path, no phase_violation emitted (engine not constructed).
Integration scenarios:
- Full WS path: client connects, sends PLAN_EXEC request, LLM mock emits write_file in PLANNING → client receives phase_violation event before any tool_call event.

Verification:

python3 -m pytest tests/unit/test_react_phase_enforcement.py tests/unit/test_chat_plan_exec_ws.py -q passes.
ruff check src/agentkit/core/react.py src/agentkit/server/routes/chat.py clean.

U3. Refactor `_build_phase_engine` helper + REST PLAN_EXEC wiring

Goal: Extract the WS PLAN_EXEC engine construction (chat.py:1093-1153) into a private _build_phase_engine(server_config, agent, tools, ...) -> tuple[ReActEngine, list[Tool]] helper. Add _execute_plan_exec_rest() for REST send_message; replace the 501 at chat.py:590-595.

Requirements: R28, R29, R26.

Dependencies: U1 (uses the hardened default_policy()).

Files:

src/agentkit/server/routes/chat.py (modify — extract helper; add REST handler; remove 501).
tests/unit/test_chat_plan_exec_ws.py (modify — add REST PLAN_EXEC test cases).
tests/unit/test_chat_rest_plan_exec.py (new — REST-specific coverage).

Approach:

New private _build_phase_engine(server_config, agent, tools, system_prompt, model) -> tuple[ReActEngine, list[Tool]]:
1. Read server_config.plan_exec (default {}).
2. If enabled is False, return (None, tools) (caller falls back to REACT).
3. Build PhasePolicy via policy_from_config; on failure or None, fall back to default_policy().
4. Construct ReActEngine(..., phase_policy=policy).
5. Register AdvancePhaseTool bound to the engine; return (engine, tools + [advance_phase]).
WS path: _execute_plan_exec_ws calls _build_phase_engine; if engine is None, falls back to REACT (existing behavior at chat.py:1101-1107).
REST path: _execute_plan_exec_rest(request, session_id, ...):
1. Calls _build_phase_engine.
2. If engine is None, delegates to execute_with_fallback_chain (REST keeps fallback chain for non-PLAN_EXEC).
3. Otherwise calls engine.execute(...) (non-streaming, single-shot — matches existing REST send_message shape).
4. Collects phase_changed/phase_violation events into a phase_events: list[dict] field on the response payload.
5. Returns SendMessageResponse extended with optional phase_events field.
Replace chat.py:590-595 with a branch: if routing.execution_mode == PLAN_EXEC, call _execute_plan_exec_rest; else continue with existing fallback chain.
SendMessageResponse model gains an optional phase_events: list[dict] | None = None field (default None keeps backward compat for non-PLAN_EXEC responses).

Execution note: characterization-first — assert that REST send_message with execution_mode="react" (or None) still goes through execute_with_fallback_chain (Wave 2 behavior unchanged) before adding PLAN_EXEC branch.

Patterns to follow:

src/agentkit/server/routes/chat.py:1093-1153 (existing WS PLAN_EXEC block — the code being extracted).
src/agentkit/server/_fallback_chain.py:execute_with_fallback_chain (Wave 2 — REST non-PLAN_EXEC path stays here).

Test scenarios (covers R28, R29):

Characterization (REST non-PLAN_EXEC preserved):
- REST send_message with execution_mode="react" → calls execute_with_fallback_chain (Wave 2 path unchanged).
- REST send_message with execution_mode=None → defaults to REACT, fallback chain applies.
Happy paths:
- REST send_message with execution_mode="plan_exec" → returns 200 (not 501).
- Response includes phase_events: list with at least one phase_changed entry when the engine transitions.
- REST with empty plan_exec config → uses default_policy() (KTD5 default whitelist).
Edge cases:
- REST with plan_exec.enabled=False → falls back to REACT, response has phase_events=None.
- REST with bad plan_exec config (invalid phase name) → 500 with error message naming the bad value.
- REST PLAN_EXEC with phase violation → phase_events includes a phase_violation entry.
Error paths:
- REST PLAN_EXEC when session is closed → 400 (existing path, no change).
- REST PLAN_EXEC with non-existent session → 404 (existing path).
Integration scenarios:
- REST PLAN_EXEC bypasses fallback chain: assert execute_with_fallback_chain is NOT called when execution_mode="plan_exec" (mutual exclusion per R29).

Verification:

python3 -m pytest tests/unit/test_chat_plan_exec_ws.py tests/unit/test_chat_rest_plan_exec.py -q passes.
ruff check src/agentkit/server/routes/chat.py clean.
The 501 at chat.py:590-595 is removed.

U4. Frontend phase event pipeline + `PhaseIndicator.vue`

Goal: Extend WsServerMessage union with phase_changed and phase_violation event types; add handleWsMessage cases that update a phase state slice; add a compact PhaseIndicator.vue component mounted only for PLAN_EXEC sessions.

Requirements: R31, R32.

Dependencies: U2 (frontend renders phase_violation events emitted by backend).

Files:

src/agentkit/server/frontend/src/api/types.ts (modify — extend WsServerMessage union).
src/agentkit/server/frontend/src/stores/chat.ts (modify — add phase state slice; add cases in handleWsMessage).
src/agentkit/server/frontend/src/components/PhaseIndicator.vue (new — badge + dots + toast).
src/agentkit/server/frontend/src/views/AgentChatView.vue (modify — mount PhaseIndicator conditionally).
src/agentkit/server/frontend/tests/unit/PhaseIndicator.spec.ts (new — component test).
src/agentkit/server/frontend/src/api/types.ts (verify — PlanExecutionMode type already covers "plan_exec").

Approach:

WsServerMessage union gains two branches: { type: "phase_changed"; data: { phase: string; previous: string } } and { type: "phase_violation"; data: { tool: string; phase: string; hint: string } }.
chat.ts Pinia store gains: currentPhase: Ref<string | null>, phaseViolations: Ref<PhaseViolation[]>, isPlanExec: ComputedRef<boolean> (derived from session's execution_mode).
handleWsMessage adds case "phase_changed": currentPhase.value = data.phase; and case "phase_violation": phaseViolations.value.push(data); (capped at last 5 to bound memory).
PhaseIndicator.vue:
- 4 dots representing PLANNING/BUILDING/VERIFICATION/DELIVERY; current phase highlighted.
- On phase_violation, show an ant-design-vue message.warning(...) toast with the violation hint.
- Renders nothing when !isPlanExec (graceful degradation).
AgentChatView.vue mounts <PhaseIndicator /> in the chat header slot, conditional on chatStore.isPlanExec.

Execution note: characterization-first — assert that handleWsMessage with data.type="token" (existing) still updates message content unchanged, before adding new cases.

Patterns to follow:

src/agentkit/server/frontend/src/stores/chat.ts:1325-1391 (existing team event handling — phase_started/phase_completed cases shape the new cases).
src/agentkit/server/frontend/src/components/PlanVisualization.vue (existing team mode component — different domain but same "compact badge + state" pattern).

Test scenarios (covers R31, R32):

Characterization (existing events preserved):
- handleWsMessage({type: "token", data: ...}) still appends to message content.
- handleWsMessage({type: "team_formed", ...}) still routes to team store.
Happy paths:
- handleWsMessage({type: "phase_changed", data: {phase: "building", previous: "planning"}}) → currentPhase.value === "building".
- handleWsMessage({type: "phase_violation", data: {tool: "write_file", phase: "planning", hint: "..."}}) → phaseViolations.value length increases by 1.
- PhaseIndicator.vue with currentPhase="building" → renders 4 dots with the 2nd highlighted.
Edge cases:
- PhaseIndicator.vue with isPlanExec=false → renders nothing (returns null or empty <template>).
- phaseViolations capped at 5 entries (6th violation pushes oldest out).
- phase_changed event with previous="" (initial transition) → no error, currentPhase updates.
Integration scenarios:
- Full mount: <PhaseIndicator /> mounted in AgentChatView.vue with isPlanExec=true and currentPhase="planning" → renders correctly; message.warning toast appears when phase_violation received.

Verification:

cd src/agentkit/server/frontend && npm run typecheck clean.
npm run test:unit -- PhaseIndicator passes.
npm run lint clean.

U5. E2E integration test for full PLAN_EXEC lifecycle

Goal: A single E2E test that exercises the full PLAN_EXEC path through a scripted LLM mock: PLANNING (search) → advance_phase → BUILDING (write_file) → advance_phase → VERIFICATION (shell with pytest) → advance_phase → DELIVERY (final answer). Asserts WS events sequence, phase transitions, tool dispatches, and phase_violation rejection when LLM attempts out-of-phase tool.

Requirements: R34, R27.

Dependencies: U1, U2, U3, U4 (all backend pieces must be in place).

Files:

tests/integration/test_plan_exec_e2e.py (new).

Approach:

Mock LLM gateway: returns scripted responses in sequence (deterministic, no real API call):
1. search tool call (PLANNING-allowed) → tool dispatched.
2. advance_phase tool call → phase_changed event emitted.
3. write_file tool call (BUILDING-allowed) → tool dispatched.
4. advance_phase tool call → phase_changed event emitted.
5. shell tool call with pytest (VERIFICATION-allowed) → tool dispatched.
6. advance_phase tool call → phase_changed event emitted.
7. Final answer text.
Negative path: insert an out-of-phase write_file call in step 1 (PLANNING) → assert phase_violation event emitted, tool NOT dispatched, LLM receives structured error.
Test asserts:
- WS event sequence includes exactly 3 phase_changed events (planning→building, building→verification, verification→delivery).
- Exactly 1 phase_violation event (in the negative path).
- Tool dispatch count matches allowed tool calls.
- Final final_answer event received.

Execution note: This is a characterization test for the wired-up system, not a unit test. Mock the LLM gateway at the LLMGateway boundary; use real ReActEngine, real PhasePolicy, real WS handler (or a WS test client).

Patterns to follow:

tests/unit/test_chat_plan_exec_ws.py (Wave 3 — same WS test client pattern).
tests/integration/test_api_coverage.py (existing — integration test patterns in the repo).

Test scenarios (covers R34):

Happy path (full lifecycle):
- Scripted LLM completes all 4 phases in order → 3 phase_changed events, 3 advance_phase tool calls dispatched, allowed tools dispatched in each phase, final_answer event received.
Negative path (violation then recovery):
- LLM attempts write_file in PLANNING → phase_violation event emitted, write_file NOT dispatched (assert write_file.execute call count is 0 at this point), LLM receives structured error in conversation.
- LLM then calls advance_phase → transitions to BUILDING, write_file now dispatched successfully.
Edge cases:
- plan_exec.enabled=False config → test asserts path falls back to REACT (no phase events emitted).
- LLM never calls advance_phase and auto_advance_after_steps=2 → phase auto-advances after 2 steps (asserts safety net).
Error paths:
- LLM raises (LLM call fails) → existing error event emitted; phase state unchanged.

Verification:

python3 -m pytest tests/integration/test_plan_exec_e2e.py -v passes.
Test runs without real LLM API call (mocked).

Risks & Dependencies

Risks

REST non-streaming shape mismatch (medium): REST clients expecting SSE for PLAN_EXEC will not get streaming phase events; they get a list after-the-fact. Mitigation: KTD2 documents this as intentional first version; SSE follow-up tracked as deferred.
Frontend state slice bloat (low): Adding currentPhase + phaseViolations to the chat store adds reactive state. Mitigation: phaseViolations capped at 5; currentPhase is a single string. Negligible memory.
ShellTool._is_dangerous import cycle (low): core/phase.py importing from tools/shell.py could create a cycle if shell.py imports from core/. Mitigation: verify import direction at implementation time; if cycle, lift _is_dangerous to a shared tools/_safety.py module (one-function extraction).
E2E test flakiness from mock sequencing (low): Scripted LLM mock must match exact sequence. Mitigation: index-based mock (call N returns response N) rather than state-based; deterministic.
Backward compat for re.Pattern config (low): Existing config-supplied regex patterns must still work after the type widening. Mitigation: KTD4 preserves re.Pattern branch in is_bash_command_allowed; characterization test in U1.

Dependencies

Wave 3 (PR #6 merged) — PhasePolicy, PhaseState, default_policy(), AdvancePhaseTool, WS PLAN_EXEC handler all in place.
Wave 2 (PR #5 merged) — execute_with_fallback_chain for REST non-PLAN_EXEC path.
No external library dependencies.

System-Wide Impact

REST send_message callers: gain PLAN_EXEC support; existing REACT/SKILL_REACT callers unchanged (fallback chain preserved).
WebSocket clients: gain phase_violation event type; existing event types unchanged. Vue frontend renders the new events; other WS clients (CLI) ignore them silently.
agentkit.yaml: no new config section (Wave 3 plan_exec section is reused; Wave 4 only changes how the policy is constructed internally).
PhasePolicy API: bash_command_filter field type widens (re.Pattern | None → Callable[[str], bool] | re.Pattern | None); is_bash_command_allowed signature unchanged. Backward compatible.
Frontend chat store: gains currentPhase + phaseViolations reactive state; existing state unchanged.

Sources & Research

Origin brainstorm: docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md (Wave 3 deferred items + KTD6/KTD7).
Wave 3 plan: docs/plans/2026-06-29-004-feat-agent-wave3-strategic-plan.md (Out of Scope section enumerates the deferrals Wave 4 executes).
Wave 3 code review: /tmp/compound-engineering/ce-code-review/20260630-015548-c44a5245/ (ponytail ceiling at core/phase.py:56-60 flagged by correctness+reliability reviewers).
Codebase research (2026-06-30): frontend/src/stores/chat.ts:800, frontend/src/api/types.ts:135, src/agentkit/server/routes/chat.py:580,1093-1153, src/agentkit/tools/shell.py:466, src/agentkit/core/react.py:2196-2203, src/agentkit/server/_fallback_chain.py:90.

Deferred to Implementation

Exact SendMessageResponse schema for phase_events: list[dict] — design above gives the shape; implementer finalizes field names based on existing response models.
PhaseIndicator.vue visual design (dot vs pill vs progress bar) — implementer picks based on existing Ant Design Vue component inventory.
Mock LLM response sequence length in U5 — implementer sizes based on whether the test asserts every step or samples key transitions.
Whether _build_phase_engine returns None or raises on opt-out (enabled=False) — design above returns None and caller falls back; implementer may switch to explicit enum return if cleaner.

32 KiB Raw Blame History

Wave 4 PLAN_EXEC Hardening — REST Symmetry + Frontend Visibility + Filter Upgrade + E2E Coverage

Summary

Problem Frame

Requirements

Key Technical Decisions

KTD1: Extract _build_phase_engine helper shared by WS and REST

KTD2: REST PLAN_EXEC returns non-streaming ChatResult; SSE streaming is deferred

KTD3: phase_violation event emitted to WS alongside LLM injection

KTD4: bash_command_filter accepts Callable[[str], bool] | re.Pattern | None

KTD5: Phase indicator UI is compact, optional, and degrades gracefully

Scope Boundaries

In Scope

Out of Scope (Deferred to Follow-Up Work)

Outside This Product's Identity

Implementation Units

U1. Bash filter upgrade — reuse ShellTool._is_dangerous() (G6 hardening)

U2. Emit phase_violation WS event from ReActEngine

U3. Refactor _build_phase_engine helper + REST PLAN_EXEC wiring

U4. Frontend phase event pipeline + PhaseIndicator.vue

U5. E2E integration test for full PLAN_EXEC lifecycle

Risks & Dependencies

Risks

Dependencies

System-Wide Impact

Sources & Research

Deferred to Implementation

32 KiB

Raw Blame History

KTD1: Extract `_build_phase_engine` helper shared by WS and REST

KTD2: REST PLAN_EXEC returns non-streaming `ChatResult`; SSE streaming is deferred

KTD3: `phase_violation` event emitted to WS alongside LLM injection

KTD4: `bash_command_filter` accepts `Callable[[str], bool] | re.Pattern | None`

U1. Bash filter upgrade — reuse `ShellTool._is_dangerous()` (G6 hardening)

U2. Emit `phase_violation` WS event from `ReActEngine`

U3. Refactor `_build_phase_engine` helper + REST PLAN_EXEC wiring

U4. Frontend phase event pipeline + `PhaseIndicator.vue`