32 KiB
| title | date | type | status | origin | execution |
|---|---|---|---|---|---|
| feat: Agent Wave 4 PLAN_EXEC hardening (REST wiring + frontend events + bash filter upgrade + e2e tests) | 2026-06-30 | feat | draft | docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md (Wave 3 deferred items + Wave 3 code review residual risks) | code |
Wave 4 PLAN_EXEC Hardening — REST Symmetry + Frontend Visibility + Filter Upgrade + E2E Coverage
Summary
Wave 3 closed G5 (function-level sharding) and G6 (SOLO four-stage state machine) at the WebSocket path only. Three concrete gaps remain before PLAN_EXEC is production-ready:
- REST asymmetry —
POST /api/v1/chat/{session_id}/send_messagewithexecution_mode="plan_exec"returns HTTP 501 (chat.py:590-595). WebSocket has a real handler; REST does not. - No frontend visibility —
phase_changedevents are emitted on the WS socket (chat.py:1295-1309) butfrontend/src/api/types.ts:135WsServerMessagehas nophase_changed/phase_violationbranch.phase_violationis not emitted to the client at all (only injected back to the LLM as a tool error at react.py:2196-2203). The Vue UI cannot show what phase the agent is in or surface policy violations. - Bash filter ceiling —
core/phase.py:56-60ponytail comment names the ceiling explicitly: regex misses:>file,dd of=file. Upgrade path = reuseShellTool._is_dangerous()(shell.py:466) at enforcement time.
Wave 4 closes all three and ships an E2E integration test that exercises the full PLAN_EXEC path (LLM → phase transition → tool dispatch → WS event). It does not migrate to tree-sitter (KTD1 upgrade path remains for Wave 5+) or add phase persistence across session resume (U7 checkpoint scope, separate concern).
Problem Frame
Wave 3 ships PLAN_EXEC behind a feature flag (agentkit.yaml plan_exec.enabled, default commented = opt-in). To turn it on in production, three conditions must hold:
- Symmetric entry points: callers using REST (CLI
agentkit task submit, external integrations) and callers using WebSocket (Vue chat UI) must both be able to invoke PLAN_EXEC. Today only WS works. - Observable behavior: when the agent transitions phases or rejects a tool call, the user must see it. Today the WS event is emitted but the frontend silently drops it; violations are invisible to the user.
- Hardened safety boundary: the bash filter is the only thing preventing a Planning-phase LLM from running
rm -rf. The regex is conservative by the author's own admission (:>fileslips through). Production use requires the same safety guaranteeShellToolalready provides.
Wave 3's "Out of Scope (Deferred to Follow-Up Work)" section explicitly lists "REST send_message PLAN_EXEC wiring" and "Tool-filter UI in the frontend" — Wave 4 executes those deferrals. The bash filter upgrade was surfaced by Wave 3 code review (ponytail ceiling) rather than the brainstorm.
Requirements
Carried forward from Wave 3 plan's deferred items + Wave 3 code review residuals:
- R28: REST
POST /api/v1/chat/{session_id}/send_messagewithexecution_mode="plan_exec"invokes the same PLAN_EXEC handler logic as the WebSocket path, returning aChatResultwith phase events recorded (mirrors WS event sequence). - R29: REST PLAN_EXEC bypasses the Wave 2 fallback chain (mutually exclusive with phase policy — chat.py:1093-1095 documents the constraint).
- R30: WebSocket emits a
phase_violationevent to the client whenReActEngine._check_phase_permissionblocks a tool call (currently only returned to the LLM). - R31: Frontend
WsServerMessageunion (types.ts:135) includesphase_changedandphase_violationcases;handleWsMessage(chat.ts:800) dispatches both to a phase state slice. - R32: A compact phase indicator renders the current phase + violation toasts in the Vue chat view.
- R33:
PhasePolicy.bash_command_filteraccepts aCallable[[str], bool]callback in addition tore.Pattern; default policy wiresShellTool._is_dangerousso:>fileanddd of=fileare blocked. - R34: An E2E integration test exercises the full PLAN_EXEC path through a scripted LLM mock: planning →
advance_phase→ building →write_file→ verification → delivery, asserting WS events and tool dispatches.
Cross-cutting:
- R26 (inherited): all configuration via
agentkit.yamlplan_execsection, parsed byServerConfig.from_dict. - R27 (inherited): each unit ships a minimal self-check (ponytail rule).
Key Technical Decisions
KTD1: Extract _build_phase_engine helper shared by WS and REST
Decision: Refactor chat.py:1093-1153 (the WS PLAN_EXEC engine construction block) into a private _build_phase_engine(server_config, agent, tools, ...) -> tuple[ReActEngine, list[Tool]] helper. Both _execute_plan_exec_ws and the new _execute_plan_exec_rest call it.
Rationale:
- The WS block currently inlines policy construction + engine instantiation + AdvancePhaseTool registration. REST needs the same assembly.
- Single source of truth for "how to build a phase-enforced engine" prevents drift between WS and REST paths.
- Helper is private (underscore prefix) — not a public API; test access goes through the routes.
KTD2: REST PLAN_EXEC returns non-streaming ChatResult; SSE streaming is deferred
Decision: _execute_plan_exec_rest() returns a regular SendMessageResponse (matching the existing REST send_message shape). The phase_changed/phase_violation events are captured into a phase_events: list field on the response payload.
Rationale:
- Existing REST
send_messageis non-streaming (chat.py:580). Streaming REST (SSE) is a separate concern owned by the/api/v1/llm/chatgateway route (llm_gateway.py), not chat.py. - First version ships parity with existing REST shape; SSE streaming for PLAN_EXEC is a follow-up if users request it.
- The phase events list lets REST clients render phase progression after-the-fact (CLI
agentkit task submitshows them in terminal output).
KTD3: phase_violation event emitted to WS alongside LLM injection
Decision: In react.py:_execute_loop, when _check_phase_permission blocks a tool call, the existing structured error is injected to the LLM conversation (unchanged), AND a phase_violation event is emitted through the engine's event stream. chat.py WS handler forwards it to the client.
Rationale:
- Wave 3 returns the violation only to the LLM (gives the model a chance to self-correct by calling
advance_phase). That stays. - Adding the WS event gives the user visibility into "the LLM tried to call
write_filein Planning, was rejected, and will retry" — without this, the UI shows the LLM thinking silently which looks like a hang. - Event payload:
{"type": "phase_violation", "data": {"tool": "write_file", "phase": "planning", "hint": "call advance_phase"}}.
KTD4: bash_command_filter accepts Callable[[str], bool] | re.Pattern | None
Decision: Change PhasePolicy.bash_command_filter field type from dict[PhaseState, re.Pattern | None] to dict[PhaseState, Callable[[str], bool] | re.Pattern | None]. is_bash_command_allowed detects callable vs pattern at call time. default_policy() injects ShellTool._is_dangerous as the callable for PLANNING/VERIFICATION.
Rationale:
ShellTool._is_dangerous(shell.py:466) is already battle-tested against_DANGEROUS_BINARIES,_DANGEROUS_BINARY_FLAGS,_DANGEROUS_ARG_PATTERNS, shell-chain operators, and pipe operators. Reusing it eliminates the regex ceiling the ponytail comment named.- The
re.Patternform stays for backward compat (config-supplied regex patterns still work). PhaseStateenum andPhasePolicyAPI stay stable; only the field type widens.
Alternative considered: Move the filter to ShellTool itself (gateway-level). Rejected because phase enforcement is per-step in ReActEngine, not per-shell-call — different lifecycle.
KTD5: Phase indicator UI is compact, optional, and degrades gracefully
Decision: Add a PhaseIndicator.vue component (badge + progress dots for 4 phases + transient toast for phase_violation). Mount it in the chat view header only when the current session has execution_mode="plan_exec"; otherwise render nothing.
Rationale:
- Most chat sessions are REACT/SKILL_REACT — phase indicator is noise for them. Conditionally render only for PLAN_EXEC sessions.
- Compact form (badge + dots) avoids competing with the existing
PlanVisualization.vue(team mode, different concept — don't unify them). - Toast pattern matches existing
useMessagefromant-design-vueused elsewhere in the frontend.
Scope Boundaries
In Scope
- Refactor WS PLAN_EXEC engine construction into
_build_phase_engineshared helper. - New
_execute_plan_exec_restfor REST send_message; remove the 501 at chat.py:590-595. - Emit
phase_violationevent fromReActEngine._execute_loopthrough the WS handler. - Frontend
WsServerMessageunion extension +handleWsMessagecases + newPhaseIndicator.vue. PhasePolicy.bash_command_filtertype widening +default_policy()wiring toShellTool._is_dangerous.- E2E integration test with scripted LLM mock covering full PLAN_EXEC lifecycle.
Out of Scope (Deferred to Follow-Up Work)
- SSE streaming for REST PLAN_EXEC (KTD2 — non-streaming first; SSE follow-up if requested).
tree-sitterintegration for symbol extraction (Wave 3 KTD1 upgrade path; Wave 5+ candidate).- Phase persistence across session resume (depends on U7 checkpoint deeper changes).
- Phase-aware prompt engineering (per-phase system prompt templates — prompt-engineering concern, not code).
- Phase rollback on
Building → Planningregression (UX/prompt concern; Wave 2 G9 file-level rollback already handles file state). config_sync.pyexposure ofplan_execto frontend (frontend reads phase events from WS, not config — config exposure only needed if the UI wants to render phase whitelists, which is out of scope).- Recovery/Emergency layer integration with PLAN_EXEC (mutually exclusive by design — chat.py:1093-1095 documents this; integrating would require ReflexionEngine to understand phase state, separate Wave).
Outside This Product's Identity
- Replacing the existing ReAct loop with LangGraph (inherited from brainstorm).
- Disc-based file system à la DeerFlow (inherited).
- Docker sandbox (inherited; only command-level safety via
bash_command_filter).
Implementation Units
U1. Bash filter upgrade — reuse ShellTool._is_dangerous() (G6 hardening)
Goal: Widen PhasePolicy.bash_command_filter to accept Callable[[str], bool] callbacks and wire ShellTool._is_dangerous as the default filter for PLANNING/VERIFICATION phases. Eliminate the ponytail ceiling at core/phase.py:56-60.
Requirements: R33, R27.
Dependencies: none.
Files:
src/agentkit/core/phase.py(modify — widen field type; injectShellTool._is_dangerousindefault_policy(); updateis_bash_command_allowedto handle callable).tests/unit/test_phase_policy.py(modify — add cases for:>file,dd of=file, callable vs pattern).
Approach:
- Field type:
bash_command_filter: dict[PhaseState, Callable[[str], bool] | re.Pattern | None]. is_bash_command_allowed(command, phase):filter = self.bash_command_filter.get(phase)if filter is None: return Trueif callable(filter): return not filter(command)if isinstance(filter, re.Pattern): return not filter.search(command)
default_policy()replaces_DEFAULT_BASH_FILTERregex withShellTool._is_dangerousmethod reference (bound method, callable).- Keep
_DEFAULT_BASH_FILTERregex as a module constant for tests and config-supplied patterns;default_policy()no longer uses it. - Remove the ponytail comment at
core/phase.py:56-60(ceiling is closed).
Execution note: characterization-first — test that default_policy().is_bash_command_allowed("rm -rf /", PLANNING) still returns False (preserves Wave 3 behavior) before adding new edge-case coverage.
Patterns to follow:
src/agentkit/core/phase.py:default_policy()(Wave 3 — same factory pattern).src/agentkit/tools/shell.py:_is_dangerous(Wave 3 — already the canonical safety check).
Test scenarios (covers R33):
- Characterization (Wave 3 preserved):
default_policy().is_bash_command_allowed("rm -rf /", PLANNING)→ False.default_policy().is_bash_command_allowed("ls -la", PLANNING)→ True.default_policy().is_bash_command_allowed("git status", PLANNING)→ True.
- Happy paths (new ceiling closed):
:>filein PLANNING → False (was True before —ShellTool._is_dangerouscatches redirect-to-empty).dd of=/dev/sdain PLANNING → False (was True before — caught by_DANGEROUS_BINARIES).echo hello > /tmp/xin PLANNING → False (was True before —ShellToolcatches>redirect).
- Edge cases:
re.Patternform still works when supplied via config (whitelist_override-adjacent — config-supplied regex pattern is honored).callableform takes precedence overre.Patternwhen both somehow present (defensive — shouldn't happen).
- Error paths:
- Empty command in PLANNING → True (ShellTool separately rejects empty commands at execution time; filter only gates dangerous patterns).
- None filter for BUILDING → True (no restriction).
Verification:
python3 -m pytest tests/unit/test_phase_policy.py -qpasses.ruff check src/agentkit/core/phase.pyclean.- Ponytail comment at
core/phase.py:56-60is removed (ceiling closed, not just documented).
U2. Emit phase_violation WS event from ReActEngine
Goal: When _check_phase_permission blocks a tool call, emit a phase_violation event through the engine's event stream so chat.py WS handler can forward it to the client. Today the violation is only injected back to the LLM (react.py:2196-2203), invisible to the user.
Requirements: R30.
Dependencies: none (independent of U1 — violation emission doesn't depend on filter implementation).
Files:
src/agentkit/core/react.py(modify — emit event alongside the existing LLM injection).src/agentkit/server/routes/chat.py(modify — forwardphase_violationevents fromexecute_streamto the WS client).tests/unit/test_react_phase_enforcement.py(modify — assert event emission).tests/unit/test_chat_plan_exec_ws.py(modify — assert WS client receivesphase_violationevent).
Approach:
ReActEngine.execute_streamalready yields events fortool_call/tool_result/thinking/token. Add a new event typephase_violationyielded before the structured error is injected to the LLM conversation.- Event payload:
{"type": "phase_violation", "data": {"tool": "<tool_name>", "phase": "<current_phase>", "hint": "call advance_phase"}}. chat.pyWS handler (around chat.py:1218async for event in react_engine.execute_stream(...)) adds anelif event["type"] == "phase_violation":branch thatwebsocket.send_jsonthe event to the client.- Existing LLM-injection path is unchanged — the LLM still gets the structured error to react to.
Execution note: characterization-first — assert that phase_policy=None (no enforcement) yields zero phase_violation events (preserves Wave 3 behavior) before adding the positive-path test.
Patterns to follow:
src/agentkit/core/react.pyexisting event emission (e.g.,tool_callevent emission pattern).src/agentkit/server/routes/chat.py:1295-1309phase_changedevent forwarding (same shape).
Test scenarios (covers R30):
- Characterization (no policy):
ReActEngine(phase_policy=None)executing a full loop yields zerophase_violationevents.
- Happy paths:
- PLANNING phase, LLM calls
write_file→ engine yieldsphase_violationevent withtool="write_file",phase="planning",hint="call advance_phase". - WS handler forwards
phase_violationto client connection (assertwebsocket.send_jsoncalled with{"type": "phase_violation", ...}). - LLM still receives the structured error in conversation (regression — Wave 3 behavior preserved).
- PLANNING phase, LLM calls
- Edge cases:
- Multiple violations in a row (LLM retries same tool) → multiple
phase_violationevents emitted (one per attempt). - Violation followed by
advance_phasefollowed by same tool now allowed → exactly onephase_violationevent, then atool_callevent.
- Multiple violations in a row (LLM retries same tool) → multiple
- Error paths:
- Phase policy construction failure → existing 500 error path, no
phase_violationemitted (engine not constructed).
- Phase policy construction failure → existing 500 error path, no
- Integration scenarios:
- Full WS path: client connects, sends PLAN_EXEC request, LLM mock emits
write_filein PLANNING → client receivesphase_violationevent before anytool_callevent.
- Full WS path: client connects, sends PLAN_EXEC request, LLM mock emits
Verification:
python3 -m pytest tests/unit/test_react_phase_enforcement.py tests/unit/test_chat_plan_exec_ws.py -qpasses.ruff check src/agentkit/core/react.py src/agentkit/server/routes/chat.pyclean.
U3. Refactor _build_phase_engine helper + REST PLAN_EXEC wiring
Goal: Extract the WS PLAN_EXEC engine construction (chat.py:1093-1153) into a private _build_phase_engine(server_config, agent, tools, ...) -> tuple[ReActEngine, list[Tool]] helper. Add _execute_plan_exec_rest() for REST send_message; replace the 501 at chat.py:590-595.
Requirements: R28, R29, R26.
Dependencies: U1 (uses the hardened default_policy()).
Files:
src/agentkit/server/routes/chat.py(modify — extract helper; add REST handler; remove 501).tests/unit/test_chat_plan_exec_ws.py(modify — add REST PLAN_EXEC test cases).tests/unit/test_chat_rest_plan_exec.py(new — REST-specific coverage).
Approach:
- New private
_build_phase_engine(server_config, agent, tools, system_prompt, model) -> tuple[ReActEngine, list[Tool]]:- Read
server_config.plan_exec(default{}). - If
enabled is False, return(None, tools)(caller falls back to REACT). - Build
PhasePolicyviapolicy_from_config; on failure or None, fall back todefault_policy(). - Construct
ReActEngine(..., phase_policy=policy). - Register
AdvancePhaseToolbound to the engine; return(engine, tools + [advance_phase]).
- Read
- WS path:
_execute_plan_exec_wscalls_build_phase_engine; if engine is None, falls back to REACT (existing behavior at chat.py:1101-1107). - REST path:
_execute_plan_exec_rest(request, session_id, ...):- Calls
_build_phase_engine. - If engine is None, delegates to
execute_with_fallback_chain(REST keeps fallback chain for non-PLAN_EXEC). - Otherwise calls
engine.execute(...)(non-streaming, single-shot — matches existing REST send_message shape). - Collects
phase_changed/phase_violationevents into aphase_events: list[dict]field on the response payload. - Returns
SendMessageResponseextended with optionalphase_eventsfield.
- Calls
- Replace chat.py:590-595 with a branch: if
routing.execution_mode == PLAN_EXEC, call_execute_plan_exec_rest; else continue with existing fallback chain. SendMessageResponsemodel gains an optionalphase_events: list[dict] | None = Nonefield (default None keeps backward compat for non-PLAN_EXEC responses).
Execution note: characterization-first — assert that REST send_message with execution_mode="react" (or None) still goes through execute_with_fallback_chain (Wave 2 behavior unchanged) before adding PLAN_EXEC branch.
Patterns to follow:
src/agentkit/server/routes/chat.py:1093-1153(existing WS PLAN_EXEC block — the code being extracted).src/agentkit/server/_fallback_chain.py:execute_with_fallback_chain(Wave 2 — REST non-PLAN_EXEC path stays here).
Test scenarios (covers R28, R29):
- Characterization (REST non-PLAN_EXEC preserved):
- REST
send_messagewithexecution_mode="react"→ callsexecute_with_fallback_chain(Wave 2 path unchanged). - REST
send_messagewithexecution_mode=None→ defaults to REACT, fallback chain applies.
- REST
- Happy paths:
- REST
send_messagewithexecution_mode="plan_exec"→ returns 200 (not 501). - Response includes
phase_events: listwith at least onephase_changedentry when the engine transitions. - REST with empty
plan_execconfig → usesdefault_policy()(KTD5 default whitelist).
- REST
- Edge cases:
- REST with
plan_exec.enabled=False→ falls back to REACT, response hasphase_events=None. - REST with bad
plan_execconfig (invalid phase name) → 500 with error message naming the bad value. - REST PLAN_EXEC with phase violation →
phase_eventsincludes aphase_violationentry.
- REST with
- Error paths:
- REST PLAN_EXEC when session is closed → 400 (existing path, no change).
- REST PLAN_EXEC with non-existent session → 404 (existing path).
- Integration scenarios:
- REST PLAN_EXEC bypasses fallback chain: assert
execute_with_fallback_chainis NOT called whenexecution_mode="plan_exec"(mutual exclusion per R29).
- REST PLAN_EXEC bypasses fallback chain: assert
Verification:
python3 -m pytest tests/unit/test_chat_plan_exec_ws.py tests/unit/test_chat_rest_plan_exec.py -qpasses.ruff check src/agentkit/server/routes/chat.pyclean.- The 501 at chat.py:590-595 is removed.
U4. Frontend phase event pipeline + PhaseIndicator.vue
Goal: Extend WsServerMessage union with phase_changed and phase_violation event types; add handleWsMessage cases that update a phase state slice; add a compact PhaseIndicator.vue component mounted only for PLAN_EXEC sessions.
Requirements: R31, R32.
Dependencies: U2 (frontend renders phase_violation events emitted by backend).
Files:
src/agentkit/server/frontend/src/api/types.ts(modify — extendWsServerMessageunion).src/agentkit/server/frontend/src/stores/chat.ts(modify — addphasestate slice; add cases inhandleWsMessage).src/agentkit/server/frontend/src/components/PhaseIndicator.vue(new — badge + dots + toast).src/agentkit/server/frontend/src/views/AgentChatView.vue(modify — mountPhaseIndicatorconditionally).src/agentkit/server/frontend/tests/unit/PhaseIndicator.spec.ts(new — component test).src/agentkit/server/frontend/src/api/types.ts(verify —PlanExecutionModetype already covers"plan_exec").
Approach:
WsServerMessageunion gains two branches:{ type: "phase_changed"; data: { phase: string; previous: string } }and{ type: "phase_violation"; data: { tool: string; phase: string; hint: string } }.chat.tsPinia store gains:currentPhase: Ref<string | null>,phaseViolations: Ref<PhaseViolation[]>,isPlanExec: ComputedRef<boolean>(derived from session'sexecution_mode).handleWsMessageaddscase "phase_changed": currentPhase.value = data.phase;andcase "phase_violation": phaseViolations.value.push(data);(capped at last 5 to bound memory).PhaseIndicator.vue:- 4 dots representing PLANNING/BUILDING/VERIFICATION/DELIVERY; current phase highlighted.
- On
phase_violation, show anant-design-vuemessage.warning(...)toast with the violation hint. - Renders nothing when
!isPlanExec(graceful degradation).
AgentChatView.vuemounts<PhaseIndicator />in the chat header slot, conditional onchatStore.isPlanExec.
Execution note: characterization-first — assert that handleWsMessage with data.type="token" (existing) still updates message content unchanged, before adding new cases.
Patterns to follow:
src/agentkit/server/frontend/src/stores/chat.ts:1325-1391(existing team event handling —phase_started/phase_completedcases shape the new cases).src/agentkit/server/frontend/src/components/PlanVisualization.vue(existing team mode component — different domain but same "compact badge + state" pattern).
Test scenarios (covers R31, R32):
- Characterization (existing events preserved):
handleWsMessage({type: "token", data: ...})still appends to message content.handleWsMessage({type: "team_formed", ...})still routes to team store.
- Happy paths:
handleWsMessage({type: "phase_changed", data: {phase: "building", previous: "planning"}})→currentPhase.value === "building".handleWsMessage({type: "phase_violation", data: {tool: "write_file", phase: "planning", hint: "..."}})→phaseViolations.valuelength increases by 1.PhaseIndicator.vuewithcurrentPhase="building"→ renders 4 dots with the 2nd highlighted.
- Edge cases:
PhaseIndicator.vuewithisPlanExec=false→ renders nothing (returnsnullor empty<template>).phaseViolationscapped at 5 entries (6th violation pushes oldest out).phase_changedevent withprevious=""(initial transition) → no error,currentPhaseupdates.
- Integration scenarios:
- Full mount:
<PhaseIndicator />mounted inAgentChatView.vuewithisPlanExec=trueandcurrentPhase="planning"→ renders correctly;message.warningtoast appears whenphase_violationreceived.
- Full mount:
Verification:
cd src/agentkit/server/frontend && npm run typecheckclean.npm run test:unit -- PhaseIndicatorpasses.npm run lintclean.
U5. E2E integration test for full PLAN_EXEC lifecycle
Goal: A single E2E test that exercises the full PLAN_EXEC path through a scripted LLM mock: PLANNING (search) → advance_phase → BUILDING (write_file) → advance_phase → VERIFICATION (shell with pytest) → advance_phase → DELIVERY (final answer). Asserts WS events sequence, phase transitions, tool dispatches, and phase_violation rejection when LLM attempts out-of-phase tool.
Requirements: R34, R27.
Dependencies: U1, U2, U3, U4 (all backend pieces must be in place).
Files:
tests/integration/test_plan_exec_e2e.py(new).
Approach:
- Mock LLM gateway: returns scripted responses in sequence (deterministic, no real API call):
searchtool call (PLANNING-allowed) → tool dispatched.advance_phasetool call →phase_changedevent emitted.write_filetool call (BUILDING-allowed) → tool dispatched.advance_phasetool call →phase_changedevent emitted.shelltool call withpytest(VERIFICATION-allowed) → tool dispatched.advance_phasetool call →phase_changedevent emitted.- Final answer text.
- Negative path: insert an out-of-phase
write_filecall in step 1 (PLANNING) → assertphase_violationevent emitted, tool NOT dispatched, LLM receives structured error. - Test asserts:
- WS event sequence includes exactly 3
phase_changedevents (planning→building, building→verification, verification→delivery). - Exactly 1
phase_violationevent (in the negative path). - Tool dispatch count matches allowed tool calls.
- Final
final_answerevent received.
- WS event sequence includes exactly 3
Execution note: This is a characterization test for the wired-up system, not a unit test. Mock the LLM gateway at the LLMGateway boundary; use real ReActEngine, real PhasePolicy, real WS handler (or a WS test client).
Patterns to follow:
tests/unit/test_chat_plan_exec_ws.py(Wave 3 — same WS test client pattern).tests/integration/test_api_coverage.py(existing — integration test patterns in the repo).
Test scenarios (covers R34):
- Happy path (full lifecycle):
- Scripted LLM completes all 4 phases in order → 3
phase_changedevents, 3advance_phasetool calls dispatched, allowed tools dispatched in each phase,final_answerevent received.
- Scripted LLM completes all 4 phases in order → 3
- Negative path (violation then recovery):
- LLM attempts
write_filein PLANNING →phase_violationevent emitted,write_fileNOT dispatched (assertwrite_file.executecall count is 0 at this point), LLM receives structured error in conversation. - LLM then calls
advance_phase→ transitions to BUILDING,write_filenow dispatched successfully.
- LLM attempts
- Edge cases:
plan_exec.enabled=Falseconfig → test asserts path falls back to REACT (no phase events emitted).- LLM never calls
advance_phaseandauto_advance_after_steps=2→ phase auto-advances after 2 steps (asserts safety net).
- Error paths:
- LLM raises (LLM call fails) → existing error event emitted; phase state unchanged.
Verification:
python3 -m pytest tests/integration/test_plan_exec_e2e.py -vpasses.- Test runs without real LLM API call (mocked).
Risks & Dependencies
Risks
- REST non-streaming shape mismatch (medium): REST clients expecting SSE for PLAN_EXEC will not get streaming phase events; they get a list after-the-fact. Mitigation: KTD2 documents this as intentional first version; SSE follow-up tracked as deferred.
- Frontend state slice bloat (low): Adding
currentPhase+phaseViolationsto the chat store adds reactive state. Mitigation:phaseViolationscapped at 5;currentPhaseis a single string. Negligible memory. ShellTool._is_dangerousimport cycle (low):core/phase.pyimporting fromtools/shell.pycould create a cycle ifshell.pyimports fromcore/. Mitigation: verify import direction at implementation time; if cycle, lift_is_dangerousto a sharedtools/_safety.pymodule (one-function extraction).- E2E test flakiness from mock sequencing (low): Scripted LLM mock must match exact sequence. Mitigation: index-based mock (call N returns response N) rather than state-based; deterministic.
- Backward compat for
re.Patternconfig (low): Existing config-supplied regex patterns must still work after the type widening. Mitigation: KTD4 preservesre.Patternbranch inis_bash_command_allowed; characterization test in U1.
Dependencies
- Wave 3 (PR #6 merged) —
PhasePolicy,PhaseState,default_policy(),AdvancePhaseTool, WS PLAN_EXEC handler all in place. - Wave 2 (PR #5 merged) —
execute_with_fallback_chainfor REST non-PLAN_EXEC path. - No external library dependencies.
System-Wide Impact
- REST
send_messagecallers: gain PLAN_EXEC support; existing REACT/SKILL_REACT callers unchanged (fallback chain preserved). - WebSocket clients: gain
phase_violationevent type; existing event types unchanged. Vue frontend renders the new events; other WS clients (CLI) ignore them silently. agentkit.yaml: no new config section (Wave 3plan_execsection is reused; Wave 4 only changes how the policy is constructed internally).PhasePolicyAPI:bash_command_filterfield type widens (re.Pattern | None→Callable[[str], bool] | re.Pattern | None);is_bash_command_allowedsignature unchanged. Backward compatible.- Frontend chat store: gains
currentPhase+phaseViolationsreactive state; existing state unchanged.
Sources & Research
- Origin brainstorm:
docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md(Wave 3 deferred items + KTD6/KTD7). - Wave 3 plan:
docs/plans/2026-06-29-004-feat-agent-wave3-strategic-plan.md(Out of Scope section enumerates the deferrals Wave 4 executes). - Wave 3 code review:
/tmp/compound-engineering/ce-code-review/20260630-015548-c44a5245/(ponytail ceiling atcore/phase.py:56-60flagged by correctness+reliability reviewers). - Codebase research (2026-06-30):
frontend/src/stores/chat.ts:800,frontend/src/api/types.ts:135,src/agentkit/server/routes/chat.py:580,1093-1153,src/agentkit/tools/shell.py:466,src/agentkit/core/react.py:2196-2203,src/agentkit/server/_fallback_chain.py:90.
Deferred to Implementation
- Exact
SendMessageResponseschema forphase_events: list[dict]— design above gives the shape; implementer finalizes field names based on existing response models. PhaseIndicator.vuevisual design (dot vs pill vs progress bar) — implementer picks based on existing Ant Design Vue component inventory.- Mock LLM response sequence length in U5 — implementer sizes based on whether the test asserts every step or samples key transitions.
- Whether
_build_phase_enginereturnsNoneor raises on opt-out (enabled=False) — design above returns None and caller falls back; implementer may switch to explicit enum return if cleaner.