feat(agent): Wave 4 PLAN_EXEC Hardening (U1-U5) #7
17
CONCEPTS.md
17
CONCEPTS.md
|
|
@ -59,3 +59,20 @@ The auth store's cold-start state machine with values `valid` / `invalid` / `err
|
|||
|
||||
### Service Broadcast Callback
|
||||
The convention for pushing backend state changes to the user's open frontend tabs in real time without coupling domain services to the WebSocket transport. A service accepts an optional async callback at construction; after a successful mutation it best-effort invokes the callback with a typed message envelope. Delivery failure is logged but never rolls back the mutation — the persisted state is the source of truth, the broadcast is a latency optimization. The callback is wired at the composition root (app lifespan) to the portal's per-user fan-out primitive, so the service stays layer-pure. The same callback shape is shared by CRUD services, reminder dispatchers, and sync providers, giving all real-time updates a single exit point.
|
||||
|
||||
## PLAN_EXEC
|
||||
|
||||
### Phase State Machine
|
||||
The four-phase lifecycle (`PLANNING → BUILDING → VERIFICATION → DELIVERY`) that constrains which tools an agent may call at each step of a PLAN_EXEC run. Each phase has a tool whitelist (e.g., PLANNING allows `search`/`read_file`; BUILDING allows `write_file`); `WILDCARD = "*"` means all tools allowed (used by DELIVERY by default). Transitions are LLM-driven via `AdvancePhaseTool` — the LLM decides when to advance, there is no implicit timer. `auto_advance_after_steps` opts into a step-count trigger as an alternative to explicit `advance_phase` calls. State lives in `ReActEngine.current_phase`; `PhaseState` is an enum with `from_string()` parser.
|
||||
|
||||
### PhasePolicy
|
||||
The dataclass holding the per-phase whitelist (`dict[PhaseState, frozenset[str]]`) and bash command filter (`dict[PhaseState, Callable[[str], bool] | re.Pattern | None]`). Constructed via `default_policy()` (hardcoded KTD5 defaults) or `policy_from_config(plan_exec_cfg)` (YAML-driven). `policy_from_config` returns `None` for empty/disabled config — signaling "opt out, fall back to REACT" (not "use defaults"). `is_bash_command_allowed` accepts either a `Callable` (returns True if dangerous; `ShellTool._is_dangerous` is the default) or a legacy `re.Pattern` (matches dangerous substrings). The `Callable` path closes the regex ceiling — regex missed `:>file`, `dd of=file`, and unknown binaries.
|
||||
|
||||
### Phase Violation
|
||||
The structured event emitted when `_check_phase_permission` blocks a tool call in PLAN_EXEC mode. Two emission paths: (1) the violation is re-injected into the LLM as a tool result so the loop continues (the LLM can switch tools or call `advance_phase`); (2) a `phase_violation` WS event is forwarded to the client for UI feedback (PhaseIndicator component). Violations carry `current_phase`, `tool`, `violation_kind` (`tool_not_allowed` / `bash_command_blocked`), `message`, and `command_preview`. The engine accumulates violations in `_phase_violations` and drains them via `_drain_phase_violations` after each tool dispatch — drains are the caller's responsibility at three sites in `execute_stream`.
|
||||
|
||||
### AdvancePhaseTool
|
||||
The state-transition tool that moves the engine between phases. Calls `engine.advance_phase()` which transitions `current_phase` to the next enum value (PLANNING→BUILDING→VERIFICATION→DELIVERY; DELIVERY is terminal). Returns `{"previous_phase", "current_phase", "message"}`. Bypasses `_check_phase_permission` (always permitted) — phase advancement is not subject to the whitelist it enforces. Known limitation: the default `_loop_threshold=2` fires on the 2nd identical `advance_phase({})` call because all transitions produce the same argument hash — needs exemption from loop detection.
|
||||
|
||||
### _build_phase_engine
|
||||
The chat.py helper that consolidates PLAN_EXEC engine construction for both WS and REST paths. Returns `(engine, tools_with_advance_phase, error_message)`. Returns `(None, None, None)` when `execution_mode != PLAN_EXEC` or `plan_exec.enabled=False` (both signal "fall back to REACT"). Returns `(None, None, error_message)` on policy construction failure. The helper ensures WS and REST paths share identical PhasePolicy + AdvancePhaseTool wiring — previously the REST path returned 501 because it had no construction logic.
|
||||
|
|
|
|||
|
|
@ -0,0 +1,428 @@
|
|||
---
|
||||
title: "feat: Agent Wave 4 PLAN_EXEC hardening (REST wiring + frontend events + bash filter upgrade + e2e tests)"
|
||||
date: 2026-06-30
|
||||
type: feat
|
||||
status: draft
|
||||
origin: docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md (Wave 3 deferred items + Wave 3 code review residual risks)
|
||||
execution: code
|
||||
---
|
||||
|
||||
# Wave 4 PLAN_EXEC Hardening — REST Symmetry + Frontend Visibility + Filter Upgrade + E2E Coverage
|
||||
|
||||
## Summary
|
||||
|
||||
Wave 3 closed G5 (function-level sharding) and G6 (SOLO four-stage state machine) at the **WebSocket path only**. Three concrete gaps remain before PLAN_EXEC is production-ready:
|
||||
|
||||
1. **REST asymmetry** — `POST /api/v1/chat/{session_id}/send_message` with `execution_mode="plan_exec"` returns HTTP 501 (chat.py:590-595). WebSocket has a real handler; REST does not.
|
||||
2. **No frontend visibility** — `phase_changed` events are emitted on the WS socket (chat.py:1295-1309) but `frontend/src/api/types.ts:135` `WsServerMessage` has no `phase_changed`/`phase_violation` branch. `phase_violation` is not emitted to the client at all (only injected back to the LLM as a tool error at react.py:2196-2203). The Vue UI cannot show what phase the agent is in or surface policy violations.
|
||||
3. **Bash filter ceiling** — `core/phase.py:56-60` ponytail comment names the ceiling explicitly: regex misses `:>file`, `dd of=file`. Upgrade path = reuse `ShellTool._is_dangerous()` (shell.py:466) at enforcement time.
|
||||
|
||||
Wave 4 closes all three and ships an E2E integration test that exercises the full PLAN_EXEC path (LLM → phase transition → tool dispatch → WS event). It does **not** migrate to tree-sitter (KTD1 upgrade path remains for Wave 5+) or add phase persistence across session resume (U7 checkpoint scope, separate concern).
|
||||
|
||||
## Problem Frame
|
||||
|
||||
Wave 3 ships PLAN_EXEC behind a feature flag (`agentkit.yaml` `plan_exec.enabled`, default commented = opt-in). To turn it on in production, three conditions must hold:
|
||||
|
||||
- **Symmetric entry points**: callers using REST (CLI `agentkit task submit`, external integrations) and callers using WebSocket (Vue chat UI) must both be able to invoke PLAN_EXEC. Today only WS works.
|
||||
- **Observable behavior**: when the agent transitions phases or rejects a tool call, the user must see it. Today the WS event is emitted but the frontend silently drops it; violations are invisible to the user.
|
||||
- **Hardened safety boundary**: the bash filter is the only thing preventing a Planning-phase LLM from running `rm -rf`. The regex is conservative by the author's own admission (`:>file` slips through). Production use requires the same safety guarantee `ShellTool` already provides.
|
||||
|
||||
Wave 3's "Out of Scope (Deferred to Follow-Up Work)" section explicitly lists "REST `send_message` PLAN_EXEC wiring" and "Tool-filter UI in the frontend" — Wave 4 executes those deferrals. The bash filter upgrade was surfaced by Wave 3 code review (ponytail ceiling) rather than the brainstorm.
|
||||
|
||||
## Requirements
|
||||
|
||||
Carried forward from Wave 3 plan's deferred items + Wave 3 code review residuals:
|
||||
|
||||
- **R28**: REST `POST /api/v1/chat/{session_id}/send_message` with `execution_mode="plan_exec"` invokes the same PLAN_EXEC handler logic as the WebSocket path, returning a `ChatResult` with phase events recorded (mirrors WS event sequence).
|
||||
- **R29**: REST PLAN_EXEC bypasses the Wave 2 fallback chain (mutually exclusive with phase policy — chat.py:1093-1095 documents the constraint).
|
||||
- **R30**: WebSocket emits a `phase_violation` event to the client when `ReActEngine._check_phase_permission` blocks a tool call (currently only returned to the LLM).
|
||||
- **R31**: Frontend `WsServerMessage` union (types.ts:135) includes `phase_changed` and `phase_violation` cases; `handleWsMessage` (chat.ts:800) dispatches both to a phase state slice.
|
||||
- **R32**: A compact phase indicator renders the current phase + violation toasts in the Vue chat view.
|
||||
- **R33**: `PhasePolicy.bash_command_filter` accepts a `Callable[[str], bool]` callback in addition to `re.Pattern`; default policy wires `ShellTool._is_dangerous` so `:>file` and `dd of=file` are blocked.
|
||||
- **R34**: An E2E integration test exercises the full PLAN_EXEC path through a scripted LLM mock: planning → `advance_phase` → building → `write_file` → verification → delivery, asserting WS events and tool dispatches.
|
||||
|
||||
Cross-cutting:
|
||||
|
||||
- **R26** (inherited): all configuration via `agentkit.yaml` `plan_exec` section, parsed by `ServerConfig.from_dict`.
|
||||
- **R27** (inherited): each unit ships a minimal self-check (ponytail rule).
|
||||
|
||||
## Key Technical Decisions
|
||||
|
||||
### KTD1: Extract `_build_phase_engine` helper shared by WS and REST
|
||||
|
||||
**Decision**: Refactor `chat.py:1093-1153` (the WS PLAN_EXEC engine construction block) into a private `_build_phase_engine(server_config, agent, tools, ...) -> tuple[ReActEngine, list[Tool]]` helper. Both `_execute_plan_exec_ws` and the new `_execute_plan_exec_rest` call it.
|
||||
|
||||
**Rationale**:
|
||||
- The WS block currently inlines policy construction + engine instantiation + AdvancePhaseTool registration. REST needs the same assembly.
|
||||
- Single source of truth for "how to build a phase-enforced engine" prevents drift between WS and REST paths.
|
||||
- Helper is private (underscore prefix) — not a public API; test access goes through the routes.
|
||||
|
||||
### KTD2: REST PLAN_EXEC returns non-streaming `ChatResult`; SSE streaming is deferred
|
||||
|
||||
**Decision**: `_execute_plan_exec_rest()` returns a regular `SendMessageResponse` (matching the existing REST send_message shape). The `phase_changed`/`phase_violation` events are captured into a `phase_events: list` field on the response payload.
|
||||
|
||||
**Rationale**:
|
||||
- Existing REST `send_message` is non-streaming (chat.py:580). Streaming REST (SSE) is a separate concern owned by the `/api/v1/llm/chat` gateway route (llm_gateway.py), not chat.py.
|
||||
- First version ships parity with existing REST shape; SSE streaming for PLAN_EXEC is a follow-up if users request it.
|
||||
- The phase events list lets REST clients render phase progression after-the-fact (CLI `agentkit task submit` shows them in terminal output).
|
||||
|
||||
### KTD3: `phase_violation` event emitted to WS alongside LLM injection
|
||||
|
||||
**Decision**: In `react.py:_execute_loop`, when `_check_phase_permission` blocks a tool call, the existing structured error is injected to the LLM conversation (unchanged), AND a `phase_violation` event is emitted through the engine's event stream. `chat.py` WS handler forwards it to the client.
|
||||
|
||||
**Rationale**:
|
||||
- Wave 3 returns the violation only to the LLM (gives the model a chance to self-correct by calling `advance_phase`). That stays.
|
||||
- Adding the WS event gives the user visibility into "the LLM tried to call `write_file` in Planning, was rejected, and will retry" — without this, the UI shows the LLM thinking silently which looks like a hang.
|
||||
- Event payload: `{"type": "phase_violation", "data": {"tool": "write_file", "phase": "planning", "hint": "call advance_phase"}}`.
|
||||
|
||||
### KTD4: `bash_command_filter` accepts `Callable[[str], bool] | re.Pattern | None`
|
||||
|
||||
**Decision**: Change `PhasePolicy.bash_command_filter` field type from `dict[PhaseState, re.Pattern | None]` to `dict[PhaseState, Callable[[str], bool] | re.Pattern | None]`. `is_bash_command_allowed` detects callable vs pattern at call time. `default_policy()` injects `ShellTool._is_dangerous` as the callable for PLANNING/VERIFICATION.
|
||||
|
||||
**Rationale**:
|
||||
- `ShellTool._is_dangerous` (shell.py:466) is already battle-tested against `_DANGEROUS_BINARIES`, `_DANGEROUS_BINARY_FLAGS`, `_DANGEROUS_ARG_PATTERNS`, shell-chain operators, and pipe operators. Reusing it eliminates the regex ceiling the ponytail comment named.
|
||||
- The `re.Pattern` form stays for backward compat (config-supplied regex patterns still work).
|
||||
- `PhaseState` enum and `PhasePolicy` API stay stable; only the field type widens.
|
||||
|
||||
**Alternative considered**: Move the filter to `ShellTool` itself (gateway-level). Rejected because phase enforcement is per-step in ReActEngine, not per-shell-call — different lifecycle.
|
||||
|
||||
### KTD5: Phase indicator UI is compact, optional, and degrades gracefully
|
||||
|
||||
**Decision**: Add a `PhaseIndicator.vue` component (badge + progress dots for 4 phases + transient toast for `phase_violation`). Mount it in the chat view header only when the current session has `execution_mode="plan_exec"`; otherwise render nothing.
|
||||
|
||||
**Rationale**:
|
||||
- Most chat sessions are REACT/SKILL_REACT — phase indicator is noise for them. Conditionally render only for PLAN_EXEC sessions.
|
||||
- Compact form (badge + dots) avoids competing with the existing `PlanVisualization.vue` (team mode, different concept — don't unify them).
|
||||
- Toast pattern matches existing `useMessage` from `ant-design-vue` used elsewhere in the frontend.
|
||||
|
||||
## Scope Boundaries
|
||||
|
||||
### In Scope
|
||||
|
||||
- Refactor WS PLAN_EXEC engine construction into `_build_phase_engine` shared helper.
|
||||
- New `_execute_plan_exec_rest` for REST send_message; remove the 501 at chat.py:590-595.
|
||||
- Emit `phase_violation` event from `ReActEngine._execute_loop` through the WS handler.
|
||||
- Frontend `WsServerMessage` union extension + `handleWsMessage` cases + new `PhaseIndicator.vue`.
|
||||
- `PhasePolicy.bash_command_filter` type widening + `default_policy()` wiring to `ShellTool._is_dangerous`.
|
||||
- E2E integration test with scripted LLM mock covering full PLAN_EXEC lifecycle.
|
||||
|
||||
### Out of Scope (Deferred to Follow-Up Work)
|
||||
|
||||
- SSE streaming for REST PLAN_EXEC (KTD2 — non-streaming first; SSE follow-up if requested).
|
||||
- `tree-sitter` integration for symbol extraction (Wave 3 KTD1 upgrade path; Wave 5+ candidate).
|
||||
- Phase persistence across session resume (depends on U7 checkpoint deeper changes).
|
||||
- Phase-aware prompt engineering (per-phase system prompt templates — prompt-engineering concern, not code).
|
||||
- Phase rollback on `Building → Planning` regression (UX/prompt concern; Wave 2 G9 file-level rollback already handles file state).
|
||||
- `config_sync.py` exposure of `plan_exec` to frontend (frontend reads phase events from WS, not config — config exposure only needed if the UI wants to render phase whitelists, which is out of scope).
|
||||
- Recovery/Emergency layer integration with PLAN_EXEC (mutually exclusive by design — chat.py:1093-1095 documents this; integrating would require ReflexionEngine to understand phase state, separate Wave).
|
||||
|
||||
### Outside This Product's Identity
|
||||
|
||||
- Replacing the existing ReAct loop with LangGraph (inherited from brainstorm).
|
||||
- Disc-based file system à la DeerFlow (inherited).
|
||||
- Docker sandbox (inherited; only command-level safety via `bash_command_filter`).
|
||||
|
||||
## Implementation Units
|
||||
|
||||
### U1. Bash filter upgrade — reuse `ShellTool._is_dangerous()` (G6 hardening)
|
||||
|
||||
**Goal**: Widen `PhasePolicy.bash_command_filter` to accept `Callable[[str], bool]` callbacks and wire `ShellTool._is_dangerous` as the default filter for PLANNING/VERIFICATION phases. Eliminate the ponytail ceiling at `core/phase.py:56-60`.
|
||||
|
||||
**Requirements**: R33, R27.
|
||||
|
||||
**Dependencies**: none.
|
||||
|
||||
**Files**:
|
||||
- `src/agentkit/core/phase.py` (modify — widen field type; inject `ShellTool._is_dangerous` in `default_policy()`; update `is_bash_command_allowed` to handle callable).
|
||||
- `tests/unit/test_phase_policy.py` (modify — add cases for `:>file`, `dd of=file`, callable vs pattern).
|
||||
|
||||
**Approach**:
|
||||
- Field type: `bash_command_filter: dict[PhaseState, Callable[[str], bool] | re.Pattern | None]`.
|
||||
- `is_bash_command_allowed(command, phase)`:
|
||||
- `filter = self.bash_command_filter.get(phase)`
|
||||
- `if filter is None: return True`
|
||||
- `if callable(filter): return not filter(command)`
|
||||
- `if isinstance(filter, re.Pattern): return not filter.search(command)`
|
||||
- `default_policy()` replaces `_DEFAULT_BASH_FILTER` regex with `ShellTool._is_dangerous` method reference (bound method, callable).
|
||||
- Keep `_DEFAULT_BASH_FILTER` regex as a module constant for tests and config-supplied patterns; `default_policy()` no longer uses it.
|
||||
- Remove the ponytail comment at `core/phase.py:56-60` (ceiling is closed).
|
||||
|
||||
**Execution note**: characterization-first — test that `default_policy().is_bash_command_allowed("rm -rf /", PLANNING)` still returns False (preserves Wave 3 behavior) before adding new edge-case coverage.
|
||||
|
||||
**Patterns to follow**:
|
||||
- `src/agentkit/core/phase.py:default_policy()` (Wave 3 — same factory pattern).
|
||||
- `src/agentkit/tools/shell.py:_is_dangerous` (Wave 3 — already the canonical safety check).
|
||||
|
||||
**Test scenarios** (covers R33):
|
||||
- **Characterization (Wave 3 preserved)**:
|
||||
- `default_policy().is_bash_command_allowed("rm -rf /", PLANNING)` → False.
|
||||
- `default_policy().is_bash_command_allowed("ls -la", PLANNING)` → True.
|
||||
- `default_policy().is_bash_command_allowed("git status", PLANNING)` → True.
|
||||
- **Happy paths (new ceiling closed)**:
|
||||
- `:>file` in PLANNING → False (was True before — `ShellTool._is_dangerous` catches redirect-to-empty).
|
||||
- `dd of=/dev/sda` in PLANNING → False (was True before — caught by `_DANGEROUS_BINARIES`).
|
||||
- `echo hello > /tmp/x` in PLANNING → False (was True before — `ShellTool` catches `>` redirect).
|
||||
- **Edge cases**:
|
||||
- `re.Pattern` form still works when supplied via config (`whitelist_override`-adjacent — config-supplied regex pattern is honored).
|
||||
- `callable` form takes precedence over `re.Pattern` when both somehow present (defensive — shouldn't happen).
|
||||
- **Error paths**:
|
||||
- Empty command in PLANNING → True (ShellTool separately rejects empty commands at execution time; filter only gates dangerous patterns).
|
||||
- None filter for BUILDING → True (no restriction).
|
||||
|
||||
**Verification**:
|
||||
- `python3 -m pytest tests/unit/test_phase_policy.py -q` passes.
|
||||
- `ruff check src/agentkit/core/phase.py` clean.
|
||||
- Ponytail comment at `core/phase.py:56-60` is removed (ceiling closed, not just documented).
|
||||
|
||||
---
|
||||
|
||||
### U2. Emit `phase_violation` WS event from `ReActEngine`
|
||||
|
||||
**Goal**: When `_check_phase_permission` blocks a tool call, emit a `phase_violation` event through the engine's event stream so `chat.py` WS handler can forward it to the client. Today the violation is only injected back to the LLM (react.py:2196-2203), invisible to the user.
|
||||
|
||||
**Requirements**: R30.
|
||||
|
||||
**Dependencies**: none (independent of U1 — violation emission doesn't depend on filter implementation).
|
||||
|
||||
**Files**:
|
||||
- `src/agentkit/core/react.py` (modify — emit event alongside the existing LLM injection).
|
||||
- `src/agentkit/server/routes/chat.py` (modify — forward `phase_violation` events from `execute_stream` to the WS client).
|
||||
- `tests/unit/test_react_phase_enforcement.py` (modify — assert event emission).
|
||||
- `tests/unit/test_chat_plan_exec_ws.py` (modify — assert WS client receives `phase_violation` event).
|
||||
|
||||
**Approach**:
|
||||
- `ReActEngine.execute_stream` already yields events for `tool_call`/`tool_result`/`thinking`/`token`. Add a new event type `phase_violation` yielded before the structured error is injected to the LLM conversation.
|
||||
- Event payload: `{"type": "phase_violation", "data": {"tool": "<tool_name>", "phase": "<current_phase>", "hint": "call advance_phase"}}`.
|
||||
- `chat.py` WS handler (around chat.py:1218 `async for event in react_engine.execute_stream(...)`) adds an `elif event["type"] == "phase_violation":` branch that `websocket.send_json` the event to the client.
|
||||
- Existing LLM-injection path is unchanged — the LLM still gets the structured error to react to.
|
||||
|
||||
**Execution note**: characterization-first — assert that `phase_policy=None` (no enforcement) yields zero `phase_violation` events (preserves Wave 3 behavior) before adding the positive-path test.
|
||||
|
||||
**Patterns to follow**:
|
||||
- `src/agentkit/core/react.py` existing event emission (e.g., `tool_call` event emission pattern).
|
||||
- `src/agentkit/server/routes/chat.py:1295-1309` `phase_changed` event forwarding (same shape).
|
||||
|
||||
**Test scenarios** (covers R30):
|
||||
- **Characterization (no policy)**:
|
||||
- `ReActEngine(phase_policy=None)` executing a full loop yields zero `phase_violation` events.
|
||||
- **Happy paths**:
|
||||
- PLANNING phase, LLM calls `write_file` → engine yields `phase_violation` event with `tool="write_file"`, `phase="planning"`, `hint="call advance_phase"`.
|
||||
- WS handler forwards `phase_violation` to client connection (assert `websocket.send_json` called with `{"type": "phase_violation", ...}`).
|
||||
- LLM still receives the structured error in conversation (regression — Wave 3 behavior preserved).
|
||||
- **Edge cases**:
|
||||
- Multiple violations in a row (LLM retries same tool) → multiple `phase_violation` events emitted (one per attempt).
|
||||
- Violation followed by `advance_phase` followed by same tool now allowed → exactly one `phase_violation` event, then a `tool_call` event.
|
||||
- **Error paths**:
|
||||
- Phase policy construction failure → existing 500 error path, no `phase_violation` emitted (engine not constructed).
|
||||
- **Integration scenarios**:
|
||||
- Full WS path: client connects, sends PLAN_EXEC request, LLM mock emits `write_file` in PLANNING → client receives `phase_violation` event before any `tool_call` event.
|
||||
|
||||
**Verification**:
|
||||
- `python3 -m pytest tests/unit/test_react_phase_enforcement.py tests/unit/test_chat_plan_exec_ws.py -q` passes.
|
||||
- `ruff check src/agentkit/core/react.py src/agentkit/server/routes/chat.py` clean.
|
||||
|
||||
---
|
||||
|
||||
### U3. Refactor `_build_phase_engine` helper + REST PLAN_EXEC wiring
|
||||
|
||||
**Goal**: Extract the WS PLAN_EXEC engine construction (chat.py:1093-1153) into a private `_build_phase_engine(server_config, agent, tools, ...) -> tuple[ReActEngine, list[Tool]]` helper. Add `_execute_plan_exec_rest()` for REST `send_message`; replace the 501 at chat.py:590-595.
|
||||
|
||||
**Requirements**: R28, R29, R26.
|
||||
|
||||
**Dependencies**: U1 (uses the hardened `default_policy()`).
|
||||
|
||||
**Files**:
|
||||
- `src/agentkit/server/routes/chat.py` (modify — extract helper; add REST handler; remove 501).
|
||||
- `tests/unit/test_chat_plan_exec_ws.py` (modify — add REST PLAN_EXEC test cases).
|
||||
- `tests/unit/test_chat_rest_plan_exec.py` (new — REST-specific coverage).
|
||||
|
||||
**Approach**:
|
||||
- New private `_build_phase_engine(server_config, agent, tools, system_prompt, model) -> tuple[ReActEngine, list[Tool]]`:
|
||||
1. Read `server_config.plan_exec` (default `{}`).
|
||||
2. If `enabled is False`, return `(None, tools)` (caller falls back to REACT).
|
||||
3. Build `PhasePolicy` via `policy_from_config`; on failure or None, fall back to `default_policy()`.
|
||||
4. Construct `ReActEngine(..., phase_policy=policy)`.
|
||||
5. Register `AdvancePhaseTool` bound to the engine; return `(engine, tools + [advance_phase])`.
|
||||
- WS path: `_execute_plan_exec_ws` calls `_build_phase_engine`; if engine is None, falls back to REACT (existing behavior at chat.py:1101-1107).
|
||||
- REST path: `_execute_plan_exec_rest(request, session_id, ...)`:
|
||||
1. Calls `_build_phase_engine`.
|
||||
2. If engine is None, delegates to `execute_with_fallback_chain` (REST keeps fallback chain for non-PLAN_EXEC).
|
||||
3. Otherwise calls `engine.execute(...)` (non-streaming, single-shot — matches existing REST send_message shape).
|
||||
4. Collects `phase_changed`/`phase_violation` events into a `phase_events: list[dict]` field on the response payload.
|
||||
5. Returns `SendMessageResponse` extended with optional `phase_events` field.
|
||||
- Replace chat.py:590-595 with a branch: if `routing.execution_mode == PLAN_EXEC`, call `_execute_plan_exec_rest`; else continue with existing fallback chain.
|
||||
- `SendMessageResponse` model gains an optional `phase_events: list[dict] | None = None` field (default None keeps backward compat for non-PLAN_EXEC responses).
|
||||
|
||||
**Execution note**: characterization-first — assert that REST send_message with `execution_mode="react"` (or None) still goes through `execute_with_fallback_chain` (Wave 2 behavior unchanged) before adding PLAN_EXEC branch.
|
||||
|
||||
**Patterns to follow**:
|
||||
- `src/agentkit/server/routes/chat.py:1093-1153` (existing WS PLAN_EXEC block — the code being extracted).
|
||||
- `src/agentkit/server/_fallback_chain.py:execute_with_fallback_chain` (Wave 2 — REST non-PLAN_EXEC path stays here).
|
||||
|
||||
**Test scenarios** (covers R28, R29):
|
||||
- **Characterization (REST non-PLAN_EXEC preserved)**:
|
||||
- REST `send_message` with `execution_mode="react"` → calls `execute_with_fallback_chain` (Wave 2 path unchanged).
|
||||
- REST `send_message` with `execution_mode=None` → defaults to REACT, fallback chain applies.
|
||||
- **Happy paths**:
|
||||
- REST `send_message` with `execution_mode="plan_exec"` → returns 200 (not 501).
|
||||
- Response includes `phase_events: list` with at least one `phase_changed` entry when the engine transitions.
|
||||
- REST with empty `plan_exec` config → uses `default_policy()` (KTD5 default whitelist).
|
||||
- **Edge cases**:
|
||||
- REST with `plan_exec.enabled=False` → falls back to REACT, response has `phase_events=None`.
|
||||
- REST with bad `plan_exec` config (invalid phase name) → 500 with error message naming the bad value.
|
||||
- REST PLAN_EXEC with phase violation → `phase_events` includes a `phase_violation` entry.
|
||||
- **Error paths**:
|
||||
- REST PLAN_EXEC when session is closed → 400 (existing path, no change).
|
||||
- REST PLAN_EXEC with non-existent session → 404 (existing path).
|
||||
- **Integration scenarios**:
|
||||
- REST PLAN_EXEC bypasses fallback chain: assert `execute_with_fallback_chain` is NOT called when `execution_mode="plan_exec"` (mutual exclusion per R29).
|
||||
|
||||
**Verification**:
|
||||
- `python3 -m pytest tests/unit/test_chat_plan_exec_ws.py tests/unit/test_chat_rest_plan_exec.py -q` passes.
|
||||
- `ruff check src/agentkit/server/routes/chat.py` clean.
|
||||
- The 501 at chat.py:590-595 is removed.
|
||||
|
||||
---
|
||||
|
||||
### U4. Frontend phase event pipeline + `PhaseIndicator.vue`
|
||||
|
||||
**Goal**: Extend `WsServerMessage` union with `phase_changed` and `phase_violation` event types; add `handleWsMessage` cases that update a phase state slice; add a compact `PhaseIndicator.vue` component mounted only for PLAN_EXEC sessions.
|
||||
|
||||
**Requirements**: R31, R32.
|
||||
|
||||
**Dependencies**: U2 (frontend renders `phase_violation` events emitted by backend).
|
||||
|
||||
**Files**:
|
||||
- `src/agentkit/server/frontend/src/api/types.ts` (modify — extend `WsServerMessage` union).
|
||||
- `src/agentkit/server/frontend/src/stores/chat.ts` (modify — add `phase` state slice; add cases in `handleWsMessage`).
|
||||
- `src/agentkit/server/frontend/src/components/PhaseIndicator.vue` (new — badge + dots + toast).
|
||||
- `src/agentkit/server/frontend/src/views/AgentChatView.vue` (modify — mount `PhaseIndicator` conditionally).
|
||||
- `src/agentkit/server/frontend/tests/unit/PhaseIndicator.spec.ts` (new — component test).
|
||||
- `src/agentkit/server/frontend/src/api/types.ts` (verify — `PlanExecutionMode` type already covers `"plan_exec"`).
|
||||
|
||||
**Approach**:
|
||||
- `WsServerMessage` union gains two branches: `{ type: "phase_changed"; data: { phase: string; previous: string } }` and `{ type: "phase_violation"; data: { tool: string; phase: string; hint: string } }`.
|
||||
- `chat.ts` Pinia store gains: `currentPhase: Ref<string | null>`, `phaseViolations: Ref<PhaseViolation[]>`, `isPlanExec: ComputedRef<boolean>` (derived from session's `execution_mode`).
|
||||
- `handleWsMessage` adds `case "phase_changed": currentPhase.value = data.phase;` and `case "phase_violation": phaseViolations.value.push(data);` (capped at last 5 to bound memory).
|
||||
- `PhaseIndicator.vue`:
|
||||
- 4 dots representing PLANNING/BUILDING/VERIFICATION/DELIVERY; current phase highlighted.
|
||||
- On `phase_violation`, show an `ant-design-vue` `message.warning(...)` toast with the violation hint.
|
||||
- Renders nothing when `!isPlanExec` (graceful degradation).
|
||||
- `AgentChatView.vue` mounts `<PhaseIndicator />` in the chat header slot, conditional on `chatStore.isPlanExec`.
|
||||
|
||||
**Execution note**: characterization-first — assert that `handleWsMessage` with `data.type="token"` (existing) still updates message content unchanged, before adding new cases.
|
||||
|
||||
**Patterns to follow**:
|
||||
- `src/agentkit/server/frontend/src/stores/chat.ts:1325-1391` (existing team event handling — `phase_started`/`phase_completed` cases shape the new cases).
|
||||
- `src/agentkit/server/frontend/src/components/PlanVisualization.vue` (existing team mode component — different domain but same "compact badge + state" pattern).
|
||||
|
||||
**Test scenarios** (covers R31, R32):
|
||||
- **Characterization (existing events preserved)**:
|
||||
- `handleWsMessage({type: "token", data: ...})` still appends to message content.
|
||||
- `handleWsMessage({type: "team_formed", ...})` still routes to team store.
|
||||
- **Happy paths**:
|
||||
- `handleWsMessage({type: "phase_changed", data: {phase: "building", previous: "planning"}})` → `currentPhase.value === "building"`.
|
||||
- `handleWsMessage({type: "phase_violation", data: {tool: "write_file", phase: "planning", hint: "..."}})` → `phaseViolations.value` length increases by 1.
|
||||
- `PhaseIndicator.vue` with `currentPhase="building"` → renders 4 dots with the 2nd highlighted.
|
||||
- **Edge cases**:
|
||||
- `PhaseIndicator.vue` with `isPlanExec=false` → renders nothing (returns `null` or empty `<template>`).
|
||||
- `phaseViolations` capped at 5 entries (6th violation pushes oldest out).
|
||||
- `phase_changed` event with `previous=""` (initial transition) → no error, `currentPhase` updates.
|
||||
- **Integration scenarios**:
|
||||
- Full mount: `<PhaseIndicator />` mounted in `AgentChatView.vue` with `isPlanExec=true` and `currentPhase="planning"` → renders correctly; `message.warning` toast appears when `phase_violation` received.
|
||||
|
||||
**Verification**:
|
||||
- `cd src/agentkit/server/frontend && npm run typecheck` clean.
|
||||
- `npm run test:unit -- PhaseIndicator` passes.
|
||||
- `npm run lint` clean.
|
||||
|
||||
---
|
||||
|
||||
### U5. E2E integration test for full PLAN_EXEC lifecycle
|
||||
|
||||
**Goal**: A single E2E test that exercises the full PLAN_EXEC path through a scripted LLM mock: PLANNING (search) → `advance_phase` → BUILDING (`write_file`) → `advance_phase` → VERIFICATION (`shell` with `pytest`) → `advance_phase` → DELIVERY (final answer). Asserts WS events sequence, phase transitions, tool dispatches, and `phase_violation` rejection when LLM attempts out-of-phase tool.
|
||||
|
||||
**Requirements**: R34, R27.
|
||||
|
||||
**Dependencies**: U1, U2, U3, U4 (all backend pieces must be in place).
|
||||
|
||||
**Files**:
|
||||
- `tests/integration/test_plan_exec_e2e.py` (new).
|
||||
|
||||
**Approach**:
|
||||
- Mock LLM gateway: returns scripted responses in sequence (deterministic, no real API call):
|
||||
1. `search` tool call (PLANNING-allowed) → tool dispatched.
|
||||
2. `advance_phase` tool call → `phase_changed` event emitted.
|
||||
3. `write_file` tool call (BUILDING-allowed) → tool dispatched.
|
||||
4. `advance_phase` tool call → `phase_changed` event emitted.
|
||||
5. `shell` tool call with `pytest` (VERIFICATION-allowed) → tool dispatched.
|
||||
6. `advance_phase` tool call → `phase_changed` event emitted.
|
||||
7. Final answer text.
|
||||
- Negative path: insert an out-of-phase `write_file` call in step 1 (PLANNING) → assert `phase_violation` event emitted, tool NOT dispatched, LLM receives structured error.
|
||||
- Test asserts:
|
||||
- WS event sequence includes exactly 3 `phase_changed` events (planning→building, building→verification, verification→delivery).
|
||||
- Exactly 1 `phase_violation` event (in the negative path).
|
||||
- Tool dispatch count matches allowed tool calls.
|
||||
- Final `final_answer` event received.
|
||||
|
||||
**Execution note**: This is a characterization test for the wired-up system, not a unit test. Mock the LLM gateway at the `LLMGateway` boundary; use real `ReActEngine`, real `PhasePolicy`, real WS handler (or a WS test client).
|
||||
|
||||
**Patterns to follow**:
|
||||
- `tests/unit/test_chat_plan_exec_ws.py` (Wave 3 — same WS test client pattern).
|
||||
- `tests/integration/test_api_coverage.py` (existing — integration test patterns in the repo).
|
||||
|
||||
**Test scenarios** (covers R34):
|
||||
- **Happy path (full lifecycle)**:
|
||||
- Scripted LLM completes all 4 phases in order → 3 `phase_changed` events, 3 `advance_phase` tool calls dispatched, allowed tools dispatched in each phase, `final_answer` event received.
|
||||
- **Negative path (violation then recovery)**:
|
||||
- LLM attempts `write_file` in PLANNING → `phase_violation` event emitted, `write_file` NOT dispatched (assert `write_file.execute` call count is 0 at this point), LLM receives structured error in conversation.
|
||||
- LLM then calls `advance_phase` → transitions to BUILDING, `write_file` now dispatched successfully.
|
||||
- **Edge cases**:
|
||||
- `plan_exec.enabled=False` config → test asserts path falls back to REACT (no phase events emitted).
|
||||
- LLM never calls `advance_phase` and `auto_advance_after_steps=2` → phase auto-advances after 2 steps (asserts safety net).
|
||||
- **Error paths**:
|
||||
- LLM raises (LLM call fails) → existing error event emitted; phase state unchanged.
|
||||
|
||||
**Verification**:
|
||||
- `python3 -m pytest tests/integration/test_plan_exec_e2e.py -v` passes.
|
||||
- Test runs without real LLM API call (mocked).
|
||||
|
||||
---
|
||||
|
||||
## Risks & Dependencies
|
||||
|
||||
### Risks
|
||||
|
||||
1. **REST non-streaming shape mismatch (medium)**: REST clients expecting SSE for PLAN_EXEC will not get streaming phase events; they get a list after-the-fact. Mitigation: KTD2 documents this as intentional first version; SSE follow-up tracked as deferred.
|
||||
2. **Frontend state slice bloat (low)**: Adding `currentPhase` + `phaseViolations` to the chat store adds reactive state. Mitigation: `phaseViolations` capped at 5; `currentPhase` is a single string. Negligible memory.
|
||||
3. **`ShellTool._is_dangerous` import cycle (low)**: `core/phase.py` importing from `tools/shell.py` could create a cycle if `shell.py` imports from `core/`. Mitigation: verify import direction at implementation time; if cycle, lift `_is_dangerous` to a shared `tools/_safety.py` module (one-function extraction).
|
||||
4. **E2E test flakiness from mock sequencing (low)**: Scripted LLM mock must match exact sequence. Mitigation: index-based mock (call N returns response N) rather than state-based; deterministic.
|
||||
5. **Backward compat for `re.Pattern` config (low)**: Existing config-supplied regex patterns must still work after the type widening. Mitigation: KTD4 preserves `re.Pattern` branch in `is_bash_command_allowed`; characterization test in U1.
|
||||
|
||||
### Dependencies
|
||||
|
||||
- Wave 3 (PR #6 merged) — `PhasePolicy`, `PhaseState`, `default_policy()`, `AdvancePhaseTool`, WS PLAN_EXEC handler all in place.
|
||||
- Wave 2 (PR #5 merged) — `execute_with_fallback_chain` for REST non-PLAN_EXEC path.
|
||||
- No external library dependencies.
|
||||
|
||||
## System-Wide Impact
|
||||
|
||||
- **REST `send_message` callers**: gain PLAN_EXEC support; existing REACT/SKILL_REACT callers unchanged (fallback chain preserved).
|
||||
- **WebSocket clients**: gain `phase_violation` event type; existing event types unchanged. Vue frontend renders the new events; other WS clients (CLI) ignore them silently.
|
||||
- **`agentkit.yaml`**: no new config section (Wave 3 `plan_exec` section is reused; Wave 4 only changes how the policy is constructed internally).
|
||||
- **`PhasePolicy` API**: `bash_command_filter` field type widens (`re.Pattern | None` → `Callable[[str], bool] | re.Pattern | None`); `is_bash_command_allowed` signature unchanged. Backward compatible.
|
||||
- **Frontend chat store**: gains `currentPhase` + `phaseViolations` reactive state; existing state unchanged.
|
||||
|
||||
## Sources & Research
|
||||
|
||||
- Origin brainstorm: `docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md` (Wave 3 deferred items + KTD6/KTD7).
|
||||
- Wave 3 plan: `docs/plans/2026-06-29-004-feat-agent-wave3-strategic-plan.md` (Out of Scope section enumerates the deferrals Wave 4 executes).
|
||||
- Wave 3 code review: `/tmp/compound-engineering/ce-code-review/20260630-015548-c44a5245/` (ponytail ceiling at `core/phase.py:56-60` flagged by correctness+reliability reviewers).
|
||||
- Codebase research (2026-06-30): `frontend/src/stores/chat.ts:800`, `frontend/src/api/types.ts:135`, `src/agentkit/server/routes/chat.py:580,1093-1153`, `src/agentkit/tools/shell.py:466`, `src/agentkit/core/react.py:2196-2203`, `src/agentkit/server/_fallback_chain.py:90`.
|
||||
|
||||
## Deferred to Implementation
|
||||
|
||||
- Exact `SendMessageResponse` schema for `phase_events: list[dict]` — design above gives the shape; implementer finalizes field names based on existing response models.
|
||||
- `PhaseIndicator.vue` visual design (dot vs pill vs progress bar) — implementer picks based on existing Ant Design Vue component inventory.
|
||||
- Mock LLM response sequence length in U5 — implementer sizes based on whether the test asserts every step or samples key transitions.
|
||||
- Whether `_build_phase_engine` returns `None` or raises on opt-out (`enabled=False`) — design above returns None and caller falls back; implementer may switch to explicit enum return if cleaner.
|
||||
Loading…
Reference in New Issue