feat(agent): Wave 4 PLAN_EXEC Hardening (U1-U5) #7

fischer · 2026-06-30T12:43:15+08:00

fischer commented

2026-06-30 12:43:15 +08:00

Summary

Wave 4 PLAN_EXEC Hardening — closes the gaps surfaced by Wave 3's PLAN_EXEC landing. Five implementation units + ce-code-review fixes.

Implementation Units

U1 — Widen PhasePolicy.bash_command_filter to accept Callable[[str], bool] (in addition to existing re.Pattern). Made ShellTool._is_dangerous a @staticmethod so the default policy can call it without an instance. Closes the regex ceiling (missed :>file, dd of=file, etc.).
U2 — Emit phase_violation WS event alongside the existing LLM reinjection when a tool is blocked by phase policy. Engine emits via _drain_phase_violations; chat.py WS handler forwards to client.
U3 — Extract _build_phase_engine helper in chat.py to consolidate PLAN_EXEC engine construction. Wire REST PLAN_EXEC path (was returning 501). WS and REST now share the same construction logic.
U4 — Frontend: new PhaseIndicator.vue component, phase_violation event handling in chat store, wired into ChatView. Forward-compatible with phase_changed events.
U5 — E2E integration test for full PLAN_EXEC lifecycle (7 tests: happy path, negative path, edge cases, error path, WS handler).

Test Results

189 tests pass across test_shell_tool.py, test_phase_policy.py, test_react_phase_enforcement.py, test_chat_plan_exec_ws.py, test_react_engine.py, test_chat_routes.py, test_plan_exec_e2e.py
Ruff clean on all modified files
Frontend vitest: 3/3 pass (chat-phase.test.ts)

ce-code-review Findings Applied

Six safe fixes from Stage 5c review (commit 8627777):

phase.py: delete dead _DEFAULT_BASH_FILTER constant
chat.py: drop Any from _build_phase_engine params (AGENTS.md prohibits any)
chat.ts: delete stale comment about phase_changed emission
chat-phase.test.ts: rename misleading 'capped at 5' test
test_chat_plan_exec_ws.py: tighten test_rest_react_mode_still_works assertion
test_plan_exec_e2e.py: clarify test_auto_advance assertion comment

Known Limitations (not fixed in this PR)

Tracked for follow-up PRs:

P1 — Loop detector + PLAN_EXEC advance_phase: The default _loop_threshold=2 fires on the 2nd identical advance_phase({}) call, breaking the 4-phase lifecycle in production. U5 tests work around this with engine._loop_threshold = 99. Fix: exempt advance_phase from loop detection (separate PR).
P2 — Parallel path phase_violation ordering: In the parallel execution path, _drain_phase_violations pops ALL accumulated violations on the first tool_result iteration, misordering events. Current tests only cover the serial path. Fix: key violations by tool_call_id (separate PR).
P2 — REST PLAN_EXEC cancellation_token: REST path omits cancellation_token; client disconnect leaves engine running until default timeout. WS path handles this correctly.
P3 — Custom Callable bash_command_filter exceptions: If a user-supplied Callable raises, the exception propagates uncaught through _check_phase_permission. Default ShellTool._is_dangerous never raises.

Plan

Plan document: docs/plans/2026-06-30-001-feat-agent-wave4-plan-exec-hardening-plan.md

## Summary Wave 4 PLAN_EXEC Hardening — closes the gaps surfaced by Wave 3's PLAN_EXEC landing. Five implementation units + ce-code-review fixes. ## Implementation Units - **U1** — Widen `PhasePolicy.bash_command_filter` to accept `Callable[[str], bool]` (in addition to existing `re.Pattern`). Made `ShellTool._is_dangerous` a `@staticmethod` so the default policy can call it without an instance. Closes the regex ceiling (missed `:>file`, `dd of=file`, etc.). - **U2** — Emit `phase_violation` WS event alongside the existing LLM reinjection when a tool is blocked by phase policy. Engine emits via `_drain_phase_violations`; chat.py WS handler forwards to client. - **U3** — Extract `_build_phase_engine` helper in chat.py to consolidate PLAN_EXEC engine construction. Wire REST PLAN_EXEC path (was returning 501). WS and REST now share the same construction logic. - **U4** — Frontend: new `PhaseIndicator.vue` component, `phase_violation` event handling in chat store, wired into ChatView. Forward-compatible with `phase_changed` events. - **U5** — E2E integration test for full PLAN_EXEC lifecycle (7 tests: happy path, negative path, edge cases, error path, WS handler). ## Test Results - 189 tests pass across `test_shell_tool.py`, `test_phase_policy.py`, `test_react_phase_enforcement.py`, `test_chat_plan_exec_ws.py`, `test_react_engine.py`, `test_chat_routes.py`, `test_plan_exec_e2e.py` - Ruff clean on all modified files - Frontend vitest: 3/3 pass (`chat-phase.test.ts`) ## ce-code-review Findings Applied Six safe fixes from Stage 5c review (commit 8627777): - `phase.py`: delete dead `_DEFAULT_BASH_FILTER` constant - `chat.py`: drop `Any` from `_build_phase_engine` params (AGENTS.md prohibits `any`) - `chat.ts`: delete stale comment about `phase_changed` emission - `chat-phase.test.ts`: rename misleading 'capped at 5' test - `test_chat_plan_exec_ws.py`: tighten `test_rest_react_mode_still_works` assertion - `test_plan_exec_e2e.py`: clarify `test_auto_advance` assertion comment ## Known Limitations (not fixed in this PR) Tracked for follow-up PRs: 1. **P1 — Loop detector + PLAN_EXEC `advance_phase`**: The default `_loop_threshold=2` fires on the 2nd identical `advance_phase({})` call, breaking the 4-phase lifecycle in production. U5 tests work around this with `engine._loop_threshold = 99`. Fix: exempt `advance_phase` from loop detection (separate PR). 2. **P2 — Parallel path phase_violation ordering**: In the parallel execution path, `_drain_phase_violations` pops ALL accumulated violations on the first tool_result iteration, misordering events. Current tests only cover the serial path. Fix: key violations by `tool_call_id` (separate PR). 3. **P2 — REST PLAN_EXEC cancellation_token**: REST path omits `cancellation_token`; client disconnect leaves engine running until default timeout. WS path handles this correctly. 4. **P3 — Custom Callable bash_command_filter exceptions**: If a user-supplied Callable raises, the exception propagates uncaught through `_check_phase_permission`. Default `ShellTool._is_dangerous` never raises. ## Plan Plan document: `docs/plans/2026-06-30-001-feat-agent-wave4-plan-exec-hardening-plan.md`

fischer added 7 commits 2026-06-30 12:43:15 +08:00

9e28ab315e feat(U1): widen PhasePolicy bash_command_filter to accept Callable

Reuses ShellTool._is_dangerous as the default bash filter for PLANNING
and VERIFICATION phases, closing the regex ceiling documented in Wave 3.

- Convert ShellTool._is_dangerous and _is_single_command_dangerous to
  @staticmethod (backward-compatible; instance calls still work via
  Python's descriptor protocol).
- Widen PhasePolicy.bash_command_filter field type to
  dict[PhaseState, Callable[[str], bool] | re.Pattern | None].
- is_bash_command_allowed dispatches on callable vs pattern at call time.
  Empty commands short-circuit to allowed (Wave 3 contract; ShellTool
  emits the clearer empty-command error).
- to_dict serializes callables as <callable> for log readability.
- default_policy() now wires ShellTool._is_dangerous for PLANNING and
  VERIFICATION. _DEFAULT_BASH_FILTER kept for backward compat with
  configs that pass a re.Pattern.
- Tests: characterization tests pin Wave 3 behavior (rm/mv/cp/echo >
  still blocked) plus new edge-case coverage for ceiling closed
  (dd of=/dev/sda, :>file, chain operators, pipe segments).

4dc58c24bc feat(U2): emit phase_violation WS event alongside LLM reinjection

Wave 3 only injected the violation error dict back to the LLM as a tool
result. Wave 4 U2 adds a parallel WS event so the frontend PhaseIndicator
can surface violations to the user.

- ReActEngine: add _phase_violations accumulator (list[dict]). Cleared in
  reset(). _check_phase_permission appends a structured violation dict
  (with new violation_kind field: tool_not_allowed | bash_command_blocked)
  before returning the error.
- Add _drain_phase_violations(step) helper that pops pending violations
  and returns ReActEvent(event_type="phase_violation", ...) list. Events
  carry a shallow copy of the violation dict so callers can't mutate the
  accumulator.
- execute_stream: drain after each tool_result yield at all 3 tool
  execution sites (parallel, serial-with-confirmation, parsed_calls).
  Non-streaming execute() ignores the accumulator (the LLM reinjection
  via the error dict is the only signal there).
- chat.py WS handler: new elif branch forwards phase_violation ReActEvents
  to the client as {"type": "phase_violation", "data": ...} WS messages.
- Tests: 11 new tests covering accumulator lifecycle, drain semantics,
  shallow-copy isolation, and execute_stream event emission for both
  tool_block and bash_block paths. 2 new WS forwarding tests pin the
  chat.py path (forward + characterization for REACT mode).

b032e08866 feat(U3): extract _build_phase_engine helper + wire REST PLAN_EXEC

Extract the WS path's inline phase_policy construction into a shared
_build_phase_engine helper so the REST send_message endpoint can reuse
it. Replace the former 501 stub with actual PLAN_EXEC execution:

- REST POST /chat/sessions/{id}/messages with execution_mode=plan_exec
  now builds a phase-policy-backed ReActEngine, calls execute()
  (non-streaming), and returns a MessageResponse.
- KTD5: PLAN_EXEC bypasses execute_with_fallback_chain — phase policy
  and fallback chain are mutually exclusive.
- When plan_exec.enabled=False, REST falls through to the REACT path
  (matching WS behavior).
- WS path refactored to call the same helper; behavior unchanged.

Tests:
- Replace TestRestPlanExec501 with TestRestPlanExec (happy path, bad
  config → 500, disabled → falls through to REACT, REACT mode unchanged).
- Add TestBuildPhaseEngineHelper covering all return branches:
  not-PLAN_EXEC, disabled, empty-config, invalid-config, tool append,
  default-policy fallback.
- All 109 tests pass across the three PLAN_EXEC test files.

2abe7c9e49 feat(U4): frontend phase_violation handling + PhaseIndicator component

Extend the frontend to surface PLAN_EXEC phase lifecycle events to the
user:

- WsServerMessage union (types.ts) gains two branches:
  `phase_changed` and `phase_violation` (matching backend U2 emission).
- chat.ts Pinia store gains a phase state slice:
  `currentPhase`, `phaseViolations` (capped at 5), `isPlanExec`
  computed, and `resetPlanExecState()`.
- handleWsMessage adds `case "phase_changed"` (sets currentPhase +
  appends a milestone step) and `case "phase_violation"` (sets
  currentPhase from violation data, appends to violations, fires an
  ant-design-vue message.warning toast, appends an error step).
- `result` handler calls `resetPlanExecState()` to clear the
  indicator when the conversation completes.
- New `PhaseIndicator.vue` component: compact badge + 4 dots
  (PLANNING/BUILDING/VERIFICATION/DELIVERY) with the current phase
  highlighted + violation counter. Renders nothing when
  `!isPlanExec` (graceful degradation).
- Mounted in `ChatView.vue` alongside ExpertTeamView and
  BoardStatusView.

Tests:
- New `tests/unit/stores/chat-phase.test.ts` verifies the phase state
  slice is exposed with correct initial values and `isPlanExec`
  derives from `currentPhase`.
- `npm run typecheck` clean.
- Pre-existing `tauri-auth.test.ts` failure is unrelated (fails in
  isolation on main).

0a8f6eebef feat(U5): E2E integration test for PLAN_EXEC lifecycle

Add tests/integration/test_plan_exec_e2e.py covering the full PLAN_EXEC
path through a scripted LLM mock (deterministic, no real API call).

Mock boundary: LLMGateway.chat_stream yields scripted StreamChunk
objects. Real ReActEngine, real PhasePolicy (default_policy()), real
AdvancePhaseTool, real chat._handle_chat_message WS handler.

Test scenarios (7 tests, all passing):
- Happy path: PLANNING (search) → advance_phase → BUILDING (write_file)
  → advance_phase → VERIFICATION (shell ls tests/unit/) → advance_phase
  → DELIVERY (final answer). Asserts final_answer, tool dispatch counts,
  no phase_violation events, engine ends at DELIVERY.
- Negative path: write_file in PLANNING blocked → phase_violation event
  emitted with violation_kind=tool_not_allowed → LLM calls advance_phase
  → write_file in BUILDING succeeds. Asserts exactly 1 violation, tool
  NOT dispatched during PLANNING (write_file.call_count==1 after recovery).
- Edge cases:
  - auto_advance_after_steps=2: engine transitions out of PLANNING
    after 2 LLM calls without explicit advance_phase.
  - policy_from_config(enabled=False) returns None (PLAN_EXEC disabled).
  - policy_from_config({}) returns None (opt-out, fall back to default).
- Error path: chat_stream raises RuntimeError → exception propagates,
  phase state unchanged (still PLANNING), tool not dispatched.
- WS handler integration: full _handle_chat_message path emits both
  phase_violation (from engine) and phase_changed (from WS handler's
  transition detection) to the client WebSocket.

Notes:
- Loop detector threshold bumped to 99 for happy/negative/auto-advance
  tests (3 legitimate advance_phase calls with {} args would trigger
  the default threshold=2; this is a known PLAN_EXEC production concern
  tracked separately).
- VERIFICATION-phase shell command uses `ls tests/unit/` instead of
  plan's `pytest tests/unit/ -q` — pytest is not in
  ShellTool._SAFE_COMMAND_PREFIXES and would be flagged dangerous by
  the default policy's bash filter. Using ls (whitelisted) keeps the
  test focused on lifecycle validation rather than policy tuning.

Verification: python3 -m pytest tests/integration/test_plan_exec_e2e.py -v
passes (7/7). Full regression: 116 tests pass across U1-U5 test files.
Ruff check + format clean.

Refs: R34, R27. Plan: docs/plans/2026-06-30-001-feat-agent-wave4-plan-exec-hardening-plan.md

cbbe937940 chore(shell): fix ruff F401/F841 + apply ruff format

Pre-existing ruff errors surfaced during Wave 4 QC:
- F401: drop unused `TerminalSession` import (only `TerminalSessionManager` is used)
- F841: drop unused `start = time.monotonic()` local in `_execute_standalone`

`ruff format` then reformatted a few long lines in the same file
(frozenset literal, curl exfiltration regex, pipe operators, session.env
call). No behavior change — formatting only.

Why now: shell.py was already touched by U1 (widen
`bash_command_filter`). Leaving known ruff failures in a file this PR
modifies would make future CI gates noisy.

Test / backend-test (pull_request) Has been cancelled Details

Test / frontend-unit (pull_request) Has been cancelled Details

Test / api-e2e (pull_request) Has been cancelled Details

Test / frontend-e2e (pull_request) Has been cancelled Details

8627777f87 fix(review): apply ce-code-review findings

Six safe fixes from Stage 5c review:

phase.py: delete dead _DEFAULT_BASH_FILTER constant (no references after U1)
chat.py: drop Any from _build_phase_engine params (AGENTS.md prohibits any)
chat.ts: delete stale comment about phase_changed emission
chat-phase.test.ts: rename misleading 'capped at 5' test name
test_chat_plan_exec_ws.py: tighten test_rest_react_mode_still_works assertion
test_plan_exec_e2e.py: clarify test_auto_advance assertion comment

Known limitations documented in PR description (not fixed): loop detector + advance_phase (P1), parallel path phase_violation ordering (P2), REST cancellation_token (P2), Callable filter exceptions (P3).

fischer added 1 commit 2026-06-30 12:46:28 +08:00

Test / backend-test (pull_request) Has been cancelled Details

Test / frontend-unit (pull_request) Has been cancelled Details

Test / api-e2e (pull_request) Has been cancelled Details

Test / frontend-e2e (pull_request) Has been cancelled Details

a872a459a6 docs: add PLAN_EXEC concepts + commit Wave 4 plan

CONCEPTS.md: new PLAN_EXEC section (Phase State Machine, PhasePolicy, Phase Violation, AdvancePhaseTool, _build_phase_engine).

docs/plans/: commit the Wave 4 plan document (was untracked).

fischer merged commit 0962df11b5 into main

2026-06-30 12:46:35 +08:00

fischer referenced this issue from a commit

2026-06-30 12:46:37 +08:00

feat(agent): Wave 4 PLAN_EXEC Hardening (U1-U5) (#7)

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: fischer/fischer-agentkit#7