Add tests/integration/test_plan_exec_e2e.py covering the full PLAN_EXEC
path through a scripted LLM mock (deterministic, no real API call).
Mock boundary: LLMGateway.chat_stream yields scripted StreamChunk
objects. Real ReActEngine, real PhasePolicy (default_policy()), real
AdvancePhaseTool, real chat._handle_chat_message WS handler.
Test scenarios (7 tests, all passing):
- Happy path: PLANNING (search) → advance_phase → BUILDING (write_file)
→ advance_phase → VERIFICATION (shell ls tests/unit/) → advance_phase
→ DELIVERY (final answer). Asserts final_answer, tool dispatch counts,
no phase_violation events, engine ends at DELIVERY.
- Negative path: write_file in PLANNING blocked → phase_violation event
emitted with violation_kind=tool_not_allowed → LLM calls advance_phase
→ write_file in BUILDING succeeds. Asserts exactly 1 violation, tool
NOT dispatched during PLANNING (write_file.call_count==1 after recovery).
- Edge cases:
- auto_advance_after_steps=2: engine transitions out of PLANNING
after 2 LLM calls without explicit advance_phase.
- policy_from_config(enabled=False) returns None (PLAN_EXEC disabled).
- policy_from_config({}) returns None (opt-out, fall back to default).
- Error path: chat_stream raises RuntimeError → exception propagates,
phase state unchanged (still PLANNING), tool not dispatched.
- WS handler integration: full _handle_chat_message path emits both
phase_violation (from engine) and phase_changed (from WS handler's
transition detection) to the client WebSocket.
Notes:
- Loop detector threshold bumped to 99 for happy/negative/auto-advance
tests (3 legitimate advance_phase calls with {} args would trigger
the default threshold=2; this is a known PLAN_EXEC production concern
tracked separately).
- VERIFICATION-phase shell command uses `ls tests/unit/` instead of
plan's `pytest tests/unit/ -q` — pytest is not in
ShellTool._SAFE_COMMAND_PREFIXES and would be flagged dangerous by
the default policy's bash filter. Using ls (whitelisted) keeps the
test focused on lifecycle validation rather than policy tuning.
Verification: python3 -m pytest tests/integration/test_plan_exec_e2e.py -v
passes (7/7). Full regression: 116 tests pass across U1-U5 test files.
Ruff check + format clean.
Refs: R34, R27. Plan: docs/plans/2026-06-30-001-feat-agent-wave4-plan-exec-hardening-plan.md