feat(agent): Wave 4 PLAN_EXEC Hardening (U1-U5) #7

Merged
fischer merged 8 commits from feat/agent-wave4-plan-exec-hardening into main 2026-06-30 12:46:35 +08:00

8 Commits

Author SHA1 Message Date
chiguyong a872a459a6 docs: add PLAN_EXEC concepts + commit Wave 4 plan
Test / backend-test (pull_request) Has been cancelled Details
Test / frontend-unit (pull_request) Has been cancelled Details
Test / api-e2e (pull_request) Has been cancelled Details
Test / frontend-e2e (pull_request) Has been cancelled Details
CONCEPTS.md: new PLAN_EXEC section (Phase State Machine, PhasePolicy, Phase Violation, AdvancePhaseTool, _build_phase_engine).

docs/plans/: commit the Wave 4 plan document (was untracked).
2026-06-30 12:46:24 +08:00
chiguyong 8627777f87 fix(review): apply ce-code-review findings
Test / backend-test (pull_request) Has been cancelled Details
Test / frontend-unit (pull_request) Has been cancelled Details
Test / api-e2e (pull_request) Has been cancelled Details
Test / frontend-e2e (pull_request) Has been cancelled Details
Six safe fixes from Stage 5c review:

phase.py: delete dead _DEFAULT_BASH_FILTER constant (no references after U1)
chat.py: drop Any from _build_phase_engine params (AGENTS.md prohibits any)
chat.ts: delete stale comment about phase_changed emission
chat-phase.test.ts: rename misleading 'capped at 5' test name
test_chat_plan_exec_ws.py: tighten test_rest_react_mode_still_works assertion
test_plan_exec_e2e.py: clarify test_auto_advance assertion comment

Known limitations documented in PR description (not fixed): loop detector + advance_phase (P1), parallel path phase_violation ordering (P2), REST cancellation_token (P2), Callable filter exceptions (P3).
2026-06-30 12:42:15 +08:00
chiguyong cbbe937940 chore(shell): fix ruff F401/F841 + apply ruff format
Pre-existing ruff errors surfaced during Wave 4 QC:
- F401: drop unused `TerminalSession` import (only `TerminalSessionManager` is used)
- F841: drop unused `start = time.monotonic()` local in `_execute_standalone`

`ruff format` then reformatted a few long lines in the same file
(frozenset literal, curl exfiltration regex, pipe operators, session.env
call). No behavior change — formatting only.

Why now: shell.py was already touched by U1 (widen
`bash_command_filter`). Leaving known ruff failures in a file this PR
modifies would make future CI gates noisy.
2026-06-30 11:52:51 +08:00
chiguyong 0a8f6eebef feat(U5): E2E integration test for PLAN_EXEC lifecycle
Add tests/integration/test_plan_exec_e2e.py covering the full PLAN_EXEC
path through a scripted LLM mock (deterministic, no real API call).

Mock boundary: LLMGateway.chat_stream yields scripted StreamChunk
objects. Real ReActEngine, real PhasePolicy (default_policy()), real
AdvancePhaseTool, real chat._handle_chat_message WS handler.

Test scenarios (7 tests, all passing):
- Happy path: PLANNING (search) → advance_phase → BUILDING (write_file)
  → advance_phase → VERIFICATION (shell ls tests/unit/) → advance_phase
  → DELIVERY (final answer). Asserts final_answer, tool dispatch counts,
  no phase_violation events, engine ends at DELIVERY.
- Negative path: write_file in PLANNING blocked → phase_violation event
  emitted with violation_kind=tool_not_allowed → LLM calls advance_phase
  → write_file in BUILDING succeeds. Asserts exactly 1 violation, tool
  NOT dispatched during PLANNING (write_file.call_count==1 after recovery).
- Edge cases:
  - auto_advance_after_steps=2: engine transitions out of PLANNING
    after 2 LLM calls without explicit advance_phase.
  - policy_from_config(enabled=False) returns None (PLAN_EXEC disabled).
  - policy_from_config({}) returns None (opt-out, fall back to default).
- Error path: chat_stream raises RuntimeError → exception propagates,
  phase state unchanged (still PLANNING), tool not dispatched.
- WS handler integration: full _handle_chat_message path emits both
  phase_violation (from engine) and phase_changed (from WS handler's
  transition detection) to the client WebSocket.

Notes:
- Loop detector threshold bumped to 99 for happy/negative/auto-advance
  tests (3 legitimate advance_phase calls with {} args would trigger
  the default threshold=2; this is a known PLAN_EXEC production concern
  tracked separately).
- VERIFICATION-phase shell command uses `ls tests/unit/` instead of
  plan's `pytest tests/unit/ -q` — pytest is not in
  ShellTool._SAFE_COMMAND_PREFIXES and would be flagged dangerous by
  the default policy's bash filter. Using ls (whitelisted) keeps the
  test focused on lifecycle validation rather than policy tuning.

Verification: python3 -m pytest tests/integration/test_plan_exec_e2e.py -v
passes (7/7). Full regression: 116 tests pass across U1-U5 test files.
Ruff check + format clean.

Refs: R34, R27. Plan: docs/plans/2026-06-30-001-feat-agent-wave4-plan-exec-hardening-plan.md
2026-06-30 11:36:02 +08:00
chiguyong 2abe7c9e49 feat(U4): frontend phase_violation handling + PhaseIndicator component
Extend the frontend to surface PLAN_EXEC phase lifecycle events to the
user:

- WsServerMessage union (types.ts) gains two branches:
  `phase_changed` and `phase_violation` (matching backend U2 emission).
- chat.ts Pinia store gains a phase state slice:
  `currentPhase`, `phaseViolations` (capped at 5), `isPlanExec`
  computed, and `resetPlanExecState()`.
- handleWsMessage adds `case "phase_changed"` (sets currentPhase +
  appends a milestone step) and `case "phase_violation"` (sets
  currentPhase from violation data, appends to violations, fires an
  ant-design-vue message.warning toast, appends an error step).
- `result` handler calls `resetPlanExecState()` to clear the
  indicator when the conversation completes.
- New `PhaseIndicator.vue` component: compact badge + 4 dots
  (PLANNING/BUILDING/VERIFICATION/DELIVERY) with the current phase
  highlighted + violation counter. Renders nothing when
  `!isPlanExec` (graceful degradation).
- Mounted in `ChatView.vue` alongside ExpertTeamView and
  BoardStatusView.

Tests:
- New `tests/unit/stores/chat-phase.test.ts` verifies the phase state
  slice is exposed with correct initial values and `isPlanExec`
  derives from `currentPhase`.
- `npm run typecheck` clean.
- Pre-existing `tauri-auth.test.ts` failure is unrelated (fails in
  isolation on main).
2026-06-30 11:11:03 +08:00
chiguyong b032e08866 feat(U3): extract _build_phase_engine helper + wire REST PLAN_EXEC
Extract the WS path's inline phase_policy construction into a shared
_build_phase_engine helper so the REST send_message endpoint can reuse
it. Replace the former 501 stub with actual PLAN_EXEC execution:

- REST POST /chat/sessions/{id}/messages with execution_mode=plan_exec
  now builds a phase-policy-backed ReActEngine, calls execute()
  (non-streaming), and returns a MessageResponse.
- KTD5: PLAN_EXEC bypasses execute_with_fallback_chain — phase policy
  and fallback chain are mutually exclusive.
- When plan_exec.enabled=False, REST falls through to the REACT path
  (matching WS behavior).
- WS path refactored to call the same helper; behavior unchanged.

Tests:
- Replace TestRestPlanExec501 with TestRestPlanExec (happy path, bad
  config → 500, disabled → falls through to REACT, REACT mode unchanged).
- Add TestBuildPhaseEngineHelper covering all return branches:
  not-PLAN_EXEC, disabled, empty-config, invalid-config, tool append,
  default-policy fallback.
- All 109 tests pass across the three PLAN_EXEC test files.
2026-06-30 10:59:43 +08:00
chiguyong 4dc58c24bc feat(U2): emit phase_violation WS event alongside LLM reinjection
Wave 3 only injected the violation error dict back to the LLM as a tool
result. Wave 4 U2 adds a parallel WS event so the frontend PhaseIndicator
can surface violations to the user.

- ReActEngine: add _phase_violations accumulator (list[dict]). Cleared in
  reset(). _check_phase_permission appends a structured violation dict
  (with new violation_kind field: tool_not_allowed | bash_command_blocked)
  before returning the error.
- Add _drain_phase_violations(step) helper that pops pending violations
  and returns ReActEvent(event_type="phase_violation", ...) list. Events
  carry a shallow copy of the violation dict so callers can't mutate the
  accumulator.
- execute_stream: drain after each tool_result yield at all 3 tool
  execution sites (parallel, serial-with-confirmation, parsed_calls).
  Non-streaming execute() ignores the accumulator (the LLM reinjection
  via the error dict is the only signal there).
- chat.py WS handler: new elif branch forwards phase_violation ReActEvents
  to the client as {"type": "phase_violation", "data": ...} WS messages.
- Tests: 11 new tests covering accumulator lifecycle, drain semantics,
  shallow-copy isolation, and execute_stream event emission for both
  tool_block and bash_block paths. 2 new WS forwarding tests pin the
  chat.py path (forward + characterization for REACT mode).
2026-06-30 10:48:35 +08:00
chiguyong 9e28ab315e feat(U1): widen PhasePolicy bash_command_filter to accept Callable
Reuses ShellTool._is_dangerous as the default bash filter for PLANNING
and VERIFICATION phases, closing the regex ceiling documented in Wave 3.

- Convert ShellTool._is_dangerous and _is_single_command_dangerous to
  @staticmethod (backward-compatible; instance calls still work via
  Python's descriptor protocol).
- Widen PhasePolicy.bash_command_filter field type to
  dict[PhaseState, Callable[[str], bool] | re.Pattern | None].
- is_bash_command_allowed dispatches on callable vs pattern at call time.
  Empty commands short-circuit to allowed (Wave 3 contract; ShellTool
  emits the clearer empty-command error).
- to_dict serializes callables as <callable> for log readability.
- default_policy() now wires ShellTool._is_dangerous for PLANNING and
  VERIFICATION. _DEFAULT_BASH_FILTER kept for backward compat with
  configs that pass a re.Pattern.
- Tests: characterization tests pin Wave 3 behavior (rm/mv/cp/echo >
  still blocked) plus new edge-case coverage for ceiling closed
  (dd of=/dev/sda, :>file, chain operators, pipe segments).
2026-06-30 10:39:44 +08:00