Test / backend-test (pull_request) Has been cancelledDetails
Test / frontend-unit (pull_request) Has been cancelledDetails
Test / api-e2e (pull_request) Has been cancelledDetails
Test / frontend-e2e (pull_request) Has been cancelledDetails
Replace emoji across codebase: YAML avatars -> first char, frontend banners -> Ant Design Vue components, CLI status -> OK/FAIL/WARN labels, terminal -> [WARN]/[OK]/[PENDING], Bitable DB default -> table, App.vue font cleanup, test fixtures -> first char letters. shell.avatar type upgraded to string | Component.
Test / backend-test (pull_request) Has been cancelledDetails
Test / frontend-unit (pull_request) Has been cancelledDetails
Test / api-e2e (pull_request) Has been cancelledDetails
Test / frontend-e2e (pull_request) Has been cancelledDetails
Addresses 4 actionable findings (1 P1 + 3 P2) from ce-code-review of
feat/ui-ue-enhancement (PR #13), now merged to main (8066e0b).
P1 — expert_step payload alignment (_phase_executor.py)
The thinking/tool_call/tool_result event payloads were missing the
fields the frontend WsServerMessage contract requires
(expert_name/expert_color/content/step). Frontend code consuming these
events silently degraded. Now all expert_step broadcasts carry the
full contract; tool_call/tool_result keep step_data for the raw payload.
P2 #1 — execute_stream CancellationToken registration (config_driven.py)
execute_stream() bypassed BaseAgent.execute() and never registered a
CancellationToken, so cancel_task() could not cooperatively cancel a
streaming task. Now registers the token and cleans it up in finally.
P2 #2 — team_synthesis orphan milestone cleanup (orchestrator.py)
If synthesis streaming was interrupted (cancel/exception), no terminal
team_synthesis event was emitted, leaving the frontend streaming
milestone spinning forever. Now an inner try/except emits a terminal
team_synthesis with status=cancelled|error before re-raising, so the
frontend can finalize the milestone. The success path also carries
the synthesis_id.
P2 #3 — synthesis_id dedup (orchestrator.py + types.ts + chatStream.ts)
Without an identifier, the frontend could not precisely match a
team_synthesis terminal event to its streaming milestone (especially
across retries/concurrent teams). The backend now injects a stable
synthesis_id (`{plan.id}:synthesis`) into both team_synthesis_chunk
and team_synthesis events; the frontend uses it for exact milestone
matching and treats error/cancelled status as terminal.
Test updates
- Updated test_thinking_events_forwarded_as_expert_step to assert the
new payload contract (expert_id/name/color/content/step).
- Added test_tool_call_events_forwarded_as_expert_step covering
tool_call/tool_result payload shape (content=tool_name摘要 +
step_data=原始 payload).
Verification
- ruff check: clean
- pytest tests/unit/experts/test_phase_executor_streaming.py: 14/14
- npm run typecheck: clean
- vitest: 126/127 (1 unrelated baseline failure in tauri-auth.test.ts)
Residuals doc: docs/residual-review-findings/feat-ui-ue-enhancement.md
Test / backend-test (pull_request) Has been cancelledDetails
Test / frontend-unit (pull_request) Has been cancelledDetails
Test / api-e2e (pull_request) Has been cancelledDetails
Test / frontend-e2e (pull_request) Has been cancelledDetails
Captures the ReAct streaming contract bug + WS event whitelist governance
from PR #13's review fixes. Three intertwined runtime issues documented:
1. P0: final_answer double-accumulated token content (logic_error)
2. P0: _VALID_TEAM_EVENT_TYPES whitelist missing 3 new streaming event types
3. P2: except (RuntimeError, TimeoutError, ConnectionError) too narrow for
LLMProviderError/ConfigValidationError in async generator
Adds ReAct Streaming Contract entry to CONCEPTS.md — defines the protocol
execute_stream() yields (token events with incremental content, then one
final_answer event with the concatenated full text). Consumers must pick
one accumulation strategy, cannot mix both without doubled output.
Test / backend-test (pull_request) Has been cancelledDetails
Test / frontend-unit (pull_request) Has been cancelledDetails
Test / api-e2e (pull_request) Has been cancelledDetails
Test / frontend-e2e (pull_request) Has been cancelledDetails
Record the strategies established during PR #8-#11 (1214+ tech debt
governance) for Any replacement priority, except Exception classification,
framework boundary preservation, and intentional-design retention.
Test / backend-test (pull_request) Has been cancelledDetails
Test / frontend-unit (pull_request) Has been cancelledDetails
Test / api-e2e (pull_request) Has been cancelledDetails
Test / frontend-e2e (pull_request) Has been cancelledDetails
Six safe fixes from Stage 5c review:
phase.py: delete dead _DEFAULT_BASH_FILTER constant (no references after U1)
chat.py: drop Any from _build_phase_engine params (AGENTS.md prohibits any)
chat.ts: delete stale comment about phase_changed emission
chat-phase.test.ts: rename misleading 'capped at 5' test name
test_chat_plan_exec_ws.py: tighten test_rest_react_mode_still_works assertion
test_plan_exec_e2e.py: clarify test_auto_advance assertion comment
Known limitations documented in PR description (not fixed): loop detector + advance_phase (P1), parallel path phase_violation ordering (P2), REST cancellation_token (P2), Callable filter exceptions (P3).
Pre-existing ruff errors surfaced during Wave 4 QC:
- F401: drop unused `TerminalSession` import (only `TerminalSessionManager` is used)
- F841: drop unused `start = time.monotonic()` local in `_execute_standalone`
`ruff format` then reformatted a few long lines in the same file
(frozenset literal, curl exfiltration regex, pipe operators, session.env
call). No behavior change — formatting only.
Why now: shell.py was already touched by U1 (widen
`bash_command_filter`). Leaving known ruff failures in a file this PR
modifies would make future CI gates noisy.
Add tests/integration/test_plan_exec_e2e.py covering the full PLAN_EXEC
path through a scripted LLM mock (deterministic, no real API call).
Mock boundary: LLMGateway.chat_stream yields scripted StreamChunk
objects. Real ReActEngine, real PhasePolicy (default_policy()), real
AdvancePhaseTool, real chat._handle_chat_message WS handler.
Test scenarios (7 tests, all passing):
- Happy path: PLANNING (search) → advance_phase → BUILDING (write_file)
→ advance_phase → VERIFICATION (shell ls tests/unit/) → advance_phase
→ DELIVERY (final answer). Asserts final_answer, tool dispatch counts,
no phase_violation events, engine ends at DELIVERY.
- Negative path: write_file in PLANNING blocked → phase_violation event
emitted with violation_kind=tool_not_allowed → LLM calls advance_phase
→ write_file in BUILDING succeeds. Asserts exactly 1 violation, tool
NOT dispatched during PLANNING (write_file.call_count==1 after recovery).
- Edge cases:
- auto_advance_after_steps=2: engine transitions out of PLANNING
after 2 LLM calls without explicit advance_phase.
- policy_from_config(enabled=False) returns None (PLAN_EXEC disabled).
- policy_from_config({}) returns None (opt-out, fall back to default).
- Error path: chat_stream raises RuntimeError → exception propagates,
phase state unchanged (still PLANNING), tool not dispatched.
- WS handler integration: full _handle_chat_message path emits both
phase_violation (from engine) and phase_changed (from WS handler's
transition detection) to the client WebSocket.
Notes:
- Loop detector threshold bumped to 99 for happy/negative/auto-advance
tests (3 legitimate advance_phase calls with {} args would trigger
the default threshold=2; this is a known PLAN_EXEC production concern
tracked separately).
- VERIFICATION-phase shell command uses `ls tests/unit/` instead of
plan's `pytest tests/unit/ -q` — pytest is not in
ShellTool._SAFE_COMMAND_PREFIXES and would be flagged dangerous by
the default policy's bash filter. Using ls (whitelisted) keeps the
test focused on lifecycle validation rather than policy tuning.
Verification: python3 -m pytest tests/integration/test_plan_exec_e2e.py -v
passes (7/7). Full regression: 116 tests pass across U1-U5 test files.
Ruff check + format clean.
Refs: R34, R27. Plan: docs/plans/2026-06-30-001-feat-agent-wave4-plan-exec-hardening-plan.md
Extend the frontend to surface PLAN_EXEC phase lifecycle events to the
user:
- WsServerMessage union (types.ts) gains two branches:
`phase_changed` and `phase_violation` (matching backend U2 emission).
- chat.ts Pinia store gains a phase state slice:
`currentPhase`, `phaseViolations` (capped at 5), `isPlanExec`
computed, and `resetPlanExecState()`.
- handleWsMessage adds `case "phase_changed"` (sets currentPhase +
appends a milestone step) and `case "phase_violation"` (sets
currentPhase from violation data, appends to violations, fires an
ant-design-vue message.warning toast, appends an error step).
- `result` handler calls `resetPlanExecState()` to clear the
indicator when the conversation completes.
- New `PhaseIndicator.vue` component: compact badge + 4 dots
(PLANNING/BUILDING/VERIFICATION/DELIVERY) with the current phase
highlighted + violation counter. Renders nothing when
`!isPlanExec` (graceful degradation).
- Mounted in `ChatView.vue` alongside ExpertTeamView and
BoardStatusView.
Tests:
- New `tests/unit/stores/chat-phase.test.ts` verifies the phase state
slice is exposed with correct initial values and `isPlanExec`
derives from `currentPhase`.
- `npm run typecheck` clean.
- Pre-existing `tauri-auth.test.ts` failure is unrelated (fails in
isolation on main).
Extract the WS path's inline phase_policy construction into a shared
_build_phase_engine helper so the REST send_message endpoint can reuse
it. Replace the former 501 stub with actual PLAN_EXEC execution:
- REST POST /chat/sessions/{id}/messages with execution_mode=plan_exec
now builds a phase-policy-backed ReActEngine, calls execute()
(non-streaming), and returns a MessageResponse.
- KTD5: PLAN_EXEC bypasses execute_with_fallback_chain — phase policy
and fallback chain are mutually exclusive.
- When plan_exec.enabled=False, REST falls through to the REACT path
(matching WS behavior).
- WS path refactored to call the same helper; behavior unchanged.
Tests:
- Replace TestRestPlanExec501 with TestRestPlanExec (happy path, bad
config → 500, disabled → falls through to REACT, REACT mode unchanged).
- Add TestBuildPhaseEngineHelper covering all return branches:
not-PLAN_EXEC, disabled, empty-config, invalid-config, tool append,
default-policy fallback.
- All 109 tests pass across the three PLAN_EXEC test files.
Wave 3 only injected the violation error dict back to the LLM as a tool
result. Wave 4 U2 adds a parallel WS event so the frontend PhaseIndicator
can surface violations to the user.
- ReActEngine: add _phase_violations accumulator (list[dict]). Cleared in
reset(). _check_phase_permission appends a structured violation dict
(with new violation_kind field: tool_not_allowed | bash_command_blocked)
before returning the error.
- Add _drain_phase_violations(step) helper that pops pending violations
and returns ReActEvent(event_type="phase_violation", ...) list. Events
carry a shallow copy of the violation dict so callers can't mutate the
accumulator.
- execute_stream: drain after each tool_result yield at all 3 tool
execution sites (parallel, serial-with-confirmation, parsed_calls).
Non-streaming execute() ignores the accumulator (the LLM reinjection
via the error dict is the only signal there).
- chat.py WS handler: new elif branch forwards phase_violation ReActEvents
to the client as {"type": "phase_violation", "data": ...} WS messages.
- Tests: 11 new tests covering accumulator lifecycle, drain semantics,
shallow-copy isolation, and execute_stream event emission for both
tool_block and bash_block paths. 2 new WS forwarding tests pin the
chat.py path (forward + characterization for REACT mode).
Reuses ShellTool._is_dangerous as the default bash filter for PLANNING
and VERIFICATION phases, closing the regex ceiling documented in Wave 3.
- Convert ShellTool._is_dangerous and _is_single_command_dangerous to
@staticmethod (backward-compatible; instance calls still work via
Python's descriptor protocol).
- Widen PhasePolicy.bash_command_filter field type to
dict[PhaseState, Callable[[str], bool] | re.Pattern | None].
- is_bash_command_allowed dispatches on callable vs pattern at call time.
Empty commands short-circuit to allowed (Wave 3 contract; ShellTool
emits the clearer empty-command error).
- to_dict serializes callables as <callable> for log readability.
- default_policy() now wires ShellTool._is_dangerous for PLANNING and
VERIFICATION. _DEFAULT_BASH_FILTER kept for backward compat with
configs that pass a re.Pattern.
- Tests: characterization tests pin Wave 3 behavior (rm/mv/cp/echo >
still blocked) plus new edge-case coverage for ceiling closed
(dd of=/dev/sda, :>file, chain operators, pipe segments).
Test / backend-test (pull_request) Has been cancelledDetails
Test / frontend-unit (pull_request) Has been cancelledDetails
Test / api-e2e (pull_request) Has been cancelledDetails
Test / frontend-e2e (pull_request) Has been cancelledDetails
Code review fixes for Wave 1:
- W1: ServerConfig.from_dict now wires prompt_cache/streaming/verification sections
from YAML to constructor (previously these params existed but were never read)
- W3: Tool._validate_input filters _-prefixed kwargs (e.g. _skip_dangerous_check)
before jsonschema.validate, preventing additionalProperties:false schemas from
rejecting internal control parameters
- N3: ReActResult.status docstring now lists "empty_fallback" and "verify_failed"
Added test test_internal_kwargs_underscore_prefixed_skipped_by_validation for W3.