docs(plan): Wave 2 medium-coupling plan (G4/G7/G9)

2026-06-29 22:30:51 +08:00 · 2026-06-29 22:30:51 +08:00 · 88bfe71d30
parent 78ed93fc81
commit 88bfe71d30
1 changed files with 481 additions and 0 deletions
--- a/docs/plans/2026-06-29-003-feat-agent-wave2-medium-coupling-plan.md
+++ b/docs/plans/2026-06-29-003-feat-agent-wave2-medium-coupling-plan.md
@ -0,0 +1,481 @@
+---
+date: 2026-06-29
+type: feat
+origin: docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md
+---
+
+# Wave 2 — Auxiliary LLM Routing, Three-Tier Fallback, Atomic Subtask Rollback
+
+## Summary
+
+Wave 2 of the advanced-agent gap optimization brainstorm. Three medium-coupling gaps: G4 routes summary calls through a cheaper auxiliary LLM (falling back to main on failure), G7 introduces a three-tier fallback chain (main → Recovery via existing `ReflexionEngine` → Emergency with structured errors), G9 binds atomic subtask rollback to `PlanPhase` (opt-in via `rollback_command`, coordinated with the existing U7 `PipelineCheckpoint`).
+
+---
+
+## Problem Frame
+
+Wave 1 shipped self-contained quick wins (G1/G2/G3/G8). Wave 2 addresses the medium-coupling gaps that touch multiple layers and resolve two deferred-to-planning decisions surfaced in the brainstorm: G7 Emergency layer rule template shape, and G9 `rollback_command` default behavior. The constraints these gaps must respect come from `docs/solutions/logic-errors/long-horizon-reliability-code-review-fixes.md` — new fields must preserve existing contracts, dynamic plan mutations must persist immediately, and `PipelineCheckpoint` is in-memory dict + Redis fallback (not a DB row lock).
+
+---
+
+## Requirements
+
+Carried from `docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md` § Wave 2 (R13-R21):
+
+- R13. `ContextCompressor` summary task routes to auxiliary model (cheap, e.g. Gemini Flash / Doubao lite), not main model.
+- R14. `auxiliary_model` is configurable, separated from main `model`.
+- R15. Summary quality does not degrade: auxiliary failure falls back to main model (not to simple_summary).
+- R16. Main agent failure triggers Recovery layer (reuse `ReflexionEngine` Evaluate→Reflect→Retry).
+- R17. Recovery failure triggers Emergency layer (rule-based fallback, structured error + suggestion).
+- R18. Fallback chain is configurable (per-layer max retries, enable/disable Recovery/Emergency).
+- R19. `PlanPhase` (`experts/plan.py:147`) gains optional `validation_command` and `rollback_command` fields.
+- R20. Phase failure auto-executes `rollback_command` when configured (default `git checkout` pattern); coordinated with U7 checkpoint.
+- R21. `checkpoint.save` runs only after rollback validation passes (avoid persisting failed state).
+
+Cross-cutting: R26 (configuration via `agentkit.yaml`), R27 (each gap ships a minimal self-check test, per ponytail rule).
+
+---
+
+## Key Technical Decisions
+
+KTD1. **Auxiliary routing is compressor-scoped, not gateway-level.** `LLMGateway` already has a per-model fallback chain (`gateway.py:170-227`), but it semantics are "main fails → try backup model", not "route by task type". G4 adds an `auxiliary_model` parameter to `ContextCompressor.__init__`; `_summarize` tries auxiliary first, falls back to main on failure. This avoids touching the gateway's existing fallback semantics and keeps the change to one file plus config wiring.
+
+KTD2. **Emergency layer adds a parallel `error_struct` field, not replacing `error_message`.** `TaskResult.error_message: str | None` is serialized to API responses (`protocol.py:132-145`) — changing its type breaks frontend contracts. New `error_struct: dict | None` field carries `{error_code, message, suggestions, retryable}`. `error_message` remains the human-readable string (now derived from `error_struct.message` when present). Backward compatibility preserved.
+
+KTD3. **Emergency rule classification by exception type, not by status field.** `TaskTimeoutError` → `timeout`, `LoopDetectedError` → `loop_detected`, `LLMProviderError` → `llm_failure`, `TaskCancelledError` → `cancelled` (not Emergency-eligible, propagates), generic `Exception` → `internal_error`. Each rule maps to a fixed message + suggestion list (no LLM-driven suggestions — pure rule-based, per brainstorm).
+
+KTD4. **Recovery layer reuses `ReflexionEngine` with main model, not auxiliary.** `ReflexionEngine.execute()` already supports `evaluate_model`/`reflect_model` parameters (`reflexion.py:94-111`), defaulting to the main model. Recovery triggers on main agent failure; using the main model for evaluate/reflect maximizes diagnostic reliability (the same model that failed is best positioned to reflect on its own failure). Auxiliary model is reserved for G4's cost-sensitive compression path.
+
+KTD5. **Three-tier chain wires at `chat.py:613` (WebSocket main path), not at `ReActEngine.execute()`.** ReAct has 5 call sites (`chat.py:613`, `rewoo.py:1204`, `cli/chat.py:338`, `reflexion.py:216`, `config_driven.py:695`). Wrapping at ReAct would force all 5 sites through Recovery/Emergency, including ReflexionEngine itself (creating a recursive loop). Wrapping at `chat.py:613` covers the primary user-facing path; CLI and ReWOO are out of scope (deferred to follow-up).
+
+KTD6. **Rollback is opt-in via `rollback_command` field; no implicit default.** Brainstorm KTD5 mentioned "default `git checkout`" but auto-executing `git checkout` on every phase failure changes existing behavior (Finding 1: "新字段默认值须保持既有契约"). Resolution: `rollback_command: str | None = None`. When unset, no rollback executes (preserves existing contract). When set, the configured command runs after validation failure, before checkpoint save. `git checkout <files>` is the canonical pattern documented in YAML examples.
+
+KTD7. **Rollback executes via `RollbackExecutor`, not via `ShellTool`.** `ShellTool._is_dangerous` would trigger `confirm_callback` for `git checkout` (not in `_SAFE_COMMAND_PREFIXES`, not in `_DANGEROUS_BINARY_FLAGS`). Orchestrator-internal execution bypasses ShellTool — uses `asyncio.create_subprocess_shell` with `cwd`/`timeout`/`proc.kill()` pattern from `verification_loop.py:67-103`. Audit logging added (not a whitelist concern; terminal_whitelist only governs `/api/v1/terminal/server` route per `terminal_server.py:227`).
+
+KTD8. **G9 checkpoint ordering: validation → rollback → checkpoint save.** Currently `orchestrator.py:265` saves checkpoint unconditionally after phase finalizes (success or failure). R21 requires checkpoint save only after rollback validation passes. New ordering: phase fails → mark FAILED → mark dependents FAILED → execute `validation_command` (if set) → if validation fails, execute `rollback_command` (if set) → save checkpoint (only if rollback validation passed or no rollback configured).
+
+---
+
+## High-Level Technical Design
+
+### Three-tier fallback chain (G7)
+
+```mermaid
+stateDiagram-v2
+    [*] --> Main: chat request
+    Main --> MainSuccess: success
+    Main --> Recovery: failure (exception or empty_fallback)
+    Recovery --> RecoverySuccess: ReflexionEngine retry succeeds
+    Recovery --> Emergency: max_reflections exhausted
+    Emergency --> [*]: emit error_struct, terminal
+    MainSuccess --> [*]: normal response
+    RecoverySuccess --> [*]: recovered response
+```
+
+### G9 phase failure + rollback sequence
+
+```mermaid
+sequenceDiagram
+    participant O as TeamOrchestrator
+    participant P as PlanPhase
+    participant RE as RollbackExecutor
+    participant CP as PipelineCheckpoint
+
+    O->>P: execute phase
+    P-->>O: failure (exception)
+    O->>O: mark FAILED + dependents FAILED
+    alt validation_command set
+        O->>RE: run validation_command
+        RE-->>O: ValidationResult
+        alt validation fails AND rollback_command set
+            O->>RE: run rollback_command
+            RE-->>O: RollbackResult
+            alt rollback validation passes
+                O->>CP: save checkpoint
+            else rollback validation fails
+                O->>O: log rollback_failed, skip checkpoint
+            end
+        else no rollback_command
+            O->>CP: save checkpoint (no rollback)
+        end
+    else no validation_command
+        O->>CP: save checkpoint (no validation)
+    end
+```
+
+### Configuration shape (YAML)
+
+```yaml
+# G4: Auxiliary LLM for compression
+llm:
+  auxiliary_model: fast    # alias resolved via existing model_aliases mechanism
+
+# G7: Three-tier fallback chain
+fallback_chain:
+  enabled: true
+  recovery:
+    enabled: true
+    max_retries: 1          # ReflexionEngine max_reflections override
+  emergency:
+    enabled: true
+
+# G9: Rollback configuration (per-phase opt-in via PlanPhase.rollback_command)
+rollback:
+  default_timeout: 30.0    # RollbackExecutor subprocess timeout
+```
+
+---
+
+## Implementation Units
+
+### U1. G4 — Auxiliary LLM Routing in ContextCompressor
+
+**Goal:** Route `_summarize` LLM calls through `auxiliary_model` when configured; fall back to main model on auxiliary failure (not to `_simple_summary`).
+
+**Requirements:** R13, R14, R15, R26, R27
+
+**Dependencies:** none (self-contained)
+
+**Files:**
+- Modify: `src/agentkit/core/compressor.py` (add `auxiliary_model` param + routing in `_summarize`)
+- Modify: `src/agentkit/llm/config.py` (add `auxiliary_model: str | None` field to `LLMConfig`)
+- Modify: `src/agentkit/server/config.py` (`_build_llm_config` reads `auxiliary_model` from llm section)
+- Modify: `agentkit.yaml` (document `llm.auxiliary_model: fast` in llm section)
+- Create: `tests/unit/test_compressor_auxiliary.py`
+
+**Approach:** `ContextCompressor.__init__` gains `auxiliary_model: str | None = None` param (after existing `model` param). `_summarize` (compressor.py:123-158) restructuring:
+
+1. If `auxiliary_model` is set and differs from `model`, try `auxiliary_model` first via `self._llm_gateway.chat(model=self._auxiliary_model, ...)`.
+2. **Truthiness check (Finding 4 anti-pattern):** treat empty `response.content` (None or whitespace-only) as failure, not success. On failure, log and fall through to main model.
+3. If auxiliary succeeds with non-empty content, return it.
+4. If auxiliary fails (exception OR empty content), retry with main `self._model`.
+5. Existing `except Exception → _simple_summary` block remains as the final degradation tier.
+
+`LLMConfig.auxiliary_model` reads from `data.get("auxiliary_model")` in `_build_llm_config`. Agentkit.yaml already declares `fast: bailian-coding/qwen-turbo` alias — this is the canonical auxiliary target.
+
+**Execution note:** characterization-first. Capture current `_summarize` behavior (single model, exception → simple_summary) with `auxiliary_model=None` tests, then add `auxiliary_model="fast"` tests for new routing behavior.
+
+**Patterns to follow:**
+- `LLMGateway._get_models_to_try` (`gateway.py:170-227`) — existing fallback chain pattern
+- Wave 1's `ServerConfig.from_dict` extension template (prompt_cache/streaming/verification sections, `config.py:239-273`)
+
+**Test scenarios:**
+- Happy path: `auxiliary_model="fast"` set, auxiliary returns non-empty content → result is auxiliary content, main model not called
+- Empty content fallback: auxiliary returns `content=""` (or `None`) → main model called, main content returned (covers Finding 4 anti-pattern)
+- Auxiliary exception: auxiliary raises `LLMProviderError` → main model called, main content returned
+- Both fail: auxiliary raises, main raises → `_simple_summary` returned (existing degradation preserved)
+- Characterization: `auxiliary_model=None` → behavior matches current code (single model call)
+- Config wiring: `LLMConfig.from_dict` reads `auxiliary_model` field from dict; `ServerConfig._build_llm_config` passes it through
+- Audit: auxiliary call uses `agent_name="compressor"`, `task_type="summarization"` (preserved for usage tracking)
+
+**Verification:** All test scenarios pass; existing `tests/unit/test_compressor*.py` (if any) still pass with `auxiliary_model=None` default; ruff clean.
+
+---
+
+### U2. G7 — Emergency Layer Rule Template + TaskResult Extension
+
+**Goal:** Add rule-based Emergency classifier with structured error output. No wiring yet — just the infrastructure (classifier + data structure + `fallback.py` extension).
+
+**Requirements:** R17, R18, R26, R27
+
+**Dependencies:** none (foundation for U3)
+
+**Files:**
+- Modify: `src/agentkit/core/fallback.py` (add `EmergencyRules` class + `EmergencyError` dataclass, preserve existing 3 constants)
+- Modify: `src/agentkit/core/protocol.py` (add `error_struct: dict | None = None` field to `TaskResult`, update `to_dict`)
+- Create: `tests/unit/test_emergency_rules.py`
+
+**Approach:**
+
+`EmergencyError` dataclass (in `fallback.py`):
+```
+@dataclass
+class EmergencyError:
+    error_code: str          # "timeout"|"loop_detected"|"llm_failure"|"internal_error"
+    message: str             # human-readable Chinese message (mirrors EMPTY_LLM_RESPONSE style)
+    suggestions: list[str]   # actionable user-facing suggestions
+    retryable: bool          # whether user retry might succeed
+    original_error: str      # str(exc) for traceability
+
+    def to_dict(self) -> dict: ...
+    def to_error_message(self) -> str: ...   # formatted "message\n建议：1) ... 2) ..."
+```
+
+`EmergencyRules` class (rule-based classifier, no LLM):
+```
+class EmergencyRules:
+    @staticmethod
+    def classify(exc: Exception, config: dict | None = None) -> EmergencyError:
+        # Match by exception type (TaskTimeoutError, LoopDetectedError, LLMProviderError, etc.)
+        # config allows per-rule customization (suggestion overrides, retryable overrides)
+```
+
+Rule mapping (initial set, expandable via config):
+- `TaskTimeoutError` → `error_code="timeout"`, `retryable=True`, suggestions: ["稍后重试", "简化任务范围"]
+- `LoopDetectedError` → `error_code="loop_detected"`, `retryable=True`, suggestions: ["拆分任务", "检查工具参数"]
+- `LLMProviderError` → `error_code="llm_failure"`, `retryable=True`, suggestions: ["稍后重试", "切换模型"]
+- `TaskCancelledError` → not classified (propagates as-is, Emergency not triggered)
+- Generic `Exception` → `error_code="internal_error"`, `retryable=False`, suggestions: ["联系管理员"]
+
+`TaskResult` extension:
+```
+@dataclass
+class TaskResult:
+    # ... existing fields ...
+    error_message: str | None    # unchanged
+    error_struct: dict | None = None   # NEW: serialized EmergencyError.to_dict()
+```
+
+`to_dict` includes `error_struct` when set; `error_message` continues to hold the human-readable string.
+
+**Patterns to follow:**
+- Existing `EMPTY_LLM_RESPONSE` / `MAX_STEPS_REACHED` style in `fallback.py` (Chinese, "建议：..." format)
+- `VerificationResult.errors: list[str]` field pattern from `verification_loop.py:18-24`
+- `TaskResult.to_dict` pattern at `protocol.py:132-145`
+
+**Test scenarios:**
+- Happy path: `classify(TaskTimeoutError(...))` returns `EmergencyError(error_code="timeout", retryable=True)`
+- Each exception type maps to correct `error_code`
+- `TaskCancelledError` is NOT classified (caller must check before invoking `classify`)
+- `to_dict()` produces all 5 fields
+- `to_error_message()` formats suggestions as "建议：1) ... 2) ..."
+- `EmergencyRules.classify(Exception("unknown"))` → `error_code="internal_error"`, `retryable=False`
+- Config override: custom suggestion list for `timeout` rule via config dict
+- `TaskResult` with `error_struct` set: `to_dict()` includes both `error_message` and `error_struct`
+- `TaskResult` with `error_struct=None` (default): `to_dict()` matches current behavior (backward compat)
+
+**Verification:** All scenarios pass; `fallback.py` existing 3 constants unchanged; `TaskResult.to_dict` for tasks without `error_struct` matches pre-change output byte-for-byte.
+
+---
+
+### U3. G7 — Three-Tier Fallback Chain Wiring
+
+**Goal:** Wire main → Recovery (ReflexionEngine) → Emergency (EmergencyRules) at `chat.py:613`. Composes U2's infrastructure with existing `ReflexionEngine`.
+
+**Requirements:** R16, R18, R26
+
+**Dependencies:** U2 (EmergencyRules + TaskResult.error_struct)
+
+**Files:**
+- Modify: `src/agentkit/server/routes/chat.py` (wrap main agent call at L613 with three-tier chain)
+- Modify: `src/agentkit/server/config.py` (add `fallback_chain` section to `from_dict`)
+- Modify: `agentkit.yaml` (document `fallback_chain:` section)
+- Create: `tests/unit/test_fallback_chain.py`
+
+**Approach:**
+
+New helper module or inline function in `chat.py`:
+```
+async def execute_with_fallback_chain(
+    agent: ConfigDrivenAgent,
+    task: Task,
+    config: dict,    # fallback_chain config section
+    llm_gateway: LLMGateway,
+) -> TaskResult:
+    # Tier 1: Main
+    try:
+        result = await agent.execute(task)
+        if result.status == "success" or result.status == "completed":
+            return result
+        # Treat non-success as soft failure → trigger Recovery
+        raise AgentExecutionError(result.error_message or "main agent did not succeed")
+    except (TaskTimeoutError, LoopDetectedError, LLMProviderError, AgentExecutionError) as exc:
+        if not config.get("recovery", {}).get("enabled", True):
+            return _to_emergency(exc, task)
+
+        # Tier 2: Recovery (ReflexionEngine)
+        try:
+            reflexion_engine = ReflexionEngine(
+                llm_gateway=llm_gateway,
+                max_reflections=config.get("recovery", {}).get("max_retries", 1),
+                # ... other params from agent's existing reflexion config
+            )
+            recovery_result = await reflexion_engine.execute(
+                messages=task.messages,    # rebuild from task
+                tools=agent.get_tools(),
+                model=task.model,
+                agent_name=agent.name,
+                task_id=task.task_id,
+            )
+            if recovery_result.status == "success":
+                return _recovery_to_task_result(recovery_result, task)
+        except Exception as recovery_exc:
+            logger.warning(f"Recovery layer failed: {recovery_exc}")
+
+        # Tier 3: Emergency
+        return _to_emergency(exc, task)
+```
+
+`_to_emergency(exc, task)` constructs `EmergencyError` via `EmergencyRules.classify(exc, config)`, then returns `TaskResult(status=FAILED, error_message=emergency.to_error_message(), error_struct=emergency.to_dict(), ...)`.
+
+Wiring at `chat.py:613`: replace direct `agent.execute(task)` call with `execute_with_fallback_chain(...)`. Recovery config from `server_config.fallback_chain` (new `ServerConfig.fallback_chain: dict` field, mirroring Wave 1 pattern).
+
+**Recovery layer scope (KTD5):** Only `chat.py:613` is wrapped. CLI (`cli/chat.py:338`), ReWOO (`rewoo.py:1204`), ReflexionEngine's internal ReAct call, and `config_driven.py:695` are NOT wrapped (would create recursive loop or unwanted coupling). These remain on the direct-execute path. Documented in `## Scope Boundaries`.
+
+**Patterns to follow:**
+- `config_driven.py:836` — existing `ReflexionEngine` instantiation pattern (constructor params, execute call)
+- Wave 1's `verification` config section in `ServerConfig.from_dict` (`config.py:240-273`)
+
+**Test scenarios:**
+- Happy path: main agent succeeds → no Recovery, no Emergency triggered; `error_struct=None`
+- Main fails (timeout) → Recovery triggered → Recovery succeeds → `error_struct=None`, output from ReflexionEngine
+- Main fails (timeout) → Recovery triggered → Recovery fails (max_reflections exhausted) → Emergency triggered → `error_struct` populated with `error_code="timeout"`, `retryable=True`
+- Main fails (LoopDetectedError) → Emergency `error_code="loop_detected"`
+- Main fails (LLMProviderError) → Emergency `error_code="llm_failure"`
+- Main fails (TaskCancelledError) → propagates as-is (NOT routed to Emergency)
+- Main fails (generic Exception) → Emergency `error_code="internal_error"`, `retryable=False`
+- Config: `fallback_chain.recovery.enabled=false` → skip Recovery, go directly to Emergency
+- Config: `fallback_chain.emergency.enabled=false` → re-raise original exception (no Emergency)
+- Integration: full chain on real `ConfigDrivenAgent` with mocked LLM (mock main raises, mock ReflexionEngine succeeds)
+
+**Verification:** All scenarios pass; existing chat WebSocket tests still pass; ruff clean.
+
+---
+
+### U4. G9 — PlanPhase Rollback Fields + RollbackExecutor + TeamOrchestrator Integration
+
+**Goal:** Add `validation_command`/`rollback_command` optional fields to `PlanPhase`; execute rollback on phase failure (opt-in); coordinate with U7 checkpoint ordering per R21.
+
+**Requirements:** R19, R20, R21, R26, R27
+
+**Dependencies:** none (uses existing U7 `PipelineCheckpoint` and `VerificationLoop` patterns)
+
+**Files:**
+- Modify: `src/agentkit/experts/plan.py` (`PlanPhase` dataclass + `to_dict`/`from_dict` symmetry)
+- Modify: `src/agentkit/experts/orchestrator.py` (insert rollback execution between phase failure and checkpoint save)
+- Create: `src/agentkit/orchestrator/rollback.py` (`RollbackExecutor` class)
+- Modify: `src/agentkit/server/config.py` (add `rollback` section to `from_dict`)
+- Modify: `agentkit.yaml` (document `rollback:` section)
+- Create: `tests/unit/test_phase_rollback.py`
+
+**Approach:**
+
+**PlanPhase extension (`plan.py:147-233`):** Add two optional fields with `None` default (preserves existing contract per Finding 1):
+```
+validation_command: str | None = None
+rollback_command: str | None = None
+```
+Update `to_dict()` to include both fields (only when not None, to keep dict shape minimal). Update `from_dict()` to read both fields.
+
+**`RollbackExecutor` class (new file `orchestrator/rollback.py`):** Mirrors `VerificationLoop` pattern (`verification_loop.py:67-103`):
+```
+class RollbackExecutor:
+    def __init__(self, working_dir: str | None = None, timeout: float = 30.0): ...
+
+    async def execute(self, command: str) -> RollbackResult:
+        # asyncio.create_subprocess_shell with cwd/timeout/proc.kill()
+        # Returns RollbackResult(passed, exit_code, stdout, stderr, command)
+
+    async def validate(self, command: str) -> RollbackResult:
+        # Same as execute but returns passed=False on non-zero exit
+```
+
+`RollbackResult` dataclass: `{passed: bool, exit_code: int, stdout: str, stderr: str, command: str}`.
+
+**TeamOrchestrator integration (`orchestrator.py:246-270`):** New ordering per KTD8:
+
+```
+# Existing: phase failure detected, mark FAILED + dependents FAILED
+# (lines 247-261 unchanged)
+
+# NEW: rollback phase (only if validation_command and rollback_command configured)
+should_save_checkpoint = True
+if ph.validation_command and ph.rollback_command:
+    validator = RollbackExecutor(working_dir=self._workspace_root, timeout=...)
+    validation_result = await validator.validate(ph.validation_command)
+    if not validation_result.passed:
+        rollback_result = await validator.execute(ph.rollback_command)
+        if not rollback_result.passed:
+            # Rollback failed → don't save checkpoint (R21)
+            should_save_checkpoint = False
+            logger.error(f"Rollback failed for phase {ph.id}: {rollback_result.stderr}")
+            # Emit phase_rollback_failed event
+            await self._broadcast_event("phase_rollback_failed", {...})
+
+# Existing: checkpoint save (conditional now)
+if should_save_checkpoint and self._checkpoint is not None:
+    try:
+        await self._checkpoint.save(plan.id, ph, plan.status.value)
+    except Exception as e:
+        logger.warning(...)
+```
+
+**Events emitted (new):** `phase_rollback_started`, `phase_rollback_completed`, `phase_rollback_failed`. Existing `phase_failed` event unchanged (emitted before rollback).
+
+**Config:** `rollback.default_timeout: float = 30.0` in `ServerConfig.rollback` dict (Wave 1 pattern).
+
+**Security boundary (KTD7):** Rollback subprocess does NOT go through `ShellTool` (avoids `confirm_callback` for `git checkout`). Audit log emitted via `_broadcast_event`. `terminal_whitelist` is not consulted (only governs `/api/v1/terminal/server` route, per `terminal_server.py:227`).
+
+**Execution note:** characterization-first. Test current behavior (`rollback_command=None` → no rollback, checkpoint saved) before adding rollback behavior.
+
+**Patterns to follow:**
+- `VerificationLoop` subprocess execution pattern (`verification_loop.py:67-103`) — `asyncio.create_subprocess_shell` + `cwd` + `timeout` + `proc.kill()`
+- `PlanPhase.to_dict`/`from_dict` symmetric serialization pattern at `plan.py:186-233`
+- `TeamOrchestrator._execute_phase` event broadcasting pattern (`orchestrator.py:253-260`)
+- Wave 1's `ServerConfig.from_dict` extension template
+
+**Test scenarios:**
+- Characterization: `PlanPhase()` with no `validation_command`/`rollback_command` → `to_dict()` output matches pre-change shape (no new keys); `from_dict({})` produces phase with both fields as None
+- Serialization: phase with `rollback_command="git checkout foo.py"` → `to_dict()` includes key; `from_dict(to_dict())` round-trips
+- RollbackExecutor happy path: `execute("git status")` returns `passed=True`, `exit_code=0`
+- RollbackExecutor timeout: `execute("sleep 10", timeout=0.1)` returns `passed=False`, exit_code=-1 (or similar)
+- RollbackExecutor failure: `execute("false")` returns `passed=False`, `exit_code=1`
+- Integration — opt-in default: phase fails, `rollback_command=None` → no RollbackExecutor call, checkpoint saved (existing behavior)
+- Integration — rollback configured, validation passes (returns 0): rollback NOT executed, checkpoint saved
+- Integration — rollback configured, validation fails, rollback succeeds (returns 0): rollback executed, checkpoint saved
+- Integration — rollback configured, validation fails, rollback fails (returns 1): rollback executed, checkpoint NOT saved (R21), `phase_rollback_failed` event emitted
+- Integration — real git repo fixture: phase writes file via `git checkout`, rollback `git checkout foo.py` restores file; assert file content matches pre-phase state
+
+**Verification:** All scenarios pass; existing `tests/unit/test_pipeline_state.py` and `tests/unit/test_team_orchestrator*.py` (if any) still pass; ruff clean.
+
+---
+
+## Scope Boundaries
+
+### Deferred to Follow-Up Work
+
+- **Recovery wiring at non-chat call sites** (`cli/chat.py:338`, `rewoo.py:1204`, `config_driven.py:695`). KTD5 limits Wave 2 to the primary chat WebSocket path. Other entry points can adopt the same wrapper in a follow-up.
+- **Recovery layer streaming events.** Wave 2's chain returns final `TaskResult`; SSE events for "recovery_started"/"recovery_completed" would require deeper WebSocket protocol changes.
+- **Patch-level rollback** (`git apply -R <patch>`). KTD6 + KTD7 scope Wave 2 to `git checkout <files>` pattern. Patch-level requires extending `CheckpointData` schema to track patches — Wave 3 candidate.
+- **DB-backed atomic claim for checkpoint.** Finding 3 noted `PipelineCheckpoint` is in-memory dict + Redis fallback, not a PostgreSQL `FOR UPDATE SKIP LOCKED` pattern. Multi-process atomicity is out of scope; single-process `asyncio.Lock` (already present at `orchestrator.py:53`) is sufficient.
+
+### Outside this product's identity (carried from brainstorm)
+
+- Wave 3 (G5/G6) — tree-sitter integration, SOLO four-stage state machine. Strategic direction locked in brainstorm KTD6/KTD7; implementation design deferred to Wave 3 plan.
+- Node-level checkpoint (ReAct single-step). Stage-level (U7) satisfies core need.
+- DeerFlow-style disk filesystem. Redis is the persistence layer.
+- Full LangGraph migration. Self-built architecture stays.
+
+---
+
+## Risks & Dependencies
+
+- **Risk: Auxiliary model availability.** If `auxiliary_model` alias is not configured in `agentkit.yaml`, `LLMGateway` raises `ModelNotFoundError`. Mitigation: `ContextCompressor` catches this in the auxiliary try-block and falls through to main model (R15 fallback path). Documented in YAML comments.
+- **Risk: Recovery layer recursive loop.** `ReflexionEngine.execute()` internally calls `ReActEngine.execute()` (5th call site). If Recovery itself fails, it must NOT trigger another Recovery — KTD5 wires only at `chat.py:613`, so ReflexionEngine's internal ReAct call bypasses the chain. Recursive loop is structurally impossible.
+- **Risk: `git checkout` destructive scope.** Misconfigured `rollback_command` (e.g., `git checkout .`) could wipe unrelated changes. Mitigation: rollback is opt-in per-phase (KTD6); audit log via `phase_rollback_started`/`phase_rollback_failed` events; documented YAML examples use file-scoped commands (`git checkout <specific_files>`).
+- **Risk: `TaskResult.error_struct` field addition breaks pickle/serialization.** `TaskResult` is a dataclass with `to_dict`; new optional field with `None` default is backward-compatible for `to_dict` consumers. Pickle deserialization of pre-change data still works (new field defaults to None).
+- **Dependency: Wave 1 merged.** `ServerConfig.from_dict` extension template (prompt_cache/streaming/verification sections) is established and tested. Wave 2's 3 new sections (auxiliary_model, fallback_chain, rollback) reuse this exact pattern.
+- **Dependency: U7 PipelineCheckpoint already shipped.** `orchestrator/checkpoint.py:56` exists with `save(plan_id, phase, plan_status)` API; U4 only adjusts the calling order, not the API.
+
+---
+
+## Sources / Research
+
+- Brainstorm: `docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md` (Wave 2 section, R13-R21, KTD4/KTD5)
+- Wave 1 plan: `docs/plans/2026-06-29-002-feat-agent-wave1-quick-wins-plan.md` (ServerConfig.from_dict extension template established)
+- Finding 1 (contract preservation rule): `docs/solutions/logic-errors/long-horizon-reliability-code-review-fixes.md` — "新字段默认值须保持既有契约" drives KTD6's opt-in default for `rollback_command`
+- Finding 2 (retry-storm defense): `docs/solutions/security-issues/portal-platform-security-reliability-fixes.md` — Emergency layer's terminal `error_code` pattern (don't propagate unhandled exceptions)
+- Finding 3 (atomic claim pattern): `docs/solutions/architecture-patterns/bitable-companion-service-security-reliability-patterns.md` — `SKIP LOCKED` not reusable; `PipelineCheckpoint` is in-memory dict + Redis
+- Finding 4 (empty-response anti-pattern): `docs/solutions/ui-bugs/tauri-reload-loses-session.md` — auxiliary LLM must check truthy return value, not just absence of exception; drives U1's empty-content fallback
+- Code locations verified during planning:
+  - `src/agentkit/core/compressor.py:35-44,123-158` — `_summarize` injection point for G4
+  - `src/agentkit/llm/gateway.py:170-227` — existing per-model fallback chain (semantics differ from G4's task-type routing)
+  - `src/agentkit/llm/config.py:198-257` — `LLMConfig` dataclass and `from_dict`
+  - `src/agentkit/core/reflexion.py:68-111` — `ReflexionEngine.__init__` and `execute()` signatures (already supports `evaluate_model`/`reflect_model`)
+  - `src/agentkit/core/fallback.py` (full file, 19 lines) — 3 existing constants; `EmergencyRules` adds alongside
+  - `src/agentkit/core/protocol.py:118-145` — `TaskResult` dataclass and `to_dict`
+  - `src/agentkit/experts/plan.py:147-233` — `PlanPhase` dataclass, `to_dict`/`from_dict`
+  - `src/agentkit/experts/orchestrator.py:246-270` — phase failure capture + checkpoint save site (U4 integration point)
+  - `src/agentkit/orchestrator/checkpoint.py:56` — `PipelineCheckpoint` (U7, no API change needed)
+  - `src/agentkit/core/verification_loop.py:67-103` — subprocess execution pattern for `RollbackExecutor`
+  - `src/agentkit/tools/shell.py:168-174,526-534` — `_DANGEROUS_BINARY_FLAGS` (git checkout not listed), `_is_dangerous` logic (KTD7 rationale)