31 KiB
| date | type | origin |
|---|---|---|
| 2026-06-29 | feat | docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md |
Wave 2 — Auxiliary LLM Routing, Three-Tier Fallback, Atomic Subtask Rollback
Summary
Wave 2 of the advanced-agent gap optimization brainstorm. Three medium-coupling gaps: G4 routes summary calls through a cheaper auxiliary LLM (falling back to main on failure), G7 introduces a three-tier fallback chain (main → Recovery via existing ReflexionEngine → Emergency with structured errors), G9 binds atomic subtask rollback to PlanPhase (opt-in via rollback_command, coordinated with the existing U7 PipelineCheckpoint).
Problem Frame
Wave 1 shipped self-contained quick wins (G1/G2/G3/G8). Wave 2 addresses the medium-coupling gaps that touch multiple layers and resolve two deferred-to-planning decisions surfaced in the brainstorm: G7 Emergency layer rule template shape, and G9 rollback_command default behavior. The constraints these gaps must respect come from docs/solutions/logic-errors/long-horizon-reliability-code-review-fixes.md — new fields must preserve existing contracts, dynamic plan mutations must persist immediately, and PipelineCheckpoint is in-memory dict + Redis fallback (not a DB row lock).
Requirements
Carried from docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md § Wave 2 (R13-R21):
- R13.
ContextCompressorsummary task routes to auxiliary model (cheap, e.g. Gemini Flash / Doubao lite), not main model. - R14.
auxiliary_modelis configurable, separated from mainmodel. - R15. Summary quality does not degrade: auxiliary failure falls back to main model (not to simple_summary).
- R16. Main agent failure triggers Recovery layer (reuse
ReflexionEngineEvaluate→Reflect→Retry). - R17. Recovery failure triggers Emergency layer (rule-based fallback, structured error + suggestion).
- R18. Fallback chain is configurable (per-layer max retries, enable/disable Recovery/Emergency).
- R19.
PlanPhase(experts/plan.py:147) gains optionalvalidation_commandandrollback_commandfields. - R20. Phase failure auto-executes
rollback_commandwhen configured (defaultgit checkoutpattern); coordinated with U7 checkpoint. - R21.
checkpoint.saveruns only after rollback validation passes (avoid persisting failed state).
Cross-cutting: R26 (configuration via agentkit.yaml), R27 (each gap ships a minimal self-check test, per ponytail rule).
Key Technical Decisions
KTD1. Auxiliary routing is compressor-scoped, not gateway-level. LLMGateway already has a per-model fallback chain (gateway.py:170-227), but it semantics are "main fails → try backup model", not "route by task type". G4 adds an auxiliary_model parameter to ContextCompressor.__init__; _summarize tries auxiliary first, falls back to main on failure. This avoids touching the gateway's existing fallback semantics and keeps the change to one file plus config wiring.
KTD2. Emergency layer adds a parallel error_struct field, not replacing error_message. TaskResult.error_message: str | None is serialized to API responses (protocol.py:132-145) — changing its type breaks frontend contracts. New error_struct: dict | None field carries {error_code, message, suggestions, retryable}. error_message remains the human-readable string (now derived from error_struct.message when present). Backward compatibility preserved.
KTD3. Emergency rule classification by exception type, not by status field. TaskTimeoutError → timeout, LoopDetectedError → loop_detected, LLMProviderError → llm_failure, TaskCancelledError → cancelled (not Emergency-eligible, propagates), generic Exception → internal_error. Each rule maps to a fixed message + suggestion list (no LLM-driven suggestions — pure rule-based, per brainstorm).
KTD4. Recovery layer reuses ReflexionEngine with main model, not auxiliary. ReflexionEngine.execute() already supports evaluate_model/reflect_model parameters (reflexion.py:94-111), defaulting to the main model. Recovery triggers on main agent failure; using the main model for evaluate/reflect maximizes diagnostic reliability (the same model that failed is best positioned to reflect on its own failure). Auxiliary model is reserved for G4's cost-sensitive compression path.
KTD5. Three-tier chain wires at chat.py:613 (WebSocket main path), not at ReActEngine.execute(). ReAct has 5 call sites (chat.py:613, rewoo.py:1204, cli/chat.py:338, reflexion.py:216, config_driven.py:695). Wrapping at ReAct would force all 5 sites through Recovery/Emergency, including ReflexionEngine itself (creating a recursive loop). Wrapping at chat.py:613 covers the primary user-facing path; CLI and ReWOO are out of scope (deferred to follow-up).
KTD6. Rollback is opt-in via rollback_command field; no implicit default. Brainstorm KTD5 mentioned "default git checkout" but auto-executing git checkout on every phase failure changes existing behavior (Finding 1: "新字段默认值须保持既有契约"). Resolution: rollback_command: str | None = None. When unset, no rollback executes (preserves existing contract). When set, the configured command runs after validation failure, before checkpoint save. git checkout <files> is the canonical pattern documented in YAML examples.
KTD7. Rollback executes via RollbackExecutor, not via ShellTool. ShellTool._is_dangerous would trigger confirm_callback for git checkout (not in _SAFE_COMMAND_PREFIXES, not in _DANGEROUS_BINARY_FLAGS). Orchestrator-internal execution bypasses ShellTool — uses asyncio.create_subprocess_shell with cwd/timeout/proc.kill() pattern from verification_loop.py:67-103. Audit logging added (not a whitelist concern; terminal_whitelist only governs /api/v1/terminal/server route per terminal_server.py:227).
KTD8. G9 checkpoint ordering: validation → rollback → checkpoint save. Currently orchestrator.py:265 saves checkpoint unconditionally after phase finalizes (success or failure). R21 requires checkpoint save only after rollback validation passes. New ordering: phase fails → mark FAILED → mark dependents FAILED → execute validation_command (if set) → if validation fails, execute rollback_command (if set) → save checkpoint (only if rollback validation passed or no rollback configured).
High-Level Technical Design
Three-tier fallback chain (G7)
stateDiagram-v2
[*] --> Main: chat request
Main --> MainSuccess: success
Main --> Recovery: failure (exception or empty_fallback)
Recovery --> RecoverySuccess: ReflexionEngine retry succeeds
Recovery --> Emergency: max_reflections exhausted
Emergency --> [*]: emit error_struct, terminal
MainSuccess --> [*]: normal response
RecoverySuccess --> [*]: recovered response
G9 phase failure + rollback sequence
sequenceDiagram
participant O as TeamOrchestrator
participant P as PlanPhase
participant RE as RollbackExecutor
participant CP as PipelineCheckpoint
O->>P: execute phase
P-->>O: failure (exception)
O->>O: mark FAILED + dependents FAILED
alt validation_command set
O->>RE: run validation_command
RE-->>O: ValidationResult
alt validation fails AND rollback_command set
O->>RE: run rollback_command
RE-->>O: RollbackResult
alt rollback validation passes
O->>CP: save checkpoint
else rollback validation fails
O->>O: log rollback_failed, skip checkpoint
end
else no rollback_command
O->>CP: save checkpoint (no rollback)
end
else no validation_command
O->>CP: save checkpoint (no validation)
end
Configuration shape (YAML)
# G4: Auxiliary LLM for compression
llm:
auxiliary_model: fast # alias resolved via existing model_aliases mechanism
# G7: Three-tier fallback chain
fallback_chain:
enabled: true
recovery:
enabled: true
max_retries: 1 # ReflexionEngine max_reflections override
emergency:
enabled: true
# G9: Rollback configuration (per-phase opt-in via PlanPhase.rollback_command)
rollback:
default_timeout: 30.0 # RollbackExecutor subprocess timeout
Implementation Units
U1. G4 — Auxiliary LLM Routing in ContextCompressor
Goal: Route _summarize LLM calls through auxiliary_model when configured; fall back to main model on auxiliary failure (not to _simple_summary).
Requirements: R13, R14, R15, R26, R27
Dependencies: none (self-contained)
Files:
- Modify:
src/agentkit/core/compressor.py(addauxiliary_modelparam + routing in_summarize) - Modify:
src/agentkit/llm/config.py(addauxiliary_model: str | Nonefield toLLMConfig) - Modify:
src/agentkit/server/config.py(_build_llm_configreadsauxiliary_modelfrom llm section) - Modify:
agentkit.yaml(documentllm.auxiliary_model: fastin llm section) - Create:
tests/unit/test_compressor_auxiliary.py
Approach: ContextCompressor.__init__ gains auxiliary_model: str | None = None param (after existing model param). _summarize (compressor.py:123-158) restructuring:
- If
auxiliary_modelis set and differs frommodel, tryauxiliary_modelfirst viaself._llm_gateway.chat(model=self._auxiliary_model, ...). - Truthiness check (Finding 4 anti-pattern): treat empty
response.content(None or whitespace-only) as failure, not success. On failure, log and fall through to main model. - If auxiliary succeeds with non-empty content, return it.
- If auxiliary fails (exception OR empty content), retry with main
self._model. - Existing
except Exception → _simple_summaryblock remains as the final degradation tier.
LLMConfig.auxiliary_model reads from data.get("auxiliary_model") in _build_llm_config. Agentkit.yaml already declares fast: bailian-coding/qwen-turbo alias — this is the canonical auxiliary target.
Execution note: characterization-first. Capture current _summarize behavior (single model, exception → simple_summary) with auxiliary_model=None tests, then add auxiliary_model="fast" tests for new routing behavior.
Patterns to follow:
LLMGateway._get_models_to_try(gateway.py:170-227) — existing fallback chain pattern- Wave 1's
ServerConfig.from_dictextension template (prompt_cache/streaming/verification sections,config.py:239-273)
Test scenarios:
- Happy path:
auxiliary_model="fast"set, auxiliary returns non-empty content → result is auxiliary content, main model not called - Empty content fallback: auxiliary returns
content=""(orNone) → main model called, main content returned (covers Finding 4 anti-pattern) - Auxiliary exception: auxiliary raises
LLMProviderError→ main model called, main content returned - Both fail: auxiliary raises, main raises →
_simple_summaryreturned (existing degradation preserved) - Characterization:
auxiliary_model=None→ behavior matches current code (single model call) - Config wiring:
LLMConfig.from_dictreadsauxiliary_modelfield from dict;ServerConfig._build_llm_configpasses it through - Audit: auxiliary call uses
agent_name="compressor",task_type="summarization"(preserved for usage tracking)
Verification: All test scenarios pass; existing tests/unit/test_compressor*.py (if any) still pass with auxiliary_model=None default; ruff clean.
U2. G7 — Emergency Layer Rule Template + TaskResult Extension
Goal: Add rule-based Emergency classifier with structured error output. No wiring yet — just the infrastructure (classifier + data structure + fallback.py extension).
Requirements: R17, R18, R26, R27
Dependencies: none (foundation for U3)
Files:
- Modify:
src/agentkit/core/fallback.py(addEmergencyRulesclass +EmergencyErrordataclass, preserve existing 3 constants) - Modify:
src/agentkit/core/protocol.py(adderror_struct: dict | None = Nonefield toTaskResult, updateto_dict) - Create:
tests/unit/test_emergency_rules.py
Approach:
EmergencyError dataclass (in fallback.py):
@dataclass
class EmergencyError:
error_code: str # "timeout"|"loop_detected"|"llm_failure"|"internal_error"
message: str # human-readable Chinese message (mirrors EMPTY_LLM_RESPONSE style)
suggestions: list[str] # actionable user-facing suggestions
retryable: bool # whether user retry might succeed
original_error: str # str(exc) for traceability
def to_dict(self) -> dict: ...
def to_error_message(self) -> str: ... # formatted "message\n建议:1) ... 2) ..."
EmergencyRules class (rule-based classifier, no LLM):
class EmergencyRules:
@staticmethod
def classify(exc: Exception, config: dict | None = None) -> EmergencyError:
# Match by exception type (TaskTimeoutError, LoopDetectedError, LLMProviderError, etc.)
# config allows per-rule customization (suggestion overrides, retryable overrides)
Rule mapping (initial set, expandable via config):
TaskTimeoutError→error_code="timeout",retryable=True, suggestions: ["稍后重试", "简化任务范围"]LoopDetectedError→error_code="loop_detected",retryable=True, suggestions: ["拆分任务", "检查工具参数"]LLMProviderError→error_code="llm_failure",retryable=True, suggestions: ["稍后重试", "切换模型"]TaskCancelledError→ not classified (propagates as-is, Emergency not triggered)- Generic
Exception→error_code="internal_error",retryable=False, suggestions: ["联系管理员"]
TaskResult extension:
@dataclass
class TaskResult:
# ... existing fields ...
error_message: str | None # unchanged
error_struct: dict | None = None # NEW: serialized EmergencyError.to_dict()
to_dict includes error_struct when set; error_message continues to hold the human-readable string.
Patterns to follow:
- Existing
EMPTY_LLM_RESPONSE/MAX_STEPS_REACHEDstyle infallback.py(Chinese, "建议:..." format) VerificationResult.errors: list[str]field pattern fromverification_loop.py:18-24TaskResult.to_dictpattern atprotocol.py:132-145
Test scenarios:
- Happy path:
classify(TaskTimeoutError(...))returnsEmergencyError(error_code="timeout", retryable=True) - Each exception type maps to correct
error_code TaskCancelledErroris NOT classified (caller must check before invokingclassify)to_dict()produces all 5 fieldsto_error_message()formats suggestions as "建议:1) ... 2) ..."EmergencyRules.classify(Exception("unknown"))→error_code="internal_error",retryable=False- Config override: custom suggestion list for
timeoutrule via config dict TaskResultwitherror_structset:to_dict()includes botherror_messageanderror_structTaskResultwitherror_struct=None(default):to_dict()matches current behavior (backward compat)
Verification: All scenarios pass; fallback.py existing 3 constants unchanged; TaskResult.to_dict for tasks without error_struct matches pre-change output byte-for-byte.
U3. G7 — Three-Tier Fallback Chain Wiring
Goal: Wire main → Recovery (ReflexionEngine) → Emergency (EmergencyRules) at chat.py:613. Composes U2's infrastructure with existing ReflexionEngine.
Requirements: R16, R18, R26
Dependencies: U2 (EmergencyRules + TaskResult.error_struct)
Files:
- Modify:
src/agentkit/server/routes/chat.py(wrap main agent call at L613 with three-tier chain) - Modify:
src/agentkit/server/config.py(addfallback_chainsection tofrom_dict) - Modify:
agentkit.yaml(documentfallback_chain:section) - Create:
tests/unit/test_fallback_chain.py
Approach:
New helper module or inline function in chat.py:
async def execute_with_fallback_chain(
agent: ConfigDrivenAgent,
task: Task,
config: dict, # fallback_chain config section
llm_gateway: LLMGateway,
) -> TaskResult:
# Tier 1: Main
try:
result = await agent.execute(task)
if result.status == "success" or result.status == "completed":
return result
# Treat non-success as soft failure → trigger Recovery
raise AgentExecutionError(result.error_message or "main agent did not succeed")
except (TaskTimeoutError, LoopDetectedError, LLMProviderError, AgentExecutionError) as exc:
if not config.get("recovery", {}).get("enabled", True):
return _to_emergency(exc, task)
# Tier 2: Recovery (ReflexionEngine)
try:
reflexion_engine = ReflexionEngine(
llm_gateway=llm_gateway,
max_reflections=config.get("recovery", {}).get("max_retries", 1),
# ... other params from agent's existing reflexion config
)
recovery_result = await reflexion_engine.execute(
messages=task.messages, # rebuild from task
tools=agent.get_tools(),
model=task.model,
agent_name=agent.name,
task_id=task.task_id,
)
if recovery_result.status == "success":
return _recovery_to_task_result(recovery_result, task)
except Exception as recovery_exc:
logger.warning(f"Recovery layer failed: {recovery_exc}")
# Tier 3: Emergency
return _to_emergency(exc, task)
_to_emergency(exc, task) constructs EmergencyError via EmergencyRules.classify(exc, config), then returns TaskResult(status=FAILED, error_message=emergency.to_error_message(), error_struct=emergency.to_dict(), ...).
Wiring at chat.py:613: replace direct agent.execute(task) call with execute_with_fallback_chain(...). Recovery config from server_config.fallback_chain (new ServerConfig.fallback_chain: dict field, mirroring Wave 1 pattern).
Recovery layer scope (KTD5): Only chat.py:613 is wrapped. CLI (cli/chat.py:338), ReWOO (rewoo.py:1204), ReflexionEngine's internal ReAct call, and config_driven.py:695 are NOT wrapped (would create recursive loop or unwanted coupling). These remain on the direct-execute path. Documented in ## Scope Boundaries.
Patterns to follow:
config_driven.py:836— existingReflexionEngineinstantiation pattern (constructor params, execute call)- Wave 1's
verificationconfig section inServerConfig.from_dict(config.py:240-273)
Test scenarios:
- Happy path: main agent succeeds → no Recovery, no Emergency triggered;
error_struct=None - Main fails (timeout) → Recovery triggered → Recovery succeeds →
error_struct=None, output from ReflexionEngine - Main fails (timeout) → Recovery triggered → Recovery fails (max_reflections exhausted) → Emergency triggered →
error_structpopulated witherror_code="timeout",retryable=True - Main fails (LoopDetectedError) → Emergency
error_code="loop_detected" - Main fails (LLMProviderError) → Emergency
error_code="llm_failure" - Main fails (TaskCancelledError) → propagates as-is (NOT routed to Emergency)
- Main fails (generic Exception) → Emergency
error_code="internal_error",retryable=False - Config:
fallback_chain.recovery.enabled=false→ skip Recovery, go directly to Emergency - Config:
fallback_chain.emergency.enabled=false→ re-raise original exception (no Emergency) - Integration: full chain on real
ConfigDrivenAgentwith mocked LLM (mock main raises, mock ReflexionEngine succeeds)
Verification: All scenarios pass; existing chat WebSocket tests still pass; ruff clean.
U4. G9 — PlanPhase Rollback Fields + RollbackExecutor + TeamOrchestrator Integration
Goal: Add validation_command/rollback_command optional fields to PlanPhase; execute rollback on phase failure (opt-in); coordinate with U7 checkpoint ordering per R21.
Requirements: R19, R20, R21, R26, R27
Dependencies: none (uses existing U7 PipelineCheckpoint and VerificationLoop patterns)
Files:
- Modify:
src/agentkit/experts/plan.py(PlanPhasedataclass +to_dict/from_dictsymmetry) - Modify:
src/agentkit/experts/orchestrator.py(insert rollback execution between phase failure and checkpoint save) - Create:
src/agentkit/orchestrator/rollback.py(RollbackExecutorclass) - Modify:
src/agentkit/server/config.py(addrollbacksection tofrom_dict) - Modify:
agentkit.yaml(documentrollback:section) - Create:
tests/unit/test_phase_rollback.py
Approach:
PlanPhase extension (plan.py:147-233): Add two optional fields with None default (preserves existing contract per Finding 1):
validation_command: str | None = None
rollback_command: str | None = None
Update to_dict() to include both fields (only when not None, to keep dict shape minimal). Update from_dict() to read both fields.
RollbackExecutor class (new file orchestrator/rollback.py): Mirrors VerificationLoop pattern (verification_loop.py:67-103):
class RollbackExecutor:
def __init__(self, working_dir: str | None = None, timeout: float = 30.0): ...
async def execute(self, command: str) -> RollbackResult:
# asyncio.create_subprocess_shell with cwd/timeout/proc.kill()
# Returns RollbackResult(passed, exit_code, stdout, stderr, command)
async def validate(self, command: str) -> RollbackResult:
# Same as execute but returns passed=False on non-zero exit
RollbackResult dataclass: {passed: bool, exit_code: int, stdout: str, stderr: str, command: str}.
TeamOrchestrator integration (orchestrator.py:246-270): New ordering per KTD8:
# Existing: phase failure detected, mark FAILED + dependents FAILED
# (lines 247-261 unchanged)
# NEW: rollback phase (only if validation_command and rollback_command configured)
should_save_checkpoint = True
if ph.validation_command and ph.rollback_command:
validator = RollbackExecutor(working_dir=self._workspace_root, timeout=...)
validation_result = await validator.validate(ph.validation_command)
if not validation_result.passed:
rollback_result = await validator.execute(ph.rollback_command)
if not rollback_result.passed:
# Rollback failed → don't save checkpoint (R21)
should_save_checkpoint = False
logger.error(f"Rollback failed for phase {ph.id}: {rollback_result.stderr}")
# Emit phase_rollback_failed event
await self._broadcast_event("phase_rollback_failed", {...})
# Existing: checkpoint save (conditional now)
if should_save_checkpoint and self._checkpoint is not None:
try:
await self._checkpoint.save(plan.id, ph, plan.status.value)
except Exception as e:
logger.warning(...)
Events emitted (new): phase_rollback_started, phase_rollback_completed, phase_rollback_failed. Existing phase_failed event unchanged (emitted before rollback).
Config: rollback.default_timeout: float = 30.0 in ServerConfig.rollback dict (Wave 1 pattern).
Security boundary (KTD7): Rollback subprocess does NOT go through ShellTool (avoids confirm_callback for git checkout). Audit log emitted via _broadcast_event. terminal_whitelist is not consulted (only governs /api/v1/terminal/server route, per terminal_server.py:227).
Execution note: characterization-first. Test current behavior (rollback_command=None → no rollback, checkpoint saved) before adding rollback behavior.
Patterns to follow:
VerificationLoopsubprocess execution pattern (verification_loop.py:67-103) —asyncio.create_subprocess_shell+cwd+timeout+proc.kill()PlanPhase.to_dict/from_dictsymmetric serialization pattern atplan.py:186-233TeamOrchestrator._execute_phaseevent broadcasting pattern (orchestrator.py:253-260)- Wave 1's
ServerConfig.from_dictextension template
Test scenarios:
- Characterization:
PlanPhase()with novalidation_command/rollback_command→to_dict()output matches pre-change shape (no new keys);from_dict({})produces phase with both fields as None - Serialization: phase with
rollback_command="git checkout foo.py"→to_dict()includes key;from_dict(to_dict())round-trips - RollbackExecutor happy path:
execute("git status")returnspassed=True,exit_code=0 - RollbackExecutor timeout:
execute("sleep 10", timeout=0.1)returnspassed=False, exit_code=-1 (or similar) - RollbackExecutor failure:
execute("false")returnspassed=False,exit_code=1 - Integration — opt-in default: phase fails,
rollback_command=None→ no RollbackExecutor call, checkpoint saved (existing behavior) - Integration — rollback configured, validation passes (returns 0): rollback NOT executed, checkpoint saved
- Integration — rollback configured, validation fails, rollback succeeds (returns 0): rollback executed, checkpoint saved
- Integration — rollback configured, validation fails, rollback fails (returns 1): rollback executed, checkpoint NOT saved (R21),
phase_rollback_failedevent emitted - Integration — real git repo fixture: phase writes file via
git checkout, rollbackgit checkout foo.pyrestores file; assert file content matches pre-phase state
Verification: All scenarios pass; existing tests/unit/test_pipeline_state.py and tests/unit/test_team_orchestrator*.py (if any) still pass; ruff clean.
Scope Boundaries
Deferred to Follow-Up Work
- Recovery wiring at non-chat call sites (
cli/chat.py:338,rewoo.py:1204,config_driven.py:695). KTD5 limits Wave 2 to the primary chat WebSocket path. Other entry points can adopt the same wrapper in a follow-up. - Recovery layer streaming events. Wave 2's chain returns final
TaskResult; SSE events for "recovery_started"/"recovery_completed" would require deeper WebSocket protocol changes. - Patch-level rollback (
git apply -R <patch>). KTD6 + KTD7 scope Wave 2 togit checkout <files>pattern. Patch-level requires extendingCheckpointDataschema to track patches — Wave 3 candidate. - DB-backed atomic claim for checkpoint. Finding 3 noted
PipelineCheckpointis in-memory dict + Redis fallback, not a PostgreSQLFOR UPDATE SKIP LOCKEDpattern. Multi-process atomicity is out of scope; single-processasyncio.Lock(already present atorchestrator.py:53) is sufficient.
Outside this product's identity (carried from brainstorm)
- Wave 3 (G5/G6) — tree-sitter integration, SOLO four-stage state machine. Strategic direction locked in brainstorm KTD6/KTD7; implementation design deferred to Wave 3 plan.
- Node-level checkpoint (ReAct single-step). Stage-level (U7) satisfies core need.
- DeerFlow-style disk filesystem. Redis is the persistence layer.
- Full LangGraph migration. Self-built architecture stays.
Risks & Dependencies
- Risk: Auxiliary model availability. If
auxiliary_modelalias is not configured inagentkit.yaml,LLMGatewayraisesModelNotFoundError. Mitigation:ContextCompressorcatches this in the auxiliary try-block and falls through to main model (R15 fallback path). Documented in YAML comments. - Risk: Recovery layer recursive loop.
ReflexionEngine.execute()internally callsReActEngine.execute()(5th call site). If Recovery itself fails, it must NOT trigger another Recovery — KTD5 wires only atchat.py:613, so ReflexionEngine's internal ReAct call bypasses the chain. Recursive loop is structurally impossible. - Risk:
git checkoutdestructive scope. Misconfiguredrollback_command(e.g.,git checkout .) could wipe unrelated changes. Mitigation: rollback is opt-in per-phase (KTD6); audit log viaphase_rollback_started/phase_rollback_failedevents; documented YAML examples use file-scoped commands (git checkout <specific_files>). - Risk:
TaskResult.error_structfield addition breaks pickle/serialization.TaskResultis a dataclass withto_dict; new optional field withNonedefault is backward-compatible forto_dictconsumers. Pickle deserialization of pre-change data still works (new field defaults to None). - Dependency: Wave 1 merged.
ServerConfig.from_dictextension template (prompt_cache/streaming/verification sections) is established and tested. Wave 2's 3 new sections (auxiliary_model, fallback_chain, rollback) reuse this exact pattern. - Dependency: U7 PipelineCheckpoint already shipped.
orchestrator/checkpoint.py:56exists withsave(plan_id, phase, plan_status)API; U4 only adjusts the calling order, not the API.
Sources / Research
- Brainstorm:
docs/brainstorms/2026-06-29-advanced-agent-gap-optimization-requirements.md(Wave 2 section, R13-R21, KTD4/KTD5) - Wave 1 plan:
docs/plans/2026-06-29-002-feat-agent-wave1-quick-wins-plan.md(ServerConfig.from_dict extension template established) - Finding 1 (contract preservation rule):
docs/solutions/logic-errors/long-horizon-reliability-code-review-fixes.md— "新字段默认值须保持既有契约" drives KTD6's opt-in default forrollback_command - Finding 2 (retry-storm defense):
docs/solutions/security-issues/portal-platform-security-reliability-fixes.md— Emergency layer's terminalerror_codepattern (don't propagate unhandled exceptions) - Finding 3 (atomic claim pattern):
docs/solutions/architecture-patterns/bitable-companion-service-security-reliability-patterns.md—SKIP LOCKEDnot reusable;PipelineCheckpointis in-memory dict + Redis - Finding 4 (empty-response anti-pattern):
docs/solutions/ui-bugs/tauri-reload-loses-session.md— auxiliary LLM must check truthy return value, not just absence of exception; drives U1's empty-content fallback - Code locations verified during planning:
src/agentkit/core/compressor.py:35-44,123-158—_summarizeinjection point for G4src/agentkit/llm/gateway.py:170-227— existing per-model fallback chain (semantics differ from G4's task-type routing)src/agentkit/llm/config.py:198-257—LLMConfigdataclass andfrom_dictsrc/agentkit/core/reflexion.py:68-111—ReflexionEngine.__init__andexecute()signatures (already supportsevaluate_model/reflect_model)src/agentkit/core/fallback.py(full file, 19 lines) — 3 existing constants;EmergencyRulesadds alongsidesrc/agentkit/core/protocol.py:118-145—TaskResultdataclass andto_dictsrc/agentkit/experts/plan.py:147-233—PlanPhasedataclass,to_dict/from_dictsrc/agentkit/experts/orchestrator.py:246-270— phase failure capture + checkpoint save site (U4 integration point)src/agentkit/orchestrator/checkpoint.py:56—PipelineCheckpoint(U7, no API change needed)src/agentkit/core/verification_loop.py:67-103— subprocess execution pattern forRollbackExecutorsrc/agentkit/tools/shell.py:168-174,526-534—_DANGEROUS_BINARY_FLAGS(git checkout not listed),_is_dangerouslogic (KTD7 rationale)