9.3 KiB
| title | status | created | plan_type | execution_posture |
|---|---|---|---|---|
| fix: AgentKit P0 Code Review Fixes | completed | 2026-06-07 | fix | TDD |
Summary
Fix 4 P0 issues and 1 import defect identified in the Phase 6+7 code review, unblocking merge to main. All units follow TDD: write failing tests first, then implement the fix.
Problem Frame
Code review of the feat/agentkit-phase7-headroom branch revealed 4 P0 defects that must be fixed before merge:
- CCR cache unbounded growth —
_ccr_cache: dict[str, str]grows without limit;ccr_ttlconfig is declared but never enforced - CCR hash collision —
sha256(...).hexdigest()[:16]truncates to 64 bits; collisions silently overwrite cached originals - OTel span leak —
_span_cm.__enter__()withouttry/finally; exception between enter and exit leaks the span - StdioTransport notification queue —
receive_response()raisesTransportErrorwhen queue is empty, inconsistent withSSETransportwhich awaits
Plus 1 import defect: mcp/__init__.py lists MCPServer and MCPClient in __all__ but never imports them.
Requirements
- R1: CCR cache must enforce capacity limit and TTL eviction
- R2: CCR hash must detect collisions and reject silent overwrites
- R3: OTel span lifecycle must use
try/finallyto guarantee cleanup - R4:
StdioTransport.receive_response()must await empty queue (consistent with SSETransport) - R5:
mcp/__init__.pymust import and exportMCPServerandMCPClient
Key Technical Decisions
KTD-1: CCR cache eviction strategy
Decision: Use collections.OrderedDict as an LRU with a configurable max_entries (default 1000). On insert, move to end (most-recent). When capacity exceeded, evict oldest (least-recent). TTL enforced by storing (content, timestamp) tuples and evicting expired entries on access.
Rationale: OrderedDict is stdlib, zero-dependency, and provides O(1) move-to-end/pop-oldest. No need for functools.lru_cache (wrong abstraction — we need per-instance, not per-function caching) or external deps like cachetools.
KTD-2: Hash collision handling
Decision: Use full SHA-256 hex digest (64 chars) instead of truncated 16-char prefix. On _store_ccr, if hash already exists and content differs, log a warning and skip caching (return None).
Rationale: Full SHA-256 makes collisions astronomically improbable (~2^-256). The collision check is a safety net for the extremely unlikely case. Truncating to 64 bits (16 hex chars) was the root cause — birthday paradox gives ~50% collision at ~2^32 entries.
KTD-3: OTel span lifecycle pattern
Decision: Replace __enter__/__exit__ manual calls with with start_span(...) as span: context manager. Guard with if _OTEL_AVAILABLE to avoid no-op span overhead.
Rationale: Context manager guarantees __exit__ on exception. The current pattern leaks on any exception between __enter__ and __exit__.
KTD-4: StdioTransport receive_response await behavior
Decision: When _notifications queue is empty, await the queue with the transport's configured timeout (same pattern as SSETransport). Raise TransportError only on timeout or disconnect.
Rationale: Consistency with SSETransport.receive_response(), which awaits _response_queue.get() with timeout. The current behavior of raising immediately breaks polling consumers that expect to wait.
Implementation Units
U1. CCR Cache: LRU + TTL + Collision Detection
Goal: Fix unbounded growth and hash collision in HeadroomCompressor._ccr_cache.
Requirements: R1, R2
Dependencies: None
Files:
src/agentkit/core/headroom_compressor.py— modifytests/unit/test_headroom_compressor.py— modify
Approach:
- Replace
_ccr_cache: dict[str, str]with_ccr_cache: OrderedDict[str, tuple[str, float]]storing(content, insert_time) - Add
_max_entriesconfig (default 1000); on insert, if at capacity, pop oldest item - On
_store_ccr, use full SHA-256 hex digest; if hash exists and content differs, log warning and returnNone - On
retrieve, check TTL before returning; evict expired entries - Add
_evict_expired()helper called on each store/retrieve
Execution note: TDD — write failing tests for each behavior first.
Test scenarios:
- Happy path: Store and retrieve content by full hash
- LRU eviction: Store
max_entries + 1items; verify oldest evicted - TTL expiry: Store with
ccr_ttl=1, wait >1s, retrieve returns not-found - Collision detection: Manually inject a hash with different content;
_store_ccrreturnsNoneand logs warning - No collision on same content: Store identical content twice; second store returns same hash (idempotent)
- Evict expired on access: Store with short TTL, wait, then store another item; expired entry cleaned during eviction sweep
- Default max_entries: Verify default is 1000
- Custom max_entries: Verify custom config respected
Verification: All new tests pass; existing CCR tests still pass with updated hash length.
U2. OTel Span Lifecycle Fix
Goal: Ensure OTel span is always properly closed, even on exceptions.
Requirements: R3
Dependencies: None
Files:
src/agentkit/core/react.py— modifytests/unit/test_react_compression.py— modify
Approach:
- Replace
_span_cm = start_span(...); _span_cm.__enter__(); ...; _span_cm.__exit__(...)withwith start_span(...) as _span:wrapped around the entire_execute_loopbody - Move
_exec_startand span attribute setting inside thewithblock - Guard with
if _OTEL_AVAILABLEto skip span creation when OTel is not installed - Ensure
agent_duration_histogramrecording happens inside thewithblock
Execution note: TDD — write a failing test that verifies span cleanup on exception first.
Test scenarios:
- Happy path: Span attributes set and span closed on successful execution
- Exception path: LLM gateway raises exception; span is still properly closed (attributes set,
__exit__called) - Cancellation path:
TaskCancelledErrorraised; span closed with outcome="cancelled" - No OTel available: When
_OTEL_AVAILABLE=False, execution proceeds without span overhead - Span attribute values: Verify
agent.total_steps,agent.total_tokens,agent.outcome,agent.duration_msare set correctly
Verification: All new tests pass; existing ReAct tests still pass.
U3. StdioTransport receive_response Await Fix
Goal: Make StdioTransport.receive_response() await empty notification queue, consistent with SSETransport.
Requirements: R4
Dependencies: None
Files:
src/agentkit/mcp/transport.py— modifytests/unit/test_mcp_transport.py— modify
Approach:
- Replace
if not self._notifications.empty(): return self._notifications.get_nowait()+raise TransportError(...)withawait asyncio.wait_for(self._notifications.get(), timeout=self._timeout) - Catch
asyncio.TimeoutErrorand raiseTransportError("Timeout waiting for notification")(matching SSETransport pattern) - Keep the
is_connectedguard at the top
Execution note: TDD — write failing test for await behavior first.
Test scenarios:
- Happy path: Notification available immediately; returned without waiting
- Await path: Queue empty;
receive_response()awaits until notification arrives - Timeout path: Queue empty; timeout expires; raises
TransportErrorwith "Timeout" message - Not connected: Raises
TransportErrorwith "not connected" message - Consistency with SSE: Same await+timeout pattern as
SSETransport.receive_response()
Verification: All new tests pass; existing transport tests still pass.
U4. MCP init.py Import Fix
Goal: Add missing MCPServer and MCPClient imports to mcp/__init__.py.
Requirements: R5
Dependencies: None
Files:
src/agentkit/mcp/__init__.py— modify
Approach:
- Add
from agentkit.mcp.server import MCPServerandfrom agentkit.mcp.client import MCPClientimports - Verify
__all__already lists both names (it does)
Test scenarios:
- Import test:
from agentkit.mcp import MCPServer, MCPClientsucceeds - All exports test: All names in
__all__are importable
Verification: python -c "from agentkit.mcp import MCPServer, MCPClient" succeeds.
Scope Boundaries
In Scope
- 4 P0 fixes + 1 import fix as described above
- Test coverage for all fixes
Deferred to Follow-Up Work
- P1: Redis degradation recovery in
pipeline_state.py - P1: Sync
urllib.request→ async inbaidu_search.pyandschema_tools.py - P1: Type annotation mismatch (
ContextCompressor→CompressionStrategy) inreact.py - P1: Config hot-reload race condition in
app.py - P2:
_request_idnon-atomic increment in transport classes - P3:
_should_compresshardcoded 8000 token threshold
Risks & Mitigations
| Risk | Mitigation |
|---|---|
| Full SHA-256 hash increases CCR marker length in compressed output | Acceptable: 64 chars vs 16 chars is negligible in tool output context |
OrderedDict LRU is not thread-safe |
HeadroomCompressor is used within async single-threaded context; no concurrent access |
with start_span() changes span scoping in _execute_loop |
Span now covers the entire loop body including error paths — strictly better |