fischer-agentkit/docs/plans/2026-06-07-014-fix-agentkit...

9.3 KiB

title status created plan_type execution_posture
fix: AgentKit P0 Code Review Fixes completed 2026-06-07 fix TDD

Summary

Fix 4 P0 issues and 1 import defect identified in the Phase 6+7 code review, unblocking merge to main. All units follow TDD: write failing tests first, then implement the fix.

Problem Frame

Code review of the feat/agentkit-phase7-headroom branch revealed 4 P0 defects that must be fixed before merge:

  1. CCR cache unbounded growth_ccr_cache: dict[str, str] grows without limit; ccr_ttl config is declared but never enforced
  2. CCR hash collisionsha256(...).hexdigest()[:16] truncates to 64 bits; collisions silently overwrite cached originals
  3. OTel span leak_span_cm.__enter__() without try/finally; exception between enter and exit leaks the span
  4. StdioTransport notification queuereceive_response() raises TransportError when queue is empty, inconsistent with SSETransport which awaits

Plus 1 import defect: mcp/__init__.py lists MCPServer and MCPClient in __all__ but never imports them.

Requirements

  • R1: CCR cache must enforce capacity limit and TTL eviction
  • R2: CCR hash must detect collisions and reject silent overwrites
  • R3: OTel span lifecycle must use try/finally to guarantee cleanup
  • R4: StdioTransport.receive_response() must await empty queue (consistent with SSETransport)
  • R5: mcp/__init__.py must import and export MCPServer and MCPClient

Key Technical Decisions

KTD-1: CCR cache eviction strategy

Decision: Use collections.OrderedDict as an LRU with a configurable max_entries (default 1000). On insert, move to end (most-recent). When capacity exceeded, evict oldest (least-recent). TTL enforced by storing (content, timestamp) tuples and evicting expired entries on access.

Rationale: OrderedDict is stdlib, zero-dependency, and provides O(1) move-to-end/pop-oldest. No need for functools.lru_cache (wrong abstraction — we need per-instance, not per-function caching) or external deps like cachetools.

KTD-2: Hash collision handling

Decision: Use full SHA-256 hex digest (64 chars) instead of truncated 16-char prefix. On _store_ccr, if hash already exists and content differs, log a warning and skip caching (return None).

Rationale: Full SHA-256 makes collisions astronomically improbable (~2^-256). The collision check is a safety net for the extremely unlikely case. Truncating to 64 bits (16 hex chars) was the root cause — birthday paradox gives ~50% collision at ~2^32 entries.

KTD-3: OTel span lifecycle pattern

Decision: Replace __enter__/__exit__ manual calls with with start_span(...) as span: context manager. Guard with if _OTEL_AVAILABLE to avoid no-op span overhead.

Rationale: Context manager guarantees __exit__ on exception. The current pattern leaks on any exception between __enter__ and __exit__.

KTD-4: StdioTransport receive_response await behavior

Decision: When _notifications queue is empty, await the queue with the transport's configured timeout (same pattern as SSETransport). Raise TransportError only on timeout or disconnect.

Rationale: Consistency with SSETransport.receive_response(), which awaits _response_queue.get() with timeout. The current behavior of raising immediately breaks polling consumers that expect to wait.


Implementation Units

U1. CCR Cache: LRU + TTL + Collision Detection

Goal: Fix unbounded growth and hash collision in HeadroomCompressor._ccr_cache.

Requirements: R1, R2

Dependencies: None

Files:

  • src/agentkit/core/headroom_compressor.py — modify
  • tests/unit/test_headroom_compressor.py — modify

Approach:

  1. Replace _ccr_cache: dict[str, str] with _ccr_cache: OrderedDict[str, tuple[str, float]] storing (content, insert_time)
  2. Add _max_entries config (default 1000); on insert, if at capacity, pop oldest item
  3. On _store_ccr, use full SHA-256 hex digest; if hash exists and content differs, log warning and return None
  4. On retrieve, check TTL before returning; evict expired entries
  5. Add _evict_expired() helper called on each store/retrieve

Execution note: TDD — write failing tests for each behavior first.

Test scenarios:

  • Happy path: Store and retrieve content by full hash
  • LRU eviction: Store max_entries + 1 items; verify oldest evicted
  • TTL expiry: Store with ccr_ttl=1, wait >1s, retrieve returns not-found
  • Collision detection: Manually inject a hash with different content; _store_ccr returns None and logs warning
  • No collision on same content: Store identical content twice; second store returns same hash (idempotent)
  • Evict expired on access: Store with short TTL, wait, then store another item; expired entry cleaned during eviction sweep
  • Default max_entries: Verify default is 1000
  • Custom max_entries: Verify custom config respected

Verification: All new tests pass; existing CCR tests still pass with updated hash length.


U2. OTel Span Lifecycle Fix

Goal: Ensure OTel span is always properly closed, even on exceptions.

Requirements: R3

Dependencies: None

Files:

  • src/agentkit/core/react.py — modify
  • tests/unit/test_react_compression.py — modify

Approach:

  1. Replace _span_cm = start_span(...); _span_cm.__enter__(); ...; _span_cm.__exit__(...) with with start_span(...) as _span: wrapped around the entire _execute_loop body
  2. Move _exec_start and span attribute setting inside the with block
  3. Guard with if _OTEL_AVAILABLE to skip span creation when OTel is not installed
  4. Ensure agent_duration_histogram recording happens inside the with block

Execution note: TDD — write a failing test that verifies span cleanup on exception first.

Test scenarios:

  • Happy path: Span attributes set and span closed on successful execution
  • Exception path: LLM gateway raises exception; span is still properly closed (attributes set, __exit__ called)
  • Cancellation path: TaskCancelledError raised; span closed with outcome="cancelled"
  • No OTel available: When _OTEL_AVAILABLE=False, execution proceeds without span overhead
  • Span attribute values: Verify agent.total_steps, agent.total_tokens, agent.outcome, agent.duration_ms are set correctly

Verification: All new tests pass; existing ReAct tests still pass.


U3. StdioTransport receive_response Await Fix

Goal: Make StdioTransport.receive_response() await empty notification queue, consistent with SSETransport.

Requirements: R4

Dependencies: None

Files:

  • src/agentkit/mcp/transport.py — modify
  • tests/unit/test_mcp_transport.py — modify

Approach:

  1. Replace if not self._notifications.empty(): return self._notifications.get_nowait() + raise TransportError(...) with await asyncio.wait_for(self._notifications.get(), timeout=self._timeout)
  2. Catch asyncio.TimeoutError and raise TransportError("Timeout waiting for notification") (matching SSETransport pattern)
  3. Keep the is_connected guard at the top

Execution note: TDD — write failing test for await behavior first.

Test scenarios:

  • Happy path: Notification available immediately; returned without waiting
  • Await path: Queue empty; receive_response() awaits until notification arrives
  • Timeout path: Queue empty; timeout expires; raises TransportError with "Timeout" message
  • Not connected: Raises TransportError with "not connected" message
  • Consistency with SSE: Same await+timeout pattern as SSETransport.receive_response()

Verification: All new tests pass; existing transport tests still pass.


U4. MCP init.py Import Fix

Goal: Add missing MCPServer and MCPClient imports to mcp/__init__.py.

Requirements: R5

Dependencies: None

Files:

  • src/agentkit/mcp/__init__.py — modify

Approach:

  1. Add from agentkit.mcp.server import MCPServer and from agentkit.mcp.client import MCPClient imports
  2. Verify __all__ already lists both names (it does)

Test scenarios:

  • Import test: from agentkit.mcp import MCPServer, MCPClient succeeds
  • All exports test: All names in __all__ are importable

Verification: python -c "from agentkit.mcp import MCPServer, MCPClient" succeeds.


Scope Boundaries

In Scope

  • 4 P0 fixes + 1 import fix as described above
  • Test coverage for all fixes

Deferred to Follow-Up Work

  • P1: Redis degradation recovery in pipeline_state.py
  • P1: Sync urllib.request → async in baidu_search.py and schema_tools.py
  • P1: Type annotation mismatch (ContextCompressorCompressionStrategy) in react.py
  • P1: Config hot-reload race condition in app.py
  • P2: _request_id non-atomic increment in transport classes
  • P3: _should_compress hardcoded 8000 token threshold

Risks & Mitigations

Risk Mitigation
Full SHA-256 hash increases CCR marker length in compressed output Acceptable: 64 chars vs 16 chars is negligible in tool output context
OrderedDict LRU is not thread-safe HeadroomCompressor is used within async single-threaded context; no concurrent access
with start_span() changes span scoping in _execute_loop Span now covers the entire loop body including error paths — strictly better