--- title: "fix: AgentKit P0 Code Review Fixes" status: completed created: 2026-06-07 plan_type: fix execution_posture: TDD --- ## Summary Fix 4 P0 issues and 1 import defect identified in the Phase 6+7 code review, unblocking merge to main. All units follow TDD: write failing tests first, then implement the fix. ## Problem Frame Code review of the `feat/agentkit-phase7-headroom` branch revealed 4 P0 defects that must be fixed before merge: 1. **CCR cache unbounded growth** — `_ccr_cache: dict[str, str]` grows without limit; `ccr_ttl` config is declared but never enforced 2. **CCR hash collision** — `sha256(...).hexdigest()[:16]` truncates to 64 bits; collisions silently overwrite cached originals 3. **OTel span leak** — `_span_cm.__enter__()` without `try/finally`; exception between enter and exit leaks the span 4. **StdioTransport notification queue** — `receive_response()` raises `TransportError` when queue is empty, inconsistent with `SSETransport` which awaits Plus 1 import defect: `mcp/__init__.py` lists `MCPServer` and `MCPClient` in `__all__` but never imports them. ## Requirements - R1: CCR cache must enforce capacity limit and TTL eviction - R2: CCR hash must detect collisions and reject silent overwrites - R3: OTel span lifecycle must use `try/finally` to guarantee cleanup - R4: `StdioTransport.receive_response()` must await empty queue (consistent with SSETransport) - R5: `mcp/__init__.py` must import and export `MCPServer` and `MCPClient` ## Key Technical Decisions ### KTD-1: CCR cache eviction strategy **Decision:** Use `collections.OrderedDict` as an LRU with a configurable `max_entries` (default 1000). On insert, move to end (most-recent). When capacity exceeded, evict oldest (least-recent). TTL enforced by storing `(content, timestamp)` tuples and evicting expired entries on access. **Rationale:** `OrderedDict` is stdlib, zero-dependency, and provides O(1) move-to-end/pop-oldest. No need for `functools.lru_cache` (wrong abstraction — we need per-instance, not per-function caching) or external deps like `cachetools`. ### KTD-2: Hash collision handling **Decision:** Use full SHA-256 hex digest (64 chars) instead of truncated 16-char prefix. On `_store_ccr`, if hash already exists and content differs, log a warning and skip caching (return `None`). **Rationale:** Full SHA-256 makes collisions astronomically improbable (~2^-256). The collision check is a safety net for the extremely unlikely case. Truncating to 64 bits (16 hex chars) was the root cause — birthday paradox gives ~50% collision at ~2^32 entries. ### KTD-3: OTel span lifecycle pattern **Decision:** Replace `__enter__`/`__exit__` manual calls with `with start_span(...) as span:` context manager. Guard with `if _OTEL_AVAILABLE` to avoid no-op span overhead. **Rationale:** Context manager guarantees `__exit__` on exception. The current pattern leaks on any exception between `__enter__` and `__exit__`. ### KTD-4: StdioTransport receive_response await behavior **Decision:** When `_notifications` queue is empty, `await` the queue with the transport's configured timeout (same pattern as `SSETransport`). Raise `TransportError` only on timeout or disconnect. **Rationale:** Consistency with `SSETransport.receive_response()`, which awaits `_response_queue.get()` with timeout. The current behavior of raising immediately breaks polling consumers that expect to wait. --- ## Implementation Units ### U1. CCR Cache: LRU + TTL + Collision Detection **Goal:** Fix unbounded growth and hash collision in `HeadroomCompressor._ccr_cache`. **Requirements:** R1, R2 **Dependencies:** None **Files:** - `src/agentkit/core/headroom_compressor.py` — modify - `tests/unit/test_headroom_compressor.py` — modify **Approach:** 1. Replace `_ccr_cache: dict[str, str]` with `_ccr_cache: OrderedDict[str, tuple[str, float]]` storing `(content, insert_time)` 2. Add `_max_entries` config (default 1000); on insert, if at capacity, pop oldest item 3. On `_store_ccr`, use full SHA-256 hex digest; if hash exists and content differs, log warning and return `None` 4. On `retrieve`, check TTL before returning; evict expired entries 5. Add `_evict_expired()` helper called on each store/retrieve **Execution note:** TDD — write failing tests for each behavior first. **Test scenarios:** - **Happy path:** Store and retrieve content by full hash - **LRU eviction:** Store `max_entries + 1` items; verify oldest evicted - **TTL expiry:** Store with `ccr_ttl=1`, wait >1s, retrieve returns not-found - **Collision detection:** Manually inject a hash with different content; `_store_ccr` returns `None` and logs warning - **No collision on same content:** Store identical content twice; second store returns same hash (idempotent) - **Evict expired on access:** Store with short TTL, wait, then store another item; expired entry cleaned during eviction sweep - **Default max_entries:** Verify default is 1000 - **Custom max_entries:** Verify custom config respected **Verification:** All new tests pass; existing CCR tests still pass with updated hash length. --- ### U2. OTel Span Lifecycle Fix **Goal:** Ensure OTel span is always properly closed, even on exceptions. **Requirements:** R3 **Dependencies:** None **Files:** - `src/agentkit/core/react.py` — modify - `tests/unit/test_react_compression.py` — modify **Approach:** 1. Replace `_span_cm = start_span(...); _span_cm.__enter__(); ...; _span_cm.__exit__(...)` with `with start_span(...) as _span:` wrapped around the entire `_execute_loop` body 2. Move `_exec_start` and span attribute setting inside the `with` block 3. Guard with `if _OTEL_AVAILABLE` to skip span creation when OTel is not installed 4. Ensure `agent_duration_histogram` recording happens inside the `with` block **Execution note:** TDD — write a failing test that verifies span cleanup on exception first. **Test scenarios:** - **Happy path:** Span attributes set and span closed on successful execution - **Exception path:** LLM gateway raises exception; span is still properly closed (attributes set, `__exit__` called) - **Cancellation path:** `TaskCancelledError` raised; span closed with outcome="cancelled" - **No OTel available:** When `_OTEL_AVAILABLE=False`, execution proceeds without span overhead - **Span attribute values:** Verify `agent.total_steps`, `agent.total_tokens`, `agent.outcome`, `agent.duration_ms` are set correctly **Verification:** All new tests pass; existing ReAct tests still pass. --- ### U3. StdioTransport receive_response Await Fix **Goal:** Make `StdioTransport.receive_response()` await empty notification queue, consistent with `SSETransport`. **Requirements:** R4 **Dependencies:** None **Files:** - `src/agentkit/mcp/transport.py` — modify - `tests/unit/test_mcp_transport.py` — modify **Approach:** 1. Replace `if not self._notifications.empty(): return self._notifications.get_nowait()` + `raise TransportError(...)` with `await asyncio.wait_for(self._notifications.get(), timeout=self._timeout)` 2. Catch `asyncio.TimeoutError` and raise `TransportError("Timeout waiting for notification")` (matching SSETransport pattern) 3. Keep the `is_connected` guard at the top **Execution note:** TDD — write failing test for await behavior first. **Test scenarios:** - **Happy path:** Notification available immediately; returned without waiting - **Await path:** Queue empty; `receive_response()` awaits until notification arrives - **Timeout path:** Queue empty; timeout expires; raises `TransportError` with "Timeout" message - **Not connected:** Raises `TransportError` with "not connected" message - **Consistency with SSE:** Same await+timeout pattern as `SSETransport.receive_response()` **Verification:** All new tests pass; existing transport tests still pass. --- ### U4. MCP __init__.py Import Fix **Goal:** Add missing `MCPServer` and `MCPClient` imports to `mcp/__init__.py`. **Requirements:** R5 **Dependencies:** None **Files:** - `src/agentkit/mcp/__init__.py` — modify **Approach:** 1. Add `from agentkit.mcp.server import MCPServer` and `from agentkit.mcp.client import MCPClient` imports 2. Verify `__all__` already lists both names (it does) **Test scenarios:** - **Import test:** `from agentkit.mcp import MCPServer, MCPClient` succeeds - **All exports test:** All names in `__all__` are importable **Verification:** `python -c "from agentkit.mcp import MCPServer, MCPClient"` succeeds. --- ## Scope Boundaries ### In Scope - 4 P0 fixes + 1 import fix as described above - Test coverage for all fixes ### Deferred to Follow-Up Work - P1: Redis degradation recovery in `pipeline_state.py` - P1: Sync `urllib.request` → async in `baidu_search.py` and `schema_tools.py` - P1: Type annotation mismatch (`ContextCompressor` → `CompressionStrategy`) in `react.py` - P1: Config hot-reload race condition in `app.py` - P2: `_request_id` non-atomic increment in transport classes - P3: `_should_compress` hardcoded 8000 token threshold ## Risks & Mitigations | Risk | Mitigation | |------|-----------| | Full SHA-256 hash increases CCR marker length in compressed output | Acceptable: 64 chars vs 16 chars is negligible in tool output context | | `OrderedDict` LRU is not thread-safe | HeadroomCompressor is used within async single-threaded context; no concurrent access | | `with start_span()` changes span scoping in `_execute_loop` | Span now covers the entire loop body including error paths — strictly better |