202 lines
9.3 KiB
Markdown
202 lines
9.3 KiB
Markdown
---
|
|
title: "fix: AgentKit P0 Code Review Fixes"
|
|
status: completed
|
|
created: 2026-06-07
|
|
plan_type: fix
|
|
execution_posture: TDD
|
|
---
|
|
|
|
## Summary
|
|
|
|
Fix 4 P0 issues and 1 import defect identified in the Phase 6+7 code review, unblocking merge to main. All units follow TDD: write failing tests first, then implement the fix.
|
|
|
|
## Problem Frame
|
|
|
|
Code review of the `feat/agentkit-phase7-headroom` branch revealed 4 P0 defects that must be fixed before merge:
|
|
|
|
1. **CCR cache unbounded growth** — `_ccr_cache: dict[str, str]` grows without limit; `ccr_ttl` config is declared but never enforced
|
|
2. **CCR hash collision** — `sha256(...).hexdigest()[:16]` truncates to 64 bits; collisions silently overwrite cached originals
|
|
3. **OTel span leak** — `_span_cm.__enter__()` without `try/finally`; exception between enter and exit leaks the span
|
|
4. **StdioTransport notification queue** — `receive_response()` raises `TransportError` when queue is empty, inconsistent with `SSETransport` which awaits
|
|
|
|
Plus 1 import defect: `mcp/__init__.py` lists `MCPServer` and `MCPClient` in `__all__` but never imports them.
|
|
|
|
## Requirements
|
|
|
|
- R1: CCR cache must enforce capacity limit and TTL eviction
|
|
- R2: CCR hash must detect collisions and reject silent overwrites
|
|
- R3: OTel span lifecycle must use `try/finally` to guarantee cleanup
|
|
- R4: `StdioTransport.receive_response()` must await empty queue (consistent with SSETransport)
|
|
- R5: `mcp/__init__.py` must import and export `MCPServer` and `MCPClient`
|
|
|
|
## Key Technical Decisions
|
|
|
|
### KTD-1: CCR cache eviction strategy
|
|
|
|
**Decision:** Use `collections.OrderedDict` as an LRU with a configurable `max_entries` (default 1000). On insert, move to end (most-recent). When capacity exceeded, evict oldest (least-recent). TTL enforced by storing `(content, timestamp)` tuples and evicting expired entries on access.
|
|
|
|
**Rationale:** `OrderedDict` is stdlib, zero-dependency, and provides O(1) move-to-end/pop-oldest. No need for `functools.lru_cache` (wrong abstraction — we need per-instance, not per-function caching) or external deps like `cachetools`.
|
|
|
|
### KTD-2: Hash collision handling
|
|
|
|
**Decision:** Use full SHA-256 hex digest (64 chars) instead of truncated 16-char prefix. On `_store_ccr`, if hash already exists and content differs, log a warning and skip caching (return `None`).
|
|
|
|
**Rationale:** Full SHA-256 makes collisions astronomically improbable (~2^-256). The collision check is a safety net for the extremely unlikely case. Truncating to 64 bits (16 hex chars) was the root cause — birthday paradox gives ~50% collision at ~2^32 entries.
|
|
|
|
### KTD-3: OTel span lifecycle pattern
|
|
|
|
**Decision:** Replace `__enter__`/`__exit__` manual calls with `with start_span(...) as span:` context manager. Guard with `if _OTEL_AVAILABLE` to avoid no-op span overhead.
|
|
|
|
**Rationale:** Context manager guarantees `__exit__` on exception. The current pattern leaks on any exception between `__enter__` and `__exit__`.
|
|
|
|
### KTD-4: StdioTransport receive_response await behavior
|
|
|
|
**Decision:** When `_notifications` queue is empty, `await` the queue with the transport's configured timeout (same pattern as `SSETransport`). Raise `TransportError` only on timeout or disconnect.
|
|
|
|
**Rationale:** Consistency with `SSETransport.receive_response()`, which awaits `_response_queue.get()` with timeout. The current behavior of raising immediately breaks polling consumers that expect to wait.
|
|
|
|
---
|
|
|
|
## Implementation Units
|
|
|
|
### U1. CCR Cache: LRU + TTL + Collision Detection
|
|
|
|
**Goal:** Fix unbounded growth and hash collision in `HeadroomCompressor._ccr_cache`.
|
|
|
|
**Requirements:** R1, R2
|
|
|
|
**Dependencies:** None
|
|
|
|
**Files:**
|
|
- `src/agentkit/core/headroom_compressor.py` — modify
|
|
- `tests/unit/test_headroom_compressor.py` — modify
|
|
|
|
**Approach:**
|
|
1. Replace `_ccr_cache: dict[str, str]` with `_ccr_cache: OrderedDict[str, tuple[str, float]]` storing `(content, insert_time)`
|
|
2. Add `_max_entries` config (default 1000); on insert, if at capacity, pop oldest item
|
|
3. On `_store_ccr`, use full SHA-256 hex digest; if hash exists and content differs, log warning and return `None`
|
|
4. On `retrieve`, check TTL before returning; evict expired entries
|
|
5. Add `_evict_expired()` helper called on each store/retrieve
|
|
|
|
**Execution note:** TDD — write failing tests for each behavior first.
|
|
|
|
**Test scenarios:**
|
|
- **Happy path:** Store and retrieve content by full hash
|
|
- **LRU eviction:** Store `max_entries + 1` items; verify oldest evicted
|
|
- **TTL expiry:** Store with `ccr_ttl=1`, wait >1s, retrieve returns not-found
|
|
- **Collision detection:** Manually inject a hash with different content; `_store_ccr` returns `None` and logs warning
|
|
- **No collision on same content:** Store identical content twice; second store returns same hash (idempotent)
|
|
- **Evict expired on access:** Store with short TTL, wait, then store another item; expired entry cleaned during eviction sweep
|
|
- **Default max_entries:** Verify default is 1000
|
|
- **Custom max_entries:** Verify custom config respected
|
|
|
|
**Verification:** All new tests pass; existing CCR tests still pass with updated hash length.
|
|
|
|
---
|
|
|
|
### U2. OTel Span Lifecycle Fix
|
|
|
|
**Goal:** Ensure OTel span is always properly closed, even on exceptions.
|
|
|
|
**Requirements:** R3
|
|
|
|
**Dependencies:** None
|
|
|
|
**Files:**
|
|
- `src/agentkit/core/react.py` — modify
|
|
- `tests/unit/test_react_compression.py` — modify
|
|
|
|
**Approach:**
|
|
1. Replace `_span_cm = start_span(...); _span_cm.__enter__(); ...; _span_cm.__exit__(...)` with `with start_span(...) as _span:` wrapped around the entire `_execute_loop` body
|
|
2. Move `_exec_start` and span attribute setting inside the `with` block
|
|
3. Guard with `if _OTEL_AVAILABLE` to skip span creation when OTel is not installed
|
|
4. Ensure `agent_duration_histogram` recording happens inside the `with` block
|
|
|
|
**Execution note:** TDD — write a failing test that verifies span cleanup on exception first.
|
|
|
|
**Test scenarios:**
|
|
- **Happy path:** Span attributes set and span closed on successful execution
|
|
- **Exception path:** LLM gateway raises exception; span is still properly closed (attributes set, `__exit__` called)
|
|
- **Cancellation path:** `TaskCancelledError` raised; span closed with outcome="cancelled"
|
|
- **No OTel available:** When `_OTEL_AVAILABLE=False`, execution proceeds without span overhead
|
|
- **Span attribute values:** Verify `agent.total_steps`, `agent.total_tokens`, `agent.outcome`, `agent.duration_ms` are set correctly
|
|
|
|
**Verification:** All new tests pass; existing ReAct tests still pass.
|
|
|
|
---
|
|
|
|
### U3. StdioTransport receive_response Await Fix
|
|
|
|
**Goal:** Make `StdioTransport.receive_response()` await empty notification queue, consistent with `SSETransport`.
|
|
|
|
**Requirements:** R4
|
|
|
|
**Dependencies:** None
|
|
|
|
**Files:**
|
|
- `src/agentkit/mcp/transport.py` — modify
|
|
- `tests/unit/test_mcp_transport.py` — modify
|
|
|
|
**Approach:**
|
|
1. Replace `if not self._notifications.empty(): return self._notifications.get_nowait()` + `raise TransportError(...)` with `await asyncio.wait_for(self._notifications.get(), timeout=self._timeout)`
|
|
2. Catch `asyncio.TimeoutError` and raise `TransportError("Timeout waiting for notification")` (matching SSETransport pattern)
|
|
3. Keep the `is_connected` guard at the top
|
|
|
|
**Execution note:** TDD — write failing test for await behavior first.
|
|
|
|
**Test scenarios:**
|
|
- **Happy path:** Notification available immediately; returned without waiting
|
|
- **Await path:** Queue empty; `receive_response()` awaits until notification arrives
|
|
- **Timeout path:** Queue empty; timeout expires; raises `TransportError` with "Timeout" message
|
|
- **Not connected:** Raises `TransportError` with "not connected" message
|
|
- **Consistency with SSE:** Same await+timeout pattern as `SSETransport.receive_response()`
|
|
|
|
**Verification:** All new tests pass; existing transport tests still pass.
|
|
|
|
---
|
|
|
|
### U4. MCP __init__.py Import Fix
|
|
|
|
**Goal:** Add missing `MCPServer` and `MCPClient` imports to `mcp/__init__.py`.
|
|
|
|
**Requirements:** R5
|
|
|
|
**Dependencies:** None
|
|
|
|
**Files:**
|
|
- `src/agentkit/mcp/__init__.py` — modify
|
|
|
|
**Approach:**
|
|
1. Add `from agentkit.mcp.server import MCPServer` and `from agentkit.mcp.client import MCPClient` imports
|
|
2. Verify `__all__` already lists both names (it does)
|
|
|
|
**Test scenarios:**
|
|
- **Import test:** `from agentkit.mcp import MCPServer, MCPClient` succeeds
|
|
- **All exports test:** All names in `__all__` are importable
|
|
|
|
**Verification:** `python -c "from agentkit.mcp import MCPServer, MCPClient"` succeeds.
|
|
|
|
---
|
|
|
|
## Scope Boundaries
|
|
|
|
### In Scope
|
|
- 4 P0 fixes + 1 import fix as described above
|
|
- Test coverage for all fixes
|
|
|
|
### Deferred to Follow-Up Work
|
|
- P1: Redis degradation recovery in `pipeline_state.py`
|
|
- P1: Sync `urllib.request` → async in `baidu_search.py` and `schema_tools.py`
|
|
- P1: Type annotation mismatch (`ContextCompressor` → `CompressionStrategy`) in `react.py`
|
|
- P1: Config hot-reload race condition in `app.py`
|
|
- P2: `_request_id` non-atomic increment in transport classes
|
|
- P3: `_should_compress` hardcoded 8000 token threshold
|
|
|
|
## Risks & Mitigations
|
|
|
|
| Risk | Mitigation |
|
|
|------|-----------|
|
|
| Full SHA-256 hash increases CCR marker length in compressed output | Acceptable: 64 chars vs 16 chars is negligible in tool output context |
|
|
| `OrderedDict` LRU is not thread-safe | HeadroomCompressor is used within async single-threaded context; no concurrent access |
|
|
| `with start_span()` changes span scoping in `_execute_loop` | Span now covers the entire loop body including error paths — strictly better |
|