fischer-agentkit/docs/plans/2026-06-14-002-u1-llm-cache...

26 KiB

U1 Architecture Design: LLM Cache Core

Status: APPROVED — Design reviewed, embedding model set to bge-m3 for Chinese-first Date: 2026-06-14 Unit: U1 of P0 Production Hardening Plan


1. Design Goals

  1. Transparent caching: LLMGateway.chat() callers cannot distinguish cached vs. uncached responses
  2. Dual-match strategy: Exact-match (hash) for deterministic hits + Semantic-match (embedding) for paraphrased hits
  3. Backend pluggability: InMemoryLLMCache for dev, RedisLLMCache for production, via factory
  4. Chinese-first embedding: Default embedding model optimized for Chinese+English mixed text, with configurable fallback

2. Component Architecture

┌──────────────────────────────────────────────────────┐
│                    LLMGateway.chat()                  │
│                                                      │
│  1. Build LLMRequest                                │
│  2. ┌─ Cache Check ─────────────────────────────┐   │
│     │  generate_cache_key(req)                    │   │
│     │  cache.get(key) ──→ CacheResult             │   │
│     │    ├─ HIT (exact)  → return cached response │   │
│     │    └─ MISS → semantic_search(query_emb)     │   │
│     │         ├─ HIT (semantic) → return response │   │
│     │         └─ MISS → call provider              │   │
│     └─────────────────────────────────────────────┘   │
│  3. Call provider → LLMResponse                      │
│  4. cache.put(key, response, query_embedding)         │
│  5. Record usage                                     │
└──────────────────────────────────────────────────────┘

File Structure

src/agentkit/llm/
├── cache.py          # NEW: LLMCache Protocol + InMemoryLLMCache + RedisLLMCache + CacheResult
├── cache_key.py      # NEW: generate_cache_key(), hash helpers
├── gateway.py        # MODIFIED in U2: inject cache check
├── config.py         # MODIFIED in U2: add CacheConfig
└── ...

3. Data Model Design

3.1 CacheKey

Reasoning: The cache key must capture ALL inputs that deterministically affect LLM output. Missing any component leads to false cache hits (wrong response returned).

Component Why Included Hash Method
model Different models produce different outputs UTF-8 encode → SHA-256
system_prompt Changes behavior fundamentally SHA-256 of full text
messages Core conversation context SHA-256 of JSON-serialized messages
temperature Affects randomness; only 0.0 is deterministic Float string representation
tools Available tools affect tool_call generation SHA-256 of JSON-serialized tools list
tool_choice "auto" vs "none" changes behavior UTF-8 encode → SHA-256

Key formula:

key = SHA256(
    SHA256(model) +
    SHA256(system_prompt) +
    SHA256(json(messages, sort_keys=True)) +
    SHA256(str(temperature)) +
    SHA256(json(tools, sort_keys=True)) +
    SHA256(tool_choice)
)

Design Decision — Why not include max_tokens? max_tokens is a truncation limit, not a semantic input. A response cached with max_tokens=2000 is still valid when requested with max_tokens=4000 (the response was simply shorter). However, the reverse is unsafe — a response generated with max_tokens=4000 might be longer than a max_tokens=2000 request expects. Decision: Include max_tokens in the key to be safe. The cost of a few extra cache misses is negligible compared to returning a response that violates the caller's token limit.

Revised key formula:

key = SHA256(
    SHA256(model) +
    SHA256(system_prompt) +
    SHA256(json(messages, sort_keys=True)) +
    SHA256(f"{temperature:.2f}") +
    SHA256(json(tools, sort_keys=True)) +
    SHA256(tool_choice) +
    SHA256(str(max_tokens))
)

3.2 CacheEntry

@dataclass
class CacheEntry:
    """A cached LLM response with metadata."""
    response: LLMResponse          # The cached response
    query_embedding: list[float]   # Embedding of last user message (for semantic match)
    created_at: float              # time.monotonic() when cached
    hit_count: int                 # Number of cache hits

3.3 CacheResult

@dataclass
class CacheResult:
    """Result of a cache lookup."""
    hit: bool                      # Whether a cache hit occurred
    response: LLMResponse | None   # The cached response (None on miss)
    match_type: str                # "exact" | "semantic" | "" (miss)

4. Protocol Design

4.1 LLMCache Protocol

class LLMCache(Protocol):
    """LLM response cache interface."""

    async def get(self, key: str) -> CacheResult:
        """Look up a cached response by exact key, then semantic search."""
        ...

    async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
        """Store a response in the cache."""
        ...

    async def invalidate(self, pattern: str | None = None) -> int:
        """Invalidate cache entries. If pattern is None, invalidate all. Returns count of invalidated entries."""
        ...

    async def stats(self) -> dict[str, int]:
        """Return cache statistics: {total_entries, total_hits, total_misses}."""
        ...

Reasoning for async Protocol: All methods are async because RedisLLMCache uses redis.asyncio. Making the Protocol async ensures both backends share the same interface without sync/async bridging.

Why get() does both exact + semantic? The caller (LLMGateway) shouldn't need to know about the two-tier lookup. It calls cache.get(key) and gets a CacheResult with match_type indicating how the hit occurred. This encapsulation keeps the integration point simple.

4.2 Semantic Search Design

Critical Question: Should semantic search be inside get() or a separate method?

Analysis:

  • Option A: get(key) does exact match first, then semantic search on miss. Single call, simple integration.
  • Option B: Separate semantic_search(embedding) method. More flexible, but requires caller to manage two calls.

Decision: Option A. The semantic search needs the query_embedding, which must be computed before calling get(). But embedding computation is expensive (~100ms). We don't want to compute embeddings on every cache miss — only when semantic caching is enabled and temperature == 0.

Revised design:

class LLMCache(Protocol):
    async def get(self, key: str) -> CacheResult:
        """Exact-match lookup only."""
        ...

    async def semantic_search(self, query_embedding: list[float], threshold: float = 0.92) -> CacheResult:
        """Semantic similarity search across all cached entries."""
        ...

    async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
        """Store response with optional embedding for semantic matching."""
        ...

Integration flow in LLMGateway (U2):

# 1. Exact match
result = await cache.get(key)
if result.hit:
    return result.response

# 2. Semantic match (only for temperature == 0)
if request.temperature == 0 and query_embedding is not None:
    result = await cache.semantic_search(query_embedding)
    if result.hit:
        return result.response

# 3. Call provider
response = await provider.chat(request)
await cache.put(key, response, query_embedding)

This gives the gateway explicit control over when to attempt semantic search, avoiding unnecessary embedding computation.


5. InMemoryLLMCache Implementation Design

5.1 Data Structure

class InMemoryLLMCache:
    def __init__(self, max_entries: int = 10000, exact_ttl: int = 3600, semantic_ttl: int = 86400, similarity_threshold: float = 0.92):
        self._max_entries = max_entries
        self._exact_ttl = exact_ttl
        self._semantic_ttl = semantic_ttl
        self._similarity_threshold = similarity_threshold

        # Exact cache: key → CacheEntry
        self._cache: OrderedDict[str, CacheEntry] = OrderedDict()

        # Semantic index: key → query_embedding (parallel to _cache)
        self._embeddings: dict[str, list[float]] = {}

        # Stats
        self._hits = 0
        self._misses = 0

5.2 Key Operations

get(key):

  1. Look up key in _cache
  2. If found and not expired (check created_at + exact_ttl > now): increment hit_count, move to end (LRU), return CacheResult(hit=True, match_type="exact")
  3. If expired: delete from _cache and _embeddings
  4. Return CacheResult(hit=False)

semantic_search(query_embedding, threshold):

  1. If _embeddings is empty: return miss
  2. For each (key, emb) in _embeddings: a. Check if entry is still valid (created_at + semantic_ttl > now) b. If expired: skip (lazy cleanup) c. Compute cosine_similarity(query_embedding, emb) d. Track best match
  3. If best similarity >= threshold: return CacheResult(hit=True, match_type="semantic")
  4. Return miss

Performance: O(n) scan over all embeddings. With <10000 entries and 1536-dim vectors, this takes <10ms using numpy. Acceptable for now. If scale becomes an issue, switch to FAISS or pgvector (deferred).

put(key, response, query_embedding):

  1. Create CacheEntry(response, query_embedding or [], now, 0)
  2. If key exists: update, move to end
  3. If new and at capacity: evict LRU (popitem(last=False))
  4. Store embedding in _embeddings[key] if provided

invalidate(pattern):

  1. If pattern is None: clear all
  2. If pattern: iterate keys, match against pattern, delete matching entries

5.3 LRU Eviction Strategy

Follow EmbeddingCache pattern: OrderedDict with move_to_end() on access, popitem(last=False) on eviction. This is O(1) for both access and eviction.

Why not size-based eviction? LLM responses vary widely in size (100 bytes to 10KB). Entry-count-based eviction is simpler and more predictable. With max_entries=10000 and average response ~1KB, memory usage is ~10MB — acceptable.


6. RedisLLMCache Implementation Design

6.1 Key Schema

agentkit:llm_cache:{sha256_hex}          → JSON(CacheEntry) with TTL
agentkit:llm_cache_emb:{sha256_hex}      → JSON(list[float]) with TTL

Why two keys instead of one?

  • Semantic search needs to iterate all embeddings without downloading full response bodies
  • Embedding keys are small (~12KB for 1536-dim float list) vs. response keys (variable, potentially large with tool_calls)
  • Different TTLs: exact cache may have shorter TTL than semantic cache

Alternative considered: Single key with embedded embedding. Rejected because KEYS agentkit:llm_cache:* + GET for each key to extract embedding would download all response bodies for semantic search, which is wasteful.

6.2 Key Operations

get(key):

  1. GET agentkit:llm_cache:{key} → deserialize CacheEntry
  2. If found: INCR agentkit:llm_cache_hits:{key} (optional, for stats), return hit
  3. Return miss

semantic_search(query_embedding, threshold):

  1. KEYS agentkit:llm_cache_emb:* → get all embedding keys
  2. MGET all embedding keys → deserialize embeddings
  3. Compute cosine similarity for each
  4. If best >= threshold: GET agentkit:llm_cache:{best_key} → return hit
  5. Return miss

Performance concern: KEYS is O(N) and blocks Redis. For production with >1000 cached entries, this is unacceptable.

Mitigation: Use SCAN instead of KEYS for iteration. Store a Redis Set agentkit:llm_cache_index containing all active cache keys. On put(), SADD agentkit:llm_cache_index {key}. On invalidate(), SREM. For semantic search, SMEMBERS agentkit:llm_cache_indexMGET embeddings.

Revised key schema:

agentkit:llm_cache:{sha256_hex}          → JSON(CacheEntry) with TTL
agentkit:llm_cache_emb:{sha256_hex}      → JSON(list[float]) with TTL
agentkit:llm_cache_index                 → SET of active cache keys (no TTL, managed manually)

put(key, response, query_embedding):

  1. Pipeline: SET agentkit:llm_cache:{key} → JSON(CacheEntry) EX exact_ttl
  2. If embedding provided: SET agentkit:llm_cache_emb:{key} → JSON(embedding) EX semantic_ttl
  3. SADD agentkit:llm_cache_index {key}

invalidate(pattern):

  1. If pattern is None: SMEMBERS agentkit:llm_cache_index → pipeline DEL all keys → DEL index
  2. If pattern: SMEMBERS → filter by pattern → pipeline DEL matching keys → SREM from index

6.3 Lazy Redis Initialization

Follow RedisSessionStore._get_redis() pattern:

class RedisLLMCache:
    def __init__(self, redis_url: str = "redis://localhost:6379", ...):
        self._redis_url = redis_url
        self._redis: aioredis.Redis | None = None

    async def _get_redis(self) -> aioredis.Redis:
        if self._redis is None:
            import redis.asyncio as aioredis
            self._redis = aioredis.from_url(self._redis_url, decode_responses=True)
        return self._redis

6.4 Connection Error Handling

async def get(self, key: str) -> CacheResult:
    try:
        redis = await self._get_redis()
        data = await redis.get(f"agentkit:llm_cache:{key}")
        ...
    except (redis.ConnectionError, redis.TimeoutError) as e:
        logger.warning(f"Redis cache unavailable, returning miss: {e}")
        return CacheResult(hit=False)

Design Decision: On Redis failure, return cache miss (not error). The cache is a performance optimization, not a correctness requirement. Failing open is the correct behavior.


7. Factory Function

def create_llm_cache(
    backend: str = "auto",
    redis_url: str = "redis://localhost:6379",
    max_entries: int = 10000,
    exact_ttl: int = 3600,
    semantic_ttl: int = 86400,
    similarity_threshold: float = 0.92,
) -> LLMCache:
    """Create an LLM cache backend.

    Args:
        backend: "auto" (try Redis, fallback to memory), "redis", "memory"
        ...
    """
    if backend in ("auto", "redis"):
        try:
            import redis.asyncio as aioredis
            return RedisLLMCache(redis_url=redis_url, ...)
        except ImportError:
            logger.warning("redis package not available, falling back to in-memory cache")
            return InMemoryLLMCache(...)
    return InMemoryLLMCache(...)

Follows existing pattern: create_session_store(), create_evolution_store().


8. CacheKey Generation Design

8.1 Module: cache_key.py

import hashlib
import json

def generate_cache_key(
    model: str,
    messages: list[dict[str, str]],
    temperature: float,
    tools: list[dict] | None = None,
    tool_choice: str = "auto",
    max_tokens: int = 2000,
    system_prompt: str | None = None,
) -> str:
    """Generate a deterministic SHA-256 cache key from LLM request parameters."""
    components = [
        _hash_str(model),
        _hash_str(system_prompt or _extract_system_prompt(messages)),
        _hash_json(messages),
        _hash_str(f"{temperature:.2f}"),
        _hash_json(tools),
        _hash_str(tool_choice),
        _hash_str(str(max_tokens)),
    ]
    combined = "".join(components)
    return hashlib.sha256(combined.encode()).hexdigest()

def _extract_system_prompt(messages: list[dict]) -> str:
    """Extract system prompt from messages list."""
    for msg in messages:
        if msg.get("role") == "system":
            return msg.get("content", "")
    return ""

def _hash_str(s: str) -> str:
    return hashlib.sha256(s.encode()).hexdigest()

def _hash_json(obj) -> str:
    if obj is None:
        return hashlib.sha256(b"null").hexdigest()
    return hashlib.sha256(json.dumps(obj, sort_keys=True, ensure_ascii=False).encode()).hexdigest()

8.2 Why Separate system_prompt Parameter?

The messages list already contains the system prompt. But in AgentKit, the system prompt is injected separately from the user's messages (via MemoryStore.build_system_prompt()). The gateway receives messages that already include the system prompt. So system_prompt is extracted from messages[0] when role == "system".

No separate parameter needed_extract_system_prompt() handles extraction. This avoids requiring callers to pass system_prompt separately.


9. Semantic Match: Temperature Gate

Rule: Semantic matching is ONLY attempted when temperature == 0.0.

Reasoning:

  • At temperature > 0, LLM outputs are non-deterministic. Two semantically similar requests may produce different outputs.
  • Caching a temperature=0.7 response and returning it for a semantically similar query is misleading — the user expects randomness.
  • At temperature=0.0, outputs are deterministic (within provider guarantees), so semantic matching is safe.

Implementation: The gateway checks temperature before calling semantic_search(). The cache itself does not enforce this — it's a policy decision made by the caller.


10. Serialization Design

10.1 LLMResponse Serialization

LLMResponse contains content: str, model: str, usage: TokenUsage, tool_calls: list[ToolCall].

For InMemoryLLMCache: No serialization needed — store Python objects directly.

For RedisLLMCache: Serialize to JSON.

def _serialize_response(response: LLMResponse) -> dict:
    return {
        "content": response.content,
        "model": response.model,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
        },
        "tool_calls": [
            {"id": tc.id, "name": tc.name, "arguments": tc.arguments}
            for tc in response.tool_calls
        ],
        "latency_ms": response.latency_ms,
    }

def _deserialize_response(data: dict) -> LLMResponse:
    return LLMResponse(
        content=data["content"],
        model=data["model"],
        usage=TokenUsage(**data["usage"]),
        tool_calls=[ToolCall(**tc) for tc in data.get("tool_calls", [])],
        latency_ms=data.get("latency_ms", 0.0),
    )

10.2 Embedding Serialization

Embeddings are list[float] with 1536 dimensions. JSON serialization produces ~12KB per embedding.

Alternative: Binary serialization (struct.pack) would reduce to ~6KB but adds complexity. JSON is sufficient for now.


11. Edge Cases & Failure Modes

Edge Case Behavior Rationale
Response with tool_calls Cached normally Tool call responses are deterministic at temperature=0
Empty response (content="") Cached normally Empty responses are valid (e.g., tool-only responses)
Very large response (>100KB) Cached, but counted as single entry Size-based eviction deferred; entry-count is sufficient
Concurrent put() for same key Last write wins No data corruption risk; both writes are valid responses
Redis SET fails Log warning, cache miss on next read Fail open, never block LLM calls
Embedding API fails during put() Store response without embedding Exact-match still works; semantic match degraded
Embedding API fails during semantic_search() Return cache miss Don't block on embedding failures
invalidate() while get() in progress Possible stale read Acceptable for cache; eventual consistency

12. Test Strategy

12.1 Unit Tests (tests/unit/llm/test_cache.py)

Using pytest + pytest-asyncio:

  1. test_exact_match_hit: Same key → cache hit, match_type="exact"
  2. test_exact_match_miss: Different key → cache miss
  3. test_semantic_match_hit: Paraphrased query with similarity > 0.92 → hit, match_type="semantic"
  4. test_semantic_match_miss: Unrelated query with similarity < 0.6 → miss
  5. test_semantic_match_boundary: Similarity exactly at threshold → hit
  6. test_ttl_expiry_exact: Entry expires after exact_ttl → miss
  7. test_ttl_expiry_semantic: Entry expires after semantic_ttl → miss
  8. test_lru_eviction: Add max_entries + 1 → oldest evicted
  9. test_invalidate_all: invalidate() clears all entries
  10. test_invalidate_pattern: invalidate("prefix:*") clears matching entries
  11. test_cache_stats: stats() returns correct counts
  12. test_tool_calls_cached: Response with tool_calls cached and restored correctly
  13. test_concurrent_puts: Two concurrent puts for same key → no error
  14. test_redis_fallback: Redis import fails → InMemoryLLMCache returned
  15. test_cache_key_deterministic: Same inputs → same key
  16. test_cache_key_different_model: Different model → different key
  17. test_cache_key_different_temperature: Different temperature → different key

12.2 Mock Embedder for Testing

Use MockEmbedder from src/agentkit/memory/embedder.py. Since MockEmbedder generates deterministic embeddings based on text hash, semantically similar text will produce similar embeddings (same hash prefix → similar vector). This is sufficient for testing the similarity threshold logic.

Limitation: MockEmbedder doesn't produce truly semantically meaningful embeddings. For testing semantic matching behavior, we'll manually construct embeddings with known cosine similarities.

def _make_embedding(base: list[float], noise: float = 0.0) -> list[float]:
    """Create a unit vector with optional noise for similarity testing."""
    vec = [x + noise for x in base]
    magnitude = sum(x**2 for x in vec) ** 0.5
    return [x / magnitude for x in vec] if magnitude > 0 else vec

13. Dependency Analysis

13.1 Internal Dependencies

Dependency Usage Risk
agentkit.llm.protocol.LLMResponse Cache entry data type Stable, no change needed
agentkit.llm.protocol.TokenUsage Part of LLMResponse Stable
agentkit.llm.protocol.ToolCall Part of LLMResponse Stable
agentkit.memory.embedder.Embedder Embedding computation for semantic match Injected, not imported directly
agentkit.utils.vector_math.compute_cosine_similarity Similarity computation Stable utility

13.2 External Dependencies

Dependency Usage Required?
redis.asyncio RedisLLMCache backend Optional (only for "redis" backend)
numpy Fast cosine similarity Optional (pure-python fallback exists)

14. Implementation Sequence

Within U1, the implementation order is:

  1. cache_key.py — No dependencies, pure functions, easy to test
  2. cache.pyCacheResult, CacheEntry, LLMCache Protocol, InMemoryLLMCache
  3. cache.pyRedisLLMCache, create_llm_cache() factory
  4. test_cache.py — All unit tests

This order allows incremental testing: cache_key tests first, then InMemoryLLMCache tests, then RedisLLMCache tests.


15. Open Design Questions

  1. Should semantic_search() return the best match or all matches above threshold?

    • Current decision: Best match only. The gateway needs one response, not a ranked list. If we need ranked results later, we can add a search() method.
  2. Should the cache store the original messages alongside the response?

    • Current decision: No. The key already deterministically represents the messages. Storing them again wastes memory. If we need message-level debugging, we can add it later.
  3. Should RedisLLMCache use Redis Hash instead of individual keys?

    • Current decision: Individual keys with SET index. Hash would allow HGETALL for all entries, but makes TTL per-entry impossible (Redis Hash fields don't support individual TTLs). Individual keys with a SET index is the standard pattern.
  4. What embedding model to use for semantic cache?

    • Decision: Default to bge-m3 (BAAI/bge-m3 via Xinference or TEI endpoint) for Chinese+English mixed text. bge-m3 supports:
      • Multi-lingual (102 languages, strong Chinese)
      • Multi-granularity (dense + sparse + ColBERT)
      • Multi-function (retrieval + classification + similarity)
      • 1024-dim dense vectors (vs. 1536 for OpenAI)
    • Fallback to text-embedding-3-small when only OpenAI API is available.
    • The embedder is injected via constructor, so the model choice is a configuration concern, not a code concern.
    • Config example:
      llm:
        cache:
          embedding:
            provider: "xinference"    # "xinference" | "openai" | "local"
            model: "bge-m3"           # model name at provider
            base_url: "http://localhost:9997/v1"
      

16. Argumentation Summary

Design Choice Alternatives Considered Why This Choice
SHA-256 hash key UUID, MD5, composite string key SHA-256 is collision-resistant, deterministic, fixed-length; MD5 has known collisions; UUID is non-deterministic
OrderedDict LRU heapq, custom doubly-linked-list OrderedDict is Python-idiomatic, O(1) access+eviction, matches EmbeddingCache pattern
Separate get() + semantic_search() Combined get() with auto-semantic Explicit control avoids unnecessary embedding computation; caller decides when to attempt semantic match
Redis SET index for semantic search KEYS pattern scan, Redis Hash KEYS blocks Redis; Hash doesn't support per-field TTL; SET index is standard pattern
Fail-open on Redis error Raise exception, return None Cache is optimization, not correctness; failing open ensures LLM calls always work
Temperature gate for semantic match Always attempt semantic match temperature>0 outputs are non-deterministic; semantic match would return misleading cached responses
JSON serialization for Redis MessagePack, Pickle, Protobuf JSON is human-readable, debuggable, no extra dependencies; sufficient for <10KB entries
bge-m3 default embedding text-embedding-3-small, multilingual-e5 bge-m3 is SOTA for Chinese+English mixed text; 1024-dim saves 33% memory vs OpenAI 1536-dim; OpenAI-compatible API via Xinference/TEI