26 KiB

Raw Blame History

U1 Architecture Design: LLM Cache Core

Status: APPROVED — Design reviewed, embedding model set to bge-m3 for Chinese-first Date: 2026-06-14 Unit: U1 of P0 Production Hardening Plan

1. Design Goals

Transparent caching: LLMGateway.chat() callers cannot distinguish cached vs. uncached responses
Dual-match strategy: Exact-match (hash) for deterministic hits + Semantic-match (embedding) for paraphrased hits
Backend pluggability: InMemoryLLMCache for dev, RedisLLMCache for production, via factory
Chinese-first embedding: Default embedding model optimized for Chinese+English mixed text, with configurable fallback

2. Component Architecture

┌──────────────────────────────────────────────────────┐
│                    LLMGateway.chat()                  │
│                                                      │
│  1. Build LLMRequest                                │
│  2. ┌─ Cache Check ─────────────────────────────┐   │
│     │  generate_cache_key(req)                    │   │
│     │  cache.get(key) ──→ CacheResult             │   │
│     │    ├─ HIT (exact)  → return cached response │   │
│     │    └─ MISS → semantic_search(query_emb)     │   │
│     │         ├─ HIT (semantic) → return response │   │
│     │         └─ MISS → call provider              │   │
│     └─────────────────────────────────────────────┘   │
│  3. Call provider → LLMResponse                      │
│  4. cache.put(key, response, query_embedding)         │
│  5. Record usage                                     │
└──────────────────────────────────────────────────────┘

File Structure

src/agentkit/llm/
├── cache.py          # NEW: LLMCache Protocol + InMemoryLLMCache + RedisLLMCache + CacheResult
├── cache_key.py      # NEW: generate_cache_key(), hash helpers
├── gateway.py        # MODIFIED in U2: inject cache check
├── config.py         # MODIFIED in U2: add CacheConfig
└── ...

3. Data Model Design

3.1 CacheKey

Reasoning: The cache key must capture ALL inputs that deterministically affect LLM output. Missing any component leads to false cache hits (wrong response returned).

Component	Why Included	Hash Method
`model`	Different models produce different outputs	UTF-8 encode → SHA-256
`system_prompt`	Changes behavior fundamentally	SHA-256 of full text
`messages`	Core conversation context	SHA-256 of JSON-serialized messages
`temperature`	Affects randomness; only 0.0 is deterministic	Float string representation
`tools`	Available tools affect tool_call generation	SHA-256 of JSON-serialized tools list
`tool_choice`	"auto" vs "none" changes behavior	UTF-8 encode → SHA-256

Key formula:

key = SHA256(
    SHA256(model) +
    SHA256(system_prompt) +
    SHA256(json(messages, sort_keys=True)) +
    SHA256(str(temperature)) +
    SHA256(json(tools, sort_keys=True)) +
    SHA256(tool_choice)
)

Design Decision — Why not include max_tokens? max_tokens is a truncation limit, not a semantic input. A response cached with max_tokens=2000 is still valid when requested with max_tokens=4000 (the response was simply shorter). However, the reverse is unsafe — a response generated with max_tokens=4000 might be longer than a max_tokens=2000 request expects. Decision: Include max_tokens in the key to be safe. The cost of a few extra cache misses is negligible compared to returning a response that violates the caller's token limit.

Revised key formula:

key = SHA256(
    SHA256(model) +
    SHA256(system_prompt) +
    SHA256(json(messages, sort_keys=True)) +
    SHA256(f"{temperature:.2f}") +
    SHA256(json(tools, sort_keys=True)) +
    SHA256(tool_choice) +
    SHA256(str(max_tokens))
)

3.2 CacheEntry

@dataclass
class CacheEntry:
    """A cached LLM response with metadata."""
    response: LLMResponse          # The cached response
    query_embedding: list[float]   # Embedding of last user message (for semantic match)
    created_at: float              # time.monotonic() when cached
    hit_count: int                 # Number of cache hits

3.3 CacheResult

@dataclass
class CacheResult:
    """Result of a cache lookup."""
    hit: bool                      # Whether a cache hit occurred
    response: LLMResponse | None   # The cached response (None on miss)
    match_type: str                # "exact" | "semantic" | "" (miss)

4. Protocol Design

4.1 LLMCache Protocol

class LLMCache(Protocol):
    """LLM response cache interface."""

    async def get(self, key: str) -> CacheResult:
        """Look up a cached response by exact key, then semantic search."""
        ...

    async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
        """Store a response in the cache."""
        ...

    async def invalidate(self, pattern: str | None = None) -> int:
        """Invalidate cache entries. If pattern is None, invalidate all. Returns count of invalidated entries."""
        ...

    async def stats(self) -> dict[str, int]:
        """Return cache statistics: {total_entries, total_hits, total_misses}."""
        ...

Reasoning for async Protocol: All methods are async because RedisLLMCache uses redis.asyncio. Making the Protocol async ensures both backends share the same interface without sync/async bridging.

Why get() does both exact + semantic? The caller (LLMGateway) shouldn't need to know about the two-tier lookup. It calls cache.get(key) and gets a CacheResult with match_type indicating how the hit occurred. This encapsulation keeps the integration point simple.

4.2 Semantic Search Design

Critical Question: Should semantic search be inside get() or a separate method?

Analysis:

Option A: get(key) does exact match first, then semantic search on miss. Single call, simple integration.
Option B: Separate semantic_search(embedding) method. More flexible, but requires caller to manage two calls.

Decision: Option A. The semantic search needs the query_embedding, which must be computed before calling get(). But embedding computation is expensive (~100ms). We don't want to compute embeddings on every cache miss — only when semantic caching is enabled and temperature == 0.

Revised design:

class LLMCache(Protocol):
    async def get(self, key: str) -> CacheResult:
        """Exact-match lookup only."""
        ...

    async def semantic_search(self, query_embedding: list[float], threshold: float = 0.92) -> CacheResult:
        """Semantic similarity search across all cached entries."""
        ...

    async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
        """Store response with optional embedding for semantic matching."""
        ...

Integration flow in LLMGateway (U2):

# 1. Exact match
result = await cache.get(key)
if result.hit:
    return result.response

# 2. Semantic match (only for temperature == 0)
if request.temperature == 0 and query_embedding is not None:
    result = await cache.semantic_search(query_embedding)
    if result.hit:
        return result.response

# 3. Call provider
response = await provider.chat(request)
await cache.put(key, response, query_embedding)

This gives the gateway explicit control over when to attempt semantic search, avoiding unnecessary embedding computation.

5. InMemoryLLMCache Implementation Design

5.1 Data Structure

class InMemoryLLMCache:
    def __init__(self, max_entries: int = 10000, exact_ttl: int = 3600, semantic_ttl: int = 86400, similarity_threshold: float = 0.92):
        self._max_entries = max_entries
        self._exact_ttl = exact_ttl
        self._semantic_ttl = semantic_ttl
        self._similarity_threshold = similarity_threshold

        # Exact cache: key → CacheEntry
        self._cache: OrderedDict[str, CacheEntry] = OrderedDict()

        # Semantic index: key → query_embedding (parallel to _cache)
        self._embeddings: dict[str, list[float]] = {}

        # Stats
        self._hits = 0
        self._misses = 0

5.2 Key Operations

get(key):

Look up key in _cache
If found and not expired (check created_at + exact_ttl > now): increment hit_count, move to end (LRU), return CacheResult(hit=True, match_type="exact")
If expired: delete from _cache and _embeddings
Return CacheResult(hit=False)

semantic_search(query_embedding, threshold):

If _embeddings is empty: return miss
For each (key, emb) in _embeddings: a. Check if entry is still valid (created_at + semantic_ttl > now) b. If expired: skip (lazy cleanup) c. Compute cosine_similarity(query_embedding, emb) d. Track best match
If best similarity >= threshold: return CacheResult(hit=True, match_type="semantic")
Return miss

Performance: O(n) scan over all embeddings. With <10000 entries and 1536-dim vectors, this takes <10ms using numpy. Acceptable for now. If scale becomes an issue, switch to FAISS or pgvector (deferred).

put(key, response, query_embedding):

Create CacheEntry(response, query_embedding or [], now, 0)
If key exists: update, move to end
If new and at capacity: evict LRU (popitem(last=False))
Store embedding in _embeddings[key] if provided

invalidate(pattern):

If pattern is None: clear all
If pattern: iterate keys, match against pattern, delete matching entries

5.3 LRU Eviction Strategy

Follow EmbeddingCache pattern: OrderedDict with move_to_end() on access, popitem(last=False) on eviction. This is O(1) for both access and eviction.

Why not size-based eviction? LLM responses vary widely in size (100 bytes to 10KB). Entry-count-based eviction is simpler and more predictable. With max_entries=10000 and average response ~1KB, memory usage is ~10MB — acceptable.

6. RedisLLMCache Implementation Design

6.1 Key Schema

agentkit:llm_cache:{sha256_hex}          → JSON(CacheEntry) with TTL
agentkit:llm_cache_emb:{sha256_hex}      → JSON(list[float]) with TTL

Why two keys instead of one?

Semantic search needs to iterate all embeddings without downloading full response bodies
Embedding keys are small (~12KB for 1536-dim float list) vs. response keys (variable, potentially large with tool_calls)
Different TTLs: exact cache may have shorter TTL than semantic cache

Alternative considered: Single key with embedded embedding. Rejected because KEYS agentkit:llm_cache:* + GET for each key to extract embedding would download all response bodies for semantic search, which is wasteful.

6.2 Key Operations

get(key):

GET agentkit:llm_cache:{key} → deserialize CacheEntry
If found: INCR agentkit:llm_cache_hits:{key} (optional, for stats), return hit
Return miss

semantic_search(query_embedding, threshold):

KEYS agentkit:llm_cache_emb:* → get all embedding keys
MGET all embedding keys → deserialize embeddings
Compute cosine similarity for each
If best >= threshold: GET agentkit:llm_cache:{best_key} → return hit
Return miss

Performance concern: KEYS is O(N) and blocks Redis. For production with >1000 cached entries, this is unacceptable.

Mitigation: Use SCAN instead of KEYS for iteration. Store a Redis Set agentkit:llm_cache_index containing all active cache keys. On put(), SADD agentkit:llm_cache_index {key}. On invalidate(), SREM. For semantic search, SMEMBERS agentkit:llm_cache_index → MGET embeddings.

Revised key schema:

agentkit:llm_cache:{sha256_hex}          → JSON(CacheEntry) with TTL
agentkit:llm_cache_emb:{sha256_hex}      → JSON(list[float]) with TTL
agentkit:llm_cache_index                 → SET of active cache keys (no TTL, managed manually)

put(key, response, query_embedding):

Pipeline: SET agentkit:llm_cache:{key} → JSON(CacheEntry) EX exact_ttl
If embedding provided: SET agentkit:llm_cache_emb:{key} → JSON(embedding) EX semantic_ttl
SADD agentkit:llm_cache_index {key}

invalidate(pattern):

If pattern is None: SMEMBERS agentkit:llm_cache_index → pipeline DEL all keys → DEL index
If pattern: SMEMBERS → filter by pattern → pipeline DEL matching keys → SREM from index

6.3 Lazy Redis Initialization

Follow RedisSessionStore._get_redis() pattern:

class RedisLLMCache:
    def __init__(self, redis_url: str = "redis://localhost:6379", ...):
        self._redis_url = redis_url
        self._redis: aioredis.Redis | None = None

    async def _get_redis(self) -> aioredis.Redis:
        if self._redis is None:
            import redis.asyncio as aioredis
            self._redis = aioredis.from_url(self._redis_url, decode_responses=True)
        return self._redis

6.4 Connection Error Handling

async def get(self, key: str) -> CacheResult:
    try:
        redis = await self._get_redis()
        data = await redis.get(f"agentkit:llm_cache:{key}")
        ...
    except (redis.ConnectionError, redis.TimeoutError) as e:
        logger.warning(f"Redis cache unavailable, returning miss: {e}")
        return CacheResult(hit=False)

Design Decision: On Redis failure, return cache miss (not error). The cache is a performance optimization, not a correctness requirement. Failing open is the correct behavior.

7. Factory Function

def create_llm_cache(
    backend: str = "auto",
    redis_url: str = "redis://localhost:6379",
    max_entries: int = 10000,
    exact_ttl: int = 3600,
    semantic_ttl: int = 86400,
    similarity_threshold: float = 0.92,
) -> LLMCache:
    """Create an LLM cache backend.

    Args:
        backend: "auto" (try Redis, fallback to memory), "redis", "memory"
        ...
    """
    if backend in ("auto", "redis"):
        try:
            import redis.asyncio as aioredis
            return RedisLLMCache(redis_url=redis_url, ...)
        except ImportError:
            logger.warning("redis package not available, falling back to in-memory cache")
            return InMemoryLLMCache(...)
    return InMemoryLLMCache(...)

Follows existing pattern: create_session_store(), create_evolution_store().

8. CacheKey Generation Design

8.1 Module: `cache_key.py`

import hashlib
import json

def generate_cache_key(
    model: str,
    messages: list[dict[str, str]],
    temperature: float,
    tools: list[dict] | None = None,
    tool_choice: str = "auto",
    max_tokens: int = 2000,
    system_prompt: str | None = None,
) -> str:
    """Generate a deterministic SHA-256 cache key from LLM request parameters."""
    components = [
        _hash_str(model),
        _hash_str(system_prompt or _extract_system_prompt(messages)),
        _hash_json(messages),
        _hash_str(f"{temperature:.2f}"),
        _hash_json(tools),
        _hash_str(tool_choice),
        _hash_str(str(max_tokens)),
    ]
    combined = "".join(components)
    return hashlib.sha256(combined.encode()).hexdigest()

def _extract_system_prompt(messages: list[dict]) -> str:
    """Extract system prompt from messages list."""
    for msg in messages:
        if msg.get("role") == "system":
            return msg.get("content", "")
    return ""

def _hash_str(s: str) -> str:
    return hashlib.sha256(s.encode()).hexdigest()

def _hash_json(obj) -> str:
    if obj is None:
        return hashlib.sha256(b"null").hexdigest()
    return hashlib.sha256(json.dumps(obj, sort_keys=True, ensure_ascii=False).encode()).hexdigest()

8.2 Why Separate `system_prompt` Parameter?

The messages list already contains the system prompt. But in AgentKit, the system prompt is injected separately from the user's messages (via MemoryStore.build_system_prompt()). The gateway receives messages that already include the system prompt. So system_prompt is extracted from messages[0] when role == "system".

No separate parameter needed — _extract_system_prompt() handles extraction. This avoids requiring callers to pass system_prompt separately.

9. Semantic Match: Temperature Gate

Rule: Semantic matching is ONLY attempted when temperature == 0.0.

Reasoning:

At temperature > 0, LLM outputs are non-deterministic. Two semantically similar requests may produce different outputs.
Caching a temperature=0.7 response and returning it for a semantically similar query is misleading — the user expects randomness.
At temperature=0.0, outputs are deterministic (within provider guarantees), so semantic matching is safe.

Implementation: The gateway checks temperature before calling semantic_search(). The cache itself does not enforce this — it's a policy decision made by the caller.

10. Serialization Design

10.1 LLMResponse Serialization

LLMResponse contains content: str, model: str, usage: TokenUsage, tool_calls: list[ToolCall].

For InMemoryLLMCache: No serialization needed — store Python objects directly.

For RedisLLMCache: Serialize to JSON.

def _serialize_response(response: LLMResponse) -> dict:
    return {
        "content": response.content,
        "model": response.model,
        "usage": {
            "prompt_tokens": response.usage.prompt_tokens,
            "completion_tokens": response.usage.completion_tokens,
        },
        "tool_calls": [
            {"id": tc.id, "name": tc.name, "arguments": tc.arguments}
            for tc in response.tool_calls
        ],
        "latency_ms": response.latency_ms,
    }

def _deserialize_response(data: dict) -> LLMResponse:
    return LLMResponse(
        content=data["content"],
        model=data["model"],
        usage=TokenUsage(**data["usage"]),
        tool_calls=[ToolCall(**tc) for tc in data.get("tool_calls", [])],
        latency_ms=data.get("latency_ms", 0.0),
    )

10.2 Embedding Serialization

Embeddings are list[float] with 1536 dimensions. JSON serialization produces ~12KB per embedding.

Alternative: Binary serialization (struct.pack) would reduce to ~6KB but adds complexity. JSON is sufficient for now.

11. Edge Cases & Failure Modes

Edge Case	Behavior	Rationale
Response with `tool_calls`	Cached normally	Tool call responses are deterministic at temperature=0
Empty response (`content=""`)	Cached normally	Empty responses are valid (e.g., tool-only responses)
Very large response (>100KB)	Cached, but counted as single entry	Size-based eviction deferred; entry-count is sufficient
Concurrent `put()` for same key	Last write wins	No data corruption risk; both writes are valid responses
Redis `SET` fails	Log warning, cache miss on next read	Fail open, never block LLM calls
Embedding API fails during `put()`	Store response without embedding	Exact-match still works; semantic match degraded
Embedding API fails during `semantic_search()`	Return cache miss	Don't block on embedding failures
`invalidate()` while `get()` in progress	Possible stale read	Acceptable for cache; eventual consistency

12. Test Strategy

12.1 Unit Tests (`tests/unit/llm/test_cache.py`)

Using pytest + pytest-asyncio:

test_exact_match_hit: Same key → cache hit, match_type="exact"
test_exact_match_miss: Different key → cache miss
test_semantic_match_hit: Paraphrased query with similarity > 0.92 → hit, match_type="semantic"
test_semantic_match_miss: Unrelated query with similarity < 0.6 → miss
test_semantic_match_boundary: Similarity exactly at threshold → hit
test_ttl_expiry_exact: Entry expires after exact_ttl → miss
test_ttl_expiry_semantic: Entry expires after semantic_ttl → miss
test_lru_eviction: Add max_entries + 1 → oldest evicted
test_invalidate_all: invalidate() clears all entries
test_invalidate_pattern: invalidate("prefix:*") clears matching entries
test_cache_stats: stats() returns correct counts
test_tool_calls_cached: Response with tool_calls cached and restored correctly
test_concurrent_puts: Two concurrent puts for same key → no error
test_redis_fallback: Redis import fails → InMemoryLLMCache returned
test_cache_key_deterministic: Same inputs → same key
test_cache_key_different_model: Different model → different key
test_cache_key_different_temperature: Different temperature → different key

12.2 Mock Embedder for Testing

Use MockEmbedder from src/agentkit/memory/embedder.py. Since MockEmbedder generates deterministic embeddings based on text hash, semantically similar text will produce similar embeddings (same hash prefix → similar vector). This is sufficient for testing the similarity threshold logic.

Limitation: MockEmbedder doesn't produce truly semantically meaningful embeddings. For testing semantic matching behavior, we'll manually construct embeddings with known cosine similarities.

def _make_embedding(base: list[float], noise: float = 0.0) -> list[float]:
    """Create a unit vector with optional noise for similarity testing."""
    vec = [x + noise for x in base]
    magnitude = sum(x**2 for x in vec) ** 0.5
    return [x / magnitude for x in vec] if magnitude > 0 else vec

13. Dependency Analysis

13.1 Internal Dependencies

Dependency	Usage	Risk
`agentkit.llm.protocol.LLMResponse`	Cache entry data type	Stable, no change needed
`agentkit.llm.protocol.TokenUsage`	Part of LLMResponse	Stable
`agentkit.llm.protocol.ToolCall`	Part of LLMResponse	Stable
`agentkit.memory.embedder.Embedder`	Embedding computation for semantic match	Injected, not imported directly
`agentkit.utils.vector_math.compute_cosine_similarity`	Similarity computation	Stable utility

13.2 External Dependencies

Dependency	Usage	Required?
`redis.asyncio`	RedisLLMCache backend	Optional (only for "redis" backend)
`numpy`	Fast cosine similarity	Optional (pure-python fallback exists)

14. Implementation Sequence

Within U1, the implementation order is:

cache_key.py — No dependencies, pure functions, easy to test
cache.py — CacheResult, CacheEntry, LLMCache Protocol, InMemoryLLMCache
cache.py — RedisLLMCache, create_llm_cache() factory
test_cache.py — All unit tests

This order allows incremental testing: cache_key tests first, then InMemoryLLMCache tests, then RedisLLMCache tests.

15. Open Design Questions

Should semantic_search() return the best match or all matches above threshold?
- Current decision: Best match only. The gateway needs one response, not a ranked list. If we need ranked results later, we can add a search() method.
Should the cache store the original messages alongside the response?
- Current decision: No. The key already deterministically represents the messages. Storing them again wastes memory. If we need message-level debugging, we can add it later.
Should RedisLLMCache use Redis Hash instead of individual keys?
- Current decision: Individual keys with SET index. Hash would allow HGETALL for all entries, but makes TTL per-entry impossible (Redis Hash fields don't support individual TTLs). Individual keys with a SET index is the standard pattern.
What embedding model to use for semantic cache?
- Decision: Default to bge-m3 (BAAI/bge-m3 via Xinference or TEI endpoint) for Chinese+English mixed text. bge-m3 supports:
  - Multi-lingual (102 languages, strong Chinese)
  - Multi-granularity (dense + sparse + ColBERT)
  - Multi-function (retrieval + classification + similarity)
  - 1024-dim dense vectors (vs. 1536 for OpenAI)
- Fallback to text-embedding-3-small when only OpenAI API is available.
- The embedder is injected via constructor, so the model choice is a configuration concern, not a code concern.
- Config example:
```
llm:
  cache:
    embedding:
      provider: "xinference"    # "xinference" | "openai" | "local"
      model: "bge-m3"           # model name at provider
      base_url: "http://localhost:9997/v1"
```

16. Argumentation Summary

Design Choice	Alternatives Considered	Why This Choice
SHA-256 hash key	UUID, MD5, composite string key	SHA-256 is collision-resistant, deterministic, fixed-length; MD5 has known collisions; UUID is non-deterministic
OrderedDict LRU	heapq, custom doubly-linked-list	OrderedDict is Python-idiomatic, O(1) access+eviction, matches EmbeddingCache pattern
Separate `get()` + `semantic_search()`	Combined `get()` with auto-semantic	Explicit control avoids unnecessary embedding computation; caller decides when to attempt semantic match
Redis SET index for semantic search	KEYS pattern scan, Redis Hash	KEYS blocks Redis; Hash doesn't support per-field TTL; SET index is standard pattern
Fail-open on Redis error	Raise exception, return None	Cache is optimization, not correctness; failing open ensures LLM calls always work
Temperature gate for semantic match	Always attempt semantic match	temperature>0 outputs are non-deterministic; semantic match would return misleading cached responses
JSON serialization for Redis	MessagePack, Pickle, Protobuf	JSON is human-readable, debuggable, no extra dependencies; sufficient for <10KB entries
bge-m3 default embedding	text-embedding-3-small, multilingual-e5	bge-m3 is SOTA for Chinese+English mixed text; 1024-dim saves 33% memory vs OpenAI 1536-dim; OpenAI-compatible API via Xinference/TEI

26 KiB Raw Blame History