26 KiB
U1 Architecture Design: LLM Cache Core
Status: APPROVED — Design reviewed, embedding model set to bge-m3 for Chinese-first Date: 2026-06-14 Unit: U1 of P0 Production Hardening Plan
1. Design Goals
- Transparent caching:
LLMGateway.chat()callers cannot distinguish cached vs. uncached responses - Dual-match strategy: Exact-match (hash) for deterministic hits + Semantic-match (embedding) for paraphrased hits
- Backend pluggability:
InMemoryLLMCachefor dev,RedisLLMCachefor production, via factory - Chinese-first embedding: Default embedding model optimized for Chinese+English mixed text, with configurable fallback
2. Component Architecture
┌──────────────────────────────────────────────────────┐
│ LLMGateway.chat() │
│ │
│ 1. Build LLMRequest │
│ 2. ┌─ Cache Check ─────────────────────────────┐ │
│ │ generate_cache_key(req) │ │
│ │ cache.get(key) ──→ CacheResult │ │
│ │ ├─ HIT (exact) → return cached response │ │
│ │ └─ MISS → semantic_search(query_emb) │ │
│ │ ├─ HIT (semantic) → return response │ │
│ │ └─ MISS → call provider │ │
│ └─────────────────────────────────────────────┘ │
│ 3. Call provider → LLMResponse │
│ 4. cache.put(key, response, query_embedding) │
│ 5. Record usage │
└──────────────────────────────────────────────────────┘
File Structure
src/agentkit/llm/
├── cache.py # NEW: LLMCache Protocol + InMemoryLLMCache + RedisLLMCache + CacheResult
├── cache_key.py # NEW: generate_cache_key(), hash helpers
├── gateway.py # MODIFIED in U2: inject cache check
├── config.py # MODIFIED in U2: add CacheConfig
└── ...
3. Data Model Design
3.1 CacheKey
Reasoning: The cache key must capture ALL inputs that deterministically affect LLM output. Missing any component leads to false cache hits (wrong response returned).
| Component | Why Included | Hash Method |
|---|---|---|
model |
Different models produce different outputs | UTF-8 encode → SHA-256 |
system_prompt |
Changes behavior fundamentally | SHA-256 of full text |
messages |
Core conversation context | SHA-256 of JSON-serialized messages |
temperature |
Affects randomness; only 0.0 is deterministic | Float string representation |
tools |
Available tools affect tool_call generation | SHA-256 of JSON-serialized tools list |
tool_choice |
"auto" vs "none" changes behavior | UTF-8 encode → SHA-256 |
Key formula:
key = SHA256(
SHA256(model) +
SHA256(system_prompt) +
SHA256(json(messages, sort_keys=True)) +
SHA256(str(temperature)) +
SHA256(json(tools, sort_keys=True)) +
SHA256(tool_choice)
)
Design Decision — Why not include max_tokens?
max_tokens is a truncation limit, not a semantic input. A response cached with max_tokens=2000 is still valid when requested with max_tokens=4000 (the response was simply shorter). However, the reverse is unsafe — a response generated with max_tokens=4000 might be longer than a max_tokens=2000 request expects. Decision: Include max_tokens in the key to be safe. The cost of a few extra cache misses is negligible compared to returning a response that violates the caller's token limit.
Revised key formula:
key = SHA256(
SHA256(model) +
SHA256(system_prompt) +
SHA256(json(messages, sort_keys=True)) +
SHA256(f"{temperature:.2f}") +
SHA256(json(tools, sort_keys=True)) +
SHA256(tool_choice) +
SHA256(str(max_tokens))
)
3.2 CacheEntry
@dataclass
class CacheEntry:
"""A cached LLM response with metadata."""
response: LLMResponse # The cached response
query_embedding: list[float] # Embedding of last user message (for semantic match)
created_at: float # time.monotonic() when cached
hit_count: int # Number of cache hits
3.3 CacheResult
@dataclass
class CacheResult:
"""Result of a cache lookup."""
hit: bool # Whether a cache hit occurred
response: LLMResponse | None # The cached response (None on miss)
match_type: str # "exact" | "semantic" | "" (miss)
4. Protocol Design
4.1 LLMCache Protocol
class LLMCache(Protocol):
"""LLM response cache interface."""
async def get(self, key: str) -> CacheResult:
"""Look up a cached response by exact key, then semantic search."""
...
async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
"""Store a response in the cache."""
...
async def invalidate(self, pattern: str | None = None) -> int:
"""Invalidate cache entries. If pattern is None, invalidate all. Returns count of invalidated entries."""
...
async def stats(self) -> dict[str, int]:
"""Return cache statistics: {total_entries, total_hits, total_misses}."""
...
Reasoning for async Protocol: All methods are async because RedisLLMCache uses redis.asyncio. Making the Protocol async ensures both backends share the same interface without sync/async bridging.
Why get() does both exact + semantic? The caller (LLMGateway) shouldn't need to know about the two-tier lookup. It calls cache.get(key) and gets a CacheResult with match_type indicating how the hit occurred. This encapsulation keeps the integration point simple.
4.2 Semantic Search Design
Critical Question: Should semantic search be inside get() or a separate method?
Analysis:
- Option A:
get(key)does exact match first, then semantic search on miss. Single call, simple integration. - Option B: Separate
semantic_search(embedding)method. More flexible, but requires caller to manage two calls.
Decision: Option A. The semantic search needs the query_embedding, which must be computed before calling get(). But embedding computation is expensive (~100ms). We don't want to compute embeddings on every cache miss — only when semantic caching is enabled and temperature == 0.
Revised design:
class LLMCache(Protocol):
async def get(self, key: str) -> CacheResult:
"""Exact-match lookup only."""
...
async def semantic_search(self, query_embedding: list[float], threshold: float = 0.92) -> CacheResult:
"""Semantic similarity search across all cached entries."""
...
async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
"""Store response with optional embedding for semantic matching."""
...
Integration flow in LLMGateway (U2):
# 1. Exact match
result = await cache.get(key)
if result.hit:
return result.response
# 2. Semantic match (only for temperature == 0)
if request.temperature == 0 and query_embedding is not None:
result = await cache.semantic_search(query_embedding)
if result.hit:
return result.response
# 3. Call provider
response = await provider.chat(request)
await cache.put(key, response, query_embedding)
This gives the gateway explicit control over when to attempt semantic search, avoiding unnecessary embedding computation.
5. InMemoryLLMCache Implementation Design
5.1 Data Structure
class InMemoryLLMCache:
def __init__(self, max_entries: int = 10000, exact_ttl: int = 3600, semantic_ttl: int = 86400, similarity_threshold: float = 0.92):
self._max_entries = max_entries
self._exact_ttl = exact_ttl
self._semantic_ttl = semantic_ttl
self._similarity_threshold = similarity_threshold
# Exact cache: key → CacheEntry
self._cache: OrderedDict[str, CacheEntry] = OrderedDict()
# Semantic index: key → query_embedding (parallel to _cache)
self._embeddings: dict[str, list[float]] = {}
# Stats
self._hits = 0
self._misses = 0
5.2 Key Operations
get(key):
- Look up
keyin_cache - If found and not expired (check
created_at + exact_ttl > now): incrementhit_count, move to end (LRU), returnCacheResult(hit=True, match_type="exact") - If expired: delete from
_cacheand_embeddings - Return
CacheResult(hit=False)
semantic_search(query_embedding, threshold):
- If
_embeddingsis empty: return miss - For each
(key, emb)in_embeddings: a. Check if entry is still valid (created_at + semantic_ttl > now) b. If expired: skip (lazy cleanup) c. Computecosine_similarity(query_embedding, emb)d. Track best match - If best similarity >= threshold: return
CacheResult(hit=True, match_type="semantic") - Return miss
Performance: O(n) scan over all embeddings. With <10000 entries and 1536-dim vectors, this takes <10ms using numpy. Acceptable for now. If scale becomes an issue, switch to FAISS or pgvector (deferred).
put(key, response, query_embedding):
- Create
CacheEntry(response, query_embedding or [], now, 0) - If key exists: update, move to end
- If new and at capacity: evict LRU (popitem(last=False))
- Store embedding in
_embeddings[key]if provided
invalidate(pattern):
- If pattern is None: clear all
- If pattern: iterate keys, match against pattern, delete matching entries
5.3 LRU Eviction Strategy
Follow EmbeddingCache pattern: OrderedDict with move_to_end() on access, popitem(last=False) on eviction. This is O(1) for both access and eviction.
Why not size-based eviction? LLM responses vary widely in size (100 bytes to 10KB). Entry-count-based eviction is simpler and more predictable. With max_entries=10000 and average response ~1KB, memory usage is ~10MB — acceptable.
6. RedisLLMCache Implementation Design
6.1 Key Schema
agentkit:llm_cache:{sha256_hex} → JSON(CacheEntry) with TTL
agentkit:llm_cache_emb:{sha256_hex} → JSON(list[float]) with TTL
Why two keys instead of one?
- Semantic search needs to iterate all embeddings without downloading full response bodies
- Embedding keys are small (~12KB for 1536-dim float list) vs. response keys (variable, potentially large with tool_calls)
- Different TTLs: exact cache may have shorter TTL than semantic cache
Alternative considered: Single key with embedded embedding. Rejected because KEYS agentkit:llm_cache:* + GET for each key to extract embedding would download all response bodies for semantic search, which is wasteful.
6.2 Key Operations
get(key):
GET agentkit:llm_cache:{key}→ deserialize CacheEntry- If found:
INCR agentkit:llm_cache_hits:{key}(optional, for stats), return hit - Return miss
semantic_search(query_embedding, threshold):
KEYS agentkit:llm_cache_emb:*→ get all embedding keysMGETall embedding keys → deserialize embeddings- Compute cosine similarity for each
- If best >= threshold:
GET agentkit:llm_cache:{best_key}→ return hit - Return miss
Performance concern: KEYS is O(N) and blocks Redis. For production with >1000 cached entries, this is unacceptable.
Mitigation: Use SCAN instead of KEYS for iteration. Store a Redis Set agentkit:llm_cache_index containing all active cache keys. On put(), SADD agentkit:llm_cache_index {key}. On invalidate(), SREM. For semantic search, SMEMBERS agentkit:llm_cache_index → MGET embeddings.
Revised key schema:
agentkit:llm_cache:{sha256_hex} → JSON(CacheEntry) with TTL
agentkit:llm_cache_emb:{sha256_hex} → JSON(list[float]) with TTL
agentkit:llm_cache_index → SET of active cache keys (no TTL, managed manually)
put(key, response, query_embedding):
- Pipeline:
SET agentkit:llm_cache:{key} → JSON(CacheEntry) EX exact_ttl - If embedding provided:
SET agentkit:llm_cache_emb:{key} → JSON(embedding) EX semantic_ttl SADD agentkit:llm_cache_index {key}
invalidate(pattern):
- If pattern is None:
SMEMBERS agentkit:llm_cache_index→ pipeline DEL all keys → DEL index - If pattern:
SMEMBERS→ filter by pattern → pipeline DEL matching keys → SREM from index
6.3 Lazy Redis Initialization
Follow RedisSessionStore._get_redis() pattern:
class RedisLLMCache:
def __init__(self, redis_url: str = "redis://localhost:6379", ...):
self._redis_url = redis_url
self._redis: aioredis.Redis | None = None
async def _get_redis(self) -> aioredis.Redis:
if self._redis is None:
import redis.asyncio as aioredis
self._redis = aioredis.from_url(self._redis_url, decode_responses=True)
return self._redis
6.4 Connection Error Handling
async def get(self, key: str) -> CacheResult:
try:
redis = await self._get_redis()
data = await redis.get(f"agentkit:llm_cache:{key}")
...
except (redis.ConnectionError, redis.TimeoutError) as e:
logger.warning(f"Redis cache unavailable, returning miss: {e}")
return CacheResult(hit=False)
Design Decision: On Redis failure, return cache miss (not error). The cache is a performance optimization, not a correctness requirement. Failing open is the correct behavior.
7. Factory Function
def create_llm_cache(
backend: str = "auto",
redis_url: str = "redis://localhost:6379",
max_entries: int = 10000,
exact_ttl: int = 3600,
semantic_ttl: int = 86400,
similarity_threshold: float = 0.92,
) -> LLMCache:
"""Create an LLM cache backend.
Args:
backend: "auto" (try Redis, fallback to memory), "redis", "memory"
...
"""
if backend in ("auto", "redis"):
try:
import redis.asyncio as aioredis
return RedisLLMCache(redis_url=redis_url, ...)
except ImportError:
logger.warning("redis package not available, falling back to in-memory cache")
return InMemoryLLMCache(...)
return InMemoryLLMCache(...)
Follows existing pattern: create_session_store(), create_evolution_store().
8. CacheKey Generation Design
8.1 Module: cache_key.py
import hashlib
import json
def generate_cache_key(
model: str,
messages: list[dict[str, str]],
temperature: float,
tools: list[dict] | None = None,
tool_choice: str = "auto",
max_tokens: int = 2000,
system_prompt: str | None = None,
) -> str:
"""Generate a deterministic SHA-256 cache key from LLM request parameters."""
components = [
_hash_str(model),
_hash_str(system_prompt or _extract_system_prompt(messages)),
_hash_json(messages),
_hash_str(f"{temperature:.2f}"),
_hash_json(tools),
_hash_str(tool_choice),
_hash_str(str(max_tokens)),
]
combined = "".join(components)
return hashlib.sha256(combined.encode()).hexdigest()
def _extract_system_prompt(messages: list[dict]) -> str:
"""Extract system prompt from messages list."""
for msg in messages:
if msg.get("role") == "system":
return msg.get("content", "")
return ""
def _hash_str(s: str) -> str:
return hashlib.sha256(s.encode()).hexdigest()
def _hash_json(obj) -> str:
if obj is None:
return hashlib.sha256(b"null").hexdigest()
return hashlib.sha256(json.dumps(obj, sort_keys=True, ensure_ascii=False).encode()).hexdigest()
8.2 Why Separate system_prompt Parameter?
The messages list already contains the system prompt. But in AgentKit, the system prompt is injected separately from the user's messages (via MemoryStore.build_system_prompt()). The gateway receives messages that already include the system prompt. So system_prompt is extracted from messages[0] when role == "system".
No separate parameter needed — _extract_system_prompt() handles extraction. This avoids requiring callers to pass system_prompt separately.
9. Semantic Match: Temperature Gate
Rule: Semantic matching is ONLY attempted when temperature == 0.0.
Reasoning:
- At
temperature > 0, LLM outputs are non-deterministic. Two semantically similar requests may produce different outputs. - Caching a
temperature=0.7response and returning it for a semantically similar query is misleading — the user expects randomness. - At
temperature=0.0, outputs are deterministic (within provider guarantees), so semantic matching is safe.
Implementation: The gateway checks temperature before calling semantic_search(). The cache itself does not enforce this — it's a policy decision made by the caller.
10. Serialization Design
10.1 LLMResponse Serialization
LLMResponse contains content: str, model: str, usage: TokenUsage, tool_calls: list[ToolCall].
For InMemoryLLMCache: No serialization needed — store Python objects directly.
For RedisLLMCache: Serialize to JSON.
def _serialize_response(response: LLMResponse) -> dict:
return {
"content": response.content,
"model": response.model,
"usage": {
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
},
"tool_calls": [
{"id": tc.id, "name": tc.name, "arguments": tc.arguments}
for tc in response.tool_calls
],
"latency_ms": response.latency_ms,
}
def _deserialize_response(data: dict) -> LLMResponse:
return LLMResponse(
content=data["content"],
model=data["model"],
usage=TokenUsage(**data["usage"]),
tool_calls=[ToolCall(**tc) for tc in data.get("tool_calls", [])],
latency_ms=data.get("latency_ms", 0.0),
)
10.2 Embedding Serialization
Embeddings are list[float] with 1536 dimensions. JSON serialization produces ~12KB per embedding.
Alternative: Binary serialization (struct.pack) would reduce to ~6KB but adds complexity. JSON is sufficient for now.
11. Edge Cases & Failure Modes
| Edge Case | Behavior | Rationale |
|---|---|---|
Response with tool_calls |
Cached normally | Tool call responses are deterministic at temperature=0 |
Empty response (content="") |
Cached normally | Empty responses are valid (e.g., tool-only responses) |
| Very large response (>100KB) | Cached, but counted as single entry | Size-based eviction deferred; entry-count is sufficient |
Concurrent put() for same key |
Last write wins | No data corruption risk; both writes are valid responses |
Redis SET fails |
Log warning, cache miss on next read | Fail open, never block LLM calls |
Embedding API fails during put() |
Store response without embedding | Exact-match still works; semantic match degraded |
Embedding API fails during semantic_search() |
Return cache miss | Don't block on embedding failures |
invalidate() while get() in progress |
Possible stale read | Acceptable for cache; eventual consistency |
12. Test Strategy
12.1 Unit Tests (tests/unit/llm/test_cache.py)
Using pytest + pytest-asyncio:
- test_exact_match_hit: Same key → cache hit,
match_type="exact" - test_exact_match_miss: Different key → cache miss
- test_semantic_match_hit: Paraphrased query with similarity > 0.92 → hit,
match_type="semantic" - test_semantic_match_miss: Unrelated query with similarity < 0.6 → miss
- test_semantic_match_boundary: Similarity exactly at threshold → hit
- test_ttl_expiry_exact: Entry expires after exact_ttl → miss
- test_ttl_expiry_semantic: Entry expires after semantic_ttl → miss
- test_lru_eviction: Add max_entries + 1 → oldest evicted
- test_invalidate_all:
invalidate()clears all entries - test_invalidate_pattern:
invalidate("prefix:*")clears matching entries - test_cache_stats:
stats()returns correct counts - test_tool_calls_cached: Response with tool_calls cached and restored correctly
- test_concurrent_puts: Two concurrent puts for same key → no error
- test_redis_fallback: Redis import fails → InMemoryLLMCache returned
- test_cache_key_deterministic: Same inputs → same key
- test_cache_key_different_model: Different model → different key
- test_cache_key_different_temperature: Different temperature → different key
12.2 Mock Embedder for Testing
Use MockEmbedder from src/agentkit/memory/embedder.py. Since MockEmbedder generates deterministic embeddings based on text hash, semantically similar text will produce similar embeddings (same hash prefix → similar vector). This is sufficient for testing the similarity threshold logic.
Limitation: MockEmbedder doesn't produce truly semantically meaningful embeddings. For testing semantic matching behavior, we'll manually construct embeddings with known cosine similarities.
def _make_embedding(base: list[float], noise: float = 0.0) -> list[float]:
"""Create a unit vector with optional noise for similarity testing."""
vec = [x + noise for x in base]
magnitude = sum(x**2 for x in vec) ** 0.5
return [x / magnitude for x in vec] if magnitude > 0 else vec
13. Dependency Analysis
13.1 Internal Dependencies
| Dependency | Usage | Risk |
|---|---|---|
agentkit.llm.protocol.LLMResponse |
Cache entry data type | Stable, no change needed |
agentkit.llm.protocol.TokenUsage |
Part of LLMResponse | Stable |
agentkit.llm.protocol.ToolCall |
Part of LLMResponse | Stable |
agentkit.memory.embedder.Embedder |
Embedding computation for semantic match | Injected, not imported directly |
agentkit.utils.vector_math.compute_cosine_similarity |
Similarity computation | Stable utility |
13.2 External Dependencies
| Dependency | Usage | Required? |
|---|---|---|
redis.asyncio |
RedisLLMCache backend | Optional (only for "redis" backend) |
numpy |
Fast cosine similarity | Optional (pure-python fallback exists) |
14. Implementation Sequence
Within U1, the implementation order is:
cache_key.py— No dependencies, pure functions, easy to testcache.py—CacheResult,CacheEntry,LLMCacheProtocol,InMemoryLLMCachecache.py—RedisLLMCache,create_llm_cache()factorytest_cache.py— All unit tests
This order allows incremental testing: cache_key tests first, then InMemoryLLMCache tests, then RedisLLMCache tests.
15. Open Design Questions
-
Should
semantic_search()return the best match or all matches above threshold?- Current decision: Best match only. The gateway needs one response, not a ranked list. If we need ranked results later, we can add a
search()method.
- Current decision: Best match only. The gateway needs one response, not a ranked list. If we need ranked results later, we can add a
-
Should the cache store the original
messagesalongside the response?- Current decision: No. The key already deterministically represents the messages. Storing them again wastes memory. If we need message-level debugging, we can add it later.
-
Should
RedisLLMCacheuse Redis Hash instead of individual keys?- Current decision: Individual keys with SET index. Hash would allow
HGETALLfor all entries, but makes TTL per-entry impossible (Redis Hash fields don't support individual TTLs). Individual keys with a SET index is the standard pattern.
- Current decision: Individual keys with SET index. Hash would allow
-
What embedding model to use for semantic cache?
- Decision: Default to
bge-m3(BAAI/bge-m3 via Xinference or TEI endpoint) for Chinese+English mixed text.bge-m3supports:- Multi-lingual (102 languages, strong Chinese)
- Multi-granularity (dense + sparse + ColBERT)
- Multi-function (retrieval + classification + similarity)
- 1024-dim dense vectors (vs. 1536 for OpenAI)
- Fallback to
text-embedding-3-smallwhen only OpenAI API is available. - The embedder is injected via constructor, so the model choice is a configuration concern, not a code concern.
- Config example:
llm: cache: embedding: provider: "xinference" # "xinference" | "openai" | "local" model: "bge-m3" # model name at provider base_url: "http://localhost:9997/v1"
- Decision: Default to
16. Argumentation Summary
| Design Choice | Alternatives Considered | Why This Choice |
|---|---|---|
| SHA-256 hash key | UUID, MD5, composite string key | SHA-256 is collision-resistant, deterministic, fixed-length; MD5 has known collisions; UUID is non-deterministic |
| OrderedDict LRU | heapq, custom doubly-linked-list | OrderedDict is Python-idiomatic, O(1) access+eviction, matches EmbeddingCache pattern |
Separate get() + semantic_search() |
Combined get() with auto-semantic |
Explicit control avoids unnecessary embedding computation; caller decides when to attempt semantic match |
| Redis SET index for semantic search | KEYS pattern scan, Redis Hash | KEYS blocks Redis; Hash doesn't support per-field TTL; SET index is standard pattern |
| Fail-open on Redis error | Raise exception, return None | Cache is optimization, not correctness; failing open ensures LLM calls always work |
| Temperature gate for semantic match | Always attempt semantic match | temperature>0 outputs are non-deterministic; semantic match would return misleading cached responses |
| JSON serialization for Redis | MessagePack, Pickle, Protobuf | JSON is human-readable, debuggable, no extra dependencies; sufficient for <10KB entries |
| bge-m3 default embedding | text-embedding-3-small, multilingual-e5 | bge-m3 is SOTA for Chinese+English mixed text; 1024-dim saves 33% memory vs OpenAI 1536-dim; OpenAI-compatible API via Xinference/TEI |