617 lines
26 KiB
Markdown
617 lines
26 KiB
Markdown
# U1 Architecture Design: LLM Cache Core
|
|
|
|
> Status: APPROVED — Design reviewed, embedding model set to bge-m3 for Chinese-first
|
|
> Date: 2026-06-14
|
|
> Unit: U1 of P0 Production Hardening Plan
|
|
|
|
---
|
|
|
|
## 1. Design Goals
|
|
|
|
1. **Transparent caching**: `LLMGateway.chat()` callers cannot distinguish cached vs. uncached responses
|
|
2. **Dual-match strategy**: Exact-match (hash) for deterministic hits + Semantic-match (embedding) for paraphrased hits
|
|
3. **Backend pluggability**: `InMemoryLLMCache` for dev, `RedisLLMCache` for production, via factory
|
|
4. **Chinese-first embedding**: Default embedding model optimized for Chinese+English mixed text, with configurable fallback
|
|
|
|
---
|
|
|
|
## 2. Component Architecture
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────┐
|
|
│ LLMGateway.chat() │
|
|
│ │
|
|
│ 1. Build LLMRequest │
|
|
│ 2. ┌─ Cache Check ─────────────────────────────┐ │
|
|
│ │ generate_cache_key(req) │ │
|
|
│ │ cache.get(key) ──→ CacheResult │ │
|
|
│ │ ├─ HIT (exact) → return cached response │ │
|
|
│ │ └─ MISS → semantic_search(query_emb) │ │
|
|
│ │ ├─ HIT (semantic) → return response │ │
|
|
│ │ └─ MISS → call provider │ │
|
|
│ └─────────────────────────────────────────────┘ │
|
|
│ 3. Call provider → LLMResponse │
|
|
│ 4. cache.put(key, response, query_embedding) │
|
|
│ 5. Record usage │
|
|
└──────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### File Structure
|
|
|
|
```
|
|
src/agentkit/llm/
|
|
├── cache.py # NEW: LLMCache Protocol + InMemoryLLMCache + RedisLLMCache + CacheResult
|
|
├── cache_key.py # NEW: generate_cache_key(), hash helpers
|
|
├── gateway.py # MODIFIED in U2: inject cache check
|
|
├── config.py # MODIFIED in U2: add CacheConfig
|
|
└── ...
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Data Model Design
|
|
|
|
### 3.1 CacheKey
|
|
|
|
**Reasoning**: The cache key must capture ALL inputs that deterministically affect LLM output. Missing any component leads to false cache hits (wrong response returned).
|
|
|
|
| Component | Why Included | Hash Method |
|
|
|-----------|-------------|-------------|
|
|
| `model` | Different models produce different outputs | UTF-8 encode → SHA-256 |
|
|
| `system_prompt` | Changes behavior fundamentally | SHA-256 of full text |
|
|
| `messages` | Core conversation context | SHA-256 of JSON-serialized messages |
|
|
| `temperature` | Affects randomness; only 0.0 is deterministic | Float string representation |
|
|
| `tools` | Available tools affect tool_call generation | SHA-256 of JSON-serialized tools list |
|
|
| `tool_choice` | "auto" vs "none" changes behavior | UTF-8 encode → SHA-256 |
|
|
|
|
**Key formula**:
|
|
```python
|
|
key = SHA256(
|
|
SHA256(model) +
|
|
SHA256(system_prompt) +
|
|
SHA256(json(messages, sort_keys=True)) +
|
|
SHA256(str(temperature)) +
|
|
SHA256(json(tools, sort_keys=True)) +
|
|
SHA256(tool_choice)
|
|
)
|
|
```
|
|
|
|
**Design Decision — Why not include `max_tokens`?**
|
|
`max_tokens` is a truncation limit, not a semantic input. A response cached with `max_tokens=2000` is still valid when requested with `max_tokens=4000` (the response was simply shorter). However, the reverse is unsafe — a response generated with `max_tokens=4000` might be longer than a `max_tokens=2000` request expects. **Decision**: Include `max_tokens` in the key to be safe. The cost of a few extra cache misses is negligible compared to returning a response that violates the caller's token limit.
|
|
|
|
**Revised key formula**:
|
|
```python
|
|
key = SHA256(
|
|
SHA256(model) +
|
|
SHA256(system_prompt) +
|
|
SHA256(json(messages, sort_keys=True)) +
|
|
SHA256(f"{temperature:.2f}") +
|
|
SHA256(json(tools, sort_keys=True)) +
|
|
SHA256(tool_choice) +
|
|
SHA256(str(max_tokens))
|
|
)
|
|
```
|
|
|
|
### 3.2 CacheEntry
|
|
|
|
```python
|
|
@dataclass
|
|
class CacheEntry:
|
|
"""A cached LLM response with metadata."""
|
|
response: LLMResponse # The cached response
|
|
query_embedding: list[float] # Embedding of last user message (for semantic match)
|
|
created_at: float # time.monotonic() when cached
|
|
hit_count: int # Number of cache hits
|
|
```
|
|
|
|
### 3.3 CacheResult
|
|
|
|
```python
|
|
@dataclass
|
|
class CacheResult:
|
|
"""Result of a cache lookup."""
|
|
hit: bool # Whether a cache hit occurred
|
|
response: LLMResponse | None # The cached response (None on miss)
|
|
match_type: str # "exact" | "semantic" | "" (miss)
|
|
```
|
|
|
|
---
|
|
|
|
## 4. Protocol Design
|
|
|
|
### 4.1 LLMCache Protocol
|
|
|
|
```python
|
|
class LLMCache(Protocol):
|
|
"""LLM response cache interface."""
|
|
|
|
async def get(self, key: str) -> CacheResult:
|
|
"""Look up a cached response by exact key, then semantic search."""
|
|
...
|
|
|
|
async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
|
|
"""Store a response in the cache."""
|
|
...
|
|
|
|
async def invalidate(self, pattern: str | None = None) -> int:
|
|
"""Invalidate cache entries. If pattern is None, invalidate all. Returns count of invalidated entries."""
|
|
...
|
|
|
|
async def stats(self) -> dict[str, int]:
|
|
"""Return cache statistics: {total_entries, total_hits, total_misses}."""
|
|
...
|
|
```
|
|
|
|
**Reasoning for async Protocol**: All methods are async because `RedisLLMCache` uses `redis.asyncio`. Making the Protocol async ensures both backends share the same interface without sync/async bridging.
|
|
|
|
**Why `get()` does both exact + semantic?** The caller (LLMGateway) shouldn't need to know about the two-tier lookup. It calls `cache.get(key)` and gets a `CacheResult` with `match_type` indicating how the hit occurred. This encapsulation keeps the integration point simple.
|
|
|
|
### 4.2 Semantic Search Design
|
|
|
|
**Critical Question**: Should semantic search be inside `get()` or a separate method?
|
|
|
|
**Analysis**:
|
|
- **Option A**: `get(key)` does exact match first, then semantic search on miss. Single call, simple integration.
|
|
- **Option B**: Separate `semantic_search(embedding)` method. More flexible, but requires caller to manage two calls.
|
|
|
|
**Decision**: Option A. The semantic search needs the `query_embedding`, which must be computed before calling `get()`. But embedding computation is expensive (~100ms). We don't want to compute embeddings on every cache miss — only when semantic caching is enabled and temperature == 0.
|
|
|
|
**Revised design**:
|
|
|
|
```python
|
|
class LLMCache(Protocol):
|
|
async def get(self, key: str) -> CacheResult:
|
|
"""Exact-match lookup only."""
|
|
...
|
|
|
|
async def semantic_search(self, query_embedding: list[float], threshold: float = 0.92) -> CacheResult:
|
|
"""Semantic similarity search across all cached entries."""
|
|
...
|
|
|
|
async def put(self, key: str, response: LLMResponse, query_embedding: list[float] | None = None) -> None:
|
|
"""Store response with optional embedding for semantic matching."""
|
|
...
|
|
```
|
|
|
|
**Integration flow in LLMGateway (U2)**:
|
|
```python
|
|
# 1. Exact match
|
|
result = await cache.get(key)
|
|
if result.hit:
|
|
return result.response
|
|
|
|
# 2. Semantic match (only for temperature == 0)
|
|
if request.temperature == 0 and query_embedding is not None:
|
|
result = await cache.semantic_search(query_embedding)
|
|
if result.hit:
|
|
return result.response
|
|
|
|
# 3. Call provider
|
|
response = await provider.chat(request)
|
|
await cache.put(key, response, query_embedding)
|
|
```
|
|
|
|
This gives the gateway explicit control over when to attempt semantic search, avoiding unnecessary embedding computation.
|
|
|
|
---
|
|
|
|
## 5. InMemoryLLMCache Implementation Design
|
|
|
|
### 5.1 Data Structure
|
|
|
|
```python
|
|
class InMemoryLLMCache:
|
|
def __init__(self, max_entries: int = 10000, exact_ttl: int = 3600, semantic_ttl: int = 86400, similarity_threshold: float = 0.92):
|
|
self._max_entries = max_entries
|
|
self._exact_ttl = exact_ttl
|
|
self._semantic_ttl = semantic_ttl
|
|
self._similarity_threshold = similarity_threshold
|
|
|
|
# Exact cache: key → CacheEntry
|
|
self._cache: OrderedDict[str, CacheEntry] = OrderedDict()
|
|
|
|
# Semantic index: key → query_embedding (parallel to _cache)
|
|
self._embeddings: dict[str, list[float]] = {}
|
|
|
|
# Stats
|
|
self._hits = 0
|
|
self._misses = 0
|
|
```
|
|
|
|
### 5.2 Key Operations
|
|
|
|
**`get(key)`**:
|
|
1. Look up `key` in `_cache`
|
|
2. If found and not expired (check `created_at + exact_ttl > now`): increment `hit_count`, move to end (LRU), return `CacheResult(hit=True, match_type="exact")`
|
|
3. If expired: delete from `_cache` and `_embeddings`
|
|
4. Return `CacheResult(hit=False)`
|
|
|
|
**`semantic_search(query_embedding, threshold)`**:
|
|
1. If `_embeddings` is empty: return miss
|
|
2. For each `(key, emb)` in `_embeddings`:
|
|
a. Check if entry is still valid (`created_at + semantic_ttl > now`)
|
|
b. If expired: skip (lazy cleanup)
|
|
c. Compute `cosine_similarity(query_embedding, emb)`
|
|
d. Track best match
|
|
3. If best similarity >= threshold: return `CacheResult(hit=True, match_type="semantic")`
|
|
4. Return miss
|
|
|
|
**Performance**: O(n) scan over all embeddings. With <10000 entries and 1536-dim vectors, this takes <10ms using numpy. Acceptable for now. If scale becomes an issue, switch to FAISS or pgvector (deferred).
|
|
|
|
**`put(key, response, query_embedding)`**:
|
|
1. Create `CacheEntry(response, query_embedding or [], now, 0)`
|
|
2. If key exists: update, move to end
|
|
3. If new and at capacity: evict LRU (popitem(last=False))
|
|
4. Store embedding in `_embeddings[key]` if provided
|
|
|
|
**`invalidate(pattern)`**:
|
|
1. If pattern is None: clear all
|
|
2. If pattern: iterate keys, match against pattern, delete matching entries
|
|
|
|
### 5.3 LRU Eviction Strategy
|
|
|
|
Follow `EmbeddingCache` pattern: `OrderedDict` with `move_to_end()` on access, `popitem(last=False)` on eviction. This is O(1) for both access and eviction.
|
|
|
|
**Why not size-based eviction?** LLM responses vary widely in size (100 bytes to 10KB). Entry-count-based eviction is simpler and more predictable. With `max_entries=10000` and average response ~1KB, memory usage is ~10MB — acceptable.
|
|
|
|
---
|
|
|
|
## 6. RedisLLMCache Implementation Design
|
|
|
|
### 6.1 Key Schema
|
|
|
|
```
|
|
agentkit:llm_cache:{sha256_hex} → JSON(CacheEntry) with TTL
|
|
agentkit:llm_cache_emb:{sha256_hex} → JSON(list[float]) with TTL
|
|
```
|
|
|
|
**Why two keys instead of one?**
|
|
- Semantic search needs to iterate all embeddings without downloading full response bodies
|
|
- Embedding keys are small (~12KB for 1536-dim float list) vs. response keys (variable, potentially large with tool_calls)
|
|
- Different TTLs: exact cache may have shorter TTL than semantic cache
|
|
|
|
**Alternative considered**: Single key with embedded embedding. Rejected because `KEYS agentkit:llm_cache:*` + `GET` for each key to extract embedding would download all response bodies for semantic search, which is wasteful.
|
|
|
|
### 6.2 Key Operations
|
|
|
|
**`get(key)`**:
|
|
1. `GET agentkit:llm_cache:{key}` → deserialize CacheEntry
|
|
2. If found: `INCR agentkit:llm_cache_hits:{key}` (optional, for stats), return hit
|
|
3. Return miss
|
|
|
|
**`semantic_search(query_embedding, threshold)`**:
|
|
1. `KEYS agentkit:llm_cache_emb:*` → get all embedding keys
|
|
2. `MGET` all embedding keys → deserialize embeddings
|
|
3. Compute cosine similarity for each
|
|
4. If best >= threshold: `GET agentkit:llm_cache:{best_key}` → return hit
|
|
5. Return miss
|
|
|
|
**Performance concern**: `KEYS` is O(N) and blocks Redis. For production with >1000 cached entries, this is unacceptable.
|
|
|
|
**Mitigation**: Use `SCAN` instead of `KEYS` for iteration. Store a Redis Set `agentkit:llm_cache_index` containing all active cache keys. On `put()`, `SADD agentkit:llm_cache_index {key}`. On `invalidate()`, `SREM`. For semantic search, `SMEMBERS agentkit:llm_cache_index` → `MGET` embeddings.
|
|
|
|
**Revised key schema**:
|
|
```
|
|
agentkit:llm_cache:{sha256_hex} → JSON(CacheEntry) with TTL
|
|
agentkit:llm_cache_emb:{sha256_hex} → JSON(list[float]) with TTL
|
|
agentkit:llm_cache_index → SET of active cache keys (no TTL, managed manually)
|
|
```
|
|
|
|
**`put(key, response, query_embedding)`**:
|
|
1. Pipeline: `SET agentkit:llm_cache:{key} → JSON(CacheEntry) EX exact_ttl`
|
|
2. If embedding provided: `SET agentkit:llm_cache_emb:{key} → JSON(embedding) EX semantic_ttl`
|
|
3. `SADD agentkit:llm_cache_index {key}`
|
|
|
|
**`invalidate(pattern)`**:
|
|
1. If pattern is None: `SMEMBERS agentkit:llm_cache_index` → pipeline DEL all keys → DEL index
|
|
2. If pattern: `SMEMBERS` → filter by pattern → pipeline DEL matching keys → SREM from index
|
|
|
|
### 6.3 Lazy Redis Initialization
|
|
|
|
Follow `RedisSessionStore._get_redis()` pattern:
|
|
|
|
```python
|
|
class RedisLLMCache:
|
|
def __init__(self, redis_url: str = "redis://localhost:6379", ...):
|
|
self._redis_url = redis_url
|
|
self._redis: aioredis.Redis | None = None
|
|
|
|
async def _get_redis(self) -> aioredis.Redis:
|
|
if self._redis is None:
|
|
import redis.asyncio as aioredis
|
|
self._redis = aioredis.from_url(self._redis_url, decode_responses=True)
|
|
return self._redis
|
|
```
|
|
|
|
### 6.4 Connection Error Handling
|
|
|
|
```python
|
|
async def get(self, key: str) -> CacheResult:
|
|
try:
|
|
redis = await self._get_redis()
|
|
data = await redis.get(f"agentkit:llm_cache:{key}")
|
|
...
|
|
except (redis.ConnectionError, redis.TimeoutError) as e:
|
|
logger.warning(f"Redis cache unavailable, returning miss: {e}")
|
|
return CacheResult(hit=False)
|
|
```
|
|
|
|
**Design Decision**: On Redis failure, return cache miss (not error). The cache is a performance optimization, not a correctness requirement. Failing open is the correct behavior.
|
|
|
|
---
|
|
|
|
## 7. Factory Function
|
|
|
|
```python
|
|
def create_llm_cache(
|
|
backend: str = "auto",
|
|
redis_url: str = "redis://localhost:6379",
|
|
max_entries: int = 10000,
|
|
exact_ttl: int = 3600,
|
|
semantic_ttl: int = 86400,
|
|
similarity_threshold: float = 0.92,
|
|
) -> LLMCache:
|
|
"""Create an LLM cache backend.
|
|
|
|
Args:
|
|
backend: "auto" (try Redis, fallback to memory), "redis", "memory"
|
|
...
|
|
"""
|
|
if backend in ("auto", "redis"):
|
|
try:
|
|
import redis.asyncio as aioredis
|
|
return RedisLLMCache(redis_url=redis_url, ...)
|
|
except ImportError:
|
|
logger.warning("redis package not available, falling back to in-memory cache")
|
|
return InMemoryLLMCache(...)
|
|
return InMemoryLLMCache(...)
|
|
```
|
|
|
|
**Follows existing pattern**: `create_session_store()`, `create_evolution_store()`.
|
|
|
|
---
|
|
|
|
## 8. CacheKey Generation Design
|
|
|
|
### 8.1 Module: `cache_key.py`
|
|
|
|
```python
|
|
import hashlib
|
|
import json
|
|
|
|
def generate_cache_key(
|
|
model: str,
|
|
messages: list[dict[str, str]],
|
|
temperature: float,
|
|
tools: list[dict] | None = None,
|
|
tool_choice: str = "auto",
|
|
max_tokens: int = 2000,
|
|
system_prompt: str | None = None,
|
|
) -> str:
|
|
"""Generate a deterministic SHA-256 cache key from LLM request parameters."""
|
|
components = [
|
|
_hash_str(model),
|
|
_hash_str(system_prompt or _extract_system_prompt(messages)),
|
|
_hash_json(messages),
|
|
_hash_str(f"{temperature:.2f}"),
|
|
_hash_json(tools),
|
|
_hash_str(tool_choice),
|
|
_hash_str(str(max_tokens)),
|
|
]
|
|
combined = "".join(components)
|
|
return hashlib.sha256(combined.encode()).hexdigest()
|
|
|
|
def _extract_system_prompt(messages: list[dict]) -> str:
|
|
"""Extract system prompt from messages list."""
|
|
for msg in messages:
|
|
if msg.get("role") == "system":
|
|
return msg.get("content", "")
|
|
return ""
|
|
|
|
def _hash_str(s: str) -> str:
|
|
return hashlib.sha256(s.encode()).hexdigest()
|
|
|
|
def _hash_json(obj) -> str:
|
|
if obj is None:
|
|
return hashlib.sha256(b"null").hexdigest()
|
|
return hashlib.sha256(json.dumps(obj, sort_keys=True, ensure_ascii=False).encode()).hexdigest()
|
|
```
|
|
|
|
### 8.2 Why Separate `system_prompt` Parameter?
|
|
|
|
The `messages` list already contains the system prompt. But in AgentKit, the system prompt is injected separately from the user's messages (via `MemoryStore.build_system_prompt()`). The gateway receives `messages` that already include the system prompt. So `system_prompt` is extracted from `messages[0]` when `role == "system"`.
|
|
|
|
**No separate parameter needed** — `_extract_system_prompt()` handles extraction. This avoids requiring callers to pass system_prompt separately.
|
|
|
|
---
|
|
|
|
## 9. Semantic Match: Temperature Gate
|
|
|
|
**Rule**: Semantic matching is ONLY attempted when `temperature == 0.0`.
|
|
|
|
**Reasoning**:
|
|
- At `temperature > 0`, LLM outputs are non-deterministic. Two semantically similar requests may produce different outputs.
|
|
- Caching a `temperature=0.7` response and returning it for a semantically similar query is misleading — the user expects randomness.
|
|
- At `temperature=0.0`, outputs are deterministic (within provider guarantees), so semantic matching is safe.
|
|
|
|
**Implementation**: The gateway checks `temperature` before calling `semantic_search()`. The cache itself does not enforce this — it's a policy decision made by the caller.
|
|
|
|
---
|
|
|
|
## 10. Serialization Design
|
|
|
|
### 10.1 LLMResponse Serialization
|
|
|
|
`LLMResponse` contains `content: str`, `model: str`, `usage: TokenUsage`, `tool_calls: list[ToolCall]`.
|
|
|
|
**For InMemoryLLMCache**: No serialization needed — store Python objects directly.
|
|
|
|
**For RedisLLMCache**: Serialize to JSON.
|
|
|
|
```python
|
|
def _serialize_response(response: LLMResponse) -> dict:
|
|
return {
|
|
"content": response.content,
|
|
"model": response.model,
|
|
"usage": {
|
|
"prompt_tokens": response.usage.prompt_tokens,
|
|
"completion_tokens": response.usage.completion_tokens,
|
|
},
|
|
"tool_calls": [
|
|
{"id": tc.id, "name": tc.name, "arguments": tc.arguments}
|
|
for tc in response.tool_calls
|
|
],
|
|
"latency_ms": response.latency_ms,
|
|
}
|
|
|
|
def _deserialize_response(data: dict) -> LLMResponse:
|
|
return LLMResponse(
|
|
content=data["content"],
|
|
model=data["model"],
|
|
usage=TokenUsage(**data["usage"]),
|
|
tool_calls=[ToolCall(**tc) for tc in data.get("tool_calls", [])],
|
|
latency_ms=data.get("latency_ms", 0.0),
|
|
)
|
|
```
|
|
|
|
### 10.2 Embedding Serialization
|
|
|
|
Embeddings are `list[float]` with 1536 dimensions. JSON serialization produces ~12KB per embedding.
|
|
|
|
**Alternative**: Binary serialization (struct.pack) would reduce to ~6KB but adds complexity. JSON is sufficient for now.
|
|
|
|
---
|
|
|
|
## 11. Edge Cases & Failure Modes
|
|
|
|
| Edge Case | Behavior | Rationale |
|
|
|-----------|----------|-----------|
|
|
| Response with `tool_calls` | Cached normally | Tool call responses are deterministic at temperature=0 |
|
|
| Empty response (`content=""`) | Cached normally | Empty responses are valid (e.g., tool-only responses) |
|
|
| Very large response (>100KB) | Cached, but counted as single entry | Size-based eviction deferred; entry-count is sufficient |
|
|
| Concurrent `put()` for same key | Last write wins | No data corruption risk; both writes are valid responses |
|
|
| Redis `SET` fails | Log warning, cache miss on next read | Fail open, never block LLM calls |
|
|
| Embedding API fails during `put()` | Store response without embedding | Exact-match still works; semantic match degraded |
|
|
| Embedding API fails during `semantic_search()` | Return cache miss | Don't block on embedding failures |
|
|
| `invalidate()` while `get()` in progress | Possible stale read | Acceptable for cache; eventual consistency |
|
|
|
|
---
|
|
|
|
## 12. Test Strategy
|
|
|
|
### 12.1 Unit Tests (`tests/unit/llm/test_cache.py`)
|
|
|
|
Using `pytest` + `pytest-asyncio`:
|
|
|
|
1. **test_exact_match_hit**: Same key → cache hit, `match_type="exact"`
|
|
2. **test_exact_match_miss**: Different key → cache miss
|
|
3. **test_semantic_match_hit**: Paraphrased query with similarity > 0.92 → hit, `match_type="semantic"`
|
|
4. **test_semantic_match_miss**: Unrelated query with similarity < 0.6 → miss
|
|
5. **test_semantic_match_boundary**: Similarity exactly at threshold → hit
|
|
6. **test_ttl_expiry_exact**: Entry expires after exact_ttl → miss
|
|
7. **test_ttl_expiry_semantic**: Entry expires after semantic_ttl → miss
|
|
8. **test_lru_eviction**: Add max_entries + 1 → oldest evicted
|
|
9. **test_invalidate_all**: `invalidate()` clears all entries
|
|
10. **test_invalidate_pattern**: `invalidate("prefix:*")` clears matching entries
|
|
11. **test_cache_stats**: `stats()` returns correct counts
|
|
12. **test_tool_calls_cached**: Response with tool_calls cached and restored correctly
|
|
13. **test_concurrent_puts**: Two concurrent puts for same key → no error
|
|
14. **test_redis_fallback**: Redis import fails → InMemoryLLMCache returned
|
|
15. **test_cache_key_deterministic**: Same inputs → same key
|
|
16. **test_cache_key_different_model**: Different model → different key
|
|
17. **test_cache_key_different_temperature**: Different temperature → different key
|
|
|
|
### 12.2 Mock Embedder for Testing
|
|
|
|
Use `MockEmbedder` from `src/agentkit/memory/embedder.py`. Since `MockEmbedder` generates deterministic embeddings based on text hash, semantically similar text will produce similar embeddings (same hash prefix → similar vector). This is sufficient for testing the similarity threshold logic.
|
|
|
|
**Limitation**: `MockEmbedder` doesn't produce truly semantically meaningful embeddings. For testing semantic matching behavior, we'll manually construct embeddings with known cosine similarities.
|
|
|
|
```python
|
|
def _make_embedding(base: list[float], noise: float = 0.0) -> list[float]:
|
|
"""Create a unit vector with optional noise for similarity testing."""
|
|
vec = [x + noise for x in base]
|
|
magnitude = sum(x**2 for x in vec) ** 0.5
|
|
return [x / magnitude for x in vec] if magnitude > 0 else vec
|
|
```
|
|
|
|
---
|
|
|
|
## 13. Dependency Analysis
|
|
|
|
### 13.1 Internal Dependencies
|
|
|
|
| Dependency | Usage | Risk |
|
|
|-----------|-------|------|
|
|
| `agentkit.llm.protocol.LLMResponse` | Cache entry data type | Stable, no change needed |
|
|
| `agentkit.llm.protocol.TokenUsage` | Part of LLMResponse | Stable |
|
|
| `agentkit.llm.protocol.ToolCall` | Part of LLMResponse | Stable |
|
|
| `agentkit.memory.embedder.Embedder` | Embedding computation for semantic match | Injected, not imported directly |
|
|
| `agentkit.utils.vector_math.compute_cosine_similarity` | Similarity computation | Stable utility |
|
|
|
|
### 13.2 External Dependencies
|
|
|
|
| Dependency | Usage | Required? |
|
|
|-----------|-------|-----------|
|
|
| `redis.asyncio` | RedisLLMCache backend | Optional (only for "redis" backend) |
|
|
| `numpy` | Fast cosine similarity | Optional (pure-python fallback exists) |
|
|
|
|
---
|
|
|
|
## 14. Implementation Sequence
|
|
|
|
Within U1, the implementation order is:
|
|
|
|
1. **`cache_key.py`** — No dependencies, pure functions, easy to test
|
|
2. **`cache.py`** — `CacheResult`, `CacheEntry`, `LLMCache` Protocol, `InMemoryLLMCache`
|
|
3. **`cache.py`** — `RedisLLMCache`, `create_llm_cache()` factory
|
|
4. **`test_cache.py`** — All unit tests
|
|
|
|
This order allows incremental testing: cache_key tests first, then InMemoryLLMCache tests, then RedisLLMCache tests.
|
|
|
|
---
|
|
|
|
## 15. Open Design Questions
|
|
|
|
1. **Should `semantic_search()` return the best match or all matches above threshold?**
|
|
- **Current decision**: Best match only. The gateway needs one response, not a ranked list. If we need ranked results later, we can add a `search()` method.
|
|
|
|
2. **Should the cache store the original `messages` alongside the response?**
|
|
- **Current decision**: No. The key already deterministically represents the messages. Storing them again wastes memory. If we need message-level debugging, we can add it later.
|
|
|
|
3. **Should `RedisLLMCache` use Redis Hash instead of individual keys?**
|
|
- **Current decision**: Individual keys with SET index. Hash would allow `HGETALL` for all entries, but makes TTL per-entry impossible (Redis Hash fields don't support individual TTLs). Individual keys with a SET index is the standard pattern.
|
|
|
|
4. **What embedding model to use for semantic cache?**
|
|
- **Decision**: Default to `bge-m3` (BAAI/bge-m3 via Xinference or TEI endpoint) for Chinese+English mixed text. `bge-m3` supports:
|
|
- Multi-lingual (102 languages, strong Chinese)
|
|
- Multi-granularity (dense + sparse + ColBERT)
|
|
- Multi-function (retrieval + classification + similarity)
|
|
- 1024-dim dense vectors (vs. 1536 for OpenAI)
|
|
- Fallback to `text-embedding-3-small` when only OpenAI API is available.
|
|
- The embedder is injected via constructor, so the model choice is a configuration concern, not a code concern.
|
|
- **Config example**:
|
|
```yaml
|
|
llm:
|
|
cache:
|
|
embedding:
|
|
provider: "xinference" # "xinference" | "openai" | "local"
|
|
model: "bge-m3" # model name at provider
|
|
base_url: "http://localhost:9997/v1"
|
|
```
|
|
|
|
---
|
|
|
|
## 16. Argumentation Summary
|
|
|
|
| Design Choice | Alternatives Considered | Why This Choice |
|
|
|--------------|------------------------|----------------|
|
|
| SHA-256 hash key | UUID, MD5, composite string key | SHA-256 is collision-resistant, deterministic, fixed-length; MD5 has known collisions; UUID is non-deterministic |
|
|
| OrderedDict LRU | heapq, custom doubly-linked-list | OrderedDict is Python-idiomatic, O(1) access+eviction, matches EmbeddingCache pattern |
|
|
| Separate `get()` + `semantic_search()` | Combined `get()` with auto-semantic | Explicit control avoids unnecessary embedding computation; caller decides when to attempt semantic match |
|
|
| Redis SET index for semantic search | KEYS pattern scan, Redis Hash | KEYS blocks Redis; Hash doesn't support per-field TTL; SET index is standard pattern |
|
|
| Fail-open on Redis error | Raise exception, return None | Cache is optimization, not correctness; failing open ensures LLM calls always work |
|
|
| Temperature gate for semantic match | Always attempt semantic match | temperature>0 outputs are non-deterministic; semantic match would return misleading cached responses |
|
|
| JSON serialization for Redis | MessagePack, Pickle, Protobuf | JSON is human-readable, debuggable, no extra dependencies; sufficient for <10KB entries |
|
|
| bge-m3 default embedding | text-embedding-3-small, multilingual-e5 | bge-m3 is SOTA for Chinese+English mixed text; 1024-dim saves 33% memory vs OpenAI 1536-dim; OpenAI-compatible API via Xinference/TEI |
|