11 KiB

Raw Blame History

U2 Architecture Design: LLM Cache Integration

Status: APPROVED — Design follows U1 architecture, minimal integration surface Date: 2026-06-14 Unit: U2 of P0 Production Hardening Plan

1. Design Goals

Transparent injection: Cache check happens inside LLMGateway.chat() without changing the public API
Usage tracking on cache hits: Cached requests record 0 cost to maintain usage query integrity
Opt-in by default: Cache disabled unless explicitly configured
Stream exclusion: chat_stream() is NOT cached in this iteration

2. Integration Point Analysis

Current `LLMGateway.chat()` flow (gateway.py:34-121):

1. _resolve_model_alias(model)                    → resolved_model
2. Check providers exist
3. Start telemetry span
4. _get_models_to_try(resolved_model)             → models_to_try
5. For each model:
   a. _resolve_model(model_name)                  → (provider, actual_model)
   b. Build LLMRequest
   c. provider.chat(req)                          → response
   d. Break on success
6. Calculate cost
7. Record usage
8. Record telemetry
9. Return response

Cache insertion points:

Cache CHECK (before step 5): After LLMRequest is constructed, before provider call.

Reason: All request normalization (alias resolution, model fallback list) has completed.
The resolved_model and actual_model are known, so the cache key is deterministic.

Cache WRITE (after step 5d): After successful response, before usage tracking.

Reason: Response is validated (no exception thrown). Usage tracking needs to happen regardless of cache hit/miss.

Cache HIT usage tracking (step 6-7): On cache hit, record usage with cost=0.

3. Modified Flow

async def chat(self, messages, model, agent_name="", task_type="", tools=None, tool_choice="auto", **kwargs):
    resolved_model = self._resolve_model_alias(model)

    # ... provider check, telemetry span setup ...

    # ── Cache check (NEW) ──
    cache_key = None
    query_embedding = None
    if self._cache is not None:
        from agentkit.llm.cache_key import generate_cache_key
        cache_key = generate_cache_key(
            model=resolved_model,
            messages=messages,
            temperature=kwargs.get("temperature", 0.7),
            tools=tools,
            tool_choice=tool_choice,
            max_tokens=kwargs.get("max_tokens", 2000),
        )
        result = await self._cache.get(cache_key)
        if result.hit:
            # Record usage with 0 cost
            latency_ms = (time.monotonic() - start) * 1000
            self._usage_tracker.record(
                agent_name=agent_name,
                model=result.response.model,
                usage=result.response.usage,
                cost=0.0,
                latency_ms=latency_ms,
            )
            return result.response

        # Semantic match (only for temperature == 0)
        temperature = kwargs.get("temperature", 0.7)
        if temperature == 0 and self._embedder is not None:
            try:
                last_user_msg = next(
                    (m["content"] for m in reversed(messages) if m.get("role") == "user"),
                    "",
                )
                if last_user_msg:
                    query_embedding = await self._embedder.embed(last_user_msg)
                    result = await self._cache.semantic_search(query_embedding)
                    if result.hit:
                        latency_ms = (time.monotonic() - start) * 1000
                        self._usage_tracker.record(
                            agent_name=agent_name,
                            model=result.response.model,
                            usage=result.response.usage,
                            cost=0.0,
                            latency_ms=latency_ms,
                        )
                        return result.response
            except Exception as e:
                logger.warning(f"Semantic cache search failed: {e}")

    # ── Normal provider call ──
    for model_name in models_to_try:
        # ... existing fallback loop ...

    # ── Cache write (NEW) ──
    if self._cache is not None and cache_key is not None:
        try:
            await self._cache.put(cache_key, response, query_embedding)
        except Exception as e:
            logger.warning(f"Cache write failed: {e}")

    # ... existing usage tracking, telemetry ...
    return response

4. CacheConfig Design

@dataclass
class CacheConfig:
    """LLM Cache configuration."""
    enabled: bool = False
    backend: str = "auto"           # "auto" | "redis" | "memory"
    redis_url: str = "redis://localhost:6379"
    exact_ttl: int = 3600
    semantic_ttl: int = 86400
    similarity_threshold: float = 0.92
    max_entries: int = 10000
    # Embedding config for semantic cache
    embedding_provider: str = "openai"   # "openai" | "xinference" | "local"
    embedding_model: str = "bge-m3"      # model name at provider
    embedding_base_url: str | None = None
    embedding_api_key: str | None = None

Nesting: CacheConfig is nested under LLMConfig.cache.

@dataclass
class LLMConfig:
    providers: dict[str, ProviderConfig] = field(default_factory=dict)
    model_aliases: dict[str, str] = field(default_factory=dict)
    fallbacks: dict[str, list[str]] = field(default_factory=dict)
    cache: CacheConfig | None = None  # NEW

5. LLMGateway Constructor Change

class LLMGateway:
    def __init__(self, config: LLMConfig | None = None):
        self._providers: dict[str, LLMProvider] = {}
        self._usage_tracker = UsageTracker()
        self._config = config or LLMConfig()

        # Cache (NEW)
        self._cache: LLMCache | None = None
        self._embedder: Embedder | None = None
        if self._config.cache and self._config.cache.enabled:
            from agentkit.llm.cache import create_llm_cache
            self._cache = create_llm_cache(
                backend=self._config.cache.backend,
                redis_url=self._config.cache.redis_url,
                max_entries=self._config.cache.max_entries,
                exact_ttl=self._config.cache.exact_ttl,
                semantic_ttl=self._config.cache.semantic_ttl,
                similarity_threshold=self._config.cache.similarity_threshold,
            )
            # Embedder for semantic cache
            self._embedder = self._create_embedder(self._config.cache)

Design Decision: Cache and embedder are created in __init__, not lazily. This ensures configuration errors are caught at startup, not at first request.

6. Embedder Factory Method

def _create_embedder(self, cache_config: CacheConfig) -> Embedder | None:
    """Create embedder for semantic cache based on config."""
    try:
        if cache_config.embedding_provider == "openai":
            from agentkit.memory.embedder import OpenAIEmbedder
            return OpenAIEmbedder(
                api_key=cache_config.embedding_api_key,
                model=cache_config.embedding_model,
                base_url=cache_config.embedding_base_url,
            )
        elif cache_config.embedding_provider in ("xinference", "local"):
            # Xinference/TEI uses OpenAI-compatible API
            from agentkit.memory.embedder import OpenAIEmbedder
            return OpenAIEmbedder(
                api_key=cache_config.embedding_api_key or "not-needed",
                model=cache_config.embedding_model,
                base_url=cache_config.embedding_base_url or "http://localhost:9997/v1",
            )
    except Exception as e:
        logger.warning(f"Failed to create embedder for semantic cache: {e}")
    return None

Design Decision: Use OpenAIEmbedder for all providers since Xinference and TEI expose OpenAI-compatible /embeddings endpoints. No need for a separate XinferenceEmbedder class.

7. Stream Handling

chat_stream() is NOT cached in this iteration. Document as known limitation.

Reasoning:

Streaming requires collecting all chunks before caching, adding latency
Chunk collection adds complexity (error handling mid-stream, partial responses)
Most cacheable requests (temperature=0, simple queries) don't need streaming
Streaming is typically used for long-form generation where caching is less beneficial

8. Edge Cases

Edge Case	Behavior
Cache disabled (default)	No cache check, no performance impact
Cache enabled, first request	Cache miss, provider called, response cached
Cache hit with tool_calls	Return cached response including tool_calls
Embedder fails during semantic search	Log warning, return miss, proceed to provider
Cache write fails	Log warning, response still returned to caller
Fallback model used	Cache key uses `resolved_model`, not `actual_model` — same query hits cache regardless of which fallback responded

Fallback model cache key issue: When model A fails and fallback model B responds, the cache key is based on resolved_model (the alias), not actual_model (B). This means a subsequent request for the same alias will get a cache hit even if model A is back online. This is correct behavior — the user asked for the alias, not a specific model.

However, if the user explicitly specifies model B (not an alias), the cache key will be different. This is also correct — different model = different cache entry.

9. Test Strategy

Integration Tests (`tests/unit/test_gateway_cache.py`)

test_cache_disabled: Requests pass through to provider normally
test_cache_enabled_first_request: Cache miss, provider called, response cached
test_cache_enabled_second_request: Cache hit, provider NOT called
test_cache_hit_usage_tracking: Usage record has 0 cost, correct token counts
test_cache_miss_fallback: Primary model fails, fallback response cached
test_config_from_dict: LLMConfig.from_dict({"cache": {"enabled": True}}) works
test_semantic_cache_hit: temperature=0, semantically similar query hits cache
test_semantic_cache_skipped_for_nonzero_temp: temperature>0 skips semantic search

10. Argumentation Summary

Design Choice	Alternatives Considered	Why This Choice
Cache check after LLMRequest construction	Before construction	Request normalization must complete first; key depends on resolved model
Cache write before usage tracking	After usage tracking	Response must be cached before tracking so cache-hit tracking uses same response
OpenAIEmbedder for all providers	Separate XinferenceEmbedder	Xinference/TEI use OpenAI-compatible API; no need for separate class
No stream caching	Collect chunks then cache	Adds latency and complexity; most cacheable requests don't need streaming
Cache key uses resolved_model alias	Uses actual_model	User requests alias, not specific model; cache should be model-agnostic within alias

11 KiB Raw Blame History