fischer-agentkit/docs/plans/2026-06-14-003-u2-llm-cache...

11 KiB

U2 Architecture Design: LLM Cache Integration

Status: APPROVED — Design follows U1 architecture, minimal integration surface Date: 2026-06-14 Unit: U2 of P0 Production Hardening Plan


1. Design Goals

  1. Transparent injection: Cache check happens inside LLMGateway.chat() without changing the public API
  2. Usage tracking on cache hits: Cached requests record 0 cost to maintain usage query integrity
  3. Opt-in by default: Cache disabled unless explicitly configured
  4. Stream exclusion: chat_stream() is NOT cached in this iteration

2. Integration Point Analysis

Current LLMGateway.chat() flow (gateway.py:34-121):

1. _resolve_model_alias(model)                    → resolved_model
2. Check providers exist
3. Start telemetry span
4. _get_models_to_try(resolved_model)             → models_to_try
5. For each model:
   a. _resolve_model(model_name)                  → (provider, actual_model)
   b. Build LLMRequest
   c. provider.chat(req)                          → response
   d. Break on success
6. Calculate cost
7. Record usage
8. Record telemetry
9. Return response

Cache insertion points:

Cache CHECK (before step 5): After LLMRequest is constructed, before provider call.

  • Reason: All request normalization (alias resolution, model fallback list) has completed.
  • The resolved_model and actual_model are known, so the cache key is deterministic.

Cache WRITE (after step 5d): After successful response, before usage tracking.

  • Reason: Response is validated (no exception thrown). Usage tracking needs to happen regardless of cache hit/miss.

Cache HIT usage tracking (step 6-7): On cache hit, record usage with cost=0.


3. Modified Flow

async def chat(self, messages, model, agent_name="", task_type="", tools=None, tool_choice="auto", **kwargs):
    resolved_model = self._resolve_model_alias(model)

    # ... provider check, telemetry span setup ...

    # ── Cache check (NEW) ──
    cache_key = None
    query_embedding = None
    if self._cache is not None:
        from agentkit.llm.cache_key import generate_cache_key
        cache_key = generate_cache_key(
            model=resolved_model,
            messages=messages,
            temperature=kwargs.get("temperature", 0.7),
            tools=tools,
            tool_choice=tool_choice,
            max_tokens=kwargs.get("max_tokens", 2000),
        )
        result = await self._cache.get(cache_key)
        if result.hit:
            # Record usage with 0 cost
            latency_ms = (time.monotonic() - start) * 1000
            self._usage_tracker.record(
                agent_name=agent_name,
                model=result.response.model,
                usage=result.response.usage,
                cost=0.0,
                latency_ms=latency_ms,
            )
            return result.response

        # Semantic match (only for temperature == 0)
        temperature = kwargs.get("temperature", 0.7)
        if temperature == 0 and self._embedder is not None:
            try:
                last_user_msg = next(
                    (m["content"] for m in reversed(messages) if m.get("role") == "user"),
                    "",
                )
                if last_user_msg:
                    query_embedding = await self._embedder.embed(last_user_msg)
                    result = await self._cache.semantic_search(query_embedding)
                    if result.hit:
                        latency_ms = (time.monotonic() - start) * 1000
                        self._usage_tracker.record(
                            agent_name=agent_name,
                            model=result.response.model,
                            usage=result.response.usage,
                            cost=0.0,
                            latency_ms=latency_ms,
                        )
                        return result.response
            except Exception as e:
                logger.warning(f"Semantic cache search failed: {e}")

    # ── Normal provider call ──
    for model_name in models_to_try:
        # ... existing fallback loop ...

    # ── Cache write (NEW) ──
    if self._cache is not None and cache_key is not None:
        try:
            await self._cache.put(cache_key, response, query_embedding)
        except Exception as e:
            logger.warning(f"Cache write failed: {e}")

    # ... existing usage tracking, telemetry ...
    return response

4. CacheConfig Design

@dataclass
class CacheConfig:
    """LLM Cache configuration."""
    enabled: bool = False
    backend: str = "auto"           # "auto" | "redis" | "memory"
    redis_url: str = "redis://localhost:6379"
    exact_ttl: int = 3600
    semantic_ttl: int = 86400
    similarity_threshold: float = 0.92
    max_entries: int = 10000
    # Embedding config for semantic cache
    embedding_provider: str = "openai"   # "openai" | "xinference" | "local"
    embedding_model: str = "bge-m3"      # model name at provider
    embedding_base_url: str | None = None
    embedding_api_key: str | None = None

Nesting: CacheConfig is nested under LLMConfig.cache.

@dataclass
class LLMConfig:
    providers: dict[str, ProviderConfig] = field(default_factory=dict)
    model_aliases: dict[str, str] = field(default_factory=dict)
    fallbacks: dict[str, list[str]] = field(default_factory=dict)
    cache: CacheConfig | None = None  # NEW

5. LLMGateway Constructor Change

class LLMGateway:
    def __init__(self, config: LLMConfig | None = None):
        self._providers: dict[str, LLMProvider] = {}
        self._usage_tracker = UsageTracker()
        self._config = config or LLMConfig()

        # Cache (NEW)
        self._cache: LLMCache | None = None
        self._embedder: Embedder | None = None
        if self._config.cache and self._config.cache.enabled:
            from agentkit.llm.cache import create_llm_cache
            self._cache = create_llm_cache(
                backend=self._config.cache.backend,
                redis_url=self._config.cache.redis_url,
                max_entries=self._config.cache.max_entries,
                exact_ttl=self._config.cache.exact_ttl,
                semantic_ttl=self._config.cache.semantic_ttl,
                similarity_threshold=self._config.cache.similarity_threshold,
            )
            # Embedder for semantic cache
            self._embedder = self._create_embedder(self._config.cache)

Design Decision: Cache and embedder are created in __init__, not lazily. This ensures configuration errors are caught at startup, not at first request.


6. Embedder Factory Method

def _create_embedder(self, cache_config: CacheConfig) -> Embedder | None:
    """Create embedder for semantic cache based on config."""
    try:
        if cache_config.embedding_provider == "openai":
            from agentkit.memory.embedder import OpenAIEmbedder
            return OpenAIEmbedder(
                api_key=cache_config.embedding_api_key,
                model=cache_config.embedding_model,
                base_url=cache_config.embedding_base_url,
            )
        elif cache_config.embedding_provider in ("xinference", "local"):
            # Xinference/TEI uses OpenAI-compatible API
            from agentkit.memory.embedder import OpenAIEmbedder
            return OpenAIEmbedder(
                api_key=cache_config.embedding_api_key or "not-needed",
                model=cache_config.embedding_model,
                base_url=cache_config.embedding_base_url or "http://localhost:9997/v1",
            )
    except Exception as e:
        logger.warning(f"Failed to create embedder for semantic cache: {e}")
    return None

Design Decision: Use OpenAIEmbedder for all providers since Xinference and TEI expose OpenAI-compatible /embeddings endpoints. No need for a separate XinferenceEmbedder class.


7. Stream Handling

chat_stream() is NOT cached in this iteration. Document as known limitation.

Reasoning:

  • Streaming requires collecting all chunks before caching, adding latency
  • Chunk collection adds complexity (error handling mid-stream, partial responses)
  • Most cacheable requests (temperature=0, simple queries) don't need streaming
  • Streaming is typically used for long-form generation where caching is less beneficial

8. Edge Cases

Edge Case Behavior
Cache disabled (default) No cache check, no performance impact
Cache enabled, first request Cache miss, provider called, response cached
Cache hit with tool_calls Return cached response including tool_calls
Embedder fails during semantic search Log warning, return miss, proceed to provider
Cache write fails Log warning, response still returned to caller
Fallback model used Cache key uses resolved_model, not actual_model — same query hits cache regardless of which fallback responded

Fallback model cache key issue: When model A fails and fallback model B responds, the cache key is based on resolved_model (the alias), not actual_model (B). This means a subsequent request for the same alias will get a cache hit even if model A is back online. This is correct behavior — the user asked for the alias, not a specific model.

However, if the user explicitly specifies model B (not an alias), the cache key will be different. This is also correct — different model = different cache entry.


9. Test Strategy

Integration Tests (tests/unit/test_gateway_cache.py)

  1. test_cache_disabled: Requests pass through to provider normally
  2. test_cache_enabled_first_request: Cache miss, provider called, response cached
  3. test_cache_enabled_second_request: Cache hit, provider NOT called
  4. test_cache_hit_usage_tracking: Usage record has 0 cost, correct token counts
  5. test_cache_miss_fallback: Primary model fails, fallback response cached
  6. test_config_from_dict: LLMConfig.from_dict({"cache": {"enabled": True}}) works
  7. test_semantic_cache_hit: temperature=0, semantically similar query hits cache
  8. test_semantic_cache_skipped_for_nonzero_temp: temperature>0 skips semantic search

10. Argumentation Summary

Design Choice Alternatives Considered Why This Choice
Cache check after LLMRequest construction Before construction Request normalization must complete first; key depends on resolved model
Cache write before usage tracking After usage tracking Response must be cached before tracking so cache-hit tracking uses same response
OpenAIEmbedder for all providers Separate XinferenceEmbedder Xinference/TEI use OpenAI-compatible API; no need for separate class
No stream caching Collect chunks then cache Adds latency and complexity; most cacheable requests don't need streaming
Cache key uses resolved_model alias Uses actual_model User requests alias, not specific model; cache should be model-agnostic within alias