30 KiB

Raw Permalink Blame History

title	status	created_at	type	origin	depth
feat: P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence	active	2026-06-14	feat	行业调研与项目审视（2026-06-14）	deep

P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence

Summary

Three P0 gaps identified from industry benchmarking and project audit: (1) LLM response caching to reduce 30-50% token cost, (2) embedding-based semantic routing to improve intent matching quality at zero LLM cost, (3) critical state persistence for UsageTracker, EvolutionStore, and CascadeDetector to survive restarts and enable multi-instance deployment. Each unit requires detailed architecture design and code reasoning before implementation — design-first, code-second.

Problem Frame

AgentKit has strong differentiation in self-evolution, quality management, and multi-paradigm engines, but three production-critical gaps prevent enterprise deployment:

Every LLM request hits the provider — no caching. Identical or similar requests waste tokens and money. Competitors like Dify have built-in caching.
Routing relies on keyword matching and LLM classification — no semantic understanding. Embedding-based routing is industry standard (Agentic RAG trend) and AgentKit already has embedding infrastructure but doesn't use it for routing.
Critical state lives in memory — UsageTracker, CascadeDetector, and EvolutionStore lose data on restart. Multi-instance deployment is impossible without shared state.

These gaps are P0 because they directly impact cost (caching), quality (routing accuracy), and reliability (state persistence) — the three pillars of production readiness.

Requirements

R1. LLM cache must support exact-match (hash-based) and semantic-match (embedding-based) cache hits
R2. LLM cache must integrate transparently into LLMGateway.chat() without changing the public API
R3. LLM cache must record usage on cache hits (0 cost) to maintain usage tracking integrity
R4. Semantic routing must insert between Layer 1 and Layer 2 in CostAwareRouter
R5. Semantic routing must use existing OpenAIEmbedder and compute_cosine_similarity() infrastructure
R6. Semantic routing must pre-compute skill embeddings at registration time, not at query time
R7. UsageTracker must persist records to Redis with O(1) write and efficient aggregation
R8. CascadeDetector must persist state to Redis using atomic INCR operations
R9. EvolutionStore must support both PostgreSQL and SQLite backends with a unified interface
R10. All three features must degrade gracefully when Redis/PG is unavailable (fallback to in-memory)
R11. Each unit must have detailed architecture design and code reasoning documented before implementation begins

Key Technical Decisions

KTD-1. Cache key design: Use SHA256(model + system_prompt_hash + messages_content_hash + temperature + tools_hash) for exact match. For semantic match, embed the last user message and compare against cached embeddings using cosine similarity > threshold. Rationale: exact match is fast and deterministic; semantic match catches paraphrased requests. Both are needed because exact match alone misses too many hits, and semantic match alone is too slow for every request.
KTD-2. Cache storage backend: Implement LLMCache as a Protocol with InMemoryLLMCache and RedisLLMCache backends. In-memory uses OrderedDict with LRU eviction (following EmbeddingCache pattern). Redis uses agentkit:llm_cache:{hash} keys with TTL. Rationale: follows existing factory pattern (create_message_bus, create_session_store); in-memory for dev/single-instance, Redis for production.
KTD-3. Semantic routing insertion point: Insert as Layer 1.5 between HeuristicClassifier and _classify_merged(). When Layer 1 returns medium complexity (0.3-0.7), try semantic routing first. If similarity > 0.85, return skill match directly (skip LLM). If similarity 0.6-0.85, pass skill hint to Layer 2 LLM (reduces LLM classification tokens). If < 0.6, proceed to Layer 2 unchanged. Rationale: this placement maximizes cost savings by avoiding LLM calls when semantic match is confident, while preserving the existing fallback chain.
KTD-4. Skill embedding source text: Embed f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}" for each skill. Cache embeddings in a dict keyed by skill name, re-embed on skill registration/update. Rationale: combines all semantic signals; description alone misses keyword intent; keywords alone misses semantic meaning.
KTD-5. UsageTracker persistence strategy: Use Redis Hash for time-series data. Key pattern: agentkit:usage:{date} with fields {agent}:{model} → JSON {tokens, cost, latency_ms, count}. Write via HINCRBYFLOAT for atomic increment. Query via HGETALL + client-side aggregation. Rationale: O(1) write, acceptable query performance, natural TTL by date, follows Redis patterns in project.
KTD-6. CascadeDetector persistence strategy: Use Redis atomic operations. Key pattern: agentkit:cascade:{session_id}:interactions (INCR + TTL) and agentkit:cascade:{session_id}:depth (SET/GET + TTL). Rationale: INCR is atomic, no race conditions across instances; TTL prevents memory leaks; matches session lifecycle.
KTD-7. EvolutionStore interface unification: Extend the base EvolutionStore Protocol to include skill_version and ab_test methods. Make PersistentEvolutionStore (SQLite) implement the unified Protocol. Add a new PostgreSQLEvolutionStore that uses async SQLAlchemy like the existing EvolutionStore but with the full unified interface. Rationale: current split (sync SQLite vs async PG) creates maintenance burden; unified Protocol enables backend-agnostic usage.
KTD-8. Graceful degradation pattern: All three features use the same pattern — try preferred backend, catch connection error, log warning, fall back to in-memory. Controlled by cache.backend, usage_store.backend, cascade_store.backend config values ("auto" | "redis" | "memory"). "auto" tries Redis, falls back to memory. Rationale: production needs persistence, but dev/testing shouldn't require Redis.

High-Level Technical Design

LLM Cache Flow

flowchart TB
    A[LLMGateway.chat] --> B{Cache enabled?}
    B -->|no| F[Call Provider]
    B -->|yes| C[Generate exact key]
    C --> D{Exact match?}
    D -->|hit| E[Return cached response]
    D -->|miss| G[Generate embedding of last user msg]
    G --> H{Semantic match? similarity > 0.92}
    H -->|hit| E
    H -->|miss| F
    F --> I[Write to cache]
    I --> J[Record usage]
    E --> K[Record usage with 0 cost]

Semantic Routing Flow

flowchart TB
    A[CostAwareRouter.route] --> B[Layer 0: Regex rules]
    B -->|matched| Z[Return DIRECT_CHAT]
    B -->|unmatched| C[Layer 1: HeuristicClassifier]
    C -->|low complexity| Z
    C -->|medium-high| D[Layer 1.5: Semantic Router NEW]
    D -->|sim > 0.85| E[Return SKILL_REACT with matched skill]
    D -->|sim 0.6-0.85| F[Pass skill_hint to Layer 2]
    D -->|sim < 0.6| G[Layer 2: LLM classification]
    F --> G
    G --> H[Return routing result]

State Persistence Architecture

flowchart TB
    subgraph "Current (In-Memory)"
        UT1[UsageTracker dict]
        CD1[CascadeDetector dict]
        ES1[EvolutionStore SQLite]
    end
    subgraph "Target (Persistent)"
        UT2[UsageStore Protocol]
        CD2[CascadeStateStore Protocol]
        ES2[UnifiedEvolutionStore Protocol]
        UT2 -->|redis| R1[Redis Hash agentkit:usage:date]
        UT2 -->|memory| M1[InMemoryUsageStore]
        CD2 -->|redis| R2[Redis INCR agentkit:cascade:session]
        CD2 -->|memory| M2[InMemoryCascadeStore]
        ES2 -->|postgresql| P1[PG EvolutionEventModel + SkillVersionModel]
        ES2 -->|sqlite| S1[PersistentEvolutionStore]
        ES2 -->|memory| M3[InMemoryEvolutionStore]
    end

Scope Boundaries

In Scope

LLM response caching (exact + semantic match, in-memory + Redis backends)
Semantic routing as Layer 1.5 in CostAwareRouter
UsageTracker Redis persistence
CascadeDetector Redis persistence
EvolutionStore interface unification
Configuration for all three features
Architecture design documents for each unit before coding

Deferred for Follow-Up

Semantic cache using pgvector (current semantic match uses in-memory embedding comparison)
Cache warming / pre-population strategies
Routing cache (caching routing results for similar queries)
Usage analytics dashboard (visualization of usage data)
Multi-tenant resource quotas
Rate limiting and concurrency control (P2)
Distributed tracing visualization (P2)

Implementation Units

U1. LLM Cache Core

Goal: Implement the LLMCache Protocol, InMemoryLLMCache, and RedisLLMCache with exact-match and semantic-match capabilities.

Dependencies: None

Files:

src/agentkit/llm/cache.py (new) — LLMCache Protocol, InMemoryLLMCache, RedisLLMCache, CacheResult, CacheKey generation
src/agentkit/llm/cache_key.py (new) — generate_cache_key(), generate_messages_hash(), generate_system_prompt_hash()
tests/unit/llm/test_cache.py (new) — unit tests for cache backends

Approach:

Architecture design before coding:

CacheKey design reasoning: The key must capture all inputs that affect LLM output. model determines which model responds. system_prompt sets behavior. messages carry the conversation. temperature affects randomness (only cache temperature=0 deterministically). tools affect tool_call availability. Hash each component independently so partial changes don't invalidate the entire key.
Exact match implementation: SHA-256 hash of concatenated component hashes. Store as agentkit:llm_cache:{sha256_hex} in Redis with TTL. In-memory uses OrderedDict keyed by hash string.
Semantic match implementation: For cache misses on exact match, embed the last user message using OpenAIEmbedder. Compare against cached embeddings using compute_cosine_similarity(). Store embeddings alongside cached responses. In-memory: linear scan of all cached embeddings. Redis: store embeddings in a separate key agentkit:llm_cache_emb:{sha256_hex}.
Cache write policy: Only cache responses where temperature == 0 (deterministic). For temperature > 0, only exact-match cache applies (no semantic match, since outputs are non-deterministic).
Cache invalidation: TTL-based (configurable, default 3600s for exact, 86400s for semantic). Manual invalidation via invalidate(pattern=None) for admin operations.

Patterns to follow:

EmbeddingCache in src/agentkit/memory/embedder.py — LRU + TTL pattern
create_session_store() factory in src/agentkit/session/store.py — backend factory pattern
RedisSessionStore._get_redis() — lazy Redis initialization

Test scenarios:

Exact match: same messages + model → cache hit, returns identical response
Exact miss: different messages → cache miss, calls provider, writes to cache
Semantic match: paraphrased question (similarity > 0.92) → cache hit
Semantic miss: unrelated question (similarity < 0.6) → cache miss
Temperature > 0: only exact match attempted, no semantic match
TTL expiry: cached entry expires after TTL, next request is a miss
Redis unavailable: falls back to in-memory cache with warning log
Cache with tool_calls: response containing tool_calls is cached correctly
Concurrent access: two concurrent requests for same key don't cause double-write issues

Verification: Unit tests pass; cache hit rate metric is observable; no change to LLMGateway public API.

U2. LLM Cache Integration

Goal: Integrate LLMCache into LLMGateway.chat() transparently, with usage tracking on cache hits.

Dependencies: U1

Files:

src/agentkit/llm/gateway.py (modify) — inject cache check before provider call, cache write after provider response
src/agentkit/llm/config.py (modify) — add CacheConfig to LLMConfig
src/agentkit/server/app.py (modify) — pass cache config to LLMGateway
tests/unit/llm/test_gateway_cache.py (new) — integration tests for cached gateway

Approach:

Architecture design before coding:

Insertion point reasoning: Cache check must happen AFTER LLMRequest construction (line ~79 in gateway.py) but BEFORE provider call (line ~87). This ensures all request normalization (alias resolution, model fallback list) has completed. Cache write happens AFTER response validation but BEFORE usage tracking.
Cache hit usage tracking: On cache hit, call _usage_tracker.record() with the original usage data from the cached response but with cost=0 and latency_ms from cache lookup time. This preserves usage query integrity — get_usage() still shows all requests, just with zero cost for cached ones.
Stream handling: chat_stream() is NOT cached in this iteration. Streaming requires collecting all chunks before caching, which adds latency and complexity. Document this as a known limitation.
Configuration integration: Add CacheConfig dataclass with enabled: bool = False, backend: str = "auto", exact_ttl: int = 3600, semantic_ttl: int = 86400, similarity_threshold: float = 0.92, max_entries: int = 10000. Nest under LLMConfig.cache.

Patterns to follow:

LLMConfig dataclass + from_dict() pattern for config
LLMGateway.__init__() dependency injection pattern

Test scenarios:

Cache disabled: requests pass through to provider normally
Cache enabled, first request: cache miss, provider called, response cached
Cache enabled, second identical request: cache hit, provider NOT called
Cache hit usage tracking: usage record has 0 cost, correct token counts
Cache miss + fallback: primary model fails, fallback model response cached under fallback model key
Config from YAML: LLMConfig.from_dict({"cache": {"enabled": true}}) works correctly

Verification: Integration tests pass; LLMGateway.chat() returns same LLMResponse shape whether cached or not; usage tracking includes cache hits.

U3. Semantic Router

Goal: Implement embedding-based semantic routing as Layer 1.5 in CostAwareRouter, using existing OpenAIEmbedder and compute_cosine_similarity().

Dependencies: None (independent of U1/U2, uses existing embedding infrastructure)

Files:

src/agentkit/chat/semantic_router.py (new) — SemanticRouter class, SkillEmbeddingIndex
src/agentkit/chat/skill_routing.py (modify) — integrate Layer 1.5 into CostAwareRouter.route()
tests/unit/chat/test_semantic_router.py (new) — unit tests for semantic router

Approach:

Architecture design before coding:

SkillEmbeddingIndex design reasoning: Pre-compute embeddings for all registered skills at initialization. Source text: f"{description} | {' '.join(keywords)} | {' '.join(capability_tags)}". Store as dict[str, tuple[list[float], str]] (skill_name → (embedding, source_text)). On skill registration/update, re-embed only the changed skill. This avoids O(n) embedding computation per query.
Query-time flow: Embed user query → compute cosine similarity against all skill embeddings → return top match if above threshold. This is O(n) in number of skills, but with <100 skills and 1536-dim vectors, this takes <5ms on CPU. No need for approximate nearest neighbor (ANN) index at this scale.
Threshold design: Three zones:
- similarity > 0.85: HIGH confidence → return skill match directly, skip Layer 2 LLM
- 0.6 <= similarity <= 0.85: MEDIUM confidence → pass skill hint to Layer 2, reducing LLM classification tokens
- similarity < 0.6: LOW confidence → no semantic signal, Layer 2 runs unmodified
Integration into CostAwareRouter: Modify route() method. After Layer 1 (_classify_merged()), if complexity is medium (0.3-0.7), call semantic_router.route(query). Based on confidence zone, either return directly or enhance the Layer 2 prompt with skill hint.
Embedding provider: Use OpenAIEmbedder by default. Support MockEmbedder for testing. Embedder is injected via constructor, not created internally.

Patterns to follow:

OpenAIEmbedder + EmbeddingCache pattern for embedding computation
compute_cosine_similarity() in src/agentkit/utils/vector_math.py
CostAwareRouter constructor injection pattern

Test scenarios:

Exact skill match: query "生成一篇关于AI的文章" matches content_generator skill (sim > 0.85)
Partial skill match: query "优化内容" matches geo_optimizer skill (sim 0.6-0.85), skill hint passed to LLM
No skill match: query "今天天气怎么样" has sim < 0.6 for all skills, Layer 2 runs normally
Skill registration: new skill added → embedding computed and indexed
Skill update: skill description changed → embedding re-computed
Empty skill registry: semantic router returns None gracefully
Embedder failure: OpenAIEmbedder throws error → semantic router logs warning, returns None, Layer 2 runs normally
Chinese query: "帮我写一篇文章" matches content_generator skill correctly

Verification: Semantic router returns correct skill matches; Layer 2 LLM calls reduced by >50% for medium-complexity queries; no regression in routing accuracy.

U4. UsageStore Persistence

Goal: Persist UsageTracker records to Redis, with in-memory fallback and efficient aggregation queries.

Dependencies: None

Files:

src/agentkit/llm/usage_store.py (new) — UsageStore Protocol, InMemoryUsageStore, RedisUsageStore
src/agentkit/llm/providers/tracker.py (modify) — delegate to UsageStore backend
tests/unit/llm/test_usage_store.py (new) — unit tests for usage store backends

Approach:

Architecture design before coding:

Redis data model reasoning: Use Redis Hash per date for time-partitioned storage. Key: agentkit:usage:{YYYY-MM-DD}, field: {agent_name}:{model}, value: JSON {prompt_tokens, completion_tokens, total_tokens, cost, latency_ms, count}. Write via pipeline: HINCRBYFLOAT for numeric fields + HINCRBY for count. This is O(1) per write, atomic, and naturally partitions by date.
Aggregation query design: For get_usage(agent=None, start=None, end=None): scan date keys in range via HGETALL, filter by agent/model in application code, aggregate in memory. For single-agent queries, use field prefix matching. This is O(days × agents) which is acceptable for dashboard queries.
UsageStore Protocol: Define record(agent, model, usage: UsageRecord) -> None, query(agent=None, model=None, start=None, end=None) -> list[UsageRecord], get_summary(agent=None, start=None, end=None) -> UsageSummary. Both sync and async versions (sync for backward compat, async for Redis).
Migration from UsageTracker: UsageTracker becomes a thin wrapper that delegates to UsageStore. Existing record() and get_usage() APIs preserved. Internal _records list replaced by store backend.
TTL management: Each date key gets TTL of 90 days (configurable). This prevents unbounded Redis memory growth while preserving 3 months of usage data.

Patterns to follow:

SessionStore Protocol in src/agentkit/session/store.py — Protocol definition pattern
RedisSessionStore._get_redis() — lazy Redis initialization
create_session_store() — factory function pattern
agentkit:usage: key namespace convention

Test scenarios:

Record and query: record usage → query returns matching records
Date partitioning: records on different dates stored in different keys
Aggregation: multiple records for same agent/model aggregated correctly
Agent filter: query with agent filter returns only that agent's records
Date range filter: query with start/end returns only records in range
TTL: date keys have correct TTL set
Redis unavailable: falls back to in-memory store with warning
Concurrent writes: two concurrent records for same agent/model don't lose data
Empty query: query with no matching records returns empty list

Verification: Usage data survives process restart; get_usage() returns same shape as before; Redis memory usage bounded by TTL.

U5. CascadeStateStore Persistence

Goal: Persist CascadeDetector state to Redis using atomic operations, enabling multi-instance cascade detection.

Dependencies: None

Files:

src/agentkit/quality/cascade_store.py (new) — CascadeStateStore Protocol, InMemoryCascadeStore, RedisCascadeStore
src/agentkit/quality/cascade_detector.py (modify) — delegate to CascadeStateStore backend
tests/unit/quality/test_cascade_store.py (new) — unit tests for cascade store backends

Approach:

Architecture design before coding:

Redis data model reasoning: Use simple string keys with INCR for atomic counting. Key: agentkit:cascade:{session_id}:interactions (INCR + TTL), agentkit:cascade:{session_id}:depth (GET/SET + TTL). TTL aligned with session TTL (default 86400s). INCR is atomic — no race conditions across instances.
Protocol design: CascadeStateStore with increment_interactions(session_id) -> int, get_interactions(session_id) -> int, set_depth(session_id, depth) -> None, get_depth(session_id) -> int, reset(session_id) -> None, get_stats(session_id) -> CascadeStats.
Integration into CascadeDetector: Replace internal _interaction_counts and _loop_depths dicts with CascadeStateStore backend. All methods delegate to store. CascadeDetector becomes stateless — all state lives in the store.
Session TTL alignment: When increment_interactions() is called, refresh the key TTL to match session TTL. This ensures state is cleaned up when sessions expire.

Patterns to follow:

Same Protocol + factory + fallback pattern as U4
Redis INCR atomic operation pattern
agentkit:cascade: key namespace

Test scenarios:

Increment and get: increment interactions → get returns correct count
Set and get depth: set depth → get returns correct depth
Reset: reset session → interactions and depth both cleared
TTL: keys have TTL set, expire after session timeout
Multi-instance: two instances incrementing same session see consistent count
Redis unavailable: falls back to in-memory store
Session isolation: different sessions have independent state

Verification: Cascade detection state survives process restart; multi-instance deployment detects cascades correctly; no false positives from state loss.

U6. EvolutionStore Interface Unification

Goal: Unify EvolutionStore and PersistentEvolutionStore interfaces, add PostgreSQL backend with full feature set.

Dependencies: None

Files:

src/agentkit/evolution/evolution_store.py (modify) — define unified EvolutionStoreProtocol, refactor existing stores
src/agentkit/evolution/models.py (modify) — add SkillVersionModel and ABTestResultModel to async PG models
src/agentkit/evolution/pg_store.py (new) — PostgreSQLEvolutionStore implementing unified Protocol with async SQLAlchemy
tests/unit/evolution/test_unified_store.py (new) — tests for unified interface

Approach:

Architecture design before coding:

Protocol design reasoning: Current EvolutionStore (async PG) has record(), rollback(), list_events(). PersistentEvolutionStore (sync SQLite) adds record_skill_version(), list_skill_versions(), record_ab_test_result(), get_ab_test_results(). The unified Protocol must include ALL methods from both. Each backend implements what it can; unsupported methods raise NotImplementedError with clear message.
PostgreSQL model migration: Add SkillVersionModel and ABTestResultModel to src/agentkit/evolution/models.py using async SQLAlchemy (matching EpisodeModel pattern in memory/models.py). These models already exist for SQLite; the PG versions use the same schema but with async engine.
PostgreSQLEvolutionStore: New class using async SQLAlchemy session (injected via constructor, same pattern as existing EvolutionStore). Implements all Protocol methods. Uses run_in_executor for any sync ORM operations if needed.
Factory update: create_evolution_store(backend="memory"|"sqlite"|"postgresql", ...) returns the appropriate backend. "postgresql" creates PostgreSQLEvolutionStore with async engine.
Backward compatibility: Existing EvolutionStore class is not removed — it becomes an internal implementation detail. The Protocol is the public interface. Code using EvolutionStore directly continues to work.

Patterns to follow:

EpisodeModel in src/agentkit/memory/models.py — async PG model pattern
create_evolution_store() factory — extend with new backend
PersistentEvolutionStore._run_sync() — sync/async bridge pattern

Test scenarios:

Protocol compliance: all backends implement all Protocol methods
PG store: record event → list events returns recorded event
PG store: record skill version → list versions returns version history
PG store: record AB test result → get results returns test data
SQLite store: existing functionality preserved after refactor
Memory store: existing functionality preserved after refactor
Factory: create_evolution_store(backend="postgresql") returns correct type
PG unavailable: falls back to SQLite with warning

Verification: All backends pass unified Protocol compliance test; existing evolution tests pass; PG store supports skill_version and ab_test operations.

U7. Configuration Integration and End-to-End Verification

Goal: Wire all three features into the application configuration, add agentkit.yaml schema support, and verify end-to-end behavior.

Dependencies: U1, U2, U3, U4, U5, U6

Files:

src/agentkit/server/app.py (modify) — initialize cache, usage store, cascade store with config
src/agentkit/cli/main.py (modify) — pass config to gateway and router
agentkit.yaml (modify) — add cache, semantic_routing, usage_store, cascade_store config sections
tests/integration/test_p0_hardening.py (new) — end-to-end integration tests

Approach:

Configuration schema: Add to agentkit.yaml:

llm:
  cache:
    enabled: true
    backend: "auto"          # auto | redis | memory
    exact_ttl: 3600
    semantic_ttl: 86400
    similarity_threshold: 0.92
    max_entries: 10000

routing:
  semantic:
    enabled: true
    similarity_high: 0.85    # direct match threshold
    similarity_low: 0.6      # hint threshold

usage_store:
  backend: "auto"            # auto | redis | memory
  ttl_days: 90

cascade_store:
  backend: "auto"            # auto | redis | memory
  session_ttl: 86400

evolution_store:
  backend: "auto"            # auto | postgresql | sqlite | memory

Application wiring: In app.py lifespan, initialize all stores and inject into gateway/router. Follow existing pattern of creating components from config.
End-to-end verification: Integration test that exercises the full flow: user query → semantic routing → LLM cache → usage tracking → cascade detection → evolution logging.

Test scenarios:

Full flow with Redis: all features use Redis backend, data persists across simulated restart
Full flow without Redis: all features fall back to in-memory, no errors
Config from YAML: agentkit.yaml parsed correctly, all features configured
Cache + routing interaction: cached response for semantically routed query works correctly
Usage tracking with cache: cached requests show 0 cost in usage summary
Cascade detection across instances: simulated multi-instance scenario detects cascade correctly

Verification: All integration tests pass; application starts with new config; features degrade gracefully when backends unavailable.

Risks & Mitigations

Risk	Impact	Likelihood	Mitigation
Semantic cache returns stale/wrong response	High — user gets incorrect answer	Medium — embedding similarity doesn't guarantee semantic equivalence	Default to temperature=0 only for semantic cache; configurable threshold; TTL expiry; admin invalidation API
Redis single point of failure	High — all persistence lost	Low — Redis is typically HA	Auto-fallback to in-memory; health check in doctor command; alert on fallback activation
Embedding API latency adds to routing time	Medium — slower routing for first query	Medium — embedding API ~100ms	Pre-compute skill embeddings; cache query embeddings; async embedding with timeout
UsageStore Redis memory growth	Medium — Redis OOM	Low — TTL + date partitioning bounds growth	90-day TTL default; monitoring on Redis memory; configurable TTL
EvolutionStore interface unification breaks existing code	High — evolution system stops working	Low — Protocol is backward compatible	Keep existing classes as internal implementations; comprehensive test coverage before refactor

Open Questions

Should semantic cache also cache streaming responses (requires chunk collection)? Deferred — current plan only caches non-streaming chat().
Should UsageStore support real-time streaming of usage data (e.g., via Redis Pub/Sub)? Deferred — current plan only supports query-based access.
What is the optimal embedding model for Chinese+English mixed text? text-embedding-3-small is adequate but not optimal. Consider bge-m3 or multilingual-e5 as alternatives. Deferred to implementation-time benchmarking.

Sources & Research

Industry benchmarking: LangChain, Dify, CrewAI, Letta, AutoGen feature comparison (2025-2026)
Project audit: 12 core files analyzed across memory, evolution, routing, quality, and LLM subsystems
Existing patterns: EmbeddingCache, RedisSessionStore, create_evolution_store(), SessionStore Protocol

30 KiB Raw Permalink Blame History Unescape Escape

P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence

Summary

Problem Frame

Requirements

Key Technical Decisions

High-Level Technical Design

LLM Cache Flow

Semantic Routing Flow

State Persistence Architecture

Scope Boundaries

In Scope

Deferred for Follow-Up

Implementation Units

U1. LLM Cache Core

U2. LLM Cache Integration

U3. Semantic Router

U4. UsageStore Persistence

U5. CascadeStateStore Persistence

U6. EvolutionStore Interface Unification

U7. Configuration Integration and End-to-End Verification

Risks & Mitigations

Open Questions

Sources & Research

30 KiB

Raw Permalink Blame History