--- title: "feat: P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence" status: active created_at: 2026-06-14 type: feat origin: "行业调研与项目审视(2026-06-14)" depth: deep --- # P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence ## Summary Three P0 gaps identified from industry benchmarking and project audit: (1) LLM response caching to reduce 30-50% token cost, (2) embedding-based semantic routing to improve intent matching quality at zero LLM cost, (3) critical state persistence for UsageTracker, EvolutionStore, and CascadeDetector to survive restarts and enable multi-instance deployment. Each unit requires detailed architecture design and code reasoning before implementation — design-first, code-second. ## Problem Frame AgentKit has strong differentiation in self-evolution, quality management, and multi-paradigm engines, but three production-critical gaps prevent enterprise deployment: 1. **Every LLM request hits the provider** — no caching. Identical or similar requests waste tokens and money. Competitors like Dify have built-in caching. 2. **Routing relies on keyword matching and LLM classification** — no semantic understanding. Embedding-based routing is industry standard (Agentic RAG trend) and AgentKit already has embedding infrastructure but doesn't use it for routing. 3. **Critical state lives in memory** — UsageTracker, CascadeDetector, and EvolutionStore lose data on restart. Multi-instance deployment is impossible without shared state. These gaps are P0 because they directly impact cost (caching), quality (routing accuracy), and reliability (state persistence) — the three pillars of production readiness. --- ## Requirements - R1. LLM cache must support exact-match (hash-based) and semantic-match (embedding-based) cache hits - R2. LLM cache must integrate transparently into `LLMGateway.chat()` without changing the public API - R3. LLM cache must record usage on cache hits (0 cost) to maintain usage tracking integrity - R4. Semantic routing must insert between Layer 1 and Layer 2 in `CostAwareRouter` - R5. Semantic routing must use existing `OpenAIEmbedder` and `compute_cosine_similarity()` infrastructure - R6. Semantic routing must pre-compute skill embeddings at registration time, not at query time - R7. UsageTracker must persist records to Redis with O(1) write and efficient aggregation - R8. CascadeDetector must persist state to Redis using atomic INCR operations - R9. EvolutionStore must support both PostgreSQL and SQLite backends with a unified interface - R10. All three features must degrade gracefully when Redis/PG is unavailable (fallback to in-memory) - R11. Each unit must have detailed architecture design and code reasoning documented before implementation begins --- ## Key Technical Decisions - **KTD-1. Cache key design**: Use `SHA256(model + system_prompt_hash + messages_content_hash + temperature + tools_hash)` for exact match. For semantic match, embed the last user message and compare against cached embeddings using cosine similarity > threshold. Rationale: exact match is fast and deterministic; semantic match catches paraphrased requests. Both are needed because exact match alone misses too many hits, and semantic match alone is too slow for every request. - **KTD-2. Cache storage backend**: Implement `LLMCache` as a Protocol with `InMemoryLLMCache` and `RedisLLMCache` backends. In-memory uses `OrderedDict` with LRU eviction (following `EmbeddingCache` pattern). Redis uses `agentkit:llm_cache:{hash}` keys with TTL. Rationale: follows existing factory pattern (`create_message_bus`, `create_session_store`); in-memory for dev/single-instance, Redis for production. - **KTD-3. Semantic routing insertion point**: Insert as Layer 1.5 between `HeuristicClassifier` and `_classify_merged()`. When Layer 1 returns medium complexity (0.3-0.7), try semantic routing first. If similarity > 0.85, return skill match directly (skip LLM). If similarity 0.6-0.85, pass skill hint to Layer 2 LLM (reduces LLM classification tokens). If < 0.6, proceed to Layer 2 unchanged. Rationale: this placement maximizes cost savings by avoiding LLM calls when semantic match is confident, while preserving the existing fallback chain. - **KTD-4. Skill embedding source text**: Embed `f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"` for each skill. Cache embeddings in a dict keyed by skill name, re-embed on skill registration/update. Rationale: combines all semantic signals; description alone misses keyword intent; keywords alone misses semantic meaning. - **KTD-5. UsageTracker persistence strategy**: Use Redis Hash for time-series data. Key pattern: `agentkit:usage:{date}` with fields `{agent}:{model}` → JSON `{tokens, cost, latency_ms, count}`. Write via `HINCRBYFLOAT` for atomic increment. Query via `HGETALL` + client-side aggregation. Rationale: O(1) write, acceptable query performance, natural TTL by date, follows Redis patterns in project. - **KTD-6. CascadeDetector persistence strategy**: Use Redis atomic operations. Key pattern: `agentkit:cascade:{session_id}:interactions` (INCR + TTL) and `agentkit:cascade:{session_id}:depth` (SET/GET + TTL). Rationale: INCR is atomic, no race conditions across instances; TTL prevents memory leaks; matches session lifecycle. - **KTD-7. EvolutionStore interface unification**: Extend the base `EvolutionStore` Protocol to include `skill_version` and `ab_test` methods. Make `PersistentEvolutionStore` (SQLite) implement the unified Protocol. Add a new `PostgreSQLEvolutionStore` that uses async SQLAlchemy like the existing `EvolutionStore` but with the full unified interface. Rationale: current split (sync SQLite vs async PG) creates maintenance burden; unified Protocol enables backend-agnostic usage. - **KTD-8. Graceful degradation pattern**: All three features use the same pattern — try preferred backend, catch connection error, log warning, fall back to in-memory. Controlled by `cache.backend`, `usage_store.backend`, `cascade_store.backend` config values (`"auto"` | `"redis"` | `"memory"`). `"auto"` tries Redis, falls back to memory. Rationale: production needs persistence, but dev/testing shouldn't require Redis. --- ## High-Level Technical Design ### LLM Cache Flow ```mermaid flowchart TB A[LLMGateway.chat] --> B{Cache enabled?} B -->|no| F[Call Provider] B -->|yes| C[Generate exact key] C --> D{Exact match?} D -->|hit| E[Return cached response] D -->|miss| G[Generate embedding of last user msg] G --> H{Semantic match? similarity > 0.92} H -->|hit| E H -->|miss| F F --> I[Write to cache] I --> J[Record usage] E --> K[Record usage with 0 cost] ``` ### Semantic Routing Flow ```mermaid flowchart TB A[CostAwareRouter.route] --> B[Layer 0: Regex rules] B -->|matched| Z[Return DIRECT_CHAT] B -->|unmatched| C[Layer 1: HeuristicClassifier] C -->|low complexity| Z C -->|medium-high| D[Layer 1.5: Semantic Router NEW] D -->|sim > 0.85| E[Return SKILL_REACT with matched skill] D -->|sim 0.6-0.85| F[Pass skill_hint to Layer 2] D -->|sim < 0.6| G[Layer 2: LLM classification] F --> G G --> H[Return routing result] ``` ### State Persistence Architecture ```mermaid flowchart TB subgraph "Current (In-Memory)" UT1[UsageTracker dict] CD1[CascadeDetector dict] ES1[EvolutionStore SQLite] end subgraph "Target (Persistent)" UT2[UsageStore Protocol] CD2[CascadeStateStore Protocol] ES2[UnifiedEvolutionStore Protocol] UT2 -->|redis| R1[Redis Hash agentkit:usage:date] UT2 -->|memory| M1[InMemoryUsageStore] CD2 -->|redis| R2[Redis INCR agentkit:cascade:session] CD2 -->|memory| M2[InMemoryCascadeStore] ES2 -->|postgresql| P1[PG EvolutionEventModel + SkillVersionModel] ES2 -->|sqlite| S1[PersistentEvolutionStore] ES2 -->|memory| M3[InMemoryEvolutionStore] end ``` --- ## Scope Boundaries ### In Scope - LLM response caching (exact + semantic match, in-memory + Redis backends) - Semantic routing as Layer 1.5 in CostAwareRouter - UsageTracker Redis persistence - CascadeDetector Redis persistence - EvolutionStore interface unification - Configuration for all three features - Architecture design documents for each unit before coding ### Deferred for Follow-Up - Semantic cache using pgvector (current semantic match uses in-memory embedding comparison) - Cache warming / pre-population strategies - Routing cache (caching routing results for similar queries) - Usage analytics dashboard (visualization of usage data) - Multi-tenant resource quotas - Rate limiting and concurrency control (P2) - Distributed tracing visualization (P2) --- ## Implementation Units ### U1. LLM Cache Core **Goal:** Implement the `LLMCache` Protocol, `InMemoryLLMCache`, and `RedisLLMCache` with exact-match and semantic-match capabilities. **Dependencies:** None **Files:** - `src/agentkit/llm/cache.py` (new) — `LLMCache` Protocol, `InMemoryLLMCache`, `RedisLLMCache`, `CacheResult`, `CacheKey` generation - `src/agentkit/llm/cache_key.py` (new) — `generate_cache_key()`, `generate_messages_hash()`, `generate_system_prompt_hash()` - `tests/unit/llm/test_cache.py` (new) — unit tests for cache backends **Approach:** Architecture design before coding: 1. **CacheKey design reasoning**: The key must capture all inputs that affect LLM output. `model` determines which model responds. `system_prompt` sets behavior. `messages` carry the conversation. `temperature` affects randomness (only cache temperature=0 deterministically). `tools` affect tool_call availability. Hash each component independently so partial changes don't invalidate the entire key. 2. **Exact match implementation**: SHA-256 hash of concatenated component hashes. Store as `agentkit:llm_cache:{sha256_hex}` in Redis with TTL. In-memory uses OrderedDict keyed by hash string. 3. **Semantic match implementation**: For cache misses on exact match, embed the last user message using `OpenAIEmbedder`. Compare against cached embeddings using `compute_cosine_similarity()`. Store embeddings alongside cached responses. In-memory: linear scan of all cached embeddings. Redis: store embeddings in a separate key `agentkit:llm_cache_emb:{sha256_hex}`. 4. **Cache write policy**: Only cache responses where `temperature == 0` (deterministic). For temperature > 0, only exact-match cache applies (no semantic match, since outputs are non-deterministic). 5. **Cache invalidation**: TTL-based (configurable, default 3600s for exact, 86400s for semantic). Manual invalidation via `invalidate(pattern=None)` for admin operations. **Patterns to follow:** - `EmbeddingCache` in `src/agentkit/memory/embedder.py` — LRU + TTL pattern - `create_session_store()` factory in `src/agentkit/session/store.py` — backend factory pattern - `RedisSessionStore._get_redis()` — lazy Redis initialization **Test scenarios:** - Exact match: same messages + model → cache hit, returns identical response - Exact miss: different messages → cache miss, calls provider, writes to cache - Semantic match: paraphrased question (similarity > 0.92) → cache hit - Semantic miss: unrelated question (similarity < 0.6) → cache miss - Temperature > 0: only exact match attempted, no semantic match - TTL expiry: cached entry expires after TTL, next request is a miss - Redis unavailable: falls back to in-memory cache with warning log - Cache with tool_calls: response containing tool_calls is cached correctly - Concurrent access: two concurrent requests for same key don't cause double-write issues **Verification:** Unit tests pass; cache hit rate metric is observable; no change to `LLMGateway` public API. --- ### U2. LLM Cache Integration **Goal:** Integrate `LLMCache` into `LLMGateway.chat()` transparently, with usage tracking on cache hits. **Dependencies:** U1 **Files:** - `src/agentkit/llm/gateway.py` (modify) — inject cache check before provider call, cache write after provider response - `src/agentkit/llm/config.py` (modify) — add `CacheConfig` to `LLMConfig` - `src/agentkit/server/app.py` (modify) — pass cache config to `LLMGateway` - `tests/unit/llm/test_gateway_cache.py` (new) — integration tests for cached gateway **Approach:** Architecture design before coding: 1. **Insertion point reasoning**: Cache check must happen AFTER `LLMRequest` construction (line ~79 in gateway.py) but BEFORE provider call (line ~87). This ensures all request normalization (alias resolution, model fallback list) has completed. Cache write happens AFTER response validation but BEFORE usage tracking. 2. **Cache hit usage tracking**: On cache hit, call `_usage_tracker.record()` with the original `usage` data from the cached response but with `cost=0` and `latency_ms` from cache lookup time. This preserves usage query integrity — `get_usage()` still shows all requests, just with zero cost for cached ones. 3. **Stream handling**: `chat_stream()` is NOT cached in this iteration. Streaming requires collecting all chunks before caching, which adds latency and complexity. Document this as a known limitation. 4. **Configuration integration**: Add `CacheConfig` dataclass with `enabled: bool = False`, `backend: str = "auto"`, `exact_ttl: int = 3600`, `semantic_ttl: int = 86400`, `similarity_threshold: float = 0.92`, `max_entries: int = 10000`. Nest under `LLMConfig.cache`. **Patterns to follow:** - `LLMConfig` dataclass + `from_dict()` pattern for config - `LLMGateway.__init__()` dependency injection pattern **Test scenarios:** - Cache disabled: requests pass through to provider normally - Cache enabled, first request: cache miss, provider called, response cached - Cache enabled, second identical request: cache hit, provider NOT called - Cache hit usage tracking: usage record has 0 cost, correct token counts - Cache miss + fallback: primary model fails, fallback model response cached under fallback model key - Config from YAML: `LLMConfig.from_dict({"cache": {"enabled": true}})` works correctly **Verification:** Integration tests pass; `LLMGateway.chat()` returns same `LLMResponse` shape whether cached or not; usage tracking includes cache hits. --- ### U3. Semantic Router **Goal:** Implement embedding-based semantic routing as Layer 1.5 in `CostAwareRouter`, using existing `OpenAIEmbedder` and `compute_cosine_similarity()`. **Dependencies:** None (independent of U1/U2, uses existing embedding infrastructure) **Files:** - `src/agentkit/chat/semantic_router.py` (new) — `SemanticRouter` class, `SkillEmbeddingIndex` - `src/agentkit/chat/skill_routing.py` (modify) — integrate Layer 1.5 into `CostAwareRouter.route()` - `tests/unit/chat/test_semantic_router.py` (new) — unit tests for semantic router **Approach:** Architecture design before coding: 1. **SkillEmbeddingIndex design reasoning**: Pre-compute embeddings for all registered skills at initialization. Source text: `f"{description} | {' '.join(keywords)} | {' '.join(capability_tags)}"`. Store as `dict[str, tuple[list[float], str]]` (skill_name → (embedding, source_text)). On skill registration/update, re-embed only the changed skill. This avoids O(n) embedding computation per query. 2. **Query-time flow**: Embed user query → compute cosine similarity against all skill embeddings → return top match if above threshold. This is O(n) in number of skills, but with <100 skills and 1536-dim vectors, this takes <5ms on CPU. No need for approximate nearest neighbor (ANN) index at this scale. 3. **Threshold design**: Three zones: - `similarity > 0.85`: HIGH confidence → return skill match directly, skip Layer 2 LLM - `0.6 <= similarity <= 0.85`: MEDIUM confidence → pass skill hint to Layer 2, reducing LLM classification tokens - `similarity < 0.6`: LOW confidence → no semantic signal, Layer 2 runs unmodified 4. **Integration into CostAwareRouter**: Modify `route()` method. After Layer 1 (`_classify_merged()`), if complexity is medium (0.3-0.7), call `semantic_router.route(query)`. Based on confidence zone, either return directly or enhance the Layer 2 prompt with skill hint. 5. **Embedding provider**: Use `OpenAIEmbedder` by default. Support `MockEmbedder` for testing. Embedder is injected via constructor, not created internally. **Patterns to follow:** - `OpenAIEmbedder` + `EmbeddingCache` pattern for embedding computation - `compute_cosine_similarity()` in `src/agentkit/utils/vector_math.py` - `CostAwareRouter` constructor injection pattern **Test scenarios:** - Exact skill match: query "生成一篇关于AI的文章" matches `content_generator` skill (sim > 0.85) - Partial skill match: query "优化内容" matches `geo_optimizer` skill (sim 0.6-0.85), skill hint passed to LLM - No skill match: query "今天天气怎么样" has sim < 0.6 for all skills, Layer 2 runs normally - Skill registration: new skill added → embedding computed and indexed - Skill update: skill description changed → embedding re-computed - Empty skill registry: semantic router returns None gracefully - Embedder failure: OpenAIEmbedder throws error → semantic router logs warning, returns None, Layer 2 runs normally - Chinese query: "帮我写一篇文章" matches content_generator skill correctly **Verification:** Semantic router returns correct skill matches; Layer 2 LLM calls reduced by >50% for medium-complexity queries; no regression in routing accuracy. --- ### U4. UsageStore Persistence **Goal:** Persist UsageTracker records to Redis, with in-memory fallback and efficient aggregation queries. **Dependencies:** None **Files:** - `src/agentkit/llm/usage_store.py` (new) — `UsageStore` Protocol, `InMemoryUsageStore`, `RedisUsageStore` - `src/agentkit/llm/providers/tracker.py` (modify) — delegate to `UsageStore` backend - `tests/unit/llm/test_usage_store.py` (new) — unit tests for usage store backends **Approach:** Architecture design before coding: 1. **Redis data model reasoning**: Use Redis Hash per date for time-partitioned storage. Key: `agentkit:usage:{YYYY-MM-DD}`, field: `{agent_name}:{model}`, value: JSON `{prompt_tokens, completion_tokens, total_tokens, cost, latency_ms, count}`. Write via pipeline: `HINCRBYFLOAT` for numeric fields + `HINCRBY` for count. This is O(1) per write, atomic, and naturally partitions by date. 2. **Aggregation query design**: For `get_usage(agent=None, start=None, end=None)`: scan date keys in range via `HGETALL`, filter by agent/model in application code, aggregate in memory. For single-agent queries, use field prefix matching. This is O(days × agents) which is acceptable for dashboard queries. 3. **UsageStore Protocol**: Define `record(agent, model, usage: UsageRecord) -> None`, `query(agent=None, model=None, start=None, end=None) -> list[UsageRecord]`, `get_summary(agent=None, start=None, end=None) -> UsageSummary`. Both sync and async versions (sync for backward compat, async for Redis). 4. **Migration from UsageTracker**: `UsageTracker` becomes a thin wrapper that delegates to `UsageStore`. Existing `record()` and `get_usage()` APIs preserved. Internal `_records` list replaced by store backend. 5. **TTL management**: Each date key gets TTL of 90 days (configurable). This prevents unbounded Redis memory growth while preserving 3 months of usage data. **Patterns to follow:** - `SessionStore` Protocol in `src/agentkit/session/store.py` — Protocol definition pattern - `RedisSessionStore._get_redis()` — lazy Redis initialization - `create_session_store()` — factory function pattern - `agentkit:usage:` key namespace convention **Test scenarios:** - Record and query: record usage → query returns matching records - Date partitioning: records on different dates stored in different keys - Aggregation: multiple records for same agent/model aggregated correctly - Agent filter: query with agent filter returns only that agent's records - Date range filter: query with start/end returns only records in range - TTL: date keys have correct TTL set - Redis unavailable: falls back to in-memory store with warning - Concurrent writes: two concurrent records for same agent/model don't lose data - Empty query: query with no matching records returns empty list **Verification:** Usage data survives process restart; `get_usage()` returns same shape as before; Redis memory usage bounded by TTL. --- ### U5. CascadeStateStore Persistence **Goal:** Persist CascadeDetector state to Redis using atomic operations, enabling multi-instance cascade detection. **Dependencies:** None **Files:** - `src/agentkit/quality/cascade_store.py` (new) — `CascadeStateStore` Protocol, `InMemoryCascadeStore`, `RedisCascadeStore` - `src/agentkit/quality/cascade_detector.py` (modify) — delegate to `CascadeStateStore` backend - `tests/unit/quality/test_cascade_store.py` (new) — unit tests for cascade store backends **Approach:** Architecture design before coding: 1. **Redis data model reasoning**: Use simple string keys with INCR for atomic counting. Key: `agentkit:cascade:{session_id}:interactions` (INCR + TTL), `agentkit:cascade:{session_id}:depth` (GET/SET + TTL). TTL aligned with session TTL (default 86400s). INCR is atomic — no race conditions across instances. 2. **Protocol design**: `CascadeStateStore` with `increment_interactions(session_id) -> int`, `get_interactions(session_id) -> int`, `set_depth(session_id, depth) -> None`, `get_depth(session_id) -> int`, `reset(session_id) -> None`, `get_stats(session_id) -> CascadeStats`. 3. **Integration into CascadeDetector**: Replace internal `_interaction_counts` and `_loop_depths` dicts with `CascadeStateStore` backend. All methods delegate to store. `CascadeDetector` becomes stateless — all state lives in the store. 4. **Session TTL alignment**: When `increment_interactions()` is called, refresh the key TTL to match session TTL. This ensures state is cleaned up when sessions expire. **Patterns to follow:** - Same Protocol + factory + fallback pattern as U4 - Redis INCR atomic operation pattern - `agentkit:cascade:` key namespace **Test scenarios:** - Increment and get: increment interactions → get returns correct count - Set and get depth: set depth → get returns correct depth - Reset: reset session → interactions and depth both cleared - TTL: keys have TTL set, expire after session timeout - Multi-instance: two instances incrementing same session see consistent count - Redis unavailable: falls back to in-memory store - Session isolation: different sessions have independent state **Verification:** Cascade detection state survives process restart; multi-instance deployment detects cascades correctly; no false positives from state loss. --- ### U6. EvolutionStore Interface Unification **Goal:** Unify `EvolutionStore` and `PersistentEvolutionStore` interfaces, add PostgreSQL backend with full feature set. **Dependencies:** None **Files:** - `src/agentkit/evolution/evolution_store.py` (modify) — define unified `EvolutionStoreProtocol`, refactor existing stores - `src/agentkit/evolution/models.py` (modify) — add `SkillVersionModel` and `ABTestResultModel` to async PG models - `src/agentkit/evolution/pg_store.py` (new) — `PostgreSQLEvolutionStore` implementing unified Protocol with async SQLAlchemy - `tests/unit/evolution/test_unified_store.py` (new) — tests for unified interface **Approach:** Architecture design before coding: 1. **Protocol design reasoning**: Current `EvolutionStore` (async PG) has `record()`, `rollback()`, `list_events()`. `PersistentEvolutionStore` (sync SQLite) adds `record_skill_version()`, `list_skill_versions()`, `record_ab_test_result()`, `get_ab_test_results()`. The unified Protocol must include ALL methods from both. Each backend implements what it can; unsupported methods raise `NotImplementedError` with clear message. 2. **PostgreSQL model migration**: Add `SkillVersionModel` and `ABTestResultModel` to `src/agentkit/evolution/models.py` using async SQLAlchemy (matching `EpisodeModel` pattern in memory/models.py). These models already exist for SQLite; the PG versions use the same schema but with async engine. 3. **PostgreSQLEvolutionStore**: New class using async SQLAlchemy session (injected via constructor, same pattern as existing `EvolutionStore`). Implements all Protocol methods. Uses `run_in_executor` for any sync ORM operations if needed. 4. **Factory update**: `create_evolution_store(backend="memory"|"sqlite"|"postgresql", ...)` returns the appropriate backend. `"postgresql"` creates `PostgreSQLEvolutionStore` with async engine. 5. **Backward compatibility**: Existing `EvolutionStore` class is not removed — it becomes an internal implementation detail. The Protocol is the public interface. Code using `EvolutionStore` directly continues to work. **Patterns to follow:** - `EpisodeModel` in `src/agentkit/memory/models.py` — async PG model pattern - `create_evolution_store()` factory — extend with new backend - `PersistentEvolutionStore._run_sync()` — sync/async bridge pattern **Test scenarios:** - Protocol compliance: all backends implement all Protocol methods - PG store: record event → list events returns recorded event - PG store: record skill version → list versions returns version history - PG store: record AB test result → get results returns test data - SQLite store: existing functionality preserved after refactor - Memory store: existing functionality preserved after refactor - Factory: `create_evolution_store(backend="postgresql")` returns correct type - PG unavailable: falls back to SQLite with warning **Verification:** All backends pass unified Protocol compliance test; existing evolution tests pass; PG store supports skill_version and ab_test operations. --- ### U7. Configuration Integration and End-to-End Verification **Goal:** Wire all three features into the application configuration, add `agentkit.yaml` schema support, and verify end-to-end behavior. **Dependencies:** U1, U2, U3, U4, U5, U6 **Files:** - `src/agentkit/server/app.py` (modify) — initialize cache, usage store, cascade store with config - `src/agentkit/cli/main.py` (modify) — pass config to gateway and router - `agentkit.yaml` (modify) — add cache, semantic_routing, usage_store, cascade_store config sections - `tests/integration/test_p0_hardening.py` (new) — end-to-end integration tests **Approach:** 1. **Configuration schema**: Add to `agentkit.yaml`: ```yaml llm: cache: enabled: true backend: "auto" # auto | redis | memory exact_ttl: 3600 semantic_ttl: 86400 similarity_threshold: 0.92 max_entries: 10000 routing: semantic: enabled: true similarity_high: 0.85 # direct match threshold similarity_low: 0.6 # hint threshold usage_store: backend: "auto" # auto | redis | memory ttl_days: 90 cascade_store: backend: "auto" # auto | redis | memory session_ttl: 86400 evolution_store: backend: "auto" # auto | postgresql | sqlite | memory ``` 2. **Application wiring**: In `app.py` lifespan, initialize all stores and inject into gateway/router. Follow existing pattern of creating components from config. 3. **End-to-end verification**: Integration test that exercises the full flow: user query → semantic routing → LLM cache → usage tracking → cascade detection → evolution logging. **Test scenarios:** - Full flow with Redis: all features use Redis backend, data persists across simulated restart - Full flow without Redis: all features fall back to in-memory, no errors - Config from YAML: `agentkit.yaml` parsed correctly, all features configured - Cache + routing interaction: cached response for semantically routed query works correctly - Usage tracking with cache: cached requests show 0 cost in usage summary - Cascade detection across instances: simulated multi-instance scenario detects cascade correctly **Verification:** All integration tests pass; application starts with new config; features degrade gracefully when backends unavailable. --- ## Risks & Mitigations | Risk | Impact | Likelihood | Mitigation | |------|--------|-----------|------------| | Semantic cache returns stale/wrong response | High — user gets incorrect answer | Medium — embedding similarity doesn't guarantee semantic equivalence | Default to temperature=0 only for semantic cache; configurable threshold; TTL expiry; admin invalidation API | | Redis single point of failure | High — all persistence lost | Low — Redis is typically HA | Auto-fallback to in-memory; health check in doctor command; alert on fallback activation | | Embedding API latency adds to routing time | Medium — slower routing for first query | Medium — embedding API ~100ms | Pre-compute skill embeddings; cache query embeddings; async embedding with timeout | | UsageStore Redis memory growth | Medium — Redis OOM | Low — TTL + date partitioning bounds growth | 90-day TTL default; monitoring on Redis memory; configurable TTL | | EvolutionStore interface unification breaks existing code | High — evolution system stops working | Low — Protocol is backward compatible | Keep existing classes as internal implementations; comprehensive test coverage before refactor | --- ## Open Questions - Should semantic cache also cache streaming responses (requires chunk collection)? Deferred — current plan only caches non-streaming `chat()`. - Should UsageStore support real-time streaming of usage data (e.g., via Redis Pub/Sub)? Deferred — current plan only supports query-based access. - What is the optimal embedding model for Chinese+English mixed text? `text-embedding-3-small` is adequate but not optimal. Consider `bge-m3` or `multilingual-e5` as alternatives. Deferred to implementation-time benchmarking. --- ## Sources & Research - Industry benchmarking: LangChain, Dify, CrewAI, Letta, AutoGen feature comparison (2025-2026) - Project audit: 12 core files analyzed across memory, evolution, routing, quality, and LLM subsystems - Existing patterns: `EmbeddingCache`, `RedisSessionStore`, `create_evolution_store()`, `SessionStore` Protocol