fischer-agentkit/docs/plans/2026-06-14-001-feat-p0-prod...

---
title: "feat: P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence"
status: active
created_at: 2026-06-14
type: feat
origin: "行业调研与项目审视（2026-06-14）"
depth: deep
---

# P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence

## Summary

Three P0 gaps identified from industry benchmarking and project audit: (1) LLM response caching to reduce 30-50% token cost, (2) embedding-based semantic routing to improve intent matching quality at zero LLM cost, (3) critical state persistence for UsageTracker, EvolutionStore, and CascadeDetector to survive restarts and enable multi-instance deployment. Each unit requires detailed architecture design and code reasoning before implementation — design-first, code-second.

## Problem Frame

AgentKit has strong differentiation in self-evolution, quality management, and multi-paradigm engines, but three production-critical gaps prevent enterprise deployment:

1. **Every LLM request hits the provider** — no caching. Identical or similar requests waste tokens and money. Competitors like Dify have built-in caching.
2. **Routing relies on keyword matching and LLM classification** — no semantic understanding. Embedding-based routing is industry standard (Agentic RAG trend) and AgentKit already has embedding infrastructure but doesn't use it for routing.
3. **Critical state lives in memory** — UsageTracker, CascadeDetector, and EvolutionStore lose data on restart. Multi-instance deployment is impossible without shared state.

These gaps are P0 because they directly impact cost (caching), quality (routing accuracy), and reliability (state persistence) — the three pillars of production readiness.

---

## Requirements

- R1. LLM cache must support exact-match (hash-based) and semantic-match (embedding-based) cache hits
- R2. LLM cache must integrate transparently into `LLMGateway.chat()` without changing the public API
- R3. LLM cache must record usage on cache hits (0 cost) to maintain usage tracking integrity
- R4. Semantic routing must insert between Layer 1 and Layer 2 in `CostAwareRouter`
- R5. Semantic routing must use existing `OpenAIEmbedder` and `compute_cosine_similarity()` infrastructure
- R6. Semantic routing must pre-compute skill embeddings at registration time, not at query time
- R7. UsageTracker must persist records to Redis with O(1) write and efficient aggregation
- R8. CascadeDetector must persist state to Redis using atomic INCR operations
- R9. EvolutionStore must support both PostgreSQL and SQLite backends with a unified interface
- R10. All three features must degrade gracefully when Redis/PG is unavailable (fallback to in-memory)
- R11. Each unit must have detailed architecture design and code reasoning documented before implementation begins

---

## Key Technical Decisions

- **KTD-1. Cache key design**: Use `SHA256(model + system_prompt_hash + messages_content_hash + temperature + tools_hash)` for exact match. For semantic match, embed the last user message and compare against cached embeddings using cosine similarity > threshold. Rationale: exact match is fast and deterministic; semantic match catches paraphrased requests. Both are needed because exact match alone misses too many hits, and semantic match alone is too slow for every request.

- **KTD-2. Cache storage backend**: Implement `LLMCache` as a Protocol with `InMemoryLLMCache` and `RedisLLMCache` backends. In-memory uses `OrderedDict` with LRU eviction (following `EmbeddingCache` pattern). Redis uses `agentkit:llm_cache:{hash}` keys with TTL. Rationale: follows existing factory pattern (`create_message_bus`, `create_session_store`); in-memory for dev/single-instance, Redis for production.

- **KTD-3. Semantic routing insertion point**: Insert as Layer 1.5 between `HeuristicClassifier` and `_classify_merged()`. When Layer 1 returns medium complexity (0.3-0.7), try semantic routing first. If similarity > 0.85, return skill match directly (skip LLM). If similarity 0.6-0.85, pass skill hint to Layer 2 LLM (reduces LLM classification tokens). If < 0.6, proceed to Layer 2 unchanged. Rationale: this placement maximizes cost savings by avoiding LLM calls when semantic match is confident, while preserving the existing fallback chain.

- **KTD-4. Skill embedding source text**: Embed `f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"` for each skill. Cache embeddings in a dict keyed by skill name, re-embed on skill registration/update. Rationale: combines all semantic signals; description alone misses keyword intent; keywords alone misses semantic meaning.

- **KTD-5. UsageTracker persistence strategy**: Use Redis Hash for time-series data. Key pattern: `agentkit:usage:{date}` with fields `{agent}:{model}` → JSON `{tokens, cost, latency_ms, count}`. Write via `HINCRBYFLOAT` for atomic increment. Query via `HGETALL` + client-side aggregation. Rationale: O(1) write, acceptable query performance, natural TTL by date, follows Redis patterns in project.

- **KTD-6. CascadeDetector persistence strategy**: Use Redis atomic operations. Key pattern: `agentkit:cascade:{session_id}:interactions` (INCR + TTL) and `agentkit:cascade:{session_id}:depth` (SET/GET + TTL). Rationale: INCR is atomic, no race conditions across instances; TTL prevents memory leaks; matches session lifecycle.

- **KTD-7. EvolutionStore interface unification**: Extend the base `EvolutionStore` Protocol to include `skill_version` and `ab_test` methods. Make `PersistentEvolutionStore` (SQLite) implement the unified Protocol. Add a new `PostgreSQLEvolutionStore` that uses async SQLAlchemy like the existing `EvolutionStore` but with the full unified interface. Rationale: current split (sync SQLite vs async PG) creates maintenance burden; unified Protocol enables backend-agnostic usage.

- **KTD-8. Graceful degradation pattern**: All three features use the same pattern — try preferred backend, catch connection error, log warning, fall back to in-memory. Controlled by `cache.backend`, `usage_store.backend`, `cascade_store.backend` config values (`"auto"` | `"redis"` | `"memory"`). `"auto"` tries Redis, falls back to memory. Rationale: production needs persistence, but dev/testing shouldn't require Redis.

---

## High-Level Technical Design

### LLM Cache Flow

```mermaid
flowchart TB
    A[LLMGateway.chat] --> B{Cache enabled?}
    B -->|no| F[Call Provider]
    B -->|yes| C[Generate exact key]
    C --> D{Exact match?}
    D -->|hit| E[Return cached response]
    D -->|miss| G[Generate embedding of last user msg]
    G --> H{Semantic match? similarity > 0.92}
    H -->|hit| E
    H -->|miss| F
    F --> I[Write to cache]
    I --> J[Record usage]
    E --> K[Record usage with 0 cost]
```

### Semantic Routing Flow

```mermaid
flowchart TB
    A[CostAwareRouter.route] --> B[Layer 0: Regex rules]
    B -->|matched| Z[Return DIRECT_CHAT]
    B -->|unmatched| C[Layer 1: HeuristicClassifier]
    C -->|low complexity| Z
    C -->|medium-high| D[Layer 1.5: Semantic Router NEW]
    D -->|sim > 0.85| E[Return SKILL_REACT with matched skill]
    D -->|sim 0.6-0.85| F[Pass skill_hint to Layer 2]
    D -->|sim < 0.6| G[Layer 2: LLM classification]
    F --> G
    G --> H[Return routing result]
```

### State Persistence Architecture

```mermaid
flowchart TB
    subgraph "Current (In-Memory)"
        UT1[UsageTracker dict]
        CD1[CascadeDetector dict]
        ES1[EvolutionStore SQLite]
    end
    subgraph "Target (Persistent)"
        UT2[UsageStore Protocol]
        CD2[CascadeStateStore Protocol]
        ES2[UnifiedEvolutionStore Protocol]
        UT2 -->|redis| R1[Redis Hash agentkit:usage:date]
        UT2 -->|memory| M1[InMemoryUsageStore]
        CD2 -->|redis| R2[Redis INCR agentkit:cascade:session]
        CD2 -->|memory| M2[InMemoryCascadeStore]
        ES2 -->|postgresql| P1[PG EvolutionEventModel + SkillVersionModel]
        ES2 -->|sqlite| S1[PersistentEvolutionStore]
        ES2 -->|memory| M3[InMemoryEvolutionStore]
    end
```

---

## Scope Boundaries

### In Scope

- LLM response caching (exact + semantic match, in-memory + Redis backends)
- Semantic routing as Layer 1.5 in CostAwareRouter
- UsageTracker Redis persistence
- CascadeDetector Redis persistence
- EvolutionStore interface unification
- Configuration for all three features
- Architecture design documents for each unit before coding

### Deferred for Follow-Up

- Semantic cache using pgvector (current semantic match uses in-memory embedding comparison)
- Cache warming / pre-population strategies
- Routing cache (caching routing results for similar queries)
- Usage analytics dashboard (visualization of usage data)
- Multi-tenant resource quotas
- Rate limiting and concurrency control (P2)
- Distributed tracing visualization (P2)

---

## Implementation Units

### U1. LLM Cache Core

**Goal:** Implement the `LLMCache` Protocol, `InMemoryLLMCache`, and `RedisLLMCache` with exact-match and semantic-match capabilities.

**Dependencies:** None

**Files:**
- `src/agentkit/llm/cache.py` (new) — `LLMCache` Protocol, `InMemoryLLMCache`, `RedisLLMCache`, `CacheResult`, `CacheKey` generation
- `src/agentkit/llm/cache_key.py` (new) — `generate_cache_key()`, `generate_messages_hash()`, `generate_system_prompt_hash()`
- `tests/unit/llm/test_cache.py` (new) — unit tests for cache backends

**Approach:**

Architecture design before coding:

1. **CacheKey design reasoning**: The key must capture all inputs that affect LLM output. `model` determines which model responds. `system_prompt` sets behavior. `messages` carry the conversation. `temperature` affects randomness (only cache temperature=0 deterministically). `tools` affect tool_call availability. Hash each component independently so partial changes don't invalidate the entire key.

2. **Exact match implementation**: SHA-256 hash of concatenated component hashes. Store as `agentkit:llm_cache:{sha256_hex}` in Redis with TTL. In-memory uses OrderedDict keyed by hash string.

3. **Semantic match implementation**: For cache misses on exact match, embed the last user message using `OpenAIEmbedder`. Compare against cached embeddings using `compute_cosine_similarity()`. Store embeddings alongside cached responses. In-memory: linear scan of all cached embeddings. Redis: store embeddings in a separate key `agentkit:llm_cache_emb:{sha256_hex}`.

4. **Cache write policy**: Only cache responses where `temperature == 0` (deterministic). For temperature > 0, only exact-match cache applies (no semantic match, since outputs are non-deterministic).

5. **Cache invalidation**: TTL-based (configurable, default 3600s for exact, 86400s for semantic). Manual invalidation via `invalidate(pattern=None)` for admin operations.

**Patterns to follow:**
- `EmbeddingCache` in `src/agentkit/memory/embedder.py` — LRU + TTL pattern
- `create_session_store()` factory in `src/agentkit/session/store.py` — backend factory pattern
- `RedisSessionStore._get_redis()` — lazy Redis initialization

**Test scenarios:**
- Exact match: same messages + model → cache hit, returns identical response
- Exact miss: different messages → cache miss, calls provider, writes to cache
- Semantic match: paraphrased question (similarity > 0.92) → cache hit
- Semantic miss: unrelated question (similarity < 0.6) → cache miss
- Temperature > 0: only exact match attempted, no semantic match
- TTL expiry: cached entry expires after TTL, next request is a miss
- Redis unavailable: falls back to in-memory cache with warning log
- Cache with tool_calls: response containing tool_calls is cached correctly
- Concurrent access: two concurrent requests for same key don't cause double-write issues

**Verification:** Unit tests pass; cache hit rate metric is observable; no change to `LLMGateway` public API.

---

### U2. LLM Cache Integration

**Goal:** Integrate `LLMCache` into `LLMGateway.chat()` transparently, with usage tracking on cache hits.

**Dependencies:** U1

**Files:**
- `src/agentkit/llm/gateway.py` (modify) — inject cache check before provider call, cache write after provider response
- `src/agentkit/llm/config.py` (modify) — add `CacheConfig` to `LLMConfig`
- `src/agentkit/server/app.py` (modify) — pass cache config to `LLMGateway`
- `tests/unit/llm/test_gateway_cache.py` (new) — integration tests for cached gateway

**Approach:**

Architecture design before coding:

1. **Insertion point reasoning**: Cache check must happen AFTER `LLMRequest` construction (line ~79 in gateway.py) but BEFORE provider call (line ~87). This ensures all request normalization (alias resolution, model fallback list) has completed. Cache write happens AFTER response validation but BEFORE usage tracking.

2. **Cache hit usage tracking**: On cache hit, call `_usage_tracker.record()` with the original `usage` data from the cached response but with `cost=0` and `latency_ms` from cache lookup time. This preserves usage query integrity — `get_usage()` still shows all requests, just with zero cost for cached ones.

3. **Stream handling**: `chat_stream()` is NOT cached in this iteration. Streaming requires collecting all chunks before caching, which adds latency and complexity. Document this as a known limitation.

4. **Configuration integration**: Add `CacheConfig` dataclass with `enabled: bool = False`, `backend: str = "auto"`, `exact_ttl: int = 3600`, `semantic_ttl: int = 86400`, `similarity_threshold: float = 0.92`, `max_entries: int = 10000`. Nest under `LLMConfig.cache`.

**Patterns to follow:**
- `LLMConfig` dataclass + `from_dict()` pattern for config
- `LLMGateway.__init__()` dependency injection pattern

**Test scenarios:**
- Cache disabled: requests pass through to provider normally
- Cache enabled, first request: cache miss, provider called, response cached
- Cache enabled, second identical request: cache hit, provider NOT called
- Cache hit usage tracking: usage record has 0 cost, correct token counts
- Cache miss + fallback: primary model fails, fallback model response cached under fallback model key
- Config from YAML: `LLMConfig.from_dict({"cache": {"enabled": true}})` works correctly

**Verification:** Integration tests pass; `LLMGateway.chat()` returns same `LLMResponse` shape whether cached or not; usage tracking includes cache hits.

---

### U3. Semantic Router

**Goal:** Implement embedding-based semantic routing as Layer 1.5 in `CostAwareRouter`, using existing `OpenAIEmbedder` and `compute_cosine_similarity()`.

**Dependencies:** None (independent of U1/U2, uses existing embedding infrastructure)

**Files:**
- `src/agentkit/chat/semantic_router.py` (new) — `SemanticRouter` class, `SkillEmbeddingIndex`
- `src/agentkit/chat/skill_routing.py` (modify) — integrate Layer 1.5 into `CostAwareRouter.route()`
- `tests/unit/chat/test_semantic_router.py` (new) — unit tests for semantic router

**Approach:**

Architecture design before coding:

1. **SkillEmbeddingIndex design reasoning**: Pre-compute embeddings for all registered skills at initialization. Source text: `f"{description} | {' '.join(keywords)} | {' '.join(capability_tags)}"`. Store as `dict[str, tuple[list[float], str]]` (skill_name → (embedding, source_text)). On skill registration/update, re-embed only the changed skill. This avoids O(n) embedding computation per query.

2. **Query-time flow**: Embed user query → compute cosine similarity against all skill embeddings → return top match if above threshold. This is O(n) in number of skills, but with <100 skills and 1536-dim vectors, this takes <5ms on CPU. No need for approximate nearest neighbor (ANN) index at this scale.

3. **Threshold design**: Three zones:
   - `similarity > 0.85`: HIGH confidence → return skill match directly, skip Layer 2 LLM
   - `0.6 <= similarity <= 0.85`: MEDIUM confidence → pass skill hint to Layer 2, reducing LLM classification tokens
   - `similarity < 0.6`: LOW confidence → no semantic signal, Layer 2 runs unmodified

4. **Integration into CostAwareRouter**: Modify `route()` method. After Layer 1 (`_classify_merged()`), if complexity is medium (0.3-0.7), call `semantic_router.route(query)`. Based on confidence zone, either return directly or enhance the Layer 2 prompt with skill hint.

5. **Embedding provider**: Use `OpenAIEmbedder` by default. Support `MockEmbedder` for testing. Embedder is injected via constructor, not created internally.

**Patterns to follow:**
- `OpenAIEmbedder` + `EmbeddingCache` pattern for embedding computation
- `compute_cosine_similarity()` in `src/agentkit/utils/vector_math.py`
- `CostAwareRouter` constructor injection pattern

**Test scenarios:**
- Exact skill match: query "生成一篇关于AI的文章" matches `content_generator` skill (sim > 0.85)
- Partial skill match: query "优化内容" matches `geo_optimizer` skill (sim 0.6-0.85), skill hint passed to LLM
- No skill match: query "今天天气怎么样" has sim < 0.6 for all skills, Layer 2 runs normally
- Skill registration: new skill added → embedding computed and indexed
- Skill update: skill description changed → embedding re-computed
- Empty skill registry: semantic router returns None gracefully
- Embedder failure: OpenAIEmbedder throws error → semantic router logs warning, returns None, Layer 2 runs normally
- Chinese query: "帮我写一篇文章" matches content_generator skill correctly

**Verification:** Semantic router returns correct skill matches; Layer 2 LLM calls reduced by >50% for medium-complexity queries; no regression in routing accuracy.

---

### U4. UsageStore Persistence

**Goal:** Persist UsageTracker records to Redis, with in-memory fallback and efficient aggregation queries.

**Dependencies:** None

**Files:**
- `src/agentkit/llm/usage_store.py` (new) — `UsageStore` Protocol, `InMemoryUsageStore`, `RedisUsageStore`
- `src/agentkit/llm/providers/tracker.py` (modify) — delegate to `UsageStore` backend
- `tests/unit/llm/test_usage_store.py` (new) — unit tests for usage store backends

**Approach:**

Architecture design before coding:

1. **Redis data model reasoning**: Use Redis Hash per date for time-partitioned storage. Key: `agentkit:usage:{YYYY-MM-DD}`, field: `{agent_name}:{model}`, value: JSON `{prompt_tokens, completion_tokens, total_tokens, cost, latency_ms, count}`. Write via pipeline: `HINCRBYFLOAT` for numeric fields + `HINCRBY` for count. This is O(1) per write, atomic, and naturally partitions by date.

2. **Aggregation query design**: For `get_usage(agent=None, start=None, end=None)`: scan date keys in range via `HGETALL`, filter by agent/model in application code, aggregate in memory. For single-agent queries, use field prefix matching. This is O(days × agents) which is acceptable for dashboard queries.

3. **UsageStore Protocol**: Define `record(agent, model, usage: UsageRecord) -> None`, `query(agent=None, model=None, start=None, end=None) -> list[UsageRecord]`, `get_summary(agent=None, start=None, end=None) -> UsageSummary`. Both sync and async versions (sync for backward compat, async for Redis).

4. **Migration from UsageTracker**: `UsageTracker` becomes a thin wrapper that delegates to `UsageStore`. Existing `record()` and `get_usage()` APIs preserved. Internal `_records` list replaced by store backend.

5. **TTL management**: Each date key gets TTL of 90 days (configurable). This prevents unbounded Redis memory growth while preserving 3 months of usage data.

**Patterns to follow:**
- `SessionStore` Protocol in `src/agentkit/session/store.py` — Protocol definition pattern
- `RedisSessionStore._get_redis()` — lazy Redis initialization
- `create_session_store()` — factory function pattern
- `agentkit:usage:` key namespace convention

**Test scenarios:**
- Record and query: record usage → query returns matching records
- Date partitioning: records on different dates stored in different keys
- Aggregation: multiple records for same agent/model aggregated correctly
- Agent filter: query with agent filter returns only that agent's records
- Date range filter: query with start/end returns only records in range
- TTL: date keys have correct TTL set
- Redis unavailable: falls back to in-memory store with warning
- Concurrent writes: two concurrent records for same agent/model don't lose data
- Empty query: query with no matching records returns empty list

**Verification:** Usage data survives process restart; `get_usage()` returns same shape as before; Redis memory usage bounded by TTL.

---

### U5. CascadeStateStore Persistence

**Goal:** Persist CascadeDetector state to Redis using atomic operations, enabling multi-instance cascade detection.

**Dependencies:** None

**Files:**
- `src/agentkit/quality/cascade_store.py` (new) — `CascadeStateStore` Protocol, `InMemoryCascadeStore`, `RedisCascadeStore`
- `src/agentkit/quality/cascade_detector.py` (modify) — delegate to `CascadeStateStore` backend
- `tests/unit/quality/test_cascade_store.py` (new) — unit tests for cascade store backends

**Approach:**

Architecture design before coding:

1. **Redis data model reasoning**: Use simple string keys with INCR for atomic counting. Key: `agentkit:cascade:{session_id}:interactions` (INCR + TTL), `agentkit:cascade:{session_id}:depth` (GET/SET + TTL). TTL aligned with session TTL (default 86400s). INCR is atomic — no race conditions across instances.

2. **Protocol design**: `CascadeStateStore` with `increment_interactions(session_id) -> int`, `get_interactions(session_id) -> int`, `set_depth(session_id, depth) -> None`, `get_depth(session_id) -> int`, `reset(session_id) -> None`, `get_stats(session_id) -> CascadeStats`.

3. **Integration into CascadeDetector**: Replace internal `_interaction_counts` and `_loop_depths` dicts with `CascadeStateStore` backend. All methods delegate to store. `CascadeDetector` becomes stateless — all state lives in the store.

4. **Session TTL alignment**: When `increment_interactions()` is called, refresh the key TTL to match session TTL. This ensures state is cleaned up when sessions expire.

**Patterns to follow:**
- Same Protocol + factory + fallback pattern as U4
- Redis INCR atomic operation pattern
- `agentkit:cascade:` key namespace

**Test scenarios:**
- Increment and get: increment interactions → get returns correct count
- Set and get depth: set depth → get returns correct depth
- Reset: reset session → interactions and depth both cleared
- TTL: keys have TTL set, expire after session timeout
- Multi-instance: two instances incrementing same session see consistent count
- Redis unavailable: falls back to in-memory store
- Session isolation: different sessions have independent state

**Verification:** Cascade detection state survives process restart; multi-instance deployment detects cascades correctly; no false positives from state loss.

---

### U6. EvolutionStore Interface Unification

**Goal:** Unify `EvolutionStore` and `PersistentEvolutionStore` interfaces, add PostgreSQL backend with full feature set.

**Dependencies:** None

**Files:**
- `src/agentkit/evolution/evolution_store.py` (modify) — define unified `EvolutionStoreProtocol`, refactor existing stores
- `src/agentkit/evolution/models.py` (modify) — add `SkillVersionModel` and `ABTestResultModel` to async PG models
- `src/agentkit/evolution/pg_store.py` (new) — `PostgreSQLEvolutionStore` implementing unified Protocol with async SQLAlchemy
- `tests/unit/evolution/test_unified_store.py` (new) — tests for unified interface

**Approach:**

Architecture design before coding:

1. **Protocol design reasoning**: Current `EvolutionStore` (async PG) has `record()`, `rollback()`, `list_events()`. `PersistentEvolutionStore` (sync SQLite) adds `record_skill_version()`, `list_skill_versions()`, `record_ab_test_result()`, `get_ab_test_results()`. The unified Protocol must include ALL methods from both. Each backend implements what it can; unsupported methods raise `NotImplementedError` with clear message.

2. **PostgreSQL model migration**: Add `SkillVersionModel` and `ABTestResultModel` to `src/agentkit/evolution/models.py` using async SQLAlchemy (matching `EpisodeModel` pattern in memory/models.py). These models already exist for SQLite; the PG versions use the same schema but with async engine.

3. **PostgreSQLEvolutionStore**: New class using async SQLAlchemy session (injected via constructor, same pattern as existing `EvolutionStore`). Implements all Protocol methods. Uses `run_in_executor` for any sync ORM operations if needed.

4. **Factory update**: `create_evolution_store(backend="memory"|"sqlite"|"postgresql", ...)` returns the appropriate backend. `"postgresql"` creates `PostgreSQLEvolutionStore` with async engine.

5. **Backward compatibility**: Existing `EvolutionStore` class is not removed — it becomes an internal implementation detail. The Protocol is the public interface. Code using `EvolutionStore` directly continues to work.

**Patterns to follow:**
- `EpisodeModel` in `src/agentkit/memory/models.py` — async PG model pattern
- `create_evolution_store()` factory — extend with new backend
- `PersistentEvolutionStore._run_sync()` — sync/async bridge pattern

**Test scenarios:**
- Protocol compliance: all backends implement all Protocol methods
- PG store: record event → list events returns recorded event
- PG store: record skill version → list versions returns version history
- PG store: record AB test result → get results returns test data
- SQLite store: existing functionality preserved after refactor
- Memory store: existing functionality preserved after refactor
- Factory: `create_evolution_store(backend="postgresql")` returns correct type
- PG unavailable: falls back to SQLite with warning

**Verification:** All backends pass unified Protocol compliance test; existing evolution tests pass; PG store supports skill_version and ab_test operations.

---

### U7. Configuration Integration and End-to-End Verification

**Goal:** Wire all three features into the application configuration, add `agentkit.yaml` schema support, and verify end-to-end behavior.

**Dependencies:** U1, U2, U3, U4, U5, U6

**Files:**
- `src/agentkit/server/app.py` (modify) — initialize cache, usage store, cascade store with config
- `src/agentkit/cli/main.py` (modify) — pass config to gateway and router
- `agentkit.yaml` (modify) — add cache, semantic_routing, usage_store, cascade_store config sections
- `tests/integration/test_p0_hardening.py` (new) — end-to-end integration tests

**Approach:**

1. **Configuration schema**: Add to `agentkit.yaml`:
```yaml
llm:
  cache:
    enabled: true
    backend: "auto"          # auto | redis | memory
    exact_ttl: 3600
    semantic_ttl: 86400
    similarity_threshold: 0.92
    max_entries: 10000

routing:
  semantic:
    enabled: true
    similarity_high: 0.85    # direct match threshold
    similarity_low: 0.6      # hint threshold

usage_store:
  backend: "auto"            # auto | redis | memory
  ttl_days: 90

cascade_store:
  backend: "auto"            # auto | redis | memory
  session_ttl: 86400

evolution_store:
  backend: "auto"            # auto | postgresql | sqlite | memory
```

2. **Application wiring**: In `app.py` lifespan, initialize all stores and inject into gateway/router. Follow existing pattern of creating components from config.

3. **End-to-end verification**: Integration test that exercises the full flow: user query → semantic routing → LLM cache → usage tracking → cascade detection → evolution logging.

**Test scenarios:**
- Full flow with Redis: all features use Redis backend, data persists across simulated restart
- Full flow without Redis: all features fall back to in-memory, no errors
- Config from YAML: `agentkit.yaml` parsed correctly, all features configured
- Cache + routing interaction: cached response for semantically routed query works correctly
- Usage tracking with cache: cached requests show 0 cost in usage summary
- Cascade detection across instances: simulated multi-instance scenario detects cascade correctly

**Verification:** All integration tests pass; application starts with new config; features degrade gracefully when backends unavailable.

---

## Risks & Mitigations

| Risk | Impact | Likelihood | Mitigation |
|------|--------|-----------|------------|
| Semantic cache returns stale/wrong response | High — user gets incorrect answer | Medium — embedding similarity doesn't guarantee semantic equivalence | Default to temperature=0 only for semantic cache; configurable threshold; TTL expiry; admin invalidation API |
| Redis single point of failure | High — all persistence lost | Low — Redis is typically HA | Auto-fallback to in-memory; health check in doctor command; alert on fallback activation |
| Embedding API latency adds to routing time | Medium — slower routing for first query | Medium — embedding API ~100ms | Pre-compute skill embeddings; cache query embeddings; async embedding with timeout |
| UsageStore Redis memory growth | Medium — Redis OOM | Low — TTL + date partitioning bounds growth | 90-day TTL default; monitoring on Redis memory; configurable TTL |
| EvolutionStore interface unification breaks existing code | High — evolution system stops working | Low — Protocol is backward compatible | Keep existing classes as internal implementations; comprehensive test coverage before refactor |

---

## Open Questions

- Should semantic cache also cache streaming responses (requires chunk collection)? Deferred — current plan only caches non-streaming `chat()`.
- Should UsageStore support real-time streaming of usage data (e.g., via Redis Pub/Sub)? Deferred — current plan only supports query-based access.
- What is the optimal embedding model for Chinese+English mixed text? `text-embedding-3-small` is adequate but not optimal. Consider `bge-m3` or `multilingual-e5` as alternatives. Deferred to implementation-time benchmarking.

---

## Sources & Research

- Industry benchmarking: LangChain, Dify, CrewAI, Letta, AutoGen feature comparison (2025-2026)
- Project audit: 12 core files analyzed across memory, evolution, routing, quality, and LLM subsystems
- Existing patterns: `EmbeddingCache`, `RedisSessionStore`, `create_evolution_store()`, `SessionStore` Protocol