fischer-agentkit/docs/plans/2026-06-14-001-feat-p0-prod...

499 lines
30 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "feat: P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence"
status: active
created_at: 2026-06-14
type: feat
origin: "行业调研与项目审视2026-06-14"
depth: deep
---
# P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence
## Summary
Three P0 gaps identified from industry benchmarking and project audit: (1) LLM response caching to reduce 30-50% token cost, (2) embedding-based semantic routing to improve intent matching quality at zero LLM cost, (3) critical state persistence for UsageTracker, EvolutionStore, and CascadeDetector to survive restarts and enable multi-instance deployment. Each unit requires detailed architecture design and code reasoning before implementation — design-first, code-second.
## Problem Frame
AgentKit has strong differentiation in self-evolution, quality management, and multi-paradigm engines, but three production-critical gaps prevent enterprise deployment:
1. **Every LLM request hits the provider** — no caching. Identical or similar requests waste tokens and money. Competitors like Dify have built-in caching.
2. **Routing relies on keyword matching and LLM classification** — no semantic understanding. Embedding-based routing is industry standard (Agentic RAG trend) and AgentKit already has embedding infrastructure but doesn't use it for routing.
3. **Critical state lives in memory** — UsageTracker, CascadeDetector, and EvolutionStore lose data on restart. Multi-instance deployment is impossible without shared state.
These gaps are P0 because they directly impact cost (caching), quality (routing accuracy), and reliability (state persistence) — the three pillars of production readiness.
---
## Requirements
- R1. LLM cache must support exact-match (hash-based) and semantic-match (embedding-based) cache hits
- R2. LLM cache must integrate transparently into `LLMGateway.chat()` without changing the public API
- R3. LLM cache must record usage on cache hits (0 cost) to maintain usage tracking integrity
- R4. Semantic routing must insert between Layer 1 and Layer 2 in `CostAwareRouter`
- R5. Semantic routing must use existing `OpenAIEmbedder` and `compute_cosine_similarity()` infrastructure
- R6. Semantic routing must pre-compute skill embeddings at registration time, not at query time
- R7. UsageTracker must persist records to Redis with O(1) write and efficient aggregation
- R8. CascadeDetector must persist state to Redis using atomic INCR operations
- R9. EvolutionStore must support both PostgreSQL and SQLite backends with a unified interface
- R10. All three features must degrade gracefully when Redis/PG is unavailable (fallback to in-memory)
- R11. Each unit must have detailed architecture design and code reasoning documented before implementation begins
---
## Key Technical Decisions
- **KTD-1. Cache key design**: Use `SHA256(model + system_prompt_hash + messages_content_hash + temperature + tools_hash)` for exact match. For semantic match, embed the last user message and compare against cached embeddings using cosine similarity > threshold. Rationale: exact match is fast and deterministic; semantic match catches paraphrased requests. Both are needed because exact match alone misses too many hits, and semantic match alone is too slow for every request.
- **KTD-2. Cache storage backend**: Implement `LLMCache` as a Protocol with `InMemoryLLMCache` and `RedisLLMCache` backends. In-memory uses `OrderedDict` with LRU eviction (following `EmbeddingCache` pattern). Redis uses `agentkit:llm_cache:{hash}` keys with TTL. Rationale: follows existing factory pattern (`create_message_bus`, `create_session_store`); in-memory for dev/single-instance, Redis for production.
- **KTD-3. Semantic routing insertion point**: Insert as Layer 1.5 between `HeuristicClassifier` and `_classify_merged()`. When Layer 1 returns medium complexity (0.3-0.7), try semantic routing first. If similarity > 0.85, return skill match directly (skip LLM). If similarity 0.6-0.85, pass skill hint to Layer 2 LLM (reduces LLM classification tokens). If < 0.6, proceed to Layer 2 unchanged. Rationale: this placement maximizes cost savings by avoiding LLM calls when semantic match is confident, while preserving the existing fallback chain.
- **KTD-4. Skill embedding source text**: Embed `f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"` for each skill. Cache embeddings in a dict keyed by skill name, re-embed on skill registration/update. Rationale: combines all semantic signals; description alone misses keyword intent; keywords alone misses semantic meaning.
- **KTD-5. UsageTracker persistence strategy**: Use Redis Hash for time-series data. Key pattern: `agentkit:usage:{date}` with fields `{agent}:{model}` JSON `{tokens, cost, latency_ms, count}`. Write via `HINCRBYFLOAT` for atomic increment. Query via `HGETALL` + client-side aggregation. Rationale: O(1) write, acceptable query performance, natural TTL by date, follows Redis patterns in project.
- **KTD-6. CascadeDetector persistence strategy**: Use Redis atomic operations. Key pattern: `agentkit:cascade:{session_id}:interactions` (INCR + TTL) and `agentkit:cascade:{session_id}:depth` (SET/GET + TTL). Rationale: INCR is atomic, no race conditions across instances; TTL prevents memory leaks; matches session lifecycle.
- **KTD-7. EvolutionStore interface unification**: Extend the base `EvolutionStore` Protocol to include `skill_version` and `ab_test` methods. Make `PersistentEvolutionStore` (SQLite) implement the unified Protocol. Add a new `PostgreSQLEvolutionStore` that uses async SQLAlchemy like the existing `EvolutionStore` but with the full unified interface. Rationale: current split (sync SQLite vs async PG) creates maintenance burden; unified Protocol enables backend-agnostic usage.
- **KTD-8. Graceful degradation pattern**: All three features use the same pattern try preferred backend, catch connection error, log warning, fall back to in-memory. Controlled by `cache.backend`, `usage_store.backend`, `cascade_store.backend` config values (`"auto"` | `"redis"` | `"memory"`). `"auto"` tries Redis, falls back to memory. Rationale: production needs persistence, but dev/testing shouldn't require Redis.
---
## High-Level Technical Design
### LLM Cache Flow
```mermaid
flowchart TB
A[LLMGateway.chat] --> B{Cache enabled?}
B -->|no| F[Call Provider]
B -->|yes| C[Generate exact key]
C --> D{Exact match?}
D -->|hit| E[Return cached response]
D -->|miss| G[Generate embedding of last user msg]
G --> H{Semantic match? similarity > 0.92}
H -->|hit| E
H -->|miss| F
F --> I[Write to cache]
I --> J[Record usage]
E --> K[Record usage with 0 cost]
```
### Semantic Routing Flow
```mermaid
flowchart TB
A[CostAwareRouter.route] --> B[Layer 0: Regex rules]
B -->|matched| Z[Return DIRECT_CHAT]
B -->|unmatched| C[Layer 1: HeuristicClassifier]
C -->|low complexity| Z
C -->|medium-high| D[Layer 1.5: Semantic Router NEW]
D -->|sim > 0.85| E[Return SKILL_REACT with matched skill]
D -->|sim 0.6-0.85| F[Pass skill_hint to Layer 2]
D -->|sim < 0.6| G[Layer 2: LLM classification]
F --> G
G --> H[Return routing result]
```
### State Persistence Architecture
```mermaid
flowchart TB
subgraph "Current (In-Memory)"
UT1[UsageTracker dict]
CD1[CascadeDetector dict]
ES1[EvolutionStore SQLite]
end
subgraph "Target (Persistent)"
UT2[UsageStore Protocol]
CD2[CascadeStateStore Protocol]
ES2[UnifiedEvolutionStore Protocol]
UT2 -->|redis| R1[Redis Hash agentkit:usage:date]
UT2 -->|memory| M1[InMemoryUsageStore]
CD2 -->|redis| R2[Redis INCR agentkit:cascade:session]
CD2 -->|memory| M2[InMemoryCascadeStore]
ES2 -->|postgresql| P1[PG EvolutionEventModel + SkillVersionModel]
ES2 -->|sqlite| S1[PersistentEvolutionStore]
ES2 -->|memory| M3[InMemoryEvolutionStore]
end
```
---
## Scope Boundaries
### In Scope
- LLM response caching (exact + semantic match, in-memory + Redis backends)
- Semantic routing as Layer 1.5 in CostAwareRouter
- UsageTracker Redis persistence
- CascadeDetector Redis persistence
- EvolutionStore interface unification
- Configuration for all three features
- Architecture design documents for each unit before coding
### Deferred for Follow-Up
- Semantic cache using pgvector (current semantic match uses in-memory embedding comparison)
- Cache warming / pre-population strategies
- Routing cache (caching routing results for similar queries)
- Usage analytics dashboard (visualization of usage data)
- Multi-tenant resource quotas
- Rate limiting and concurrency control (P2)
- Distributed tracing visualization (P2)
---
## Implementation Units
### U1. LLM Cache Core
**Goal:** Implement the `LLMCache` Protocol, `InMemoryLLMCache`, and `RedisLLMCache` with exact-match and semantic-match capabilities.
**Dependencies:** None
**Files:**
- `src/agentkit/llm/cache.py` (new) `LLMCache` Protocol, `InMemoryLLMCache`, `RedisLLMCache`, `CacheResult`, `CacheKey` generation
- `src/agentkit/llm/cache_key.py` (new) `generate_cache_key()`, `generate_messages_hash()`, `generate_system_prompt_hash()`
- `tests/unit/llm/test_cache.py` (new) unit tests for cache backends
**Approach:**
Architecture design before coding:
1. **CacheKey design reasoning**: The key must capture all inputs that affect LLM output. `model` determines which model responds. `system_prompt` sets behavior. `messages` carry the conversation. `temperature` affects randomness (only cache temperature=0 deterministically). `tools` affect tool_call availability. Hash each component independently so partial changes don't invalidate the entire key.
2. **Exact match implementation**: SHA-256 hash of concatenated component hashes. Store as `agentkit:llm_cache:{sha256_hex}` in Redis with TTL. In-memory uses OrderedDict keyed by hash string.
3. **Semantic match implementation**: For cache misses on exact match, embed the last user message using `OpenAIEmbedder`. Compare against cached embeddings using `compute_cosine_similarity()`. Store embeddings alongside cached responses. In-memory: linear scan of all cached embeddings. Redis: store embeddings in a separate key `agentkit:llm_cache_emb:{sha256_hex}`.
4. **Cache write policy**: Only cache responses where `temperature == 0` (deterministic). For temperature > 0, only exact-match cache applies (no semantic match, since outputs are non-deterministic).
5. **Cache invalidation**: TTL-based (configurable, default 3600s for exact, 86400s for semantic). Manual invalidation via `invalidate(pattern=None)` for admin operations.
**Patterns to follow:**
- `EmbeddingCache` in `src/agentkit/memory/embedder.py` — LRU + TTL pattern
- `create_session_store()` factory in `src/agentkit/session/store.py` — backend factory pattern
- `RedisSessionStore._get_redis()` — lazy Redis initialization
**Test scenarios:**
- Exact match: same messages + model → cache hit, returns identical response
- Exact miss: different messages → cache miss, calls provider, writes to cache
- Semantic match: paraphrased question (similarity > 0.92) → cache hit
- Semantic miss: unrelated question (similarity < 0.6) cache miss
- Temperature > 0: only exact match attempted, no semantic match
- TTL expiry: cached entry expires after TTL, next request is a miss
- Redis unavailable: falls back to in-memory cache with warning log
- Cache with tool_calls: response containing tool_calls is cached correctly
- Concurrent access: two concurrent requests for same key don't cause double-write issues
**Verification:** Unit tests pass; cache hit rate metric is observable; no change to `LLMGateway` public API.
---
### U2. LLM Cache Integration
**Goal:** Integrate `LLMCache` into `LLMGateway.chat()` transparently, with usage tracking on cache hits.
**Dependencies:** U1
**Files:**
- `src/agentkit/llm/gateway.py` (modify) — inject cache check before provider call, cache write after provider response
- `src/agentkit/llm/config.py` (modify) — add `CacheConfig` to `LLMConfig`
- `src/agentkit/server/app.py` (modify) — pass cache config to `LLMGateway`
- `tests/unit/llm/test_gateway_cache.py` (new) — integration tests for cached gateway
**Approach:**
Architecture design before coding:
1. **Insertion point reasoning**: Cache check must happen AFTER `LLMRequest` construction (line ~79 in gateway.py) but BEFORE provider call (line ~87). This ensures all request normalization (alias resolution, model fallback list) has completed. Cache write happens AFTER response validation but BEFORE usage tracking.
2. **Cache hit usage tracking**: On cache hit, call `_usage_tracker.record()` with the original `usage` data from the cached response but with `cost=0` and `latency_ms` from cache lookup time. This preserves usage query integrity — `get_usage()` still shows all requests, just with zero cost for cached ones.
3. **Stream handling**: `chat_stream()` is NOT cached in this iteration. Streaming requires collecting all chunks before caching, which adds latency and complexity. Document this as a known limitation.
4. **Configuration integration**: Add `CacheConfig` dataclass with `enabled: bool = False`, `backend: str = "auto"`, `exact_ttl: int = 3600`, `semantic_ttl: int = 86400`, `similarity_threshold: float = 0.92`, `max_entries: int = 10000`. Nest under `LLMConfig.cache`.
**Patterns to follow:**
- `LLMConfig` dataclass + `from_dict()` pattern for config
- `LLMGateway.__init__()` dependency injection pattern
**Test scenarios:**
- Cache disabled: requests pass through to provider normally
- Cache enabled, first request: cache miss, provider called, response cached
- Cache enabled, second identical request: cache hit, provider NOT called
- Cache hit usage tracking: usage record has 0 cost, correct token counts
- Cache miss + fallback: primary model fails, fallback model response cached under fallback model key
- Config from YAML: `LLMConfig.from_dict({"cache": {"enabled": true}})` works correctly
**Verification:** Integration tests pass; `LLMGateway.chat()` returns same `LLMResponse` shape whether cached or not; usage tracking includes cache hits.
---
### U3. Semantic Router
**Goal:** Implement embedding-based semantic routing as Layer 1.5 in `CostAwareRouter`, using existing `OpenAIEmbedder` and `compute_cosine_similarity()`.
**Dependencies:** None (independent of U1/U2, uses existing embedding infrastructure)
**Files:**
- `src/agentkit/chat/semantic_router.py` (new) — `SemanticRouter` class, `SkillEmbeddingIndex`
- `src/agentkit/chat/skill_routing.py` (modify) — integrate Layer 1.5 into `CostAwareRouter.route()`
- `tests/unit/chat/test_semantic_router.py` (new) — unit tests for semantic router
**Approach:**
Architecture design before coding:
1. **SkillEmbeddingIndex design reasoning**: Pre-compute embeddings for all registered skills at initialization. Source text: `f"{description} | {' '.join(keywords)} | {' '.join(capability_tags)}"`. Store as `dict[str, tuple[list[float], str]]` (skill_name → (embedding, source_text)). On skill registration/update, re-embed only the changed skill. This avoids O(n) embedding computation per query.
2. **Query-time flow**: Embed user query → compute cosine similarity against all skill embeddings → return top match if above threshold. This is O(n) in number of skills, but with <100 skills and 1536-dim vectors, this takes <5ms on CPU. No need for approximate nearest neighbor (ANN) index at this scale.
3. **Threshold design**: Three zones:
- `similarity > 0.85`: HIGH confidence return skill match directly, skip Layer 2 LLM
- `0.6 <= similarity <= 0.85`: MEDIUM confidence pass skill hint to Layer 2, reducing LLM classification tokens
- `similarity < 0.6`: LOW confidence no semantic signal, Layer 2 runs unmodified
4. **Integration into CostAwareRouter**: Modify `route()` method. After Layer 1 (`_classify_merged()`), if complexity is medium (0.3-0.7), call `semantic_router.route(query)`. Based on confidence zone, either return directly or enhance the Layer 2 prompt with skill hint.
5. **Embedding provider**: Use `OpenAIEmbedder` by default. Support `MockEmbedder` for testing. Embedder is injected via constructor, not created internally.
**Patterns to follow:**
- `OpenAIEmbedder` + `EmbeddingCache` pattern for embedding computation
- `compute_cosine_similarity()` in `src/agentkit/utils/vector_math.py`
- `CostAwareRouter` constructor injection pattern
**Test scenarios:**
- Exact skill match: query "生成一篇关于AI的文章" matches `content_generator` skill (sim > 0.85)
- Partial skill match: query "优化内容" matches `geo_optimizer` skill (sim 0.6-0.85), skill hint passed to LLM
- No skill match: query "今天天气怎么样" has sim < 0.6 for all skills, Layer 2 runs normally
- Skill registration: new skill added embedding computed and indexed
- Skill update: skill description changed embedding re-computed
- Empty skill registry: semantic router returns None gracefully
- Embedder failure: OpenAIEmbedder throws error semantic router logs warning, returns None, Layer 2 runs normally
- Chinese query: "帮我写一篇文章" matches content_generator skill correctly
**Verification:** Semantic router returns correct skill matches; Layer 2 LLM calls reduced by >50% for medium-complexity queries; no regression in routing accuracy.
---
### U4. UsageStore Persistence
**Goal:** Persist UsageTracker records to Redis, with in-memory fallback and efficient aggregation queries.
**Dependencies:** None
**Files:**
- `src/agentkit/llm/usage_store.py` (new) — `UsageStore` Protocol, `InMemoryUsageStore`, `RedisUsageStore`
- `src/agentkit/llm/providers/tracker.py` (modify) — delegate to `UsageStore` backend
- `tests/unit/llm/test_usage_store.py` (new) — unit tests for usage store backends
**Approach:**
Architecture design before coding:
1. **Redis data model reasoning**: Use Redis Hash per date for time-partitioned storage. Key: `agentkit:usage:{YYYY-MM-DD}`, field: `{agent_name}:{model}`, value: JSON `{prompt_tokens, completion_tokens, total_tokens, cost, latency_ms, count}`. Write via pipeline: `HINCRBYFLOAT` for numeric fields + `HINCRBY` for count. This is O(1) per write, atomic, and naturally partitions by date.
2. **Aggregation query design**: For `get_usage(agent=None, start=None, end=None)`: scan date keys in range via `HGETALL`, filter by agent/model in application code, aggregate in memory. For single-agent queries, use field prefix matching. This is O(days × agents) which is acceptable for dashboard queries.
3. **UsageStore Protocol**: Define `record(agent, model, usage: UsageRecord) -> None`, `query(agent=None, model=None, start=None, end=None) -> list[UsageRecord]`, `get_summary(agent=None, start=None, end=None) -> UsageSummary`. Both sync and async versions (sync for backward compat, async for Redis).
4. **Migration from UsageTracker**: `UsageTracker` becomes a thin wrapper that delegates to `UsageStore`. Existing `record()` and `get_usage()` APIs preserved. Internal `_records` list replaced by store backend.
5. **TTL management**: Each date key gets TTL of 90 days (configurable). This prevents unbounded Redis memory growth while preserving 3 months of usage data.
**Patterns to follow:**
- `SessionStore` Protocol in `src/agentkit/session/store.py` — Protocol definition pattern
- `RedisSessionStore._get_redis()` — lazy Redis initialization
- `create_session_store()` — factory function pattern
- `agentkit:usage:` key namespace convention
**Test scenarios:**
- Record and query: record usage → query returns matching records
- Date partitioning: records on different dates stored in different keys
- Aggregation: multiple records for same agent/model aggregated correctly
- Agent filter: query with agent filter returns only that agent's records
- Date range filter: query with start/end returns only records in range
- TTL: date keys have correct TTL set
- Redis unavailable: falls back to in-memory store with warning
- Concurrent writes: two concurrent records for same agent/model don't lose data
- Empty query: query with no matching records returns empty list
**Verification:** Usage data survives process restart; `get_usage()` returns same shape as before; Redis memory usage bounded by TTL.
---
### U5. CascadeStateStore Persistence
**Goal:** Persist CascadeDetector state to Redis using atomic operations, enabling multi-instance cascade detection.
**Dependencies:** None
**Files:**
- `src/agentkit/quality/cascade_store.py` (new) — `CascadeStateStore` Protocol, `InMemoryCascadeStore`, `RedisCascadeStore`
- `src/agentkit/quality/cascade_detector.py` (modify) — delegate to `CascadeStateStore` backend
- `tests/unit/quality/test_cascade_store.py` (new) — unit tests for cascade store backends
**Approach:**
Architecture design before coding:
1. **Redis data model reasoning**: Use simple string keys with INCR for atomic counting. Key: `agentkit:cascade:{session_id}:interactions` (INCR + TTL), `agentkit:cascade:{session_id}:depth` (GET/SET + TTL). TTL aligned with session TTL (default 86400s). INCR is atomic — no race conditions across instances.
2. **Protocol design**: `CascadeStateStore` with `increment_interactions(session_id) -> int`, `get_interactions(session_id) -> int`, `set_depth(session_id, depth) -> None`, `get_depth(session_id) -> int`, `reset(session_id) -> None`, `get_stats(session_id) -> CascadeStats`.
3. **Integration into CascadeDetector**: Replace internal `_interaction_counts` and `_loop_depths` dicts with `CascadeStateStore` backend. All methods delegate to store. `CascadeDetector` becomes stateless — all state lives in the store.
4. **Session TTL alignment**: When `increment_interactions()` is called, refresh the key TTL to match session TTL. This ensures state is cleaned up when sessions expire.
**Patterns to follow:**
- Same Protocol + factory + fallback pattern as U4
- Redis INCR atomic operation pattern
- `agentkit:cascade:` key namespace
**Test scenarios:**
- Increment and get: increment interactions → get returns correct count
- Set and get depth: set depth → get returns correct depth
- Reset: reset session → interactions and depth both cleared
- TTL: keys have TTL set, expire after session timeout
- Multi-instance: two instances incrementing same session see consistent count
- Redis unavailable: falls back to in-memory store
- Session isolation: different sessions have independent state
**Verification:** Cascade detection state survives process restart; multi-instance deployment detects cascades correctly; no false positives from state loss.
---
### U6. EvolutionStore Interface Unification
**Goal:** Unify `EvolutionStore` and `PersistentEvolutionStore` interfaces, add PostgreSQL backend with full feature set.
**Dependencies:** None
**Files:**
- `src/agentkit/evolution/evolution_store.py` (modify) — define unified `EvolutionStoreProtocol`, refactor existing stores
- `src/agentkit/evolution/models.py` (modify) — add `SkillVersionModel` and `ABTestResultModel` to async PG models
- `src/agentkit/evolution/pg_store.py` (new) — `PostgreSQLEvolutionStore` implementing unified Protocol with async SQLAlchemy
- `tests/unit/evolution/test_unified_store.py` (new) — tests for unified interface
**Approach:**
Architecture design before coding:
1. **Protocol design reasoning**: Current `EvolutionStore` (async PG) has `record()`, `rollback()`, `list_events()`. `PersistentEvolutionStore` (sync SQLite) adds `record_skill_version()`, `list_skill_versions()`, `record_ab_test_result()`, `get_ab_test_results()`. The unified Protocol must include ALL methods from both. Each backend implements what it can; unsupported methods raise `NotImplementedError` with clear message.
2. **PostgreSQL model migration**: Add `SkillVersionModel` and `ABTestResultModel` to `src/agentkit/evolution/models.py` using async SQLAlchemy (matching `EpisodeModel` pattern in memory/models.py). These models already exist for SQLite; the PG versions use the same schema but with async engine.
3. **PostgreSQLEvolutionStore**: New class using async SQLAlchemy session (injected via constructor, same pattern as existing `EvolutionStore`). Implements all Protocol methods. Uses `run_in_executor` for any sync ORM operations if needed.
4. **Factory update**: `create_evolution_store(backend="memory"|"sqlite"|"postgresql", ...)` returns the appropriate backend. `"postgresql"` creates `PostgreSQLEvolutionStore` with async engine.
5. **Backward compatibility**: Existing `EvolutionStore` class is not removed — it becomes an internal implementation detail. The Protocol is the public interface. Code using `EvolutionStore` directly continues to work.
**Patterns to follow:**
- `EpisodeModel` in `src/agentkit/memory/models.py` — async PG model pattern
- `create_evolution_store()` factory — extend with new backend
- `PersistentEvolutionStore._run_sync()` — sync/async bridge pattern
**Test scenarios:**
- Protocol compliance: all backends implement all Protocol methods
- PG store: record event → list events returns recorded event
- PG store: record skill version → list versions returns version history
- PG store: record AB test result → get results returns test data
- SQLite store: existing functionality preserved after refactor
- Memory store: existing functionality preserved after refactor
- Factory: `create_evolution_store(backend="postgresql")` returns correct type
- PG unavailable: falls back to SQLite with warning
**Verification:** All backends pass unified Protocol compliance test; existing evolution tests pass; PG store supports skill_version and ab_test operations.
---
### U7. Configuration Integration and End-to-End Verification
**Goal:** Wire all three features into the application configuration, add `agentkit.yaml` schema support, and verify end-to-end behavior.
**Dependencies:** U1, U2, U3, U4, U5, U6
**Files:**
- `src/agentkit/server/app.py` (modify) — initialize cache, usage store, cascade store with config
- `src/agentkit/cli/main.py` (modify) — pass config to gateway and router
- `agentkit.yaml` (modify) — add cache, semantic_routing, usage_store, cascade_store config sections
- `tests/integration/test_p0_hardening.py` (new) — end-to-end integration tests
**Approach:**
1. **Configuration schema**: Add to `agentkit.yaml`:
```yaml
llm:
cache:
enabled: true
backend: "auto" # auto | redis | memory
exact_ttl: 3600
semantic_ttl: 86400
similarity_threshold: 0.92
max_entries: 10000
routing:
semantic:
enabled: true
similarity_high: 0.85 # direct match threshold
similarity_low: 0.6 # hint threshold
usage_store:
backend: "auto" # auto | redis | memory
ttl_days: 90
cascade_store:
backend: "auto" # auto | redis | memory
session_ttl: 86400
evolution_store:
backend: "auto" # auto | postgresql | sqlite | memory
```
2. **Application wiring**: In `app.py` lifespan, initialize all stores and inject into gateway/router. Follow existing pattern of creating components from config.
3. **End-to-end verification**: Integration test that exercises the full flow: user query → semantic routing → LLM cache → usage tracking → cascade detection → evolution logging.
**Test scenarios:**
- Full flow with Redis: all features use Redis backend, data persists across simulated restart
- Full flow without Redis: all features fall back to in-memory, no errors
- Config from YAML: `agentkit.yaml` parsed correctly, all features configured
- Cache + routing interaction: cached response for semantically routed query works correctly
- Usage tracking with cache: cached requests show 0 cost in usage summary
- Cascade detection across instances: simulated multi-instance scenario detects cascade correctly
**Verification:** All integration tests pass; application starts with new config; features degrade gracefully when backends unavailable.
---
## Risks & Mitigations
| Risk | Impact | Likelihood | Mitigation |
|------|--------|-----------|------------|
| Semantic cache returns stale/wrong response | High — user gets incorrect answer | Medium — embedding similarity doesn't guarantee semantic equivalence | Default to temperature=0 only for semantic cache; configurable threshold; TTL expiry; admin invalidation API |
| Redis single point of failure | High — all persistence lost | Low — Redis is typically HA | Auto-fallback to in-memory; health check in doctor command; alert on fallback activation |
| Embedding API latency adds to routing time | Medium — slower routing for first query | Medium — embedding API ~100ms | Pre-compute skill embeddings; cache query embeddings; async embedding with timeout |
| UsageStore Redis memory growth | Medium — Redis OOM | Low — TTL + date partitioning bounds growth | 90-day TTL default; monitoring on Redis memory; configurable TTL |
| EvolutionStore interface unification breaks existing code | High — evolution system stops working | Low — Protocol is backward compatible | Keep existing classes as internal implementations; comprehensive test coverage before refactor |
---
## Open Questions
- Should semantic cache also cache streaming responses (requires chunk collection)? Deferred — current plan only caches non-streaming `chat()`.
- Should UsageStore support real-time streaming of usage data (e.g., via Redis Pub/Sub)? Deferred — current plan only supports query-based access.
- What is the optimal embedding model for Chinese+English mixed text? `text-embedding-3-small` is adequate but not optimal. Consider `bge-m3` or `multilingual-e5` as alternatives. Deferred to implementation-time benchmarking.
---
## Sources & Research
- Industry benchmarking: LangChain, Dify, CrewAI, Letta, AutoGen feature comparison (2025-2026)
- Project audit: 12 core files analyzed across memory, evolution, routing, quality, and LLM subsystems
- Existing patterns: `EmbeddingCache`, `RedisSessionStore`, `create_evolution_store()`, `SessionStore` Protocol