499 lines
30 KiB
Markdown
499 lines
30 KiB
Markdown
---
|
||
title: "feat: P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence"
|
||
status: active
|
||
created_at: 2026-06-14
|
||
type: feat
|
||
origin: "行业调研与项目审视(2026-06-14)"
|
||
depth: deep
|
||
---
|
||
|
||
# P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence
|
||
|
||
## Summary
|
||
|
||
Three P0 gaps identified from industry benchmarking and project audit: (1) LLM response caching to reduce 30-50% token cost, (2) embedding-based semantic routing to improve intent matching quality at zero LLM cost, (3) critical state persistence for UsageTracker, EvolutionStore, and CascadeDetector to survive restarts and enable multi-instance deployment. Each unit requires detailed architecture design and code reasoning before implementation — design-first, code-second.
|
||
|
||
## Problem Frame
|
||
|
||
AgentKit has strong differentiation in self-evolution, quality management, and multi-paradigm engines, but three production-critical gaps prevent enterprise deployment:
|
||
|
||
1. **Every LLM request hits the provider** — no caching. Identical or similar requests waste tokens and money. Competitors like Dify have built-in caching.
|
||
2. **Routing relies on keyword matching and LLM classification** — no semantic understanding. Embedding-based routing is industry standard (Agentic RAG trend) and AgentKit already has embedding infrastructure but doesn't use it for routing.
|
||
3. **Critical state lives in memory** — UsageTracker, CascadeDetector, and EvolutionStore lose data on restart. Multi-instance deployment is impossible without shared state.
|
||
|
||
These gaps are P0 because they directly impact cost (caching), quality (routing accuracy), and reliability (state persistence) — the three pillars of production readiness.
|
||
|
||
---
|
||
|
||
## Requirements
|
||
|
||
- R1. LLM cache must support exact-match (hash-based) and semantic-match (embedding-based) cache hits
|
||
- R2. LLM cache must integrate transparently into `LLMGateway.chat()` without changing the public API
|
||
- R3. LLM cache must record usage on cache hits (0 cost) to maintain usage tracking integrity
|
||
- R4. Semantic routing must insert between Layer 1 and Layer 2 in `CostAwareRouter`
|
||
- R5. Semantic routing must use existing `OpenAIEmbedder` and `compute_cosine_similarity()` infrastructure
|
||
- R6. Semantic routing must pre-compute skill embeddings at registration time, not at query time
|
||
- R7. UsageTracker must persist records to Redis with O(1) write and efficient aggregation
|
||
- R8. CascadeDetector must persist state to Redis using atomic INCR operations
|
||
- R9. EvolutionStore must support both PostgreSQL and SQLite backends with a unified interface
|
||
- R10. All three features must degrade gracefully when Redis/PG is unavailable (fallback to in-memory)
|
||
- R11. Each unit must have detailed architecture design and code reasoning documented before implementation begins
|
||
|
||
---
|
||
|
||
## Key Technical Decisions
|
||
|
||
- **KTD-1. Cache key design**: Use `SHA256(model + system_prompt_hash + messages_content_hash + temperature + tools_hash)` for exact match. For semantic match, embed the last user message and compare against cached embeddings using cosine similarity > threshold. Rationale: exact match is fast and deterministic; semantic match catches paraphrased requests. Both are needed because exact match alone misses too many hits, and semantic match alone is too slow for every request.
|
||
|
||
- **KTD-2. Cache storage backend**: Implement `LLMCache` as a Protocol with `InMemoryLLMCache` and `RedisLLMCache` backends. In-memory uses `OrderedDict` with LRU eviction (following `EmbeddingCache` pattern). Redis uses `agentkit:llm_cache:{hash}` keys with TTL. Rationale: follows existing factory pattern (`create_message_bus`, `create_session_store`); in-memory for dev/single-instance, Redis for production.
|
||
|
||
- **KTD-3. Semantic routing insertion point**: Insert as Layer 1.5 between `HeuristicClassifier` and `_classify_merged()`. When Layer 1 returns medium complexity (0.3-0.7), try semantic routing first. If similarity > 0.85, return skill match directly (skip LLM). If similarity 0.6-0.85, pass skill hint to Layer 2 LLM (reduces LLM classification tokens). If < 0.6, proceed to Layer 2 unchanged. Rationale: this placement maximizes cost savings by avoiding LLM calls when semantic match is confident, while preserving the existing fallback chain.
|
||
|
||
- **KTD-4. Skill embedding source text**: Embed `f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"` for each skill. Cache embeddings in a dict keyed by skill name, re-embed on skill registration/update. Rationale: combines all semantic signals; description alone misses keyword intent; keywords alone misses semantic meaning.
|
||
|
||
- **KTD-5. UsageTracker persistence strategy**: Use Redis Hash for time-series data. Key pattern: `agentkit:usage:{date}` with fields `{agent}:{model}` → JSON `{tokens, cost, latency_ms, count}`. Write via `HINCRBYFLOAT` for atomic increment. Query via `HGETALL` + client-side aggregation. Rationale: O(1) write, acceptable query performance, natural TTL by date, follows Redis patterns in project.
|
||
|
||
- **KTD-6. CascadeDetector persistence strategy**: Use Redis atomic operations. Key pattern: `agentkit:cascade:{session_id}:interactions` (INCR + TTL) and `agentkit:cascade:{session_id}:depth` (SET/GET + TTL). Rationale: INCR is atomic, no race conditions across instances; TTL prevents memory leaks; matches session lifecycle.
|
||
|
||
- **KTD-7. EvolutionStore interface unification**: Extend the base `EvolutionStore` Protocol to include `skill_version` and `ab_test` methods. Make `PersistentEvolutionStore` (SQLite) implement the unified Protocol. Add a new `PostgreSQLEvolutionStore` that uses async SQLAlchemy like the existing `EvolutionStore` but with the full unified interface. Rationale: current split (sync SQLite vs async PG) creates maintenance burden; unified Protocol enables backend-agnostic usage.
|
||
|
||
- **KTD-8. Graceful degradation pattern**: All three features use the same pattern — try preferred backend, catch connection error, log warning, fall back to in-memory. Controlled by `cache.backend`, `usage_store.backend`, `cascade_store.backend` config values (`"auto"` | `"redis"` | `"memory"`). `"auto"` tries Redis, falls back to memory. Rationale: production needs persistence, but dev/testing shouldn't require Redis.
|
||
|
||
---
|
||
|
||
## High-Level Technical Design
|
||
|
||
### LLM Cache Flow
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
A[LLMGateway.chat] --> B{Cache enabled?}
|
||
B -->|no| F[Call Provider]
|
||
B -->|yes| C[Generate exact key]
|
||
C --> D{Exact match?}
|
||
D -->|hit| E[Return cached response]
|
||
D -->|miss| G[Generate embedding of last user msg]
|
||
G --> H{Semantic match? similarity > 0.92}
|
||
H -->|hit| E
|
||
H -->|miss| F
|
||
F --> I[Write to cache]
|
||
I --> J[Record usage]
|
||
E --> K[Record usage with 0 cost]
|
||
```
|
||
|
||
### Semantic Routing Flow
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
A[CostAwareRouter.route] --> B[Layer 0: Regex rules]
|
||
B -->|matched| Z[Return DIRECT_CHAT]
|
||
B -->|unmatched| C[Layer 1: HeuristicClassifier]
|
||
C -->|low complexity| Z
|
||
C -->|medium-high| D[Layer 1.5: Semantic Router NEW]
|
||
D -->|sim > 0.85| E[Return SKILL_REACT with matched skill]
|
||
D -->|sim 0.6-0.85| F[Pass skill_hint to Layer 2]
|
||
D -->|sim < 0.6| G[Layer 2: LLM classification]
|
||
F --> G
|
||
G --> H[Return routing result]
|
||
```
|
||
|
||
### State Persistence Architecture
|
||
|
||
```mermaid
|
||
flowchart TB
|
||
subgraph "Current (In-Memory)"
|
||
UT1[UsageTracker dict]
|
||
CD1[CascadeDetector dict]
|
||
ES1[EvolutionStore SQLite]
|
||
end
|
||
subgraph "Target (Persistent)"
|
||
UT2[UsageStore Protocol]
|
||
CD2[CascadeStateStore Protocol]
|
||
ES2[UnifiedEvolutionStore Protocol]
|
||
UT2 -->|redis| R1[Redis Hash agentkit:usage:date]
|
||
UT2 -->|memory| M1[InMemoryUsageStore]
|
||
CD2 -->|redis| R2[Redis INCR agentkit:cascade:session]
|
||
CD2 -->|memory| M2[InMemoryCascadeStore]
|
||
ES2 -->|postgresql| P1[PG EvolutionEventModel + SkillVersionModel]
|
||
ES2 -->|sqlite| S1[PersistentEvolutionStore]
|
||
ES2 -->|memory| M3[InMemoryEvolutionStore]
|
||
end
|
||
```
|
||
|
||
---
|
||
|
||
## Scope Boundaries
|
||
|
||
### In Scope
|
||
|
||
- LLM response caching (exact + semantic match, in-memory + Redis backends)
|
||
- Semantic routing as Layer 1.5 in CostAwareRouter
|
||
- UsageTracker Redis persistence
|
||
- CascadeDetector Redis persistence
|
||
- EvolutionStore interface unification
|
||
- Configuration for all three features
|
||
- Architecture design documents for each unit before coding
|
||
|
||
### Deferred for Follow-Up
|
||
|
||
- Semantic cache using pgvector (current semantic match uses in-memory embedding comparison)
|
||
- Cache warming / pre-population strategies
|
||
- Routing cache (caching routing results for similar queries)
|
||
- Usage analytics dashboard (visualization of usage data)
|
||
- Multi-tenant resource quotas
|
||
- Rate limiting and concurrency control (P2)
|
||
- Distributed tracing visualization (P2)
|
||
|
||
---
|
||
|
||
## Implementation Units
|
||
|
||
### U1. LLM Cache Core
|
||
|
||
**Goal:** Implement the `LLMCache` Protocol, `InMemoryLLMCache`, and `RedisLLMCache` with exact-match and semantic-match capabilities.
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/llm/cache.py` (new) — `LLMCache` Protocol, `InMemoryLLMCache`, `RedisLLMCache`, `CacheResult`, `CacheKey` generation
|
||
- `src/agentkit/llm/cache_key.py` (new) — `generate_cache_key()`, `generate_messages_hash()`, `generate_system_prompt_hash()`
|
||
- `tests/unit/llm/test_cache.py` (new) — unit tests for cache backends
|
||
|
||
**Approach:**
|
||
|
||
Architecture design before coding:
|
||
|
||
1. **CacheKey design reasoning**: The key must capture all inputs that affect LLM output. `model` determines which model responds. `system_prompt` sets behavior. `messages` carry the conversation. `temperature` affects randomness (only cache temperature=0 deterministically). `tools` affect tool_call availability. Hash each component independently so partial changes don't invalidate the entire key.
|
||
|
||
2. **Exact match implementation**: SHA-256 hash of concatenated component hashes. Store as `agentkit:llm_cache:{sha256_hex}` in Redis with TTL. In-memory uses OrderedDict keyed by hash string.
|
||
|
||
3. **Semantic match implementation**: For cache misses on exact match, embed the last user message using `OpenAIEmbedder`. Compare against cached embeddings using `compute_cosine_similarity()`. Store embeddings alongside cached responses. In-memory: linear scan of all cached embeddings. Redis: store embeddings in a separate key `agentkit:llm_cache_emb:{sha256_hex}`.
|
||
|
||
4. **Cache write policy**: Only cache responses where `temperature == 0` (deterministic). For temperature > 0, only exact-match cache applies (no semantic match, since outputs are non-deterministic).
|
||
|
||
5. **Cache invalidation**: TTL-based (configurable, default 3600s for exact, 86400s for semantic). Manual invalidation via `invalidate(pattern=None)` for admin operations.
|
||
|
||
**Patterns to follow:**
|
||
- `EmbeddingCache` in `src/agentkit/memory/embedder.py` — LRU + TTL pattern
|
||
- `create_session_store()` factory in `src/agentkit/session/store.py` — backend factory pattern
|
||
- `RedisSessionStore._get_redis()` — lazy Redis initialization
|
||
|
||
**Test scenarios:**
|
||
- Exact match: same messages + model → cache hit, returns identical response
|
||
- Exact miss: different messages → cache miss, calls provider, writes to cache
|
||
- Semantic match: paraphrased question (similarity > 0.92) → cache hit
|
||
- Semantic miss: unrelated question (similarity < 0.6) → cache miss
|
||
- Temperature > 0: only exact match attempted, no semantic match
|
||
- TTL expiry: cached entry expires after TTL, next request is a miss
|
||
- Redis unavailable: falls back to in-memory cache with warning log
|
||
- Cache with tool_calls: response containing tool_calls is cached correctly
|
||
- Concurrent access: two concurrent requests for same key don't cause double-write issues
|
||
|
||
**Verification:** Unit tests pass; cache hit rate metric is observable; no change to `LLMGateway` public API.
|
||
|
||
---
|
||
|
||
### U2. LLM Cache Integration
|
||
|
||
**Goal:** Integrate `LLMCache` into `LLMGateway.chat()` transparently, with usage tracking on cache hits.
|
||
|
||
**Dependencies:** U1
|
||
|
||
**Files:**
|
||
- `src/agentkit/llm/gateway.py` (modify) — inject cache check before provider call, cache write after provider response
|
||
- `src/agentkit/llm/config.py` (modify) — add `CacheConfig` to `LLMConfig`
|
||
- `src/agentkit/server/app.py` (modify) — pass cache config to `LLMGateway`
|
||
- `tests/unit/llm/test_gateway_cache.py` (new) — integration tests for cached gateway
|
||
|
||
**Approach:**
|
||
|
||
Architecture design before coding:
|
||
|
||
1. **Insertion point reasoning**: Cache check must happen AFTER `LLMRequest` construction (line ~79 in gateway.py) but BEFORE provider call (line ~87). This ensures all request normalization (alias resolution, model fallback list) has completed. Cache write happens AFTER response validation but BEFORE usage tracking.
|
||
|
||
2. **Cache hit usage tracking**: On cache hit, call `_usage_tracker.record()` with the original `usage` data from the cached response but with `cost=0` and `latency_ms` from cache lookup time. This preserves usage query integrity — `get_usage()` still shows all requests, just with zero cost for cached ones.
|
||
|
||
3. **Stream handling**: `chat_stream()` is NOT cached in this iteration. Streaming requires collecting all chunks before caching, which adds latency and complexity. Document this as a known limitation.
|
||
|
||
4. **Configuration integration**: Add `CacheConfig` dataclass with `enabled: bool = False`, `backend: str = "auto"`, `exact_ttl: int = 3600`, `semantic_ttl: int = 86400`, `similarity_threshold: float = 0.92`, `max_entries: int = 10000`. Nest under `LLMConfig.cache`.
|
||
|
||
**Patterns to follow:**
|
||
- `LLMConfig` dataclass + `from_dict()` pattern for config
|
||
- `LLMGateway.__init__()` dependency injection pattern
|
||
|
||
**Test scenarios:**
|
||
- Cache disabled: requests pass through to provider normally
|
||
- Cache enabled, first request: cache miss, provider called, response cached
|
||
- Cache enabled, second identical request: cache hit, provider NOT called
|
||
- Cache hit usage tracking: usage record has 0 cost, correct token counts
|
||
- Cache miss + fallback: primary model fails, fallback model response cached under fallback model key
|
||
- Config from YAML: `LLMConfig.from_dict({"cache": {"enabled": true}})` works correctly
|
||
|
||
**Verification:** Integration tests pass; `LLMGateway.chat()` returns same `LLMResponse` shape whether cached or not; usage tracking includes cache hits.
|
||
|
||
---
|
||
|
||
### U3. Semantic Router
|
||
|
||
**Goal:** Implement embedding-based semantic routing as Layer 1.5 in `CostAwareRouter`, using existing `OpenAIEmbedder` and `compute_cosine_similarity()`.
|
||
|
||
**Dependencies:** None (independent of U1/U2, uses existing embedding infrastructure)
|
||
|
||
**Files:**
|
||
- `src/agentkit/chat/semantic_router.py` (new) — `SemanticRouter` class, `SkillEmbeddingIndex`
|
||
- `src/agentkit/chat/skill_routing.py` (modify) — integrate Layer 1.5 into `CostAwareRouter.route()`
|
||
- `tests/unit/chat/test_semantic_router.py` (new) — unit tests for semantic router
|
||
|
||
**Approach:**
|
||
|
||
Architecture design before coding:
|
||
|
||
1. **SkillEmbeddingIndex design reasoning**: Pre-compute embeddings for all registered skills at initialization. Source text: `f"{description} | {' '.join(keywords)} | {' '.join(capability_tags)}"`. Store as `dict[str, tuple[list[float], str]]` (skill_name → (embedding, source_text)). On skill registration/update, re-embed only the changed skill. This avoids O(n) embedding computation per query.
|
||
|
||
2. **Query-time flow**: Embed user query → compute cosine similarity against all skill embeddings → return top match if above threshold. This is O(n) in number of skills, but with <100 skills and 1536-dim vectors, this takes <5ms on CPU. No need for approximate nearest neighbor (ANN) index at this scale.
|
||
|
||
3. **Threshold design**: Three zones:
|
||
- `similarity > 0.85`: HIGH confidence → return skill match directly, skip Layer 2 LLM
|
||
- `0.6 <= similarity <= 0.85`: MEDIUM confidence → pass skill hint to Layer 2, reducing LLM classification tokens
|
||
- `similarity < 0.6`: LOW confidence → no semantic signal, Layer 2 runs unmodified
|
||
|
||
4. **Integration into CostAwareRouter**: Modify `route()` method. After Layer 1 (`_classify_merged()`), if complexity is medium (0.3-0.7), call `semantic_router.route(query)`. Based on confidence zone, either return directly or enhance the Layer 2 prompt with skill hint.
|
||
|
||
5. **Embedding provider**: Use `OpenAIEmbedder` by default. Support `MockEmbedder` for testing. Embedder is injected via constructor, not created internally.
|
||
|
||
**Patterns to follow:**
|
||
- `OpenAIEmbedder` + `EmbeddingCache` pattern for embedding computation
|
||
- `compute_cosine_similarity()` in `src/agentkit/utils/vector_math.py`
|
||
- `CostAwareRouter` constructor injection pattern
|
||
|
||
**Test scenarios:**
|
||
- Exact skill match: query "生成一篇关于AI的文章" matches `content_generator` skill (sim > 0.85)
|
||
- Partial skill match: query "优化内容" matches `geo_optimizer` skill (sim 0.6-0.85), skill hint passed to LLM
|
||
- No skill match: query "今天天气怎么样" has sim < 0.6 for all skills, Layer 2 runs normally
|
||
- Skill registration: new skill added → embedding computed and indexed
|
||
- Skill update: skill description changed → embedding re-computed
|
||
- Empty skill registry: semantic router returns None gracefully
|
||
- Embedder failure: OpenAIEmbedder throws error → semantic router logs warning, returns None, Layer 2 runs normally
|
||
- Chinese query: "帮我写一篇文章" matches content_generator skill correctly
|
||
|
||
**Verification:** Semantic router returns correct skill matches; Layer 2 LLM calls reduced by >50% for medium-complexity queries; no regression in routing accuracy.
|
||
|
||
---
|
||
|
||
### U4. UsageStore Persistence
|
||
|
||
**Goal:** Persist UsageTracker records to Redis, with in-memory fallback and efficient aggregation queries.
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/llm/usage_store.py` (new) — `UsageStore` Protocol, `InMemoryUsageStore`, `RedisUsageStore`
|
||
- `src/agentkit/llm/providers/tracker.py` (modify) — delegate to `UsageStore` backend
|
||
- `tests/unit/llm/test_usage_store.py` (new) — unit tests for usage store backends
|
||
|
||
**Approach:**
|
||
|
||
Architecture design before coding:
|
||
|
||
1. **Redis data model reasoning**: Use Redis Hash per date for time-partitioned storage. Key: `agentkit:usage:{YYYY-MM-DD}`, field: `{agent_name}:{model}`, value: JSON `{prompt_tokens, completion_tokens, total_tokens, cost, latency_ms, count}`. Write via pipeline: `HINCRBYFLOAT` for numeric fields + `HINCRBY` for count. This is O(1) per write, atomic, and naturally partitions by date.
|
||
|
||
2. **Aggregation query design**: For `get_usage(agent=None, start=None, end=None)`: scan date keys in range via `HGETALL`, filter by agent/model in application code, aggregate in memory. For single-agent queries, use field prefix matching. This is O(days × agents) which is acceptable for dashboard queries.
|
||
|
||
3. **UsageStore Protocol**: Define `record(agent, model, usage: UsageRecord) -> None`, `query(agent=None, model=None, start=None, end=None) -> list[UsageRecord]`, `get_summary(agent=None, start=None, end=None) -> UsageSummary`. Both sync and async versions (sync for backward compat, async for Redis).
|
||
|
||
4. **Migration from UsageTracker**: `UsageTracker` becomes a thin wrapper that delegates to `UsageStore`. Existing `record()` and `get_usage()` APIs preserved. Internal `_records` list replaced by store backend.
|
||
|
||
5. **TTL management**: Each date key gets TTL of 90 days (configurable). This prevents unbounded Redis memory growth while preserving 3 months of usage data.
|
||
|
||
**Patterns to follow:**
|
||
- `SessionStore` Protocol in `src/agentkit/session/store.py` — Protocol definition pattern
|
||
- `RedisSessionStore._get_redis()` — lazy Redis initialization
|
||
- `create_session_store()` — factory function pattern
|
||
- `agentkit:usage:` key namespace convention
|
||
|
||
**Test scenarios:**
|
||
- Record and query: record usage → query returns matching records
|
||
- Date partitioning: records on different dates stored in different keys
|
||
- Aggregation: multiple records for same agent/model aggregated correctly
|
||
- Agent filter: query with agent filter returns only that agent's records
|
||
- Date range filter: query with start/end returns only records in range
|
||
- TTL: date keys have correct TTL set
|
||
- Redis unavailable: falls back to in-memory store with warning
|
||
- Concurrent writes: two concurrent records for same agent/model don't lose data
|
||
- Empty query: query with no matching records returns empty list
|
||
|
||
**Verification:** Usage data survives process restart; `get_usage()` returns same shape as before; Redis memory usage bounded by TTL.
|
||
|
||
---
|
||
|
||
### U5. CascadeStateStore Persistence
|
||
|
||
**Goal:** Persist CascadeDetector state to Redis using atomic operations, enabling multi-instance cascade detection.
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/quality/cascade_store.py` (new) — `CascadeStateStore` Protocol, `InMemoryCascadeStore`, `RedisCascadeStore`
|
||
- `src/agentkit/quality/cascade_detector.py` (modify) — delegate to `CascadeStateStore` backend
|
||
- `tests/unit/quality/test_cascade_store.py` (new) — unit tests for cascade store backends
|
||
|
||
**Approach:**
|
||
|
||
Architecture design before coding:
|
||
|
||
1. **Redis data model reasoning**: Use simple string keys with INCR for atomic counting. Key: `agentkit:cascade:{session_id}:interactions` (INCR + TTL), `agentkit:cascade:{session_id}:depth` (GET/SET + TTL). TTL aligned with session TTL (default 86400s). INCR is atomic — no race conditions across instances.
|
||
|
||
2. **Protocol design**: `CascadeStateStore` with `increment_interactions(session_id) -> int`, `get_interactions(session_id) -> int`, `set_depth(session_id, depth) -> None`, `get_depth(session_id) -> int`, `reset(session_id) -> None`, `get_stats(session_id) -> CascadeStats`.
|
||
|
||
3. **Integration into CascadeDetector**: Replace internal `_interaction_counts` and `_loop_depths` dicts with `CascadeStateStore` backend. All methods delegate to store. `CascadeDetector` becomes stateless — all state lives in the store.
|
||
|
||
4. **Session TTL alignment**: When `increment_interactions()` is called, refresh the key TTL to match session TTL. This ensures state is cleaned up when sessions expire.
|
||
|
||
**Patterns to follow:**
|
||
- Same Protocol + factory + fallback pattern as U4
|
||
- Redis INCR atomic operation pattern
|
||
- `agentkit:cascade:` key namespace
|
||
|
||
**Test scenarios:**
|
||
- Increment and get: increment interactions → get returns correct count
|
||
- Set and get depth: set depth → get returns correct depth
|
||
- Reset: reset session → interactions and depth both cleared
|
||
- TTL: keys have TTL set, expire after session timeout
|
||
- Multi-instance: two instances incrementing same session see consistent count
|
||
- Redis unavailable: falls back to in-memory store
|
||
- Session isolation: different sessions have independent state
|
||
|
||
**Verification:** Cascade detection state survives process restart; multi-instance deployment detects cascades correctly; no false positives from state loss.
|
||
|
||
---
|
||
|
||
### U6. EvolutionStore Interface Unification
|
||
|
||
**Goal:** Unify `EvolutionStore` and `PersistentEvolutionStore` interfaces, add PostgreSQL backend with full feature set.
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/evolution/evolution_store.py` (modify) — define unified `EvolutionStoreProtocol`, refactor existing stores
|
||
- `src/agentkit/evolution/models.py` (modify) — add `SkillVersionModel` and `ABTestResultModel` to async PG models
|
||
- `src/agentkit/evolution/pg_store.py` (new) — `PostgreSQLEvolutionStore` implementing unified Protocol with async SQLAlchemy
|
||
- `tests/unit/evolution/test_unified_store.py` (new) — tests for unified interface
|
||
|
||
**Approach:**
|
||
|
||
Architecture design before coding:
|
||
|
||
1. **Protocol design reasoning**: Current `EvolutionStore` (async PG) has `record()`, `rollback()`, `list_events()`. `PersistentEvolutionStore` (sync SQLite) adds `record_skill_version()`, `list_skill_versions()`, `record_ab_test_result()`, `get_ab_test_results()`. The unified Protocol must include ALL methods from both. Each backend implements what it can; unsupported methods raise `NotImplementedError` with clear message.
|
||
|
||
2. **PostgreSQL model migration**: Add `SkillVersionModel` and `ABTestResultModel` to `src/agentkit/evolution/models.py` using async SQLAlchemy (matching `EpisodeModel` pattern in memory/models.py). These models already exist for SQLite; the PG versions use the same schema but with async engine.
|
||
|
||
3. **PostgreSQLEvolutionStore**: New class using async SQLAlchemy session (injected via constructor, same pattern as existing `EvolutionStore`). Implements all Protocol methods. Uses `run_in_executor` for any sync ORM operations if needed.
|
||
|
||
4. **Factory update**: `create_evolution_store(backend="memory"|"sqlite"|"postgresql", ...)` returns the appropriate backend. `"postgresql"` creates `PostgreSQLEvolutionStore` with async engine.
|
||
|
||
5. **Backward compatibility**: Existing `EvolutionStore` class is not removed — it becomes an internal implementation detail. The Protocol is the public interface. Code using `EvolutionStore` directly continues to work.
|
||
|
||
**Patterns to follow:**
|
||
- `EpisodeModel` in `src/agentkit/memory/models.py` — async PG model pattern
|
||
- `create_evolution_store()` factory — extend with new backend
|
||
- `PersistentEvolutionStore._run_sync()` — sync/async bridge pattern
|
||
|
||
**Test scenarios:**
|
||
- Protocol compliance: all backends implement all Protocol methods
|
||
- PG store: record event → list events returns recorded event
|
||
- PG store: record skill version → list versions returns version history
|
||
- PG store: record AB test result → get results returns test data
|
||
- SQLite store: existing functionality preserved after refactor
|
||
- Memory store: existing functionality preserved after refactor
|
||
- Factory: `create_evolution_store(backend="postgresql")` returns correct type
|
||
- PG unavailable: falls back to SQLite with warning
|
||
|
||
**Verification:** All backends pass unified Protocol compliance test; existing evolution tests pass; PG store supports skill_version and ab_test operations.
|
||
|
||
---
|
||
|
||
### U7. Configuration Integration and End-to-End Verification
|
||
|
||
**Goal:** Wire all three features into the application configuration, add `agentkit.yaml` schema support, and verify end-to-end behavior.
|
||
|
||
**Dependencies:** U1, U2, U3, U4, U5, U6
|
||
|
||
**Files:**
|
||
- `src/agentkit/server/app.py` (modify) — initialize cache, usage store, cascade store with config
|
||
- `src/agentkit/cli/main.py` (modify) — pass config to gateway and router
|
||
- `agentkit.yaml` (modify) — add cache, semantic_routing, usage_store, cascade_store config sections
|
||
- `tests/integration/test_p0_hardening.py` (new) — end-to-end integration tests
|
||
|
||
**Approach:**
|
||
|
||
1. **Configuration schema**: Add to `agentkit.yaml`:
|
||
```yaml
|
||
llm:
|
||
cache:
|
||
enabled: true
|
||
backend: "auto" # auto | redis | memory
|
||
exact_ttl: 3600
|
||
semantic_ttl: 86400
|
||
similarity_threshold: 0.92
|
||
max_entries: 10000
|
||
|
||
routing:
|
||
semantic:
|
||
enabled: true
|
||
similarity_high: 0.85 # direct match threshold
|
||
similarity_low: 0.6 # hint threshold
|
||
|
||
usage_store:
|
||
backend: "auto" # auto | redis | memory
|
||
ttl_days: 90
|
||
|
||
cascade_store:
|
||
backend: "auto" # auto | redis | memory
|
||
session_ttl: 86400
|
||
|
||
evolution_store:
|
||
backend: "auto" # auto | postgresql | sqlite | memory
|
||
```
|
||
|
||
2. **Application wiring**: In `app.py` lifespan, initialize all stores and inject into gateway/router. Follow existing pattern of creating components from config.
|
||
|
||
3. **End-to-end verification**: Integration test that exercises the full flow: user query → semantic routing → LLM cache → usage tracking → cascade detection → evolution logging.
|
||
|
||
**Test scenarios:**
|
||
- Full flow with Redis: all features use Redis backend, data persists across simulated restart
|
||
- Full flow without Redis: all features fall back to in-memory, no errors
|
||
- Config from YAML: `agentkit.yaml` parsed correctly, all features configured
|
||
- Cache + routing interaction: cached response for semantically routed query works correctly
|
||
- Usage tracking with cache: cached requests show 0 cost in usage summary
|
||
- Cascade detection across instances: simulated multi-instance scenario detects cascade correctly
|
||
|
||
**Verification:** All integration tests pass; application starts with new config; features degrade gracefully when backends unavailable.
|
||
|
||
---
|
||
|
||
## Risks & Mitigations
|
||
|
||
| Risk | Impact | Likelihood | Mitigation |
|
||
|------|--------|-----------|------------|
|
||
| Semantic cache returns stale/wrong response | High — user gets incorrect answer | Medium — embedding similarity doesn't guarantee semantic equivalence | Default to temperature=0 only for semantic cache; configurable threshold; TTL expiry; admin invalidation API |
|
||
| Redis single point of failure | High — all persistence lost | Low — Redis is typically HA | Auto-fallback to in-memory; health check in doctor command; alert on fallback activation |
|
||
| Embedding API latency adds to routing time | Medium — slower routing for first query | Medium — embedding API ~100ms | Pre-compute skill embeddings; cache query embeddings; async embedding with timeout |
|
||
| UsageStore Redis memory growth | Medium — Redis OOM | Low — TTL + date partitioning bounds growth | 90-day TTL default; monitoring on Redis memory; configurable TTL |
|
||
| EvolutionStore interface unification breaks existing code | High — evolution system stops working | Low — Protocol is backward compatible | Keep existing classes as internal implementations; comprehensive test coverage before refactor |
|
||
|
||
---
|
||
|
||
## Open Questions
|
||
|
||
- Should semantic cache also cache streaming responses (requires chunk collection)? Deferred — current plan only caches non-streaming `chat()`.
|
||
- Should UsageStore support real-time streaming of usage data (e.g., via Redis Pub/Sub)? Deferred — current plan only supports query-based access.
|
||
- What is the optimal embedding model for Chinese+English mixed text? `text-embedding-3-small` is adequate but not optimal. Consider `bge-m3` or `multilingual-e5` as alternatives. Deferred to implementation-time benchmarking.
|
||
|
||
---
|
||
|
||
## Sources & Research
|
||
|
||
- Industry benchmarking: LangChain, Dify, CrewAI, Letta, AutoGen feature comparison (2025-2026)
|
||
- Project audit: 12 core files analyzed across memory, evolution, routing, quality, and LLM subsystems
|
||
- Existing patterns: `EmbeddingCache`, `RedisSessionStore`, `create_evolution_store()`, `SessionStore` Protocol
|