30 KiB
| title | status | created_at | type | origin | depth |
|---|---|---|---|---|---|
| feat: P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence | active | 2026-06-14 | feat | 行业调研与项目审视(2026-06-14) | deep |
P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence
Summary
Three P0 gaps identified from industry benchmarking and project audit: (1) LLM response caching to reduce 30-50% token cost, (2) embedding-based semantic routing to improve intent matching quality at zero LLM cost, (3) critical state persistence for UsageTracker, EvolutionStore, and CascadeDetector to survive restarts and enable multi-instance deployment. Each unit requires detailed architecture design and code reasoning before implementation — design-first, code-second.
Problem Frame
AgentKit has strong differentiation in self-evolution, quality management, and multi-paradigm engines, but three production-critical gaps prevent enterprise deployment:
- Every LLM request hits the provider — no caching. Identical or similar requests waste tokens and money. Competitors like Dify have built-in caching.
- Routing relies on keyword matching and LLM classification — no semantic understanding. Embedding-based routing is industry standard (Agentic RAG trend) and AgentKit already has embedding infrastructure but doesn't use it for routing.
- Critical state lives in memory — UsageTracker, CascadeDetector, and EvolutionStore lose data on restart. Multi-instance deployment is impossible without shared state.
These gaps are P0 because they directly impact cost (caching), quality (routing accuracy), and reliability (state persistence) — the three pillars of production readiness.
Requirements
- R1. LLM cache must support exact-match (hash-based) and semantic-match (embedding-based) cache hits
- R2. LLM cache must integrate transparently into
LLMGateway.chat()without changing the public API - R3. LLM cache must record usage on cache hits (0 cost) to maintain usage tracking integrity
- R4. Semantic routing must insert between Layer 1 and Layer 2 in
CostAwareRouter - R5. Semantic routing must use existing
OpenAIEmbedderandcompute_cosine_similarity()infrastructure - R6. Semantic routing must pre-compute skill embeddings at registration time, not at query time
- R7. UsageTracker must persist records to Redis with O(1) write and efficient aggregation
- R8. CascadeDetector must persist state to Redis using atomic INCR operations
- R9. EvolutionStore must support both PostgreSQL and SQLite backends with a unified interface
- R10. All three features must degrade gracefully when Redis/PG is unavailable (fallback to in-memory)
- R11. Each unit must have detailed architecture design and code reasoning documented before implementation begins
Key Technical Decisions
-
KTD-1. Cache key design: Use
SHA256(model + system_prompt_hash + messages_content_hash + temperature + tools_hash)for exact match. For semantic match, embed the last user message and compare against cached embeddings using cosine similarity > threshold. Rationale: exact match is fast and deterministic; semantic match catches paraphrased requests. Both are needed because exact match alone misses too many hits, and semantic match alone is too slow for every request. -
KTD-2. Cache storage backend: Implement
LLMCacheas a Protocol withInMemoryLLMCacheandRedisLLMCachebackends. In-memory usesOrderedDictwith LRU eviction (followingEmbeddingCachepattern). Redis usesagentkit:llm_cache:{hash}keys with TTL. Rationale: follows existing factory pattern (create_message_bus,create_session_store); in-memory for dev/single-instance, Redis for production. -
KTD-3. Semantic routing insertion point: Insert as Layer 1.5 between
HeuristicClassifierand_classify_merged(). When Layer 1 returns medium complexity (0.3-0.7), try semantic routing first. If similarity > 0.85, return skill match directly (skip LLM). If similarity 0.6-0.85, pass skill hint to Layer 2 LLM (reduces LLM classification tokens). If < 0.6, proceed to Layer 2 unchanged. Rationale: this placement maximizes cost savings by avoiding LLM calls when semantic match is confident, while preserving the existing fallback chain. -
KTD-4. Skill embedding source text: Embed
f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"for each skill. Cache embeddings in a dict keyed by skill name, re-embed on skill registration/update. Rationale: combines all semantic signals; description alone misses keyword intent; keywords alone misses semantic meaning. -
KTD-5. UsageTracker persistence strategy: Use Redis Hash for time-series data. Key pattern:
agentkit:usage:{date}with fields{agent}:{model}→ JSON{tokens, cost, latency_ms, count}. Write viaHINCRBYFLOATfor atomic increment. Query viaHGETALL+ client-side aggregation. Rationale: O(1) write, acceptable query performance, natural TTL by date, follows Redis patterns in project. -
KTD-6. CascadeDetector persistence strategy: Use Redis atomic operations. Key pattern:
agentkit:cascade:{session_id}:interactions(INCR + TTL) andagentkit:cascade:{session_id}:depth(SET/GET + TTL). Rationale: INCR is atomic, no race conditions across instances; TTL prevents memory leaks; matches session lifecycle. -
KTD-7. EvolutionStore interface unification: Extend the base
EvolutionStoreProtocol to includeskill_versionandab_testmethods. MakePersistentEvolutionStore(SQLite) implement the unified Protocol. Add a newPostgreSQLEvolutionStorethat uses async SQLAlchemy like the existingEvolutionStorebut with the full unified interface. Rationale: current split (sync SQLite vs async PG) creates maintenance burden; unified Protocol enables backend-agnostic usage. -
KTD-8. Graceful degradation pattern: All three features use the same pattern — try preferred backend, catch connection error, log warning, fall back to in-memory. Controlled by
cache.backend,usage_store.backend,cascade_store.backendconfig values ("auto"|"redis"|"memory")."auto"tries Redis, falls back to memory. Rationale: production needs persistence, but dev/testing shouldn't require Redis.
High-Level Technical Design
LLM Cache Flow
flowchart TB
A[LLMGateway.chat] --> B{Cache enabled?}
B -->|no| F[Call Provider]
B -->|yes| C[Generate exact key]
C --> D{Exact match?}
D -->|hit| E[Return cached response]
D -->|miss| G[Generate embedding of last user msg]
G --> H{Semantic match? similarity > 0.92}
H -->|hit| E
H -->|miss| F
F --> I[Write to cache]
I --> J[Record usage]
E --> K[Record usage with 0 cost]
Semantic Routing Flow
flowchart TB
A[CostAwareRouter.route] --> B[Layer 0: Regex rules]
B -->|matched| Z[Return DIRECT_CHAT]
B -->|unmatched| C[Layer 1: HeuristicClassifier]
C -->|low complexity| Z
C -->|medium-high| D[Layer 1.5: Semantic Router NEW]
D -->|sim > 0.85| E[Return SKILL_REACT with matched skill]
D -->|sim 0.6-0.85| F[Pass skill_hint to Layer 2]
D -->|sim < 0.6| G[Layer 2: LLM classification]
F --> G
G --> H[Return routing result]
State Persistence Architecture
flowchart TB
subgraph "Current (In-Memory)"
UT1[UsageTracker dict]
CD1[CascadeDetector dict]
ES1[EvolutionStore SQLite]
end
subgraph "Target (Persistent)"
UT2[UsageStore Protocol]
CD2[CascadeStateStore Protocol]
ES2[UnifiedEvolutionStore Protocol]
UT2 -->|redis| R1[Redis Hash agentkit:usage:date]
UT2 -->|memory| M1[InMemoryUsageStore]
CD2 -->|redis| R2[Redis INCR agentkit:cascade:session]
CD2 -->|memory| M2[InMemoryCascadeStore]
ES2 -->|postgresql| P1[PG EvolutionEventModel + SkillVersionModel]
ES2 -->|sqlite| S1[PersistentEvolutionStore]
ES2 -->|memory| M3[InMemoryEvolutionStore]
end
Scope Boundaries
In Scope
- LLM response caching (exact + semantic match, in-memory + Redis backends)
- Semantic routing as Layer 1.5 in CostAwareRouter
- UsageTracker Redis persistence
- CascadeDetector Redis persistence
- EvolutionStore interface unification
- Configuration for all three features
- Architecture design documents for each unit before coding
Deferred for Follow-Up
- Semantic cache using pgvector (current semantic match uses in-memory embedding comparison)
- Cache warming / pre-population strategies
- Routing cache (caching routing results for similar queries)
- Usage analytics dashboard (visualization of usage data)
- Multi-tenant resource quotas
- Rate limiting and concurrency control (P2)
- Distributed tracing visualization (P2)
Implementation Units
U1. LLM Cache Core
Goal: Implement the LLMCache Protocol, InMemoryLLMCache, and RedisLLMCache with exact-match and semantic-match capabilities.
Dependencies: None
Files:
src/agentkit/llm/cache.py(new) —LLMCacheProtocol,InMemoryLLMCache,RedisLLMCache,CacheResult,CacheKeygenerationsrc/agentkit/llm/cache_key.py(new) —generate_cache_key(),generate_messages_hash(),generate_system_prompt_hash()tests/unit/llm/test_cache.py(new) — unit tests for cache backends
Approach:
Architecture design before coding:
-
CacheKey design reasoning: The key must capture all inputs that affect LLM output.
modeldetermines which model responds.system_promptsets behavior.messagescarry the conversation.temperatureaffects randomness (only cache temperature=0 deterministically).toolsaffect tool_call availability. Hash each component independently so partial changes don't invalidate the entire key. -
Exact match implementation: SHA-256 hash of concatenated component hashes. Store as
agentkit:llm_cache:{sha256_hex}in Redis with TTL. In-memory uses OrderedDict keyed by hash string. -
Semantic match implementation: For cache misses on exact match, embed the last user message using
OpenAIEmbedder. Compare against cached embeddings usingcompute_cosine_similarity(). Store embeddings alongside cached responses. In-memory: linear scan of all cached embeddings. Redis: store embeddings in a separate keyagentkit:llm_cache_emb:{sha256_hex}. -
Cache write policy: Only cache responses where
temperature == 0(deterministic). For temperature > 0, only exact-match cache applies (no semantic match, since outputs are non-deterministic). -
Cache invalidation: TTL-based (configurable, default 3600s for exact, 86400s for semantic). Manual invalidation via
invalidate(pattern=None)for admin operations.
Patterns to follow:
EmbeddingCacheinsrc/agentkit/memory/embedder.py— LRU + TTL patterncreate_session_store()factory insrc/agentkit/session/store.py— backend factory patternRedisSessionStore._get_redis()— lazy Redis initialization
Test scenarios:
- Exact match: same messages + model → cache hit, returns identical response
- Exact miss: different messages → cache miss, calls provider, writes to cache
- Semantic match: paraphrased question (similarity > 0.92) → cache hit
- Semantic miss: unrelated question (similarity < 0.6) → cache miss
- Temperature > 0: only exact match attempted, no semantic match
- TTL expiry: cached entry expires after TTL, next request is a miss
- Redis unavailable: falls back to in-memory cache with warning log
- Cache with tool_calls: response containing tool_calls is cached correctly
- Concurrent access: two concurrent requests for same key don't cause double-write issues
Verification: Unit tests pass; cache hit rate metric is observable; no change to LLMGateway public API.
U2. LLM Cache Integration
Goal: Integrate LLMCache into LLMGateway.chat() transparently, with usage tracking on cache hits.
Dependencies: U1
Files:
src/agentkit/llm/gateway.py(modify) — inject cache check before provider call, cache write after provider responsesrc/agentkit/llm/config.py(modify) — addCacheConfigtoLLMConfigsrc/agentkit/server/app.py(modify) — pass cache config toLLMGatewaytests/unit/llm/test_gateway_cache.py(new) — integration tests for cached gateway
Approach:
Architecture design before coding:
-
Insertion point reasoning: Cache check must happen AFTER
LLMRequestconstruction (line ~79 in gateway.py) but BEFORE provider call (line ~87). This ensures all request normalization (alias resolution, model fallback list) has completed. Cache write happens AFTER response validation but BEFORE usage tracking. -
Cache hit usage tracking: On cache hit, call
_usage_tracker.record()with the originalusagedata from the cached response but withcost=0andlatency_msfrom cache lookup time. This preserves usage query integrity —get_usage()still shows all requests, just with zero cost for cached ones. -
Stream handling:
chat_stream()is NOT cached in this iteration. Streaming requires collecting all chunks before caching, which adds latency and complexity. Document this as a known limitation. -
Configuration integration: Add
CacheConfigdataclass withenabled: bool = False,backend: str = "auto",exact_ttl: int = 3600,semantic_ttl: int = 86400,similarity_threshold: float = 0.92,max_entries: int = 10000. Nest underLLMConfig.cache.
Patterns to follow:
LLMConfigdataclass +from_dict()pattern for configLLMGateway.__init__()dependency injection pattern
Test scenarios:
- Cache disabled: requests pass through to provider normally
- Cache enabled, first request: cache miss, provider called, response cached
- Cache enabled, second identical request: cache hit, provider NOT called
- Cache hit usage tracking: usage record has 0 cost, correct token counts
- Cache miss + fallback: primary model fails, fallback model response cached under fallback model key
- Config from YAML:
LLMConfig.from_dict({"cache": {"enabled": true}})works correctly
Verification: Integration tests pass; LLMGateway.chat() returns same LLMResponse shape whether cached or not; usage tracking includes cache hits.
U3. Semantic Router
Goal: Implement embedding-based semantic routing as Layer 1.5 in CostAwareRouter, using existing OpenAIEmbedder and compute_cosine_similarity().
Dependencies: None (independent of U1/U2, uses existing embedding infrastructure)
Files:
src/agentkit/chat/semantic_router.py(new) —SemanticRouterclass,SkillEmbeddingIndexsrc/agentkit/chat/skill_routing.py(modify) — integrate Layer 1.5 intoCostAwareRouter.route()tests/unit/chat/test_semantic_router.py(new) — unit tests for semantic router
Approach:
Architecture design before coding:
-
SkillEmbeddingIndex design reasoning: Pre-compute embeddings for all registered skills at initialization. Source text:
f"{description} | {' '.join(keywords)} | {' '.join(capability_tags)}". Store asdict[str, tuple[list[float], str]](skill_name → (embedding, source_text)). On skill registration/update, re-embed only the changed skill. This avoids O(n) embedding computation per query. -
Query-time flow: Embed user query → compute cosine similarity against all skill embeddings → return top match if above threshold. This is O(n) in number of skills, but with <100 skills and 1536-dim vectors, this takes <5ms on CPU. No need for approximate nearest neighbor (ANN) index at this scale.
-
Threshold design: Three zones:
similarity > 0.85: HIGH confidence → return skill match directly, skip Layer 2 LLM0.6 <= similarity <= 0.85: MEDIUM confidence → pass skill hint to Layer 2, reducing LLM classification tokenssimilarity < 0.6: LOW confidence → no semantic signal, Layer 2 runs unmodified
-
Integration into CostAwareRouter: Modify
route()method. After Layer 1 (_classify_merged()), if complexity is medium (0.3-0.7), callsemantic_router.route(query). Based on confidence zone, either return directly or enhance the Layer 2 prompt with skill hint. -
Embedding provider: Use
OpenAIEmbedderby default. SupportMockEmbedderfor testing. Embedder is injected via constructor, not created internally.
Patterns to follow:
OpenAIEmbedder+EmbeddingCachepattern for embedding computationcompute_cosine_similarity()insrc/agentkit/utils/vector_math.pyCostAwareRouterconstructor injection pattern
Test scenarios:
- Exact skill match: query "生成一篇关于AI的文章" matches
content_generatorskill (sim > 0.85) - Partial skill match: query "优化内容" matches
geo_optimizerskill (sim 0.6-0.85), skill hint passed to LLM - No skill match: query "今天天气怎么样" has sim < 0.6 for all skills, Layer 2 runs normally
- Skill registration: new skill added → embedding computed and indexed
- Skill update: skill description changed → embedding re-computed
- Empty skill registry: semantic router returns None gracefully
- Embedder failure: OpenAIEmbedder throws error → semantic router logs warning, returns None, Layer 2 runs normally
- Chinese query: "帮我写一篇文章" matches content_generator skill correctly
Verification: Semantic router returns correct skill matches; Layer 2 LLM calls reduced by >50% for medium-complexity queries; no regression in routing accuracy.
U4. UsageStore Persistence
Goal: Persist UsageTracker records to Redis, with in-memory fallback and efficient aggregation queries.
Dependencies: None
Files:
src/agentkit/llm/usage_store.py(new) —UsageStoreProtocol,InMemoryUsageStore,RedisUsageStoresrc/agentkit/llm/providers/tracker.py(modify) — delegate toUsageStorebackendtests/unit/llm/test_usage_store.py(new) — unit tests for usage store backends
Approach:
Architecture design before coding:
-
Redis data model reasoning: Use Redis Hash per date for time-partitioned storage. Key:
agentkit:usage:{YYYY-MM-DD}, field:{agent_name}:{model}, value: JSON{prompt_tokens, completion_tokens, total_tokens, cost, latency_ms, count}. Write via pipeline:HINCRBYFLOATfor numeric fields +HINCRBYfor count. This is O(1) per write, atomic, and naturally partitions by date. -
Aggregation query design: For
get_usage(agent=None, start=None, end=None): scan date keys in range viaHGETALL, filter by agent/model in application code, aggregate in memory. For single-agent queries, use field prefix matching. This is O(days × agents) which is acceptable for dashboard queries. -
UsageStore Protocol: Define
record(agent, model, usage: UsageRecord) -> None,query(agent=None, model=None, start=None, end=None) -> list[UsageRecord],get_summary(agent=None, start=None, end=None) -> UsageSummary. Both sync and async versions (sync for backward compat, async for Redis). -
Migration from UsageTracker:
UsageTrackerbecomes a thin wrapper that delegates toUsageStore. Existingrecord()andget_usage()APIs preserved. Internal_recordslist replaced by store backend. -
TTL management: Each date key gets TTL of 90 days (configurable). This prevents unbounded Redis memory growth while preserving 3 months of usage data.
Patterns to follow:
SessionStoreProtocol insrc/agentkit/session/store.py— Protocol definition patternRedisSessionStore._get_redis()— lazy Redis initializationcreate_session_store()— factory function patternagentkit:usage:key namespace convention
Test scenarios:
- Record and query: record usage → query returns matching records
- Date partitioning: records on different dates stored in different keys
- Aggregation: multiple records for same agent/model aggregated correctly
- Agent filter: query with agent filter returns only that agent's records
- Date range filter: query with start/end returns only records in range
- TTL: date keys have correct TTL set
- Redis unavailable: falls back to in-memory store with warning
- Concurrent writes: two concurrent records for same agent/model don't lose data
- Empty query: query with no matching records returns empty list
Verification: Usage data survives process restart; get_usage() returns same shape as before; Redis memory usage bounded by TTL.
U5. CascadeStateStore Persistence
Goal: Persist CascadeDetector state to Redis using atomic operations, enabling multi-instance cascade detection.
Dependencies: None
Files:
src/agentkit/quality/cascade_store.py(new) —CascadeStateStoreProtocol,InMemoryCascadeStore,RedisCascadeStoresrc/agentkit/quality/cascade_detector.py(modify) — delegate toCascadeStateStorebackendtests/unit/quality/test_cascade_store.py(new) — unit tests for cascade store backends
Approach:
Architecture design before coding:
-
Redis data model reasoning: Use simple string keys with INCR for atomic counting. Key:
agentkit:cascade:{session_id}:interactions(INCR + TTL),agentkit:cascade:{session_id}:depth(GET/SET + TTL). TTL aligned with session TTL (default 86400s). INCR is atomic — no race conditions across instances. -
Protocol design:
CascadeStateStorewithincrement_interactions(session_id) -> int,get_interactions(session_id) -> int,set_depth(session_id, depth) -> None,get_depth(session_id) -> int,reset(session_id) -> None,get_stats(session_id) -> CascadeStats. -
Integration into CascadeDetector: Replace internal
_interaction_countsand_loop_depthsdicts withCascadeStateStorebackend. All methods delegate to store.CascadeDetectorbecomes stateless — all state lives in the store. -
Session TTL alignment: When
increment_interactions()is called, refresh the key TTL to match session TTL. This ensures state is cleaned up when sessions expire.
Patterns to follow:
- Same Protocol + factory + fallback pattern as U4
- Redis INCR atomic operation pattern
agentkit:cascade:key namespace
Test scenarios:
- Increment and get: increment interactions → get returns correct count
- Set and get depth: set depth → get returns correct depth
- Reset: reset session → interactions and depth both cleared
- TTL: keys have TTL set, expire after session timeout
- Multi-instance: two instances incrementing same session see consistent count
- Redis unavailable: falls back to in-memory store
- Session isolation: different sessions have independent state
Verification: Cascade detection state survives process restart; multi-instance deployment detects cascades correctly; no false positives from state loss.
U6. EvolutionStore Interface Unification
Goal: Unify EvolutionStore and PersistentEvolutionStore interfaces, add PostgreSQL backend with full feature set.
Dependencies: None
Files:
src/agentkit/evolution/evolution_store.py(modify) — define unifiedEvolutionStoreProtocol, refactor existing storessrc/agentkit/evolution/models.py(modify) — addSkillVersionModelandABTestResultModelto async PG modelssrc/agentkit/evolution/pg_store.py(new) —PostgreSQLEvolutionStoreimplementing unified Protocol with async SQLAlchemytests/unit/evolution/test_unified_store.py(new) — tests for unified interface
Approach:
Architecture design before coding:
-
Protocol design reasoning: Current
EvolutionStore(async PG) hasrecord(),rollback(),list_events().PersistentEvolutionStore(sync SQLite) addsrecord_skill_version(),list_skill_versions(),record_ab_test_result(),get_ab_test_results(). The unified Protocol must include ALL methods from both. Each backend implements what it can; unsupported methods raiseNotImplementedErrorwith clear message. -
PostgreSQL model migration: Add
SkillVersionModelandABTestResultModeltosrc/agentkit/evolution/models.pyusing async SQLAlchemy (matchingEpisodeModelpattern in memory/models.py). These models already exist for SQLite; the PG versions use the same schema but with async engine. -
PostgreSQLEvolutionStore: New class using async SQLAlchemy session (injected via constructor, same pattern as existing
EvolutionStore). Implements all Protocol methods. Usesrun_in_executorfor any sync ORM operations if needed. -
Factory update:
create_evolution_store(backend="memory"|"sqlite"|"postgresql", ...)returns the appropriate backend."postgresql"createsPostgreSQLEvolutionStorewith async engine. -
Backward compatibility: Existing
EvolutionStoreclass is not removed — it becomes an internal implementation detail. The Protocol is the public interface. Code usingEvolutionStoredirectly continues to work.
Patterns to follow:
EpisodeModelinsrc/agentkit/memory/models.py— async PG model patterncreate_evolution_store()factory — extend with new backendPersistentEvolutionStore._run_sync()— sync/async bridge pattern
Test scenarios:
- Protocol compliance: all backends implement all Protocol methods
- PG store: record event → list events returns recorded event
- PG store: record skill version → list versions returns version history
- PG store: record AB test result → get results returns test data
- SQLite store: existing functionality preserved after refactor
- Memory store: existing functionality preserved after refactor
- Factory:
create_evolution_store(backend="postgresql")returns correct type - PG unavailable: falls back to SQLite with warning
Verification: All backends pass unified Protocol compliance test; existing evolution tests pass; PG store supports skill_version and ab_test operations.
U7. Configuration Integration and End-to-End Verification
Goal: Wire all three features into the application configuration, add agentkit.yaml schema support, and verify end-to-end behavior.
Dependencies: U1, U2, U3, U4, U5, U6
Files:
src/agentkit/server/app.py(modify) — initialize cache, usage store, cascade store with configsrc/agentkit/cli/main.py(modify) — pass config to gateway and routeragentkit.yaml(modify) — add cache, semantic_routing, usage_store, cascade_store config sectionstests/integration/test_p0_hardening.py(new) — end-to-end integration tests
Approach:
- Configuration schema: Add to
agentkit.yaml:
llm:
cache:
enabled: true
backend: "auto" # auto | redis | memory
exact_ttl: 3600
semantic_ttl: 86400
similarity_threshold: 0.92
max_entries: 10000
routing:
semantic:
enabled: true
similarity_high: 0.85 # direct match threshold
similarity_low: 0.6 # hint threshold
usage_store:
backend: "auto" # auto | redis | memory
ttl_days: 90
cascade_store:
backend: "auto" # auto | redis | memory
session_ttl: 86400
evolution_store:
backend: "auto" # auto | postgresql | sqlite | memory
-
Application wiring: In
app.pylifespan, initialize all stores and inject into gateway/router. Follow existing pattern of creating components from config. -
End-to-end verification: Integration test that exercises the full flow: user query → semantic routing → LLM cache → usage tracking → cascade detection → evolution logging.
Test scenarios:
- Full flow with Redis: all features use Redis backend, data persists across simulated restart
- Full flow without Redis: all features fall back to in-memory, no errors
- Config from YAML:
agentkit.yamlparsed correctly, all features configured - Cache + routing interaction: cached response for semantically routed query works correctly
- Usage tracking with cache: cached requests show 0 cost in usage summary
- Cascade detection across instances: simulated multi-instance scenario detects cascade correctly
Verification: All integration tests pass; application starts with new config; features degrade gracefully when backends unavailable.
Risks & Mitigations
| Risk | Impact | Likelihood | Mitigation |
|---|---|---|---|
| Semantic cache returns stale/wrong response | High — user gets incorrect answer | Medium — embedding similarity doesn't guarantee semantic equivalence | Default to temperature=0 only for semantic cache; configurable threshold; TTL expiry; admin invalidation API |
| Redis single point of failure | High — all persistence lost | Low — Redis is typically HA | Auto-fallback to in-memory; health check in doctor command; alert on fallback activation |
| Embedding API latency adds to routing time | Medium — slower routing for first query | Medium — embedding API ~100ms | Pre-compute skill embeddings; cache query embeddings; async embedding with timeout |
| UsageStore Redis memory growth | Medium — Redis OOM | Low — TTL + date partitioning bounds growth | 90-day TTL default; monitoring on Redis memory; configurable TTL |
| EvolutionStore interface unification breaks existing code | High — evolution system stops working | Low — Protocol is backward compatible | Keep existing classes as internal implementations; comprehensive test coverage before refactor |
Open Questions
- Should semantic cache also cache streaming responses (requires chunk collection)? Deferred — current plan only caches non-streaming
chat(). - Should UsageStore support real-time streaming of usage data (e.g., via Redis Pub/Sub)? Deferred — current plan only supports query-based access.
- What is the optimal embedding model for Chinese+English mixed text?
text-embedding-3-smallis adequate but not optimal. Considerbge-m3ormultilingual-e5as alternatives. Deferred to implementation-time benchmarking.
Sources & Research
- Industry benchmarking: LangChain, Dify, CrewAI, Letta, AutoGen feature comparison (2025-2026)
- Project audit: 12 core files analyzed across memory, evolution, routing, quality, and LLM subsystems
- Existing patterns:
EmbeddingCache,RedisSessionStore,create_evolution_store(),SessionStoreProtocol