fischer-agentkit/docs/plans/2026-06-14-001-feat-p0-prod...

30 KiB
Raw Permalink Blame History

title status created_at type origin depth
feat: P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence active 2026-06-14 feat 行业调研与项目审视2026-06-14 deep

P0 Production Hardening — LLM Cache, Semantic Routing, State Persistence

Summary

Three P0 gaps identified from industry benchmarking and project audit: (1) LLM response caching to reduce 30-50% token cost, (2) embedding-based semantic routing to improve intent matching quality at zero LLM cost, (3) critical state persistence for UsageTracker, EvolutionStore, and CascadeDetector to survive restarts and enable multi-instance deployment. Each unit requires detailed architecture design and code reasoning before implementation — design-first, code-second.

Problem Frame

AgentKit has strong differentiation in self-evolution, quality management, and multi-paradigm engines, but three production-critical gaps prevent enterprise deployment:

  1. Every LLM request hits the provider — no caching. Identical or similar requests waste tokens and money. Competitors like Dify have built-in caching.
  2. Routing relies on keyword matching and LLM classification — no semantic understanding. Embedding-based routing is industry standard (Agentic RAG trend) and AgentKit already has embedding infrastructure but doesn't use it for routing.
  3. Critical state lives in memory — UsageTracker, CascadeDetector, and EvolutionStore lose data on restart. Multi-instance deployment is impossible without shared state.

These gaps are P0 because they directly impact cost (caching), quality (routing accuracy), and reliability (state persistence) — the three pillars of production readiness.


Requirements

  • R1. LLM cache must support exact-match (hash-based) and semantic-match (embedding-based) cache hits
  • R2. LLM cache must integrate transparently into LLMGateway.chat() without changing the public API
  • R3. LLM cache must record usage on cache hits (0 cost) to maintain usage tracking integrity
  • R4. Semantic routing must insert between Layer 1 and Layer 2 in CostAwareRouter
  • R5. Semantic routing must use existing OpenAIEmbedder and compute_cosine_similarity() infrastructure
  • R6. Semantic routing must pre-compute skill embeddings at registration time, not at query time
  • R7. UsageTracker must persist records to Redis with O(1) write and efficient aggregation
  • R8. CascadeDetector must persist state to Redis using atomic INCR operations
  • R9. EvolutionStore must support both PostgreSQL and SQLite backends with a unified interface
  • R10. All three features must degrade gracefully when Redis/PG is unavailable (fallback to in-memory)
  • R11. Each unit must have detailed architecture design and code reasoning documented before implementation begins

Key Technical Decisions

  • KTD-1. Cache key design: Use SHA256(model + system_prompt_hash + messages_content_hash + temperature + tools_hash) for exact match. For semantic match, embed the last user message and compare against cached embeddings using cosine similarity > threshold. Rationale: exact match is fast and deterministic; semantic match catches paraphrased requests. Both are needed because exact match alone misses too many hits, and semantic match alone is too slow for every request.

  • KTD-2. Cache storage backend: Implement LLMCache as a Protocol with InMemoryLLMCache and RedisLLMCache backends. In-memory uses OrderedDict with LRU eviction (following EmbeddingCache pattern). Redis uses agentkit:llm_cache:{hash} keys with TTL. Rationale: follows existing factory pattern (create_message_bus, create_session_store); in-memory for dev/single-instance, Redis for production.

  • KTD-3. Semantic routing insertion point: Insert as Layer 1.5 between HeuristicClassifier and _classify_merged(). When Layer 1 returns medium complexity (0.3-0.7), try semantic routing first. If similarity > 0.85, return skill match directly (skip LLM). If similarity 0.6-0.85, pass skill hint to Layer 2 LLM (reduces LLM classification tokens). If < 0.6, proceed to Layer 2 unchanged. Rationale: this placement maximizes cost savings by avoiding LLM calls when semantic match is confident, while preserving the existing fallback chain.

  • KTD-4. Skill embedding source text: Embed f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}" for each skill. Cache embeddings in a dict keyed by skill name, re-embed on skill registration/update. Rationale: combines all semantic signals; description alone misses keyword intent; keywords alone misses semantic meaning.

  • KTD-5. UsageTracker persistence strategy: Use Redis Hash for time-series data. Key pattern: agentkit:usage:{date} with fields {agent}:{model} → JSON {tokens, cost, latency_ms, count}. Write via HINCRBYFLOAT for atomic increment. Query via HGETALL + client-side aggregation. Rationale: O(1) write, acceptable query performance, natural TTL by date, follows Redis patterns in project.

  • KTD-6. CascadeDetector persistence strategy: Use Redis atomic operations. Key pattern: agentkit:cascade:{session_id}:interactions (INCR + TTL) and agentkit:cascade:{session_id}:depth (SET/GET + TTL). Rationale: INCR is atomic, no race conditions across instances; TTL prevents memory leaks; matches session lifecycle.

  • KTD-7. EvolutionStore interface unification: Extend the base EvolutionStore Protocol to include skill_version and ab_test methods. Make PersistentEvolutionStore (SQLite) implement the unified Protocol. Add a new PostgreSQLEvolutionStore that uses async SQLAlchemy like the existing EvolutionStore but with the full unified interface. Rationale: current split (sync SQLite vs async PG) creates maintenance burden; unified Protocol enables backend-agnostic usage.

  • KTD-8. Graceful degradation pattern: All three features use the same pattern — try preferred backend, catch connection error, log warning, fall back to in-memory. Controlled by cache.backend, usage_store.backend, cascade_store.backend config values ("auto" | "redis" | "memory"). "auto" tries Redis, falls back to memory. Rationale: production needs persistence, but dev/testing shouldn't require Redis.


High-Level Technical Design

LLM Cache Flow

flowchart TB
    A[LLMGateway.chat] --> B{Cache enabled?}
    B -->|no| F[Call Provider]
    B -->|yes| C[Generate exact key]
    C --> D{Exact match?}
    D -->|hit| E[Return cached response]
    D -->|miss| G[Generate embedding of last user msg]
    G --> H{Semantic match? similarity > 0.92}
    H -->|hit| E
    H -->|miss| F
    F --> I[Write to cache]
    I --> J[Record usage]
    E --> K[Record usage with 0 cost]

Semantic Routing Flow

flowchart TB
    A[CostAwareRouter.route] --> B[Layer 0: Regex rules]
    B -->|matched| Z[Return DIRECT_CHAT]
    B -->|unmatched| C[Layer 1: HeuristicClassifier]
    C -->|low complexity| Z
    C -->|medium-high| D[Layer 1.5: Semantic Router NEW]
    D -->|sim > 0.85| E[Return SKILL_REACT with matched skill]
    D -->|sim 0.6-0.85| F[Pass skill_hint to Layer 2]
    D -->|sim < 0.6| G[Layer 2: LLM classification]
    F --> G
    G --> H[Return routing result]

State Persistence Architecture

flowchart TB
    subgraph "Current (In-Memory)"
        UT1[UsageTracker dict]
        CD1[CascadeDetector dict]
        ES1[EvolutionStore SQLite]
    end
    subgraph "Target (Persistent)"
        UT2[UsageStore Protocol]
        CD2[CascadeStateStore Protocol]
        ES2[UnifiedEvolutionStore Protocol]
        UT2 -->|redis| R1[Redis Hash agentkit:usage:date]
        UT2 -->|memory| M1[InMemoryUsageStore]
        CD2 -->|redis| R2[Redis INCR agentkit:cascade:session]
        CD2 -->|memory| M2[InMemoryCascadeStore]
        ES2 -->|postgresql| P1[PG EvolutionEventModel + SkillVersionModel]
        ES2 -->|sqlite| S1[PersistentEvolutionStore]
        ES2 -->|memory| M3[InMemoryEvolutionStore]
    end

Scope Boundaries

In Scope

  • LLM response caching (exact + semantic match, in-memory + Redis backends)
  • Semantic routing as Layer 1.5 in CostAwareRouter
  • UsageTracker Redis persistence
  • CascadeDetector Redis persistence
  • EvolutionStore interface unification
  • Configuration for all three features
  • Architecture design documents for each unit before coding

Deferred for Follow-Up

  • Semantic cache using pgvector (current semantic match uses in-memory embedding comparison)
  • Cache warming / pre-population strategies
  • Routing cache (caching routing results for similar queries)
  • Usage analytics dashboard (visualization of usage data)
  • Multi-tenant resource quotas
  • Rate limiting and concurrency control (P2)
  • Distributed tracing visualization (P2)

Implementation Units

U1. LLM Cache Core

Goal: Implement the LLMCache Protocol, InMemoryLLMCache, and RedisLLMCache with exact-match and semantic-match capabilities.

Dependencies: None

Files:

  • src/agentkit/llm/cache.py (new) — LLMCache Protocol, InMemoryLLMCache, RedisLLMCache, CacheResult, CacheKey generation
  • src/agentkit/llm/cache_key.py (new) — generate_cache_key(), generate_messages_hash(), generate_system_prompt_hash()
  • tests/unit/llm/test_cache.py (new) — unit tests for cache backends

Approach:

Architecture design before coding:

  1. CacheKey design reasoning: The key must capture all inputs that affect LLM output. model determines which model responds. system_prompt sets behavior. messages carry the conversation. temperature affects randomness (only cache temperature=0 deterministically). tools affect tool_call availability. Hash each component independently so partial changes don't invalidate the entire key.

  2. Exact match implementation: SHA-256 hash of concatenated component hashes. Store as agentkit:llm_cache:{sha256_hex} in Redis with TTL. In-memory uses OrderedDict keyed by hash string.

  3. Semantic match implementation: For cache misses on exact match, embed the last user message using OpenAIEmbedder. Compare against cached embeddings using compute_cosine_similarity(). Store embeddings alongside cached responses. In-memory: linear scan of all cached embeddings. Redis: store embeddings in a separate key agentkit:llm_cache_emb:{sha256_hex}.

  4. Cache write policy: Only cache responses where temperature == 0 (deterministic). For temperature > 0, only exact-match cache applies (no semantic match, since outputs are non-deterministic).

  5. Cache invalidation: TTL-based (configurable, default 3600s for exact, 86400s for semantic). Manual invalidation via invalidate(pattern=None) for admin operations.

Patterns to follow:

  • EmbeddingCache in src/agentkit/memory/embedder.py — LRU + TTL pattern
  • create_session_store() factory in src/agentkit/session/store.py — backend factory pattern
  • RedisSessionStore._get_redis() — lazy Redis initialization

Test scenarios:

  • Exact match: same messages + model → cache hit, returns identical response
  • Exact miss: different messages → cache miss, calls provider, writes to cache
  • Semantic match: paraphrased question (similarity > 0.92) → cache hit
  • Semantic miss: unrelated question (similarity < 0.6) → cache miss
  • Temperature > 0: only exact match attempted, no semantic match
  • TTL expiry: cached entry expires after TTL, next request is a miss
  • Redis unavailable: falls back to in-memory cache with warning log
  • Cache with tool_calls: response containing tool_calls is cached correctly
  • Concurrent access: two concurrent requests for same key don't cause double-write issues

Verification: Unit tests pass; cache hit rate metric is observable; no change to LLMGateway public API.


U2. LLM Cache Integration

Goal: Integrate LLMCache into LLMGateway.chat() transparently, with usage tracking on cache hits.

Dependencies: U1

Files:

  • src/agentkit/llm/gateway.py (modify) — inject cache check before provider call, cache write after provider response
  • src/agentkit/llm/config.py (modify) — add CacheConfig to LLMConfig
  • src/agentkit/server/app.py (modify) — pass cache config to LLMGateway
  • tests/unit/llm/test_gateway_cache.py (new) — integration tests for cached gateway

Approach:

Architecture design before coding:

  1. Insertion point reasoning: Cache check must happen AFTER LLMRequest construction (line ~79 in gateway.py) but BEFORE provider call (line ~87). This ensures all request normalization (alias resolution, model fallback list) has completed. Cache write happens AFTER response validation but BEFORE usage tracking.

  2. Cache hit usage tracking: On cache hit, call _usage_tracker.record() with the original usage data from the cached response but with cost=0 and latency_ms from cache lookup time. This preserves usage query integrity — get_usage() still shows all requests, just with zero cost for cached ones.

  3. Stream handling: chat_stream() is NOT cached in this iteration. Streaming requires collecting all chunks before caching, which adds latency and complexity. Document this as a known limitation.

  4. Configuration integration: Add CacheConfig dataclass with enabled: bool = False, backend: str = "auto", exact_ttl: int = 3600, semantic_ttl: int = 86400, similarity_threshold: float = 0.92, max_entries: int = 10000. Nest under LLMConfig.cache.

Patterns to follow:

  • LLMConfig dataclass + from_dict() pattern for config
  • LLMGateway.__init__() dependency injection pattern

Test scenarios:

  • Cache disabled: requests pass through to provider normally
  • Cache enabled, first request: cache miss, provider called, response cached
  • Cache enabled, second identical request: cache hit, provider NOT called
  • Cache hit usage tracking: usage record has 0 cost, correct token counts
  • Cache miss + fallback: primary model fails, fallback model response cached under fallback model key
  • Config from YAML: LLMConfig.from_dict({"cache": {"enabled": true}}) works correctly

Verification: Integration tests pass; LLMGateway.chat() returns same LLMResponse shape whether cached or not; usage tracking includes cache hits.


U3. Semantic Router

Goal: Implement embedding-based semantic routing as Layer 1.5 in CostAwareRouter, using existing OpenAIEmbedder and compute_cosine_similarity().

Dependencies: None (independent of U1/U2, uses existing embedding infrastructure)

Files:

  • src/agentkit/chat/semantic_router.py (new) — SemanticRouter class, SkillEmbeddingIndex
  • src/agentkit/chat/skill_routing.py (modify) — integrate Layer 1.5 into CostAwareRouter.route()
  • tests/unit/chat/test_semantic_router.py (new) — unit tests for semantic router

Approach:

Architecture design before coding:

  1. SkillEmbeddingIndex design reasoning: Pre-compute embeddings for all registered skills at initialization. Source text: f"{description} | {' '.join(keywords)} | {' '.join(capability_tags)}". Store as dict[str, tuple[list[float], str]] (skill_name → (embedding, source_text)). On skill registration/update, re-embed only the changed skill. This avoids O(n) embedding computation per query.

  2. Query-time flow: Embed user query → compute cosine similarity against all skill embeddings → return top match if above threshold. This is O(n) in number of skills, but with <100 skills and 1536-dim vectors, this takes <5ms on CPU. No need for approximate nearest neighbor (ANN) index at this scale.

  3. Threshold design: Three zones:

    • similarity > 0.85: HIGH confidence → return skill match directly, skip Layer 2 LLM
    • 0.6 <= similarity <= 0.85: MEDIUM confidence → pass skill hint to Layer 2, reducing LLM classification tokens
    • similarity < 0.6: LOW confidence → no semantic signal, Layer 2 runs unmodified
  4. Integration into CostAwareRouter: Modify route() method. After Layer 1 (_classify_merged()), if complexity is medium (0.3-0.7), call semantic_router.route(query). Based on confidence zone, either return directly or enhance the Layer 2 prompt with skill hint.

  5. Embedding provider: Use OpenAIEmbedder by default. Support MockEmbedder for testing. Embedder is injected via constructor, not created internally.

Patterns to follow:

  • OpenAIEmbedder + EmbeddingCache pattern for embedding computation
  • compute_cosine_similarity() in src/agentkit/utils/vector_math.py
  • CostAwareRouter constructor injection pattern

Test scenarios:

  • Exact skill match: query "生成一篇关于AI的文章" matches content_generator skill (sim > 0.85)
  • Partial skill match: query "优化内容" matches geo_optimizer skill (sim 0.6-0.85), skill hint passed to LLM
  • No skill match: query "今天天气怎么样" has sim < 0.6 for all skills, Layer 2 runs normally
  • Skill registration: new skill added → embedding computed and indexed
  • Skill update: skill description changed → embedding re-computed
  • Empty skill registry: semantic router returns None gracefully
  • Embedder failure: OpenAIEmbedder throws error → semantic router logs warning, returns None, Layer 2 runs normally
  • Chinese query: "帮我写一篇文章" matches content_generator skill correctly

Verification: Semantic router returns correct skill matches; Layer 2 LLM calls reduced by >50% for medium-complexity queries; no regression in routing accuracy.


U4. UsageStore Persistence

Goal: Persist UsageTracker records to Redis, with in-memory fallback and efficient aggregation queries.

Dependencies: None

Files:

  • src/agentkit/llm/usage_store.py (new) — UsageStore Protocol, InMemoryUsageStore, RedisUsageStore
  • src/agentkit/llm/providers/tracker.py (modify) — delegate to UsageStore backend
  • tests/unit/llm/test_usage_store.py (new) — unit tests for usage store backends

Approach:

Architecture design before coding:

  1. Redis data model reasoning: Use Redis Hash per date for time-partitioned storage. Key: agentkit:usage:{YYYY-MM-DD}, field: {agent_name}:{model}, value: JSON {prompt_tokens, completion_tokens, total_tokens, cost, latency_ms, count}. Write via pipeline: HINCRBYFLOAT for numeric fields + HINCRBY for count. This is O(1) per write, atomic, and naturally partitions by date.

  2. Aggregation query design: For get_usage(agent=None, start=None, end=None): scan date keys in range via HGETALL, filter by agent/model in application code, aggregate in memory. For single-agent queries, use field prefix matching. This is O(days × agents) which is acceptable for dashboard queries.

  3. UsageStore Protocol: Define record(agent, model, usage: UsageRecord) -> None, query(agent=None, model=None, start=None, end=None) -> list[UsageRecord], get_summary(agent=None, start=None, end=None) -> UsageSummary. Both sync and async versions (sync for backward compat, async for Redis).

  4. Migration from UsageTracker: UsageTracker becomes a thin wrapper that delegates to UsageStore. Existing record() and get_usage() APIs preserved. Internal _records list replaced by store backend.

  5. TTL management: Each date key gets TTL of 90 days (configurable). This prevents unbounded Redis memory growth while preserving 3 months of usage data.

Patterns to follow:

  • SessionStore Protocol in src/agentkit/session/store.py — Protocol definition pattern
  • RedisSessionStore._get_redis() — lazy Redis initialization
  • create_session_store() — factory function pattern
  • agentkit:usage: key namespace convention

Test scenarios:

  • Record and query: record usage → query returns matching records
  • Date partitioning: records on different dates stored in different keys
  • Aggregation: multiple records for same agent/model aggregated correctly
  • Agent filter: query with agent filter returns only that agent's records
  • Date range filter: query with start/end returns only records in range
  • TTL: date keys have correct TTL set
  • Redis unavailable: falls back to in-memory store with warning
  • Concurrent writes: two concurrent records for same agent/model don't lose data
  • Empty query: query with no matching records returns empty list

Verification: Usage data survives process restart; get_usage() returns same shape as before; Redis memory usage bounded by TTL.


U5. CascadeStateStore Persistence

Goal: Persist CascadeDetector state to Redis using atomic operations, enabling multi-instance cascade detection.

Dependencies: None

Files:

  • src/agentkit/quality/cascade_store.py (new) — CascadeStateStore Protocol, InMemoryCascadeStore, RedisCascadeStore
  • src/agentkit/quality/cascade_detector.py (modify) — delegate to CascadeStateStore backend
  • tests/unit/quality/test_cascade_store.py (new) — unit tests for cascade store backends

Approach:

Architecture design before coding:

  1. Redis data model reasoning: Use simple string keys with INCR for atomic counting. Key: agentkit:cascade:{session_id}:interactions (INCR + TTL), agentkit:cascade:{session_id}:depth (GET/SET + TTL). TTL aligned with session TTL (default 86400s). INCR is atomic — no race conditions across instances.

  2. Protocol design: CascadeStateStore with increment_interactions(session_id) -> int, get_interactions(session_id) -> int, set_depth(session_id, depth) -> None, get_depth(session_id) -> int, reset(session_id) -> None, get_stats(session_id) -> CascadeStats.

  3. Integration into CascadeDetector: Replace internal _interaction_counts and _loop_depths dicts with CascadeStateStore backend. All methods delegate to store. CascadeDetector becomes stateless — all state lives in the store.

  4. Session TTL alignment: When increment_interactions() is called, refresh the key TTL to match session TTL. This ensures state is cleaned up when sessions expire.

Patterns to follow:

  • Same Protocol + factory + fallback pattern as U4
  • Redis INCR atomic operation pattern
  • agentkit:cascade: key namespace

Test scenarios:

  • Increment and get: increment interactions → get returns correct count
  • Set and get depth: set depth → get returns correct depth
  • Reset: reset session → interactions and depth both cleared
  • TTL: keys have TTL set, expire after session timeout
  • Multi-instance: two instances incrementing same session see consistent count
  • Redis unavailable: falls back to in-memory store
  • Session isolation: different sessions have independent state

Verification: Cascade detection state survives process restart; multi-instance deployment detects cascades correctly; no false positives from state loss.


U6. EvolutionStore Interface Unification

Goal: Unify EvolutionStore and PersistentEvolutionStore interfaces, add PostgreSQL backend with full feature set.

Dependencies: None

Files:

  • src/agentkit/evolution/evolution_store.py (modify) — define unified EvolutionStoreProtocol, refactor existing stores
  • src/agentkit/evolution/models.py (modify) — add SkillVersionModel and ABTestResultModel to async PG models
  • src/agentkit/evolution/pg_store.py (new) — PostgreSQLEvolutionStore implementing unified Protocol with async SQLAlchemy
  • tests/unit/evolution/test_unified_store.py (new) — tests for unified interface

Approach:

Architecture design before coding:

  1. Protocol design reasoning: Current EvolutionStore (async PG) has record(), rollback(), list_events(). PersistentEvolutionStore (sync SQLite) adds record_skill_version(), list_skill_versions(), record_ab_test_result(), get_ab_test_results(). The unified Protocol must include ALL methods from both. Each backend implements what it can; unsupported methods raise NotImplementedError with clear message.

  2. PostgreSQL model migration: Add SkillVersionModel and ABTestResultModel to src/agentkit/evolution/models.py using async SQLAlchemy (matching EpisodeModel pattern in memory/models.py). These models already exist for SQLite; the PG versions use the same schema but with async engine.

  3. PostgreSQLEvolutionStore: New class using async SQLAlchemy session (injected via constructor, same pattern as existing EvolutionStore). Implements all Protocol methods. Uses run_in_executor for any sync ORM operations if needed.

  4. Factory update: create_evolution_store(backend="memory"|"sqlite"|"postgresql", ...) returns the appropriate backend. "postgresql" creates PostgreSQLEvolutionStore with async engine.

  5. Backward compatibility: Existing EvolutionStore class is not removed — it becomes an internal implementation detail. The Protocol is the public interface. Code using EvolutionStore directly continues to work.

Patterns to follow:

  • EpisodeModel in src/agentkit/memory/models.py — async PG model pattern
  • create_evolution_store() factory — extend with new backend
  • PersistentEvolutionStore._run_sync() — sync/async bridge pattern

Test scenarios:

  • Protocol compliance: all backends implement all Protocol methods
  • PG store: record event → list events returns recorded event
  • PG store: record skill version → list versions returns version history
  • PG store: record AB test result → get results returns test data
  • SQLite store: existing functionality preserved after refactor
  • Memory store: existing functionality preserved after refactor
  • Factory: create_evolution_store(backend="postgresql") returns correct type
  • PG unavailable: falls back to SQLite with warning

Verification: All backends pass unified Protocol compliance test; existing evolution tests pass; PG store supports skill_version and ab_test operations.


U7. Configuration Integration and End-to-End Verification

Goal: Wire all three features into the application configuration, add agentkit.yaml schema support, and verify end-to-end behavior.

Dependencies: U1, U2, U3, U4, U5, U6

Files:

  • src/agentkit/server/app.py (modify) — initialize cache, usage store, cascade store with config
  • src/agentkit/cli/main.py (modify) — pass config to gateway and router
  • agentkit.yaml (modify) — add cache, semantic_routing, usage_store, cascade_store config sections
  • tests/integration/test_p0_hardening.py (new) — end-to-end integration tests

Approach:

  1. Configuration schema: Add to agentkit.yaml:
llm:
  cache:
    enabled: true
    backend: "auto"          # auto | redis | memory
    exact_ttl: 3600
    semantic_ttl: 86400
    similarity_threshold: 0.92
    max_entries: 10000

routing:
  semantic:
    enabled: true
    similarity_high: 0.85    # direct match threshold
    similarity_low: 0.6      # hint threshold

usage_store:
  backend: "auto"            # auto | redis | memory
  ttl_days: 90

cascade_store:
  backend: "auto"            # auto | redis | memory
  session_ttl: 86400

evolution_store:
  backend: "auto"            # auto | postgresql | sqlite | memory
  1. Application wiring: In app.py lifespan, initialize all stores and inject into gateway/router. Follow existing pattern of creating components from config.

  2. End-to-end verification: Integration test that exercises the full flow: user query → semantic routing → LLM cache → usage tracking → cascade detection → evolution logging.

Test scenarios:

  • Full flow with Redis: all features use Redis backend, data persists across simulated restart
  • Full flow without Redis: all features fall back to in-memory, no errors
  • Config from YAML: agentkit.yaml parsed correctly, all features configured
  • Cache + routing interaction: cached response for semantically routed query works correctly
  • Usage tracking with cache: cached requests show 0 cost in usage summary
  • Cascade detection across instances: simulated multi-instance scenario detects cascade correctly

Verification: All integration tests pass; application starts with new config; features degrade gracefully when backends unavailable.


Risks & Mitigations

Risk Impact Likelihood Mitigation
Semantic cache returns stale/wrong response High — user gets incorrect answer Medium — embedding similarity doesn't guarantee semantic equivalence Default to temperature=0 only for semantic cache; configurable threshold; TTL expiry; admin invalidation API
Redis single point of failure High — all persistence lost Low — Redis is typically HA Auto-fallback to in-memory; health check in doctor command; alert on fallback activation
Embedding API latency adds to routing time Medium — slower routing for first query Medium — embedding API ~100ms Pre-compute skill embeddings; cache query embeddings; async embedding with timeout
UsageStore Redis memory growth Medium — Redis OOM Low — TTL + date partitioning bounds growth 90-day TTL default; monitoring on Redis memory; configurable TTL
EvolutionStore interface unification breaks existing code High — evolution system stops working Low — Protocol is backward compatible Keep existing classes as internal implementations; comprehensive test coverage before refactor

Open Questions

  • Should semantic cache also cache streaming responses (requires chunk collection)? Deferred — current plan only caches non-streaming chat().
  • Should UsageStore support real-time streaming of usage data (e.g., via Redis Pub/Sub)? Deferred — current plan only supports query-based access.
  • What is the optimal embedding model for Chinese+English mixed text? text-embedding-3-small is adequate but not optimal. Consider bge-m3 or multilingual-e5 as alternatives. Deferred to implementation-time benchmarking.

Sources & Research

  • Industry benchmarking: LangChain, Dify, CrewAI, Letta, AutoGen feature comparison (2025-2026)
  • Project audit: 12 core files analyzed across memory, evolution, routing, quality, and LLM subsystems
  • Existing patterns: EmbeddingCache, RedisSessionStore, create_evolution_store(), SessionStore Protocol