# U3 Architecture Design: Semantic Router > Status: APPROVED — Design follows existing CostAwareRouter layer pattern > Date: 2026-06-14 > Unit: U3 of P0 Production Hardening Plan --- ## 1. Design Goals 1. **Zero LLM cost for confident matches**: When semantic similarity > 0.85, skip Layer 2 LLM classification entirely 2. **Reduce LLM tokens for medium matches**: When similarity 0.6-0.85, pass skill hint to Layer 2, reducing classification tokens 3. **Chinese-first**: Embedding model must handle Chinese+English mixed text well 4. **Pre-computed skill embeddings**: Compute at skill registration time, not query time 5. **Graceful degradation**: If embedder fails, fall through to existing Layer 1/2 flow --- ## 2. Insertion Point Analysis ### Current `CostAwareRouter.route()` flow: ``` Layer 0: Rule-based (zero cost) → explicit_skill / greeting / chat_mode / identity → return Layer 1: Complexity classification → low (<0.3) → DIRECT_CHAT → return → medium (0.3-0.7) → _classify_merged() or IntentRouter → return → high (>0.7) → Layer 2 Layer 2: Capability matching / Auction → return ``` ### Semantic Router insertion: **Between Layer 1 complexity classification and the medium/high branching** ``` Layer 1: Complexity classification → complexity score → low (<0.3) → DIRECT_CHAT → return → medium (0.3-0.7): ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐ │ embed query → compare with skill embeddings │ │ sim > 0.85 → SKILL_REACT with matched skill │ │ sim 0.6-0.85 → pass skill_hint to _classify_merged │ │ sim < 0.6 → proceed to _classify_merged normally │ └──────────────────────────────────────────────────┘ → high (>0.7): ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐ │ sim > 0.85 → SKILL_REACT with matched skill │ │ sim 0.6-0.85 → pass skill_hint to Layer 2 │ │ sim < 0.6 → proceed to Layer 2 normally │ └──────────────────────────────────────────────────┘ ``` **Why both medium AND high complexity?** The plan says "when Layer 1 returns medium complexity (0.3-0.7), try semantic routing first." But semantic routing is also valuable for high complexity — if we can confidently match a skill at zero cost, we should. The cost saving is even greater for high complexity (which would use more expensive Layer 2 LLM calls). --- ## 3. Component Design ### 3.1 SkillEmbeddingIndex ```python class SkillEmbeddingIndex: """Pre-computed embedding index for registered skills.""" def __init__(self, embedder: Embedder): self._embedder = embedder self._index: dict[str, tuple[list[float], str]] = {} # skill_name → (embedding, source_text) async def build(self, skill_registry) -> None: """Build index from all registered skills.""" ... async def update_skill(self, skill_name: str, skill) -> None: """Re-embed a single skill (on registration/update).""" ... def remove_skill(self, skill_name: str) -> None: """Remove a skill from the index.""" ... async def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]: """Search for skills matching the query. Returns [(skill_name, similarity)].""" ... ``` ### 3.2 SemanticRouter ```python class SemanticRouter: """Embedding-based semantic routing as Layer 1.5.""" def __init__( self, embedder: Embedder, similarity_high: float = 0.85, similarity_low: float = 0.6, ): self._index = SkillEmbeddingIndex(embedder) self._similarity_high = similarity_high self._similarity_low = similarity_low self._enabled = True async def route(self, query: str) -> SemanticRouteResult: """Route a query using semantic similarity. Returns: SemanticRouteResult with: - confidence: "high" | "medium" | "low" - skill_name: matched skill name (None if low confidence) - similarity: cosine similarity score """ ... @dataclass class SemanticRouteResult: confidence: str # "high" | "medium" | "low" skill_name: str | None similarity: float ``` --- ## 4. Skill Embedding Source Text **Design Decision**: What text to embed for each skill? ```python source_text = f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}" ``` **Why this combination?** - `description`: Captures the semantic meaning of what the skill does - `intent.keywords`: Captures the trigger phrases users might use - `capability tags`: Captures the functional categories **Chinese consideration**: Skill descriptions and keywords are often in Chinese. The embedding model must handle this well. `bge-m3` is the default for this reason. --- ## 5. Integration into CostAwareRouter ### 5.1 Constructor Change ```python class CostAwareRouter: def __init__(self, ..., semantic_router: SemanticRouter | None = None): self._semantic_router = semantic_router ... ``` ### 5.2 Route Method Modification The key change is in `route()`, after Layer 1 complexity classification: ```python # After complexity is determined (medium or high) if self._semantic_router is not None and complexity >= 0.3: try: semantic_result = await self._semantic_router.route(clean_content) if semantic_result.confidence == "high": # Direct skill match — skip Layer 2 result = await resolve_skill_routing( content=content, skill_registry=skill_registry, intent_router=intent_router, ..., force_skill=semantic_result.skill_name, # NEW parameter ) result.match_method = "semantic_high" result.match_confidence = semantic_result.similarity result.execution_mode = ExecutionMode.SKILL_REACT return result elif semantic_result.confidence == "medium": # Pass skill hint to Layer 1.5 merged classify or Layer 2 skill_hint = semantic_result.skill_name except Exception as e: logger.warning(f"Semantic routing failed, falling through: {e}") ``` ### 5.3 Skill Hint Propagation For medium confidence matches, the skill hint is passed to `_classify_merged()` or `_route_layer2()` via a new `skill_hint` parameter. This reduces the LLM classification prompt by providing a strong signal. **Implementation**: Add `skill_hint: str | None = None` parameter to `_classify_merged()` and `_route_layer2()`. When provided, include it in the LLM prompt: "Based on semantic analysis, the query may relate to skill '{skill_hint}'. Please confirm or override." --- ## 6. Embedding Caching Skill embeddings are pre-computed and cached in `SkillEmbeddingIndex`. Query embeddings are computed per-request but can be cached using the existing `EmbeddingCache` from `agentkit.memory.embedder`. **Design**: The `SemanticRouter` uses an `OpenAIEmbedder` with `EmbeddingCache` for query embeddings. Skill embeddings are stored in `SkillEmbeddingIndex` and only re-computed on skill registration/update. --- ## 7. Edge Cases | Edge Case | Behavior | |-----------|----------| | No skills registered | `SkillEmbeddingIndex` is empty, `route()` returns low confidence | | Embedder API fails | Catch exception, return low confidence, fall through to existing flow | | Skill has no description | Use `skill.name` as fallback source text | | Chinese query, English skill description | `bge-m3` handles cross-lingual matching | | Multiple skills with similar embeddings | Return top match; if top_k > 1, could return alternatives (deferred) | | Semantic router disabled (None) | Existing flow unchanged, zero overhead | --- ## 8. Test Strategy 1. **test_semantic_high_confidence**: Query matches skill with sim > 0.85 → SKILL_REACT returned 2. **test_semantic_medium_confidence**: Query matches skill with sim 0.6-0.85 → skill_hint passed 3. **test_semantic_low_confidence**: Query has sim < 0.6 → normal routing proceeds 4. **test_semantic_router_disabled**: No semantic_router → existing flow unchanged 5. **test_embedder_failure**: Embedder throws error → falls through gracefully 6. **test_skill_registration_updates_index**: New skill added → embedding computed 7. **test_chinese_query**: Chinese query matches Chinese skill description --- ## 9. Argumentation Summary | Design Choice | Alternatives Considered | Why This Choice | |--------------|------------------------|----------------| | Layer 1.5 for both medium AND high | Only medium | High complexity benefits even more from zero-cost skill match | | Pre-computed skill embeddings | Compute per query | O(n) embedding per query is ~100ms × n_skills; pre-compute is O(1) per query | | bge-m3 default | text-embedding-3-small | Chinese+English mixed text; bge-m3 is SOTA for multilingual | | Skill hint for medium confidence | Direct match for medium | Medium confidence isn't reliable enough for direct match; hint reduces LLM tokens without risking wrong routing | | Separate SemanticRouter class | Inline in CostAwareRouter | Separation of concerns; testable independently; can be disabled without touching router |