9.4 KiB

Raw Blame History

U3 Architecture Design: Semantic Router

Status: APPROVED — Design follows existing CostAwareRouter layer pattern Date: 2026-06-14 Unit: U3 of P0 Production Hardening Plan

1. Design Goals

Zero LLM cost for confident matches: When semantic similarity > 0.85, skip Layer 2 LLM classification entirely
Reduce LLM tokens for medium matches: When similarity 0.6-0.85, pass skill hint to Layer 2, reducing classification tokens
Chinese-first: Embedding model must handle Chinese+English mixed text well
Pre-computed skill embeddings: Compute at skill registration time, not query time
Graceful degradation: If embedder fails, fall through to existing Layer 1/2 flow

2. Insertion Point Analysis

Current `CostAwareRouter.route()` flow:

Layer 0: Rule-based (zero cost)
  → explicit_skill / greeting / chat_mode / identity → return

Layer 1: Complexity classification
  → low (<0.3) → DIRECT_CHAT → return
  → medium (0.3-0.7) → _classify_merged() or IntentRouter → return
  → high (>0.7) → Layer 2

Layer 2: Capability matching / Auction
  → return

Semantic Router insertion: Between Layer 1 complexity classification and the medium/high branching

Layer 1: Complexity classification → complexity score

  → low (<0.3) → DIRECT_CHAT → return

  → medium (0.3-0.7):
    ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
    │  embed query → compare with skill embeddings     │
    │  sim > 0.85 → SKILL_REACT with matched skill     │
    │  sim 0.6-0.85 → pass skill_hint to _classify_merged │
    │  sim < 0.6 → proceed to _classify_merged normally │
    └──────────────────────────────────────────────────┘

  → high (>0.7):
    ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
    │  sim > 0.85 → SKILL_REACT with matched skill     │
    │  sim 0.6-0.85 → pass skill_hint to Layer 2       │
    │  sim < 0.6 → proceed to Layer 2 normally         │
    └──────────────────────────────────────────────────┘

Why both medium AND high complexity? The plan says "when Layer 1 returns medium complexity (0.3-0.7), try semantic routing first." But semantic routing is also valuable for high complexity — if we can confidently match a skill at zero cost, we should. The cost saving is even greater for high complexity (which would use more expensive Layer 2 LLM calls).

3. Component Design

3.1 SkillEmbeddingIndex

class SkillEmbeddingIndex:
    """Pre-computed embedding index for registered skills."""

    def __init__(self, embedder: Embedder):
        self._embedder = embedder
        self._index: dict[str, tuple[list[float], str]] = {}  # skill_name → (embedding, source_text)

    async def build(self, skill_registry) -> None:
        """Build index from all registered skills."""
        ...

    async def update_skill(self, skill_name: str, skill) -> None:
        """Re-embed a single skill (on registration/update)."""
        ...

    def remove_skill(self, skill_name: str) -> None:
        """Remove a skill from the index."""
        ...

    async def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        """Search for skills matching the query. Returns [(skill_name, similarity)]."""
        ...

3.2 SemanticRouter

class SemanticRouter:
    """Embedding-based semantic routing as Layer 1.5."""

    def __init__(
        self,
        embedder: Embedder,
        similarity_high: float = 0.85,
        similarity_low: float = 0.6,
    ):
        self._index = SkillEmbeddingIndex(embedder)
        self._similarity_high = similarity_high
        self._similarity_low = similarity_low
        self._enabled = True

    async def route(self, query: str) -> SemanticRouteResult:
        """Route a query using semantic similarity.

        Returns:
            SemanticRouteResult with:
            - confidence: "high" | "medium" | "low"
            - skill_name: matched skill name (None if low confidence)
            - similarity: cosine similarity score
        """
        ...

@dataclass
class SemanticRouteResult:
    confidence: str  # "high" | "medium" | "low"
    skill_name: str | None
    similarity: float

4. Skill Embedding Source Text

Design Decision: What text to embed for each skill?

source_text = f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"

Why this combination?

description: Captures the semantic meaning of what the skill does
intent.keywords: Captures the trigger phrases users might use
capability tags: Captures the functional categories

Chinese consideration: Skill descriptions and keywords are often in Chinese. The embedding model must handle this well. bge-m3 is the default for this reason.

5. Integration into CostAwareRouter

5.1 Constructor Change

class CostAwareRouter:
    def __init__(self, ..., semantic_router: SemanticRouter | None = None):
        self._semantic_router = semantic_router
        ...

5.2 Route Method Modification

The key change is in route(), after Layer 1 complexity classification:

# After complexity is determined (medium or high)
if self._semantic_router is not None and complexity >= 0.3:
    try:
        semantic_result = await self._semantic_router.route(clean_content)
        if semantic_result.confidence == "high":
            # Direct skill match — skip Layer 2
            result = await resolve_skill_routing(
                content=content,
                skill_registry=skill_registry,
                intent_router=intent_router,
                ...,
                force_skill=semantic_result.skill_name,  # NEW parameter
            )
            result.match_method = "semantic_high"
            result.match_confidence = semantic_result.similarity
            result.execution_mode = ExecutionMode.SKILL_REACT
            return result
        elif semantic_result.confidence == "medium":
            # Pass skill hint to Layer 1.5 merged classify or Layer 2
            skill_hint = semantic_result.skill_name
    except Exception as e:
        logger.warning(f"Semantic routing failed, falling through: {e}")

5.3 Skill Hint Propagation

For medium confidence matches, the skill hint is passed to _classify_merged() or _route_layer2() via a new skill_hint parameter. This reduces the LLM classification prompt by providing a strong signal.

Implementation: Add skill_hint: str | None = None parameter to _classify_merged() and _route_layer2(). When provided, include it in the LLM prompt: "Based on semantic analysis, the query may relate to skill '{skill_hint}'. Please confirm or override."

6. Embedding Caching

Skill embeddings are pre-computed and cached in SkillEmbeddingIndex. Query embeddings are computed per-request but can be cached using the existing EmbeddingCache from agentkit.memory.embedder.

Design: The SemanticRouter uses an OpenAIEmbedder with EmbeddingCache for query embeddings. Skill embeddings are stored in SkillEmbeddingIndex and only re-computed on skill registration/update.

7. Edge Cases

Edge Case	Behavior
No skills registered	`SkillEmbeddingIndex` is empty, `route()` returns low confidence
Embedder API fails	Catch exception, return low confidence, fall through to existing flow
Skill has no description	Use `skill.name` as fallback source text
Chinese query, English skill description	`bge-m3` handles cross-lingual matching
Multiple skills with similar embeddings	Return top match; if top_k > 1, could return alternatives (deferred)
Semantic router disabled (None)	Existing flow unchanged, zero overhead

8. Test Strategy

test_semantic_high_confidence: Query matches skill with sim > 0.85 → SKILL_REACT returned
test_semantic_medium_confidence: Query matches skill with sim 0.6-0.85 → skill_hint passed
test_semantic_low_confidence: Query has sim < 0.6 → normal routing proceeds
test_semantic_router_disabled: No semantic_router → existing flow unchanged
test_embedder_failure: Embedder throws error → falls through gracefully
test_skill_registration_updates_index: New skill added → embedding computed
test_chinese_query: Chinese query matches Chinese skill description

9. Argumentation Summary

Design Choice	Alternatives Considered	Why This Choice
Layer 1.5 for both medium AND high	Only medium	High complexity benefits even more from zero-cost skill match
Pre-computed skill embeddings	Compute per query	O(n) embedding per query is ~100ms × n_skills; pre-compute is O(1) per query
bge-m3 default	text-embedding-3-small	Chinese+English mixed text; bge-m3 is SOTA for multilingual
Skill hint for medium confidence	Direct match for medium	Medium confidence isn't reliable enough for direct match; hint reduces LLM tokens without risking wrong routing
Separate SemanticRouter class	Inline in CostAwareRouter	Separation of concerns; testable independently; can be disabled without touching router

9.4 KiB Raw Blame History Unescape Escape