fischer-agentkit/docs/plans/2026-06-14-004-u3-semantic-...

9.4 KiB
Raw Permalink Blame History

U3 Architecture Design: Semantic Router

Status: APPROVED — Design follows existing CostAwareRouter layer pattern Date: 2026-06-14 Unit: U3 of P0 Production Hardening Plan


1. Design Goals

  1. Zero LLM cost for confident matches: When semantic similarity > 0.85, skip Layer 2 LLM classification entirely
  2. Reduce LLM tokens for medium matches: When similarity 0.6-0.85, pass skill hint to Layer 2, reducing classification tokens
  3. Chinese-first: Embedding model must handle Chinese+English mixed text well
  4. Pre-computed skill embeddings: Compute at skill registration time, not query time
  5. Graceful degradation: If embedder fails, fall through to existing Layer 1/2 flow

2. Insertion Point Analysis

Current CostAwareRouter.route() flow:

Layer 0: Rule-based (zero cost)
  → explicit_skill / greeting / chat_mode / identity → return

Layer 1: Complexity classification
  → low (<0.3) → DIRECT_CHAT → return
  → medium (0.3-0.7) → _classify_merged() or IntentRouter → return
  → high (>0.7) → Layer 2

Layer 2: Capability matching / Auction
  → return

Semantic Router insertion: Between Layer 1 complexity classification and the medium/high branching

Layer 1: Complexity classification → complexity score

  → low (<0.3) → DIRECT_CHAT → return

  → medium (0.3-0.7):
    ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
    │  embed query → compare with skill embeddings     │
    │  sim > 0.85 → SKILL_REACT with matched skill     │
    │  sim 0.6-0.85 → pass skill_hint to _classify_merged │
    │  sim < 0.6 → proceed to _classify_merged normally │
    └──────────────────────────────────────────────────┘

  → high (>0.7):
    ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
    │  sim > 0.85 → SKILL_REACT with matched skill     │
    │  sim 0.6-0.85 → pass skill_hint to Layer 2       │
    │  sim < 0.6 → proceed to Layer 2 normally         │
    └──────────────────────────────────────────────────┘

Why both medium AND high complexity? The plan says "when Layer 1 returns medium complexity (0.3-0.7), try semantic routing first." But semantic routing is also valuable for high complexity — if we can confidently match a skill at zero cost, we should. The cost saving is even greater for high complexity (which would use more expensive Layer 2 LLM calls).


3. Component Design

3.1 SkillEmbeddingIndex

class SkillEmbeddingIndex:
    """Pre-computed embedding index for registered skills."""

    def __init__(self, embedder: Embedder):
        self._embedder = embedder
        self._index: dict[str, tuple[list[float], str]] = {}  # skill_name → (embedding, source_text)

    async def build(self, skill_registry) -> None:
        """Build index from all registered skills."""
        ...

    async def update_skill(self, skill_name: str, skill) -> None:
        """Re-embed a single skill (on registration/update)."""
        ...

    def remove_skill(self, skill_name: str) -> None:
        """Remove a skill from the index."""
        ...

    async def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        """Search for skills matching the query. Returns [(skill_name, similarity)]."""
        ...

3.2 SemanticRouter

class SemanticRouter:
    """Embedding-based semantic routing as Layer 1.5."""

    def __init__(
        self,
        embedder: Embedder,
        similarity_high: float = 0.85,
        similarity_low: float = 0.6,
    ):
        self._index = SkillEmbeddingIndex(embedder)
        self._similarity_high = similarity_high
        self._similarity_low = similarity_low
        self._enabled = True

    async def route(self, query: str) -> SemanticRouteResult:
        """Route a query using semantic similarity.

        Returns:
            SemanticRouteResult with:
            - confidence: "high" | "medium" | "low"
            - skill_name: matched skill name (None if low confidence)
            - similarity: cosine similarity score
        """
        ...

@dataclass
class SemanticRouteResult:
    confidence: str  # "high" | "medium" | "low"
    skill_name: str | None
    similarity: float

4. Skill Embedding Source Text

Design Decision: What text to embed for each skill?

source_text = f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"

Why this combination?

  • description: Captures the semantic meaning of what the skill does
  • intent.keywords: Captures the trigger phrases users might use
  • capability tags: Captures the functional categories

Chinese consideration: Skill descriptions and keywords are often in Chinese. The embedding model must handle this well. bge-m3 is the default for this reason.


5. Integration into CostAwareRouter

5.1 Constructor Change

class CostAwareRouter:
    def __init__(self, ..., semantic_router: SemanticRouter | None = None):
        self._semantic_router = semantic_router
        ...

5.2 Route Method Modification

The key change is in route(), after Layer 1 complexity classification:

# After complexity is determined (medium or high)
if self._semantic_router is not None and complexity >= 0.3:
    try:
        semantic_result = await self._semantic_router.route(clean_content)
        if semantic_result.confidence == "high":
            # Direct skill match — skip Layer 2
            result = await resolve_skill_routing(
                content=content,
                skill_registry=skill_registry,
                intent_router=intent_router,
                ...,
                force_skill=semantic_result.skill_name,  # NEW parameter
            )
            result.match_method = "semantic_high"
            result.match_confidence = semantic_result.similarity
            result.execution_mode = ExecutionMode.SKILL_REACT
            return result
        elif semantic_result.confidence == "medium":
            # Pass skill hint to Layer 1.5 merged classify or Layer 2
            skill_hint = semantic_result.skill_name
    except Exception as e:
        logger.warning(f"Semantic routing failed, falling through: {e}")

5.3 Skill Hint Propagation

For medium confidence matches, the skill hint is passed to _classify_merged() or _route_layer2() via a new skill_hint parameter. This reduces the LLM classification prompt by providing a strong signal.

Implementation: Add skill_hint: str | None = None parameter to _classify_merged() and _route_layer2(). When provided, include it in the LLM prompt: "Based on semantic analysis, the query may relate to skill '{skill_hint}'. Please confirm or override."


6. Embedding Caching

Skill embeddings are pre-computed and cached in SkillEmbeddingIndex. Query embeddings are computed per-request but can be cached using the existing EmbeddingCache from agentkit.memory.embedder.

Design: The SemanticRouter uses an OpenAIEmbedder with EmbeddingCache for query embeddings. Skill embeddings are stored in SkillEmbeddingIndex and only re-computed on skill registration/update.


7. Edge Cases

Edge Case Behavior
No skills registered SkillEmbeddingIndex is empty, route() returns low confidence
Embedder API fails Catch exception, return low confidence, fall through to existing flow
Skill has no description Use skill.name as fallback source text
Chinese query, English skill description bge-m3 handles cross-lingual matching
Multiple skills with similar embeddings Return top match; if top_k > 1, could return alternatives (deferred)
Semantic router disabled (None) Existing flow unchanged, zero overhead

8. Test Strategy

  1. test_semantic_high_confidence: Query matches skill with sim > 0.85 → SKILL_REACT returned
  2. test_semantic_medium_confidence: Query matches skill with sim 0.6-0.85 → skill_hint passed
  3. test_semantic_low_confidence: Query has sim < 0.6 → normal routing proceeds
  4. test_semantic_router_disabled: No semantic_router → existing flow unchanged
  5. test_embedder_failure: Embedder throws error → falls through gracefully
  6. test_skill_registration_updates_index: New skill added → embedding computed
  7. test_chinese_query: Chinese query matches Chinese skill description

9. Argumentation Summary

Design Choice Alternatives Considered Why This Choice
Layer 1.5 for both medium AND high Only medium High complexity benefits even more from zero-cost skill match
Pre-computed skill embeddings Compute per query O(n) embedding per query is ~100ms × n_skills; pre-compute is O(1) per query
bge-m3 default text-embedding-3-small Chinese+English mixed text; bge-m3 is SOTA for multilingual
Skill hint for medium confidence Direct match for medium Medium confidence isn't reliable enough for direct match; hint reduces LLM tokens without risking wrong routing
Separate SemanticRouter class Inline in CostAwareRouter Separation of concerns; testable independently; can be disabled without touching router