fischer-agentkit/docs/plans/2026-06-14-004-u3-semantic-...

# U3 Architecture Design: Semantic Router

> Status: APPROVED — Design follows existing CostAwareRouter layer pattern
> Date: 2026-06-14
> Unit: U3 of P0 Production Hardening Plan

---

## 1. Design Goals

1. **Zero LLM cost for confident matches**: When semantic similarity > 0.85, skip Layer 2 LLM classification entirely
2. **Reduce LLM tokens for medium matches**: When similarity 0.6-0.85, pass skill hint to Layer 2, reducing classification tokens
3. **Chinese-first**: Embedding model must handle Chinese+English mixed text well
4. **Pre-computed skill embeddings**: Compute at skill registration time, not query time
5. **Graceful degradation**: If embedder fails, fall through to existing Layer 1/2 flow

---

## 2. Insertion Point Analysis

### Current `CostAwareRouter.route()` flow:

```
Layer 0: Rule-based (zero cost)
  → explicit_skill / greeting / chat_mode / identity → return

Layer 1: Complexity classification
  → low (<0.3) → DIRECT_CHAT → return
  → medium (0.3-0.7) → _classify_merged() or IntentRouter → return
  → high (>0.7) → Layer 2

Layer 2: Capability matching / Auction
  → return
```

### Semantic Router insertion: **Between Layer 1 complexity classification and the medium/high branching**

```
Layer 1: Complexity classification → complexity score

  → low (<0.3) → DIRECT_CHAT → return

  → medium (0.3-0.7):
    ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
    │  embed query → compare with skill embeddings     │
    │  sim > 0.85 → SKILL_REACT with matched skill     │
    │  sim 0.6-0.85 → pass skill_hint to _classify_merged │
    │  sim < 0.6 → proceed to _classify_merged normally │
    └──────────────────────────────────────────────────┘

  → high (>0.7):
    ┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
    │  sim > 0.85 → SKILL_REACT with matched skill     │
    │  sim 0.6-0.85 → pass skill_hint to Layer 2       │
    │  sim < 0.6 → proceed to Layer 2 normally         │
    └──────────────────────────────────────────────────┘
```

**Why both medium AND high complexity?** The plan says "when Layer 1 returns medium complexity (0.3-0.7), try semantic routing first." But semantic routing is also valuable for high complexity — if we can confidently match a skill at zero cost, we should. The cost saving is even greater for high complexity (which would use more expensive Layer 2 LLM calls).

---

## 3. Component Design

### 3.1 SkillEmbeddingIndex

```python
class SkillEmbeddingIndex:
    """Pre-computed embedding index for registered skills."""

    def __init__(self, embedder: Embedder):
        self._embedder = embedder
        self._index: dict[str, tuple[list[float], str]] = {}  # skill_name → (embedding, source_text)

    async def build(self, skill_registry) -> None:
        """Build index from all registered skills."""
        ...

    async def update_skill(self, skill_name: str, skill) -> None:
        """Re-embed a single skill (on registration/update)."""
        ...

    def remove_skill(self, skill_name: str) -> None:
        """Remove a skill from the index."""
        ...

    async def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
        """Search for skills matching the query. Returns [(skill_name, similarity)]."""
        ...
```

### 3.2 SemanticRouter

```python
class SemanticRouter:
    """Embedding-based semantic routing as Layer 1.5."""

    def __init__(
        self,
        embedder: Embedder,
        similarity_high: float = 0.85,
        similarity_low: float = 0.6,
    ):
        self._index = SkillEmbeddingIndex(embedder)
        self._similarity_high = similarity_high
        self._similarity_low = similarity_low
        self._enabled = True

    async def route(self, query: str) -> SemanticRouteResult:
        """Route a query using semantic similarity.

        Returns:
            SemanticRouteResult with:
            - confidence: "high" | "medium" | "low"
            - skill_name: matched skill name (None if low confidence)
            - similarity: cosine similarity score
        """
        ...

@dataclass
class SemanticRouteResult:
    confidence: str  # "high" | "medium" | "low"
    skill_name: str | None
    similarity: float
```

---

## 4. Skill Embedding Source Text

**Design Decision**: What text to embed for each skill?

```python
source_text = f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"
```

**Why this combination?**
- `description`: Captures the semantic meaning of what the skill does
- `intent.keywords`: Captures the trigger phrases users might use
- `capability tags`: Captures the functional categories

**Chinese consideration**: Skill descriptions and keywords are often in Chinese. The embedding model must handle this well. `bge-m3` is the default for this reason.

---

## 5. Integration into CostAwareRouter

### 5.1 Constructor Change

```python
class CostAwareRouter:
    def __init__(self, ..., semantic_router: SemanticRouter | None = None):
        self._semantic_router = semantic_router
        ...
```

### 5.2 Route Method Modification

The key change is in `route()`, after Layer 1 complexity classification:

```python
# After complexity is determined (medium or high)
if self._semantic_router is not None and complexity >= 0.3:
    try:
        semantic_result = await self._semantic_router.route(clean_content)
        if semantic_result.confidence == "high":
            # Direct skill match — skip Layer 2
            result = await resolve_skill_routing(
                content=content,
                skill_registry=skill_registry,
                intent_router=intent_router,
                ...,
                force_skill=semantic_result.skill_name,  # NEW parameter
            )
            result.match_method = "semantic_high"
            result.match_confidence = semantic_result.similarity
            result.execution_mode = ExecutionMode.SKILL_REACT
            return result
        elif semantic_result.confidence == "medium":
            # Pass skill hint to Layer 1.5 merged classify or Layer 2
            skill_hint = semantic_result.skill_name
    except Exception as e:
        logger.warning(f"Semantic routing failed, falling through: {e}")
```

### 5.3 Skill Hint Propagation

For medium confidence matches, the skill hint is passed to `_classify_merged()` or `_route_layer2()` via a new `skill_hint` parameter. This reduces the LLM classification prompt by providing a strong signal.

**Implementation**: Add `skill_hint: str | None = None` parameter to `_classify_merged()` and `_route_layer2()`. When provided, include it in the LLM prompt: "Based on semantic analysis, the query may relate to skill '{skill_hint}'. Please confirm or override."

---

## 6. Embedding Caching

Skill embeddings are pre-computed and cached in `SkillEmbeddingIndex`. Query embeddings are computed per-request but can be cached using the existing `EmbeddingCache` from `agentkit.memory.embedder`.

**Design**: The `SemanticRouter` uses an `OpenAIEmbedder` with `EmbeddingCache` for query embeddings. Skill embeddings are stored in `SkillEmbeddingIndex` and only re-computed on skill registration/update.

---

## 7. Edge Cases

| Edge Case | Behavior |
|-----------|----------|
| No skills registered | `SkillEmbeddingIndex` is empty, `route()` returns low confidence |
| Embedder API fails | Catch exception, return low confidence, fall through to existing flow |
| Skill has no description | Use `skill.name` as fallback source text |
| Chinese query, English skill description | `bge-m3` handles cross-lingual matching |
| Multiple skills with similar embeddings | Return top match; if top_k > 1, could return alternatives (deferred) |
| Semantic router disabled (None) | Existing flow unchanged, zero overhead |

---

## 8. Test Strategy

1. **test_semantic_high_confidence**: Query matches skill with sim > 0.85 → SKILL_REACT returned
2. **test_semantic_medium_confidence**: Query matches skill with sim 0.6-0.85 → skill_hint passed
3. **test_semantic_low_confidence**: Query has sim < 0.6 → normal routing proceeds
4. **test_semantic_router_disabled**: No semantic_router → existing flow unchanged
5. **test_embedder_failure**: Embedder throws error → falls through gracefully
6. **test_skill_registration_updates_index**: New skill added → embedding computed
7. **test_chinese_query**: Chinese query matches Chinese skill description

---

## 9. Argumentation Summary

| Design Choice | Alternatives Considered | Why This Choice |
|--------------|------------------------|----------------|
| Layer 1.5 for both medium AND high | Only medium | High complexity benefits even more from zero-cost skill match |
| Pre-computed skill embeddings | Compute per query | O(n) embedding per query is ~100ms × n_skills; pre-compute is O(1) per query |
| bge-m3 default | text-embedding-3-small | Chinese+English mixed text; bge-m3 is SOTA for multilingual |
| Skill hint for medium confidence | Direct match for medium | Medium confidence isn't reliable enough for direct match; hint reduces LLM tokens without risking wrong routing |
| Separate SemanticRouter class | Inline in CostAwareRouter | Separation of concerns; testable independently; can be disabled without touching router |