236 lines
9.4 KiB
Markdown
236 lines
9.4 KiB
Markdown
# U3 Architecture Design: Semantic Router
|
||
|
||
> Status: APPROVED — Design follows existing CostAwareRouter layer pattern
|
||
> Date: 2026-06-14
|
||
> Unit: U3 of P0 Production Hardening Plan
|
||
|
||
---
|
||
|
||
## 1. Design Goals
|
||
|
||
1. **Zero LLM cost for confident matches**: When semantic similarity > 0.85, skip Layer 2 LLM classification entirely
|
||
2. **Reduce LLM tokens for medium matches**: When similarity 0.6-0.85, pass skill hint to Layer 2, reducing classification tokens
|
||
3. **Chinese-first**: Embedding model must handle Chinese+English mixed text well
|
||
4. **Pre-computed skill embeddings**: Compute at skill registration time, not query time
|
||
5. **Graceful degradation**: If embedder fails, fall through to existing Layer 1/2 flow
|
||
|
||
---
|
||
|
||
## 2. Insertion Point Analysis
|
||
|
||
### Current `CostAwareRouter.route()` flow:
|
||
|
||
```
|
||
Layer 0: Rule-based (zero cost)
|
||
→ explicit_skill / greeting / chat_mode / identity → return
|
||
|
||
Layer 1: Complexity classification
|
||
→ low (<0.3) → DIRECT_CHAT → return
|
||
→ medium (0.3-0.7) → _classify_merged() or IntentRouter → return
|
||
→ high (>0.7) → Layer 2
|
||
|
||
Layer 2: Capability matching / Auction
|
||
→ return
|
||
```
|
||
|
||
### Semantic Router insertion: **Between Layer 1 complexity classification and the medium/high branching**
|
||
|
||
```
|
||
Layer 1: Complexity classification → complexity score
|
||
|
||
→ low (<0.3) → DIRECT_CHAT → return
|
||
|
||
→ medium (0.3-0.7):
|
||
┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
|
||
│ embed query → compare with skill embeddings │
|
||
│ sim > 0.85 → SKILL_REACT with matched skill │
|
||
│ sim 0.6-0.85 → pass skill_hint to _classify_merged │
|
||
│ sim < 0.6 → proceed to _classify_merged normally │
|
||
└──────────────────────────────────────────────────┘
|
||
|
||
→ high (>0.7):
|
||
┌─ Layer 1.5: Semantic Router (NEW) ─────────────┐
|
||
│ sim > 0.85 → SKILL_REACT with matched skill │
|
||
│ sim 0.6-0.85 → pass skill_hint to Layer 2 │
|
||
│ sim < 0.6 → proceed to Layer 2 normally │
|
||
└──────────────────────────────────────────────────┘
|
||
```
|
||
|
||
**Why both medium AND high complexity?** The plan says "when Layer 1 returns medium complexity (0.3-0.7), try semantic routing first." But semantic routing is also valuable for high complexity — if we can confidently match a skill at zero cost, we should. The cost saving is even greater for high complexity (which would use more expensive Layer 2 LLM calls).
|
||
|
||
---
|
||
|
||
## 3. Component Design
|
||
|
||
### 3.1 SkillEmbeddingIndex
|
||
|
||
```python
|
||
class SkillEmbeddingIndex:
|
||
"""Pre-computed embedding index for registered skills."""
|
||
|
||
def __init__(self, embedder: Embedder):
|
||
self._embedder = embedder
|
||
self._index: dict[str, tuple[list[float], str]] = {} # skill_name → (embedding, source_text)
|
||
|
||
async def build(self, skill_registry) -> None:
|
||
"""Build index from all registered skills."""
|
||
...
|
||
|
||
async def update_skill(self, skill_name: str, skill) -> None:
|
||
"""Re-embed a single skill (on registration/update)."""
|
||
...
|
||
|
||
def remove_skill(self, skill_name: str) -> None:
|
||
"""Remove a skill from the index."""
|
||
...
|
||
|
||
async def search(self, query: str, top_k: int = 5) -> list[tuple[str, float]]:
|
||
"""Search for skills matching the query. Returns [(skill_name, similarity)]."""
|
||
...
|
||
```
|
||
|
||
### 3.2 SemanticRouter
|
||
|
||
```python
|
||
class SemanticRouter:
|
||
"""Embedding-based semantic routing as Layer 1.5."""
|
||
|
||
def __init__(
|
||
self,
|
||
embedder: Embedder,
|
||
similarity_high: float = 0.85,
|
||
similarity_low: float = 0.6,
|
||
):
|
||
self._index = SkillEmbeddingIndex(embedder)
|
||
self._similarity_high = similarity_high
|
||
self._similarity_low = similarity_low
|
||
self._enabled = True
|
||
|
||
async def route(self, query: str) -> SemanticRouteResult:
|
||
"""Route a query using semantic similarity.
|
||
|
||
Returns:
|
||
SemanticRouteResult with:
|
||
- confidence: "high" | "medium" | "low"
|
||
- skill_name: matched skill name (None if low confidence)
|
||
- similarity: cosine similarity score
|
||
"""
|
||
...
|
||
|
||
@dataclass
|
||
class SemanticRouteResult:
|
||
confidence: str # "high" | "medium" | "low"
|
||
skill_name: str | None
|
||
similarity: float
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Skill Embedding Source Text
|
||
|
||
**Design Decision**: What text to embed for each skill?
|
||
|
||
```python
|
||
source_text = f"{skill.description} | {' '.join(skill.intent.keywords)} | {' '.join(cap.tag for cap in skill.capabilities)}"
|
||
```
|
||
|
||
**Why this combination?**
|
||
- `description`: Captures the semantic meaning of what the skill does
|
||
- `intent.keywords`: Captures the trigger phrases users might use
|
||
- `capability tags`: Captures the functional categories
|
||
|
||
**Chinese consideration**: Skill descriptions and keywords are often in Chinese. The embedding model must handle this well. `bge-m3` is the default for this reason.
|
||
|
||
---
|
||
|
||
## 5. Integration into CostAwareRouter
|
||
|
||
### 5.1 Constructor Change
|
||
|
||
```python
|
||
class CostAwareRouter:
|
||
def __init__(self, ..., semantic_router: SemanticRouter | None = None):
|
||
self._semantic_router = semantic_router
|
||
...
|
||
```
|
||
|
||
### 5.2 Route Method Modification
|
||
|
||
The key change is in `route()`, after Layer 1 complexity classification:
|
||
|
||
```python
|
||
# After complexity is determined (medium or high)
|
||
if self._semantic_router is not None and complexity >= 0.3:
|
||
try:
|
||
semantic_result = await self._semantic_router.route(clean_content)
|
||
if semantic_result.confidence == "high":
|
||
# Direct skill match — skip Layer 2
|
||
result = await resolve_skill_routing(
|
||
content=content,
|
||
skill_registry=skill_registry,
|
||
intent_router=intent_router,
|
||
...,
|
||
force_skill=semantic_result.skill_name, # NEW parameter
|
||
)
|
||
result.match_method = "semantic_high"
|
||
result.match_confidence = semantic_result.similarity
|
||
result.execution_mode = ExecutionMode.SKILL_REACT
|
||
return result
|
||
elif semantic_result.confidence == "medium":
|
||
# Pass skill hint to Layer 1.5 merged classify or Layer 2
|
||
skill_hint = semantic_result.skill_name
|
||
except Exception as e:
|
||
logger.warning(f"Semantic routing failed, falling through: {e}")
|
||
```
|
||
|
||
### 5.3 Skill Hint Propagation
|
||
|
||
For medium confidence matches, the skill hint is passed to `_classify_merged()` or `_route_layer2()` via a new `skill_hint` parameter. This reduces the LLM classification prompt by providing a strong signal.
|
||
|
||
**Implementation**: Add `skill_hint: str | None = None` parameter to `_classify_merged()` and `_route_layer2()`. When provided, include it in the LLM prompt: "Based on semantic analysis, the query may relate to skill '{skill_hint}'. Please confirm or override."
|
||
|
||
---
|
||
|
||
## 6. Embedding Caching
|
||
|
||
Skill embeddings are pre-computed and cached in `SkillEmbeddingIndex`. Query embeddings are computed per-request but can be cached using the existing `EmbeddingCache` from `agentkit.memory.embedder`.
|
||
|
||
**Design**: The `SemanticRouter` uses an `OpenAIEmbedder` with `EmbeddingCache` for query embeddings. Skill embeddings are stored in `SkillEmbeddingIndex` and only re-computed on skill registration/update.
|
||
|
||
---
|
||
|
||
## 7. Edge Cases
|
||
|
||
| Edge Case | Behavior |
|
||
|-----------|----------|
|
||
| No skills registered | `SkillEmbeddingIndex` is empty, `route()` returns low confidence |
|
||
| Embedder API fails | Catch exception, return low confidence, fall through to existing flow |
|
||
| Skill has no description | Use `skill.name` as fallback source text |
|
||
| Chinese query, English skill description | `bge-m3` handles cross-lingual matching |
|
||
| Multiple skills with similar embeddings | Return top match; if top_k > 1, could return alternatives (deferred) |
|
||
| Semantic router disabled (None) | Existing flow unchanged, zero overhead |
|
||
|
||
---
|
||
|
||
## 8. Test Strategy
|
||
|
||
1. **test_semantic_high_confidence**: Query matches skill with sim > 0.85 → SKILL_REACT returned
|
||
2. **test_semantic_medium_confidence**: Query matches skill with sim 0.6-0.85 → skill_hint passed
|
||
3. **test_semantic_low_confidence**: Query has sim < 0.6 → normal routing proceeds
|
||
4. **test_semantic_router_disabled**: No semantic_router → existing flow unchanged
|
||
5. **test_embedder_failure**: Embedder throws error → falls through gracefully
|
||
6. **test_skill_registration_updates_index**: New skill added → embedding computed
|
||
7. **test_chinese_query**: Chinese query matches Chinese skill description
|
||
|
||
---
|
||
|
||
## 9. Argumentation Summary
|
||
|
||
| Design Choice | Alternatives Considered | Why This Choice |
|
||
|--------------|------------------------|----------------|
|
||
| Layer 1.5 for both medium AND high | Only medium | High complexity benefits even more from zero-cost skill match |
|
||
| Pre-computed skill embeddings | Compute per query | O(n) embedding per query is ~100ms × n_skills; pre-compute is O(1) per query |
|
||
| bge-m3 default | text-embedding-3-small | Chinese+English mixed text; bge-m3 is SOTA for multilingual |
|
||
| Skill hint for medium confidence | Direct match for medium | Medium confidence isn't reliable enough for direct match; hint reduces LLM tokens without risking wrong routing |
|
||
| Separate SemanticRouter class | Inline in CostAwareRouter | Separation of concerns; testable independently; can be disabled without touching router |
|