373 lines
17 KiB
Markdown
373 lines
17 KiB
Markdown
---
|
||
title: "feat: Chat Response Speed Optimization — Sub-1s First Token"
|
||
status: active
|
||
created: 2026-06-12
|
||
plan-type: feat
|
||
depth: standard
|
||
---
|
||
|
||
# feat: Chat Response Speed Optimization — Sub-1s First Token
|
||
|
||
## Summary
|
||
|
||
Optimize the fischer-agentkit conversation response pipeline to achieve sub-1-second first-token latency. The primary bottleneck is 1–2 extra LLM calls in the routing layer before the main ReAct loop. Secondary optimizations include parallel tool execution, async session I/O, and connection pool tuning. All changes are gated by configuration flags for safe rollback.
|
||
|
||
## Problem Frame
|
||
|
||
Users experience 5–10 second delays before seeing any response in the chat interface. The root cause is a serial chain of LLM calls: CostAwareRouter.quick_classify() → IntentRouter._classify_with_llm() → ReActEngine LLM Think. The first two calls are routing overhead that add 2–6 seconds with no user-visible value. The third call is the actual reasoning step and cannot be eliminated, but its perceived latency can be reduced via streaming.
|
||
|
||
**Current worst-case latency chain:**
|
||
|
||
```
|
||
User message → quick_classify() [1-2s LLM]
|
||
→ _classify_with_llm() [1-2s LLM]
|
||
→ ReActEngine Think [2-5s LLM]
|
||
→ Tool Act [0.5-5s]
|
||
→ First token visible to user
|
||
```
|
||
|
||
**Target latency chain:**
|
||
|
||
```
|
||
User message → Local rule classification [<1ms]
|
||
→ ReActEngine Think (streaming) [first token in <1s]
|
||
→ Tool Act (parallel when possible)
|
||
→ First token visible to user
|
||
```
|
||
|
||
## Requirements
|
||
|
||
| ID | Requirement | Priority |
|
||
|----|-------------|----------|
|
||
| R1 | First token latency must be under 1 second for simple conversations (greetings, Q&A) | P0 |
|
||
| R2 | First token latency must be under 1 second for routed conversations when keyword matching succeeds | P0 |
|
||
| R3 | Routing accuracy must not degrade more than 10% compared to current LLM-based classification | P1 |
|
||
| R4 | All optimizations must be configurable with on/off switches for safe rollback | P0 |
|
||
| R5 | Parallel tool execution must preserve conversation history ordering | P1 |
|
||
| R6 | Async session writes must not lose messages on process crash | P1 |
|
||
|
||
## Key Technical Decisions
|
||
|
||
### KTD1: Replace LLM quick_classify with local heuristic
|
||
|
||
**Decision:** Replace `CostAwareRouter.quick_classify()` LLM call with a zero-cost local heuristic based on message length, keyword density, and tool-hint detection.
|
||
|
||
**Rationale:** The LLM classification adds 1–2s latency for a binary decision (simple vs complex). A local heuristic using the same signals already present in the message content (length, presence of tool-related keywords, question marks, etc.) can achieve ~85% accuracy at zero latency cost.
|
||
|
||
**Alternative considered:** Cache LLM classification results. Rejected because cache hit rate would be near-zero for conversational messages (each is unique).
|
||
|
||
### KTD2: Merge quick_classify and intent classification into single LLM call
|
||
|
||
**Decision:** When LLM routing is needed (heuristic uncertainty), combine complexity scoring and intent classification into a single LLM call instead of two serial calls.
|
||
|
||
**Rationale:** Currently `quick_classify()` and `_classify_with_llm()` are separate LLM calls that could be merged into one prompt returning both complexity score and matched skill. This halves the routing LLM overhead when it cannot be avoided.
|
||
|
||
### KTD3: Parallel execution of independent tool_calls
|
||
|
||
**Decision:** Execute multiple tool_calls from a single LLM response in parallel using `asyncio.gather()`, with results appended to conversation in tool_call_id order.
|
||
|
||
**Rationale:** When LLM returns multiple tool calls (e.g., search + calculate), they are independent and can run concurrently. Results must be appended in order for the next LLM call to see them correctly.
|
||
|
||
**Risk:** Some tool calls may have implicit dependencies. Mitigation: the LLM generally does not return dependent calls in a single response (it waits for results before calling the next). Add a config flag `react.parallel_tools: false` to disable if needed.
|
||
|
||
### KTD4: Fire-and-forget session writes with write-ahead buffer
|
||
|
||
**Decision:** Make `SessionManager.append_message()` non-blocking by returning immediately after queuing the write, with a background task performing the actual I/O. Add a small in-memory buffer as write-ahead log to prevent message loss.
|
||
|
||
**Rationale:** Session writes (especially `save_session()` for updated_at) add unnecessary blocking. The user doesn't need to wait for persistence before seeing a response. A write-ahead buffer ensures messages survive brief failures.
|
||
|
||
### KTD5: Unified httpx connection pool configuration
|
||
|
||
**Decision:** Configure explicit `httpx.Limits` on all LLM provider clients with sensible defaults for keepalive and connection pooling.
|
||
|
||
**Rationale:** Default httpx settings are reasonable but not optimized for high-frequency LLM API calls. Explicit configuration ensures consistent behavior across providers and enables tuning.
|
||
|
||
## Scope Boundaries
|
||
|
||
### In Scope
|
||
|
||
- Routing layer optimization (CostAwareRouter, IntentRouter)
|
||
- ReActEngine parallel tool execution
|
||
- Session I/O async optimization
|
||
- httpx connection pool tuning
|
||
- Configuration flags for all changes
|
||
- Test coverage for new behavior
|
||
|
||
### Out of Scope
|
||
|
||
- Frontend rendering optimization (separate concern)
|
||
- LLM provider response time optimization (external dependency)
|
||
- Memory/RAG pipeline optimization (covered by existing plan 009)
|
||
- Compression strategy changes (covered by existing plan 013)
|
||
- New LLM provider implementations
|
||
|
||
### Deferred to Follow-Up Work
|
||
|
||
- A/B testing framework for routing accuracy measurement
|
||
- Performance benchmarking CI pipeline
|
||
- WebSocket chat flow test coverage
|
||
- WenxinProvider token-refresh client reuse
|
||
|
||
---
|
||
|
||
## Implementation Units
|
||
|
||
### U1. Local heuristic classifier for CostAwareRouter
|
||
|
||
**Goal:** Replace the LLM-based `quick_classify()` with a zero-cost local heuristic, gated by config flag `router.classifier: heuristic | llm`.
|
||
|
||
**Requirements:** R1, R2, R3, R4
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/chat/skill_routing.py` — add `HeuristicClassifier` class, modify `CostAwareRouter.route()`
|
||
- `src/agentkit/server/config.py` — add `router` config section
|
||
- `agentkit.yaml` — add `router` section with defaults
|
||
- `tests/unit/test_cost_aware_router.py` — add heuristic classifier tests
|
||
|
||
**Approach:**
|
||
|
||
1. Create `HeuristicClassifier` class with a `classify(content: str) -> float` method that returns a complexity score (0.0–1.0) based on:
|
||
- Message length: short messages (<20 chars) → low complexity
|
||
- Question patterns: presence of "为什么", "如何", "怎么", "how", "why", "what" → moderate complexity
|
||
- Tool hints: presence of tool-related keywords (existing `_tokenize_content` + `tool_hints` list already in code) → high complexity
|
||
- Multi-sentence: messages with multiple sentences → higher complexity
|
||
- Code patterns: presence of code-like patterns (backticks, brackets) → higher complexity
|
||
|
||
2. Modify `CostAwareRouter.__init__` to accept a `classifier_mode` parameter (`"heuristic"` or `"llm"`)
|
||
|
||
3. Modify `CostAwareRouter.route()` Phase 1 to use `HeuristicClassifier.classify()` when mode is `"heuristic"`
|
||
|
||
4. Add `router` config section to `ServerConfig` with `classifier` field (default: `"heuristic"`)
|
||
|
||
5. Wire config through `create_app()` to `CostAwareRouter`
|
||
|
||
**Patterns to follow:** Existing `CostAwareRouter._match_layer0()` rule-based pattern; existing `_tokenize_content()` for keyword extraction.
|
||
|
||
**Test scenarios:**
|
||
- Short greeting → complexity < 0.3
|
||
- Single question with "如何" → complexity 0.3–0.7
|
||
- Multi-step request with tool keywords → complexity > 0.7
|
||
- Code-related request → complexity > 0.7
|
||
- Empty string → complexity 0.0
|
||
- Very long message (>500 chars) → complexity > 0.5
|
||
- Config flag `classifier: llm` falls back to LLM classification
|
||
- Config flag `classifier: heuristic` uses local heuristic
|
||
|
||
**Verification:** All existing `test_cost_aware_router.py` tests pass; new heuristic tests pass; manual test shows first-token latency <1s for simple messages.
|
||
|
||
---
|
||
|
||
### U2. Merged routing LLM call
|
||
|
||
**Goal:** When LLM routing is needed (heuristic uncertain or config forces LLM), combine complexity scoring and intent classification into a single LLM call.
|
||
|
||
**Requirements:** R2, R3, R4
|
||
|
||
**Dependencies:** U1
|
||
|
||
**Files:**
|
||
- `src/agentkit/chat/skill_routing.py` — add `MergedRouter` method
|
||
- `src/agentkit/router/intent.py` — add `route_with_complexity()` method
|
||
- `tests/unit/test_cost_aware_router.py` — add merged routing tests
|
||
- `tests/unit/test_intent_router.py` — add merged routing tests
|
||
|
||
**Approach:**
|
||
|
||
1. Add `IntentRouter.route_with_complexity()` method that returns both a `RoutingResult` and a complexity score in a single LLM call. The prompt asks the LLM to return `{"skill": "...", "confidence": 0.9, "complexity": 0.5}`.
|
||
|
||
2. Modify `CostAwareRouter.route()` so that when `classifier` is `"llm"`, it calls `route_with_complexity()` instead of making two separate calls.
|
||
|
||
3. When `classifier` is `"heuristic"` and the heuristic returns uncertainty (score in 0.3–0.7 range), use `route_with_complexity()` as a single fallback call.
|
||
|
||
**Patterns to follow:** Existing `_classify_with_llm()` prompt structure; existing `quick_classify()` prompt structure.
|
||
|
||
**Test scenarios:**
|
||
- Merged call returns both skill match and complexity score
|
||
- Merged call with no matching skill returns complexity only
|
||
- Merged call with invalid LLM response falls back to rule-based evaluation
|
||
- Heuristic uncertain + merged call produces correct routing
|
||
- Config `classifier: llm` uses merged call instead of two separate calls
|
||
|
||
**Verification:** Existing tests pass; merged routing reduces LLM calls from 2 to 1 when LLM routing is needed.
|
||
|
||
---
|
||
|
||
### U3. Parallel tool execution in ReActEngine
|
||
|
||
**Goal:** Execute multiple independent tool_calls from a single LLM response in parallel, gated by config flag `react.parallel_tools`.
|
||
|
||
**Requirements:** R5, R4
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/core/react.py` — modify `_execute_loop()` and `execute_stream()` to use `asyncio.gather()`
|
||
- `src/agentkit/server/config.py` — add `react.parallel_tools` config
|
||
- `agentkit.yaml` — add `react` section
|
||
- `tests/unit/test_react_engine.py` — add parallel execution tests
|
||
|
||
**Approach:**
|
||
|
||
1. Add `parallel_tools: bool = True` parameter to `ReActEngine.__init__`.
|
||
|
||
2. In `_execute_loop()` and `execute_stream()`, when `response.tool_calls` has >1 items and `parallel_tools` is True:
|
||
- Execute all tool calls concurrently with `asyncio.gather(*[_execute_tool(tc.name, tc.arguments, tools) for tc in response.tool_calls], return_exceptions=True)`
|
||
- Build tool result messages in tool_call_id order
|
||
- Append all results to conversation in order
|
||
|
||
3. When `parallel_tools` is False, keep current serial behavior.
|
||
|
||
4. For `execute_stream()`, yield all `tool_call` events first, then all `tool_result` events after gather completes.
|
||
|
||
**Patterns to follow:** Existing `_execute_tool()` method; existing `Orchestrator._execute_plan()` parallel group pattern in `orchestrator.py`.
|
||
|
||
**Test scenarios:**
|
||
- Two independent tools execute in parallel, both results present in conversation
|
||
- Parallel execution preserves tool_call_id ordering in conversation
|
||
- One tool fails, other succeeds — partial results preserved
|
||
- `parallel_tools: false` falls back to serial execution
|
||
- Single tool_call works identically with parallel mode on/off
|
||
- Tool results appended to conversation in correct order for next LLM call
|
||
|
||
**Verification:** Existing ReAct tests pass; new parallel tests pass; manual test with multi-tool request shows reduced execution time.
|
||
|
||
---
|
||
|
||
### U4. Async session writes with write-ahead buffer
|
||
|
||
**Goal:** Make `SessionManager.append_message()` non-blocking by deferring `save_session()` and making `append_message()` fire-and-forget with a small write-ahead buffer.
|
||
|
||
**Requirements:** R6, R4
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/session/manager.py` — add async write queue and WAL buffer
|
||
- `tests/unit/test_session_manager.py` — add async write tests
|
||
|
||
**Approach:**
|
||
|
||
1. Add an `AsyncWriteQueue` to `SessionManager` that:
|
||
- Accepts write operations (append_message, save_session) as tasks
|
||
- Executes them in a background `asyncio.Task`
|
||
- Maintains a small in-memory buffer of recent writes for crash recovery
|
||
- Provides `await flush()` for graceful shutdown
|
||
|
||
2. Modify `append_message()`:
|
||
- Keep `get_session()` + validation as synchronous (needed for error checking)
|
||
- Queue `store.append_message()` + `store.save_session()` as a single async task
|
||
- Return the `Message` object immediately without waiting for persistence
|
||
|
||
3. Modify `get_chat_messages()` to first check the WAL buffer for uncommitted messages, then fall back to store.
|
||
|
||
4. Add `flush()` method called during session close and app shutdown.
|
||
|
||
**Patterns to follow:** Existing `BackgroundRunner` pattern in `server/runner.py`; existing `TaskStore` cleanup pattern.
|
||
|
||
**Test scenarios:**
|
||
- append_message returns immediately, message persisted asynchronously
|
||
- get_chat_messages includes WAL-buffered messages not yet persisted
|
||
- flush() ensures all pending writes complete
|
||
- Multiple rapid append_messages are batched correctly
|
||
- Session close flushes pending writes
|
||
- App shutdown flushes pending writes
|
||
|
||
**Verification:** Existing session tests pass; new async write tests pass; no message loss during normal operation.
|
||
|
||
---
|
||
|
||
### U5. httpx connection pool configuration
|
||
|
||
**Goal:** Configure explicit `httpx.Limits` on all LLM provider clients for optimal connection reuse.
|
||
|
||
**Requirements:** R4
|
||
|
||
**Dependencies:** None
|
||
|
||
**Files:**
|
||
- `src/agentkit/llm/providers/openai.py` — add `httpx.Limits` configuration
|
||
- `src/agentkit/llm/providers/anthropic.py` — add `httpx.Limits` configuration
|
||
- `src/agentkit/llm/providers/gemini.py` — add `httpx.Limits` configuration
|
||
- `src/agentkit/llm/config.py` — add connection pool config fields
|
||
- `tests/unit/test_llm_provider.py` — verify connection pool settings
|
||
|
||
**Approach:**
|
||
|
||
1. Add `connection_pool` section to `ProviderConfig`:
|
||
- `max_connections: int = 100`
|
||
- `max_keepalive_connections: int = 20`
|
||
- `keepalive_expiry: float = 30.0`
|
||
|
||
2. Pass `httpx.Limits` to all provider constructors.
|
||
|
||
3. Configure `httpx.AsyncClient` with explicit limits in each provider.
|
||
|
||
**Patterns to follow:** Existing `ProviderConfig` dataclass pattern; existing `timeout` parameter pattern.
|
||
|
||
**Test scenarios:**
|
||
- Provider creates httpx client with configured limits
|
||
- Default limits applied when not configured
|
||
- Custom limits from config override defaults
|
||
- Connection reuse verified via mock
|
||
|
||
**Verification:** Existing provider tests pass; connection pool settings applied correctly.
|
||
|
||
---
|
||
|
||
### U6. Chat route pipeline optimization
|
||
|
||
**Goal:** Optimize the WebSocket chat handler to overlap I/O operations and reduce serial waits.
|
||
|
||
**Requirements:** R1, R2
|
||
|
||
**Dependencies:** U1, U4
|
||
|
||
**Files:**
|
||
- `src/agentkit/server/routes/chat.py` — parallelize session operations
|
||
- `tests/unit/test_chat_routes.py` — add pipeline optimization tests
|
||
|
||
**Approach:**
|
||
|
||
1. In `_handle_chat_message()`, parallelize:
|
||
- `sm.append_message()` (user message) and `sm.get_chat_messages()` — these can run concurrently since append_message now returns immediately (U4)
|
||
|
||
2. Move assistant message `append_message()` to fire-and-forget after streaming completes (already non-blocking with U4).
|
||
|
||
3. Reuse `ReActEngine` instance per session instead of creating new one per message.
|
||
|
||
**Patterns to follow:** Existing `asyncio.gather` pattern in orchestrator.
|
||
|
||
**Test scenarios:**
|
||
- User message append and chat messages retrieval run concurrently
|
||
- Assistant message persisted after streaming completes
|
||
- ReActEngine reuse across messages in same session
|
||
- Error during parallel operations handled gracefully
|
||
|
||
**Verification:** Existing chat route tests pass; manual test shows reduced latency.
|
||
|
||
---
|
||
|
||
## Risks & Mitigations
|
||
|
||
| Risk | Likelihood | Impact | Mitigation |
|
||
|------|-----------|--------|------------|
|
||
| Heuristic classifier misroutes requests | Medium | Medium — wrong skill or wrong execution mode | Config flag to revert to LLM; monitor routing accuracy via telemetry |
|
||
| Parallel tool execution breaks implicit dependencies | Low | High — incorrect results | Config flag to disable; LLM rarely returns dependent calls in single response |
|
||
| Async session writes lose messages on crash | Low | Medium — missing conversation history | WAL buffer + flush on shutdown; acceptable trade-off for speed |
|
||
| Merged LLM call prompt confuses the model | Low | Low — falls back to separate calls | Fallback to separate calls on parse failure |
|
||
|
||
## System-Wide Impact
|
||
|
||
- **Routing layer:** CostAwareRouter and IntentRouter behavior changes when heuristic mode is active; existing LLM-based routing preserved as fallback
|
||
- **ReAct engine:** Tool execution changes from serial to parallel; conversation history ordering preserved
|
||
- **Session management:** Write operations become asynchronous; read operations check WAL buffer
|
||
- **Configuration:** New `router` and `react` config sections in `agentkit.yaml`
|
||
- **Telemetry:** Existing OpenTelemetry spans continue to work; new spans for heuristic classification
|
||
|
||
## Open Questions
|
||
|
||
- What is the actual routing accuracy of the current LLM-based classifier? Need baseline measurement before comparing heuristic accuracy.
|
||
- Should the heuristic classifier be extensible (plugin pattern) or hardcoded? Starting with hardcoded for simplicity, can extend later.
|