--- title: "feat: Chat Response Speed Optimization — Sub-1s First Token" status: active created: 2026-06-12 plan-type: feat depth: standard --- # feat: Chat Response Speed Optimization — Sub-1s First Token ## Summary Optimize the fischer-agentkit conversation response pipeline to achieve sub-1-second first-token latency. The primary bottleneck is 1–2 extra LLM calls in the routing layer before the main ReAct loop. Secondary optimizations include parallel tool execution, async session I/O, and connection pool tuning. All changes are gated by configuration flags for safe rollback. ## Problem Frame Users experience 5–10 second delays before seeing any response in the chat interface. The root cause is a serial chain of LLM calls: CostAwareRouter.quick_classify() → IntentRouter._classify_with_llm() → ReActEngine LLM Think. The first two calls are routing overhead that add 2–6 seconds with no user-visible value. The third call is the actual reasoning step and cannot be eliminated, but its perceived latency can be reduced via streaming. **Current worst-case latency chain:** ``` User message → quick_classify() [1-2s LLM] → _classify_with_llm() [1-2s LLM] → ReActEngine Think [2-5s LLM] → Tool Act [0.5-5s] → First token visible to user ``` **Target latency chain:** ``` User message → Local rule classification [<1ms] → ReActEngine Think (streaming) [first token in <1s] → Tool Act (parallel when possible) → First token visible to user ``` ## Requirements | ID | Requirement | Priority | |----|-------------|----------| | R1 | First token latency must be under 1 second for simple conversations (greetings, Q&A) | P0 | | R2 | First token latency must be under 1 second for routed conversations when keyword matching succeeds | P0 | | R3 | Routing accuracy must not degrade more than 10% compared to current LLM-based classification | P1 | | R4 | All optimizations must be configurable with on/off switches for safe rollback | P0 | | R5 | Parallel tool execution must preserve conversation history ordering | P1 | | R6 | Async session writes must not lose messages on process crash | P1 | ## Key Technical Decisions ### KTD1: Replace LLM quick_classify with local heuristic **Decision:** Replace `CostAwareRouter.quick_classify()` LLM call with a zero-cost local heuristic based on message length, keyword density, and tool-hint detection. **Rationale:** The LLM classification adds 1–2s latency for a binary decision (simple vs complex). A local heuristic using the same signals already present in the message content (length, presence of tool-related keywords, question marks, etc.) can achieve ~85% accuracy at zero latency cost. **Alternative considered:** Cache LLM classification results. Rejected because cache hit rate would be near-zero for conversational messages (each is unique). ### KTD2: Merge quick_classify and intent classification into single LLM call **Decision:** When LLM routing is needed (heuristic uncertainty), combine complexity scoring and intent classification into a single LLM call instead of two serial calls. **Rationale:** Currently `quick_classify()` and `_classify_with_llm()` are separate LLM calls that could be merged into one prompt returning both complexity score and matched skill. This halves the routing LLM overhead when it cannot be avoided. ### KTD3: Parallel execution of independent tool_calls **Decision:** Execute multiple tool_calls from a single LLM response in parallel using `asyncio.gather()`, with results appended to conversation in tool_call_id order. **Rationale:** When LLM returns multiple tool calls (e.g., search + calculate), they are independent and can run concurrently. Results must be appended in order for the next LLM call to see them correctly. **Risk:** Some tool calls may have implicit dependencies. Mitigation: the LLM generally does not return dependent calls in a single response (it waits for results before calling the next). Add a config flag `react.parallel_tools: false` to disable if needed. ### KTD4: Fire-and-forget session writes with write-ahead buffer **Decision:** Make `SessionManager.append_message()` non-blocking by returning immediately after queuing the write, with a background task performing the actual I/O. Add a small in-memory buffer as write-ahead log to prevent message loss. **Rationale:** Session writes (especially `save_session()` for updated_at) add unnecessary blocking. The user doesn't need to wait for persistence before seeing a response. A write-ahead buffer ensures messages survive brief failures. ### KTD5: Unified httpx connection pool configuration **Decision:** Configure explicit `httpx.Limits` on all LLM provider clients with sensible defaults for keepalive and connection pooling. **Rationale:** Default httpx settings are reasonable but not optimized for high-frequency LLM API calls. Explicit configuration ensures consistent behavior across providers and enables tuning. ## Scope Boundaries ### In Scope - Routing layer optimization (CostAwareRouter, IntentRouter) - ReActEngine parallel tool execution - Session I/O async optimization - httpx connection pool tuning - Configuration flags for all changes - Test coverage for new behavior ### Out of Scope - Frontend rendering optimization (separate concern) - LLM provider response time optimization (external dependency) - Memory/RAG pipeline optimization (covered by existing plan 009) - Compression strategy changes (covered by existing plan 013) - New LLM provider implementations ### Deferred to Follow-Up Work - A/B testing framework for routing accuracy measurement - Performance benchmarking CI pipeline - WebSocket chat flow test coverage - WenxinProvider token-refresh client reuse --- ## Implementation Units ### U1. Local heuristic classifier for CostAwareRouter **Goal:** Replace the LLM-based `quick_classify()` with a zero-cost local heuristic, gated by config flag `router.classifier: heuristic | llm`. **Requirements:** R1, R2, R3, R4 **Dependencies:** None **Files:** - `src/agentkit/chat/skill_routing.py` — add `HeuristicClassifier` class, modify `CostAwareRouter.route()` - `src/agentkit/server/config.py` — add `router` config section - `agentkit.yaml` — add `router` section with defaults - `tests/unit/test_cost_aware_router.py` — add heuristic classifier tests **Approach:** 1. Create `HeuristicClassifier` class with a `classify(content: str) -> float` method that returns a complexity score (0.0–1.0) based on: - Message length: short messages (<20 chars) → low complexity - Question patterns: presence of "为什么", "如何", "怎么", "how", "why", "what" → moderate complexity - Tool hints: presence of tool-related keywords (existing `_tokenize_content` + `tool_hints` list already in code) → high complexity - Multi-sentence: messages with multiple sentences → higher complexity - Code patterns: presence of code-like patterns (backticks, brackets) → higher complexity 2. Modify `CostAwareRouter.__init__` to accept a `classifier_mode` parameter (`"heuristic"` or `"llm"`) 3. Modify `CostAwareRouter.route()` Phase 1 to use `HeuristicClassifier.classify()` when mode is `"heuristic"` 4. Add `router` config section to `ServerConfig` with `classifier` field (default: `"heuristic"`) 5. Wire config through `create_app()` to `CostAwareRouter` **Patterns to follow:** Existing `CostAwareRouter._match_layer0()` rule-based pattern; existing `_tokenize_content()` for keyword extraction. **Test scenarios:** - Short greeting → complexity < 0.3 - Single question with "如何" → complexity 0.3–0.7 - Multi-step request with tool keywords → complexity > 0.7 - Code-related request → complexity > 0.7 - Empty string → complexity 0.0 - Very long message (>500 chars) → complexity > 0.5 - Config flag `classifier: llm` falls back to LLM classification - Config flag `classifier: heuristic` uses local heuristic **Verification:** All existing `test_cost_aware_router.py` tests pass; new heuristic tests pass; manual test shows first-token latency <1s for simple messages. --- ### U2. Merged routing LLM call **Goal:** When LLM routing is needed (heuristic uncertain or config forces LLM), combine complexity scoring and intent classification into a single LLM call. **Requirements:** R2, R3, R4 **Dependencies:** U1 **Files:** - `src/agentkit/chat/skill_routing.py` — add `MergedRouter` method - `src/agentkit/router/intent.py` — add `route_with_complexity()` method - `tests/unit/test_cost_aware_router.py` — add merged routing tests - `tests/unit/test_intent_router.py` — add merged routing tests **Approach:** 1. Add `IntentRouter.route_with_complexity()` method that returns both a `RoutingResult` and a complexity score in a single LLM call. The prompt asks the LLM to return `{"skill": "...", "confidence": 0.9, "complexity": 0.5}`. 2. Modify `CostAwareRouter.route()` so that when `classifier` is `"llm"`, it calls `route_with_complexity()` instead of making two separate calls. 3. When `classifier` is `"heuristic"` and the heuristic returns uncertainty (score in 0.3–0.7 range), use `route_with_complexity()` as a single fallback call. **Patterns to follow:** Existing `_classify_with_llm()` prompt structure; existing `quick_classify()` prompt structure. **Test scenarios:** - Merged call returns both skill match and complexity score - Merged call with no matching skill returns complexity only - Merged call with invalid LLM response falls back to rule-based evaluation - Heuristic uncertain + merged call produces correct routing - Config `classifier: llm` uses merged call instead of two separate calls **Verification:** Existing tests pass; merged routing reduces LLM calls from 2 to 1 when LLM routing is needed. --- ### U3. Parallel tool execution in ReActEngine **Goal:** Execute multiple independent tool_calls from a single LLM response in parallel, gated by config flag `react.parallel_tools`. **Requirements:** R5, R4 **Dependencies:** None **Files:** - `src/agentkit/core/react.py` — modify `_execute_loop()` and `execute_stream()` to use `asyncio.gather()` - `src/agentkit/server/config.py` — add `react.parallel_tools` config - `agentkit.yaml` — add `react` section - `tests/unit/test_react_engine.py` — add parallel execution tests **Approach:** 1. Add `parallel_tools: bool = True` parameter to `ReActEngine.__init__`. 2. In `_execute_loop()` and `execute_stream()`, when `response.tool_calls` has >1 items and `parallel_tools` is True: - Execute all tool calls concurrently with `asyncio.gather(*[_execute_tool(tc.name, tc.arguments, tools) for tc in response.tool_calls], return_exceptions=True)` - Build tool result messages in tool_call_id order - Append all results to conversation in order 3. When `parallel_tools` is False, keep current serial behavior. 4. For `execute_stream()`, yield all `tool_call` events first, then all `tool_result` events after gather completes. **Patterns to follow:** Existing `_execute_tool()` method; existing `Orchestrator._execute_plan()` parallel group pattern in `orchestrator.py`. **Test scenarios:** - Two independent tools execute in parallel, both results present in conversation - Parallel execution preserves tool_call_id ordering in conversation - One tool fails, other succeeds — partial results preserved - `parallel_tools: false` falls back to serial execution - Single tool_call works identically with parallel mode on/off - Tool results appended to conversation in correct order for next LLM call **Verification:** Existing ReAct tests pass; new parallel tests pass; manual test with multi-tool request shows reduced execution time. --- ### U4. Async session writes with write-ahead buffer **Goal:** Make `SessionManager.append_message()` non-blocking by deferring `save_session()` and making `append_message()` fire-and-forget with a small write-ahead buffer. **Requirements:** R6, R4 **Dependencies:** None **Files:** - `src/agentkit/session/manager.py` — add async write queue and WAL buffer - `tests/unit/test_session_manager.py` — add async write tests **Approach:** 1. Add an `AsyncWriteQueue` to `SessionManager` that: - Accepts write operations (append_message, save_session) as tasks - Executes them in a background `asyncio.Task` - Maintains a small in-memory buffer of recent writes for crash recovery - Provides `await flush()` for graceful shutdown 2. Modify `append_message()`: - Keep `get_session()` + validation as synchronous (needed for error checking) - Queue `store.append_message()` + `store.save_session()` as a single async task - Return the `Message` object immediately without waiting for persistence 3. Modify `get_chat_messages()` to first check the WAL buffer for uncommitted messages, then fall back to store. 4. Add `flush()` method called during session close and app shutdown. **Patterns to follow:** Existing `BackgroundRunner` pattern in `server/runner.py`; existing `TaskStore` cleanup pattern. **Test scenarios:** - append_message returns immediately, message persisted asynchronously - get_chat_messages includes WAL-buffered messages not yet persisted - flush() ensures all pending writes complete - Multiple rapid append_messages are batched correctly - Session close flushes pending writes - App shutdown flushes pending writes **Verification:** Existing session tests pass; new async write tests pass; no message loss during normal operation. --- ### U5. httpx connection pool configuration **Goal:** Configure explicit `httpx.Limits` on all LLM provider clients for optimal connection reuse. **Requirements:** R4 **Dependencies:** None **Files:** - `src/agentkit/llm/providers/openai.py` — add `httpx.Limits` configuration - `src/agentkit/llm/providers/anthropic.py` — add `httpx.Limits` configuration - `src/agentkit/llm/providers/gemini.py` — add `httpx.Limits` configuration - `src/agentkit/llm/config.py` — add connection pool config fields - `tests/unit/test_llm_provider.py` — verify connection pool settings **Approach:** 1. Add `connection_pool` section to `ProviderConfig`: - `max_connections: int = 100` - `max_keepalive_connections: int = 20` - `keepalive_expiry: float = 30.0` 2. Pass `httpx.Limits` to all provider constructors. 3. Configure `httpx.AsyncClient` with explicit limits in each provider. **Patterns to follow:** Existing `ProviderConfig` dataclass pattern; existing `timeout` parameter pattern. **Test scenarios:** - Provider creates httpx client with configured limits - Default limits applied when not configured - Custom limits from config override defaults - Connection reuse verified via mock **Verification:** Existing provider tests pass; connection pool settings applied correctly. --- ### U6. Chat route pipeline optimization **Goal:** Optimize the WebSocket chat handler to overlap I/O operations and reduce serial waits. **Requirements:** R1, R2 **Dependencies:** U1, U4 **Files:** - `src/agentkit/server/routes/chat.py` — parallelize session operations - `tests/unit/test_chat_routes.py` — add pipeline optimization tests **Approach:** 1. In `_handle_chat_message()`, parallelize: - `sm.append_message()` (user message) and `sm.get_chat_messages()` — these can run concurrently since append_message now returns immediately (U4) 2. Move assistant message `append_message()` to fire-and-forget after streaming completes (already non-blocking with U4). 3. Reuse `ReActEngine` instance per session instead of creating new one per message. **Patterns to follow:** Existing `asyncio.gather` pattern in orchestrator. **Test scenarios:** - User message append and chat messages retrieval run concurrently - Assistant message persisted after streaming completes - ReActEngine reuse across messages in same session - Error during parallel operations handled gracefully **Verification:** Existing chat route tests pass; manual test shows reduced latency. --- ## Risks & Mitigations | Risk | Likelihood | Impact | Mitigation | |------|-----------|--------|------------| | Heuristic classifier misroutes requests | Medium | Medium — wrong skill or wrong execution mode | Config flag to revert to LLM; monitor routing accuracy via telemetry | | Parallel tool execution breaks implicit dependencies | Low | High — incorrect results | Config flag to disable; LLM rarely returns dependent calls in single response | | Async session writes lose messages on crash | Low | Medium — missing conversation history | WAL buffer + flush on shutdown; acceptable trade-off for speed | | Merged LLM call prompt confuses the model | Low | Low — falls back to separate calls | Fallback to separate calls on parse failure | ## System-Wide Impact - **Routing layer:** CostAwareRouter and IntentRouter behavior changes when heuristic mode is active; existing LLM-based routing preserved as fallback - **ReAct engine:** Tool execution changes from serial to parallel; conversation history ordering preserved - **Session management:** Write operations become asynchronous; read operations check WAL buffer - **Configuration:** New `router` and `react` config sections in `agentkit.yaml` - **Telemetry:** Existing OpenTelemetry spans continue to work; new spans for heuristic classification ## Open Questions - What is the actual routing accuracy of the current LLM-based classifier? Need baseline measurement before comparing heuristic accuracy. - Should the heuristic classifier be extensible (plugin pattern) or hardcoded? Starting with hardcoded for simplicity, can extend later.