17 KiB
| title | status | created | plan-type | depth |
|---|---|---|---|---|
| feat: Chat Response Speed Optimization — Sub-1s First Token | active | 2026-06-12 | feat | standard |
feat: Chat Response Speed Optimization — Sub-1s First Token
Summary
Optimize the fischer-agentkit conversation response pipeline to achieve sub-1-second first-token latency. The primary bottleneck is 1–2 extra LLM calls in the routing layer before the main ReAct loop. Secondary optimizations include parallel tool execution, async session I/O, and connection pool tuning. All changes are gated by configuration flags for safe rollback.
Problem Frame
Users experience 5–10 second delays before seeing any response in the chat interface. The root cause is a serial chain of LLM calls: CostAwareRouter.quick_classify() → IntentRouter._classify_with_llm() → ReActEngine LLM Think. The first two calls are routing overhead that add 2–6 seconds with no user-visible value. The third call is the actual reasoning step and cannot be eliminated, but its perceived latency can be reduced via streaming.
Current worst-case latency chain:
User message → quick_classify() [1-2s LLM]
→ _classify_with_llm() [1-2s LLM]
→ ReActEngine Think [2-5s LLM]
→ Tool Act [0.5-5s]
→ First token visible to user
Target latency chain:
User message → Local rule classification [<1ms]
→ ReActEngine Think (streaming) [first token in <1s]
→ Tool Act (parallel when possible)
→ First token visible to user
Requirements
| ID | Requirement | Priority |
|---|---|---|
| R1 | First token latency must be under 1 second for simple conversations (greetings, Q&A) | P0 |
| R2 | First token latency must be under 1 second for routed conversations when keyword matching succeeds | P0 |
| R3 | Routing accuracy must not degrade more than 10% compared to current LLM-based classification | P1 |
| R4 | All optimizations must be configurable with on/off switches for safe rollback | P0 |
| R5 | Parallel tool execution must preserve conversation history ordering | P1 |
| R6 | Async session writes must not lose messages on process crash | P1 |
Key Technical Decisions
KTD1: Replace LLM quick_classify with local heuristic
Decision: Replace CostAwareRouter.quick_classify() LLM call with a zero-cost local heuristic based on message length, keyword density, and tool-hint detection.
Rationale: The LLM classification adds 1–2s latency for a binary decision (simple vs complex). A local heuristic using the same signals already present in the message content (length, presence of tool-related keywords, question marks, etc.) can achieve ~85% accuracy at zero latency cost.
Alternative considered: Cache LLM classification results. Rejected because cache hit rate would be near-zero for conversational messages (each is unique).
KTD2: Merge quick_classify and intent classification into single LLM call
Decision: When LLM routing is needed (heuristic uncertainty), combine complexity scoring and intent classification into a single LLM call instead of two serial calls.
Rationale: Currently quick_classify() and _classify_with_llm() are separate LLM calls that could be merged into one prompt returning both complexity score and matched skill. This halves the routing LLM overhead when it cannot be avoided.
KTD3: Parallel execution of independent tool_calls
Decision: Execute multiple tool_calls from a single LLM response in parallel using asyncio.gather(), with results appended to conversation in tool_call_id order.
Rationale: When LLM returns multiple tool calls (e.g., search + calculate), they are independent and can run concurrently. Results must be appended in order for the next LLM call to see them correctly.
Risk: Some tool calls may have implicit dependencies. Mitigation: the LLM generally does not return dependent calls in a single response (it waits for results before calling the next). Add a config flag react.parallel_tools: false to disable if needed.
KTD4: Fire-and-forget session writes with write-ahead buffer
Decision: Make SessionManager.append_message() non-blocking by returning immediately after queuing the write, with a background task performing the actual I/O. Add a small in-memory buffer as write-ahead log to prevent message loss.
Rationale: Session writes (especially save_session() for updated_at) add unnecessary blocking. The user doesn't need to wait for persistence before seeing a response. A write-ahead buffer ensures messages survive brief failures.
KTD5: Unified httpx connection pool configuration
Decision: Configure explicit httpx.Limits on all LLM provider clients with sensible defaults for keepalive and connection pooling.
Rationale: Default httpx settings are reasonable but not optimized for high-frequency LLM API calls. Explicit configuration ensures consistent behavior across providers and enables tuning.
Scope Boundaries
In Scope
- Routing layer optimization (CostAwareRouter, IntentRouter)
- ReActEngine parallel tool execution
- Session I/O async optimization
- httpx connection pool tuning
- Configuration flags for all changes
- Test coverage for new behavior
Out of Scope
- Frontend rendering optimization (separate concern)
- LLM provider response time optimization (external dependency)
- Memory/RAG pipeline optimization (covered by existing plan 009)
- Compression strategy changes (covered by existing plan 013)
- New LLM provider implementations
Deferred to Follow-Up Work
- A/B testing framework for routing accuracy measurement
- Performance benchmarking CI pipeline
- WebSocket chat flow test coverage
- WenxinProvider token-refresh client reuse
Implementation Units
U1. Local heuristic classifier for CostAwareRouter
Goal: Replace the LLM-based quick_classify() with a zero-cost local heuristic, gated by config flag router.classifier: heuristic | llm.
Requirements: R1, R2, R3, R4
Dependencies: None
Files:
src/agentkit/chat/skill_routing.py— addHeuristicClassifierclass, modifyCostAwareRouter.route()src/agentkit/server/config.py— addrouterconfig sectionagentkit.yaml— addroutersection with defaultstests/unit/test_cost_aware_router.py— add heuristic classifier tests
Approach:
-
Create
HeuristicClassifierclass with aclassify(content: str) -> floatmethod that returns a complexity score (0.0–1.0) based on:- Message length: short messages (<20 chars) → low complexity
- Question patterns: presence of "为什么", "如何", "怎么", "how", "why", "what" → moderate complexity
- Tool hints: presence of tool-related keywords (existing
_tokenize_content+tool_hintslist already in code) → high complexity - Multi-sentence: messages with multiple sentences → higher complexity
- Code patterns: presence of code-like patterns (backticks, brackets) → higher complexity
-
Modify
CostAwareRouter.__init__to accept aclassifier_modeparameter ("heuristic"or"llm") -
Modify
CostAwareRouter.route()Phase 1 to useHeuristicClassifier.classify()when mode is"heuristic" -
Add
routerconfig section toServerConfigwithclassifierfield (default:"heuristic") -
Wire config through
create_app()toCostAwareRouter
Patterns to follow: Existing CostAwareRouter._match_layer0() rule-based pattern; existing _tokenize_content() for keyword extraction.
Test scenarios:
- Short greeting → complexity < 0.3
- Single question with "如何" → complexity 0.3–0.7
- Multi-step request with tool keywords → complexity > 0.7
- Code-related request → complexity > 0.7
- Empty string → complexity 0.0
- Very long message (>500 chars) → complexity > 0.5
- Config flag
classifier: llmfalls back to LLM classification - Config flag
classifier: heuristicuses local heuristic
Verification: All existing test_cost_aware_router.py tests pass; new heuristic tests pass; manual test shows first-token latency <1s for simple messages.
U2. Merged routing LLM call
Goal: When LLM routing is needed (heuristic uncertain or config forces LLM), combine complexity scoring and intent classification into a single LLM call.
Requirements: R2, R3, R4
Dependencies: U1
Files:
src/agentkit/chat/skill_routing.py— addMergedRoutermethodsrc/agentkit/router/intent.py— addroute_with_complexity()methodtests/unit/test_cost_aware_router.py— add merged routing teststests/unit/test_intent_router.py— add merged routing tests
Approach:
-
Add
IntentRouter.route_with_complexity()method that returns both aRoutingResultand a complexity score in a single LLM call. The prompt asks the LLM to return{"skill": "...", "confidence": 0.9, "complexity": 0.5}. -
Modify
CostAwareRouter.route()so that whenclassifieris"llm", it callsroute_with_complexity()instead of making two separate calls. -
When
classifieris"heuristic"and the heuristic returns uncertainty (score in 0.3–0.7 range), useroute_with_complexity()as a single fallback call.
Patterns to follow: Existing _classify_with_llm() prompt structure; existing quick_classify() prompt structure.
Test scenarios:
- Merged call returns both skill match and complexity score
- Merged call with no matching skill returns complexity only
- Merged call with invalid LLM response falls back to rule-based evaluation
- Heuristic uncertain + merged call produces correct routing
- Config
classifier: llmuses merged call instead of two separate calls
Verification: Existing tests pass; merged routing reduces LLM calls from 2 to 1 when LLM routing is needed.
U3. Parallel tool execution in ReActEngine
Goal: Execute multiple independent tool_calls from a single LLM response in parallel, gated by config flag react.parallel_tools.
Requirements: R5, R4
Dependencies: None
Files:
src/agentkit/core/react.py— modify_execute_loop()andexecute_stream()to useasyncio.gather()src/agentkit/server/config.py— addreact.parallel_toolsconfigagentkit.yaml— addreactsectiontests/unit/test_react_engine.py— add parallel execution tests
Approach:
-
Add
parallel_tools: bool = Trueparameter toReActEngine.__init__. -
In
_execute_loop()andexecute_stream(), whenresponse.tool_callshas >1 items andparallel_toolsis True:- Execute all tool calls concurrently with
asyncio.gather(*[_execute_tool(tc.name, tc.arguments, tools) for tc in response.tool_calls], return_exceptions=True) - Build tool result messages in tool_call_id order
- Append all results to conversation in order
- Execute all tool calls concurrently with
-
When
parallel_toolsis False, keep current serial behavior. -
For
execute_stream(), yield alltool_callevents first, then alltool_resultevents after gather completes.
Patterns to follow: Existing _execute_tool() method; existing Orchestrator._execute_plan() parallel group pattern in orchestrator.py.
Test scenarios:
- Two independent tools execute in parallel, both results present in conversation
- Parallel execution preserves tool_call_id ordering in conversation
- One tool fails, other succeeds — partial results preserved
parallel_tools: falsefalls back to serial execution- Single tool_call works identically with parallel mode on/off
- Tool results appended to conversation in correct order for next LLM call
Verification: Existing ReAct tests pass; new parallel tests pass; manual test with multi-tool request shows reduced execution time.
U4. Async session writes with write-ahead buffer
Goal: Make SessionManager.append_message() non-blocking by deferring save_session() and making append_message() fire-and-forget with a small write-ahead buffer.
Requirements: R6, R4
Dependencies: None
Files:
src/agentkit/session/manager.py— add async write queue and WAL buffertests/unit/test_session_manager.py— add async write tests
Approach:
-
Add an
AsyncWriteQueuetoSessionManagerthat:- Accepts write operations (append_message, save_session) as tasks
- Executes them in a background
asyncio.Task - Maintains a small in-memory buffer of recent writes for crash recovery
- Provides
await flush()for graceful shutdown
-
Modify
append_message():- Keep
get_session()+ validation as synchronous (needed for error checking) - Queue
store.append_message()+store.save_session()as a single async task - Return the
Messageobject immediately without waiting for persistence
- Keep
-
Modify
get_chat_messages()to first check the WAL buffer for uncommitted messages, then fall back to store. -
Add
flush()method called during session close and app shutdown.
Patterns to follow: Existing BackgroundRunner pattern in server/runner.py; existing TaskStore cleanup pattern.
Test scenarios:
- append_message returns immediately, message persisted asynchronously
- get_chat_messages includes WAL-buffered messages not yet persisted
- flush() ensures all pending writes complete
- Multiple rapid append_messages are batched correctly
- Session close flushes pending writes
- App shutdown flushes pending writes
Verification: Existing session tests pass; new async write tests pass; no message loss during normal operation.
U5. httpx connection pool configuration
Goal: Configure explicit httpx.Limits on all LLM provider clients for optimal connection reuse.
Requirements: R4
Dependencies: None
Files:
src/agentkit/llm/providers/openai.py— addhttpx.Limitsconfigurationsrc/agentkit/llm/providers/anthropic.py— addhttpx.Limitsconfigurationsrc/agentkit/llm/providers/gemini.py— addhttpx.Limitsconfigurationsrc/agentkit/llm/config.py— add connection pool config fieldstests/unit/test_llm_provider.py— verify connection pool settings
Approach:
-
Add
connection_poolsection toProviderConfig:max_connections: int = 100max_keepalive_connections: int = 20keepalive_expiry: float = 30.0
-
Pass
httpx.Limitsto all provider constructors. -
Configure
httpx.AsyncClientwith explicit limits in each provider.
Patterns to follow: Existing ProviderConfig dataclass pattern; existing timeout parameter pattern.
Test scenarios:
- Provider creates httpx client with configured limits
- Default limits applied when not configured
- Custom limits from config override defaults
- Connection reuse verified via mock
Verification: Existing provider tests pass; connection pool settings applied correctly.
U6. Chat route pipeline optimization
Goal: Optimize the WebSocket chat handler to overlap I/O operations and reduce serial waits.
Requirements: R1, R2
Dependencies: U1, U4
Files:
src/agentkit/server/routes/chat.py— parallelize session operationstests/unit/test_chat_routes.py— add pipeline optimization tests
Approach:
-
In
_handle_chat_message(), parallelize:sm.append_message()(user message) andsm.get_chat_messages()— these can run concurrently since append_message now returns immediately (U4)
-
Move assistant message
append_message()to fire-and-forget after streaming completes (already non-blocking with U4). -
Reuse
ReActEngineinstance per session instead of creating new one per message.
Patterns to follow: Existing asyncio.gather pattern in orchestrator.
Test scenarios:
- User message append and chat messages retrieval run concurrently
- Assistant message persisted after streaming completes
- ReActEngine reuse across messages in same session
- Error during parallel operations handled gracefully
Verification: Existing chat route tests pass; manual test shows reduced latency.
Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| Heuristic classifier misroutes requests | Medium | Medium — wrong skill or wrong execution mode | Config flag to revert to LLM; monitor routing accuracy via telemetry |
| Parallel tool execution breaks implicit dependencies | Low | High — incorrect results | Config flag to disable; LLM rarely returns dependent calls in single response |
| Async session writes lose messages on crash | Low | Medium — missing conversation history | WAL buffer + flush on shutdown; acceptable trade-off for speed |
| Merged LLM call prompt confuses the model | Low | Low — falls back to separate calls | Fallback to separate calls on parse failure |
System-Wide Impact
- Routing layer: CostAwareRouter and IntentRouter behavior changes when heuristic mode is active; existing LLM-based routing preserved as fallback
- ReAct engine: Tool execution changes from serial to parallel; conversation history ordering preserved
- Session management: Write operations become asynchronous; read operations check WAL buffer
- Configuration: New
routerandreactconfig sections inagentkit.yaml - Telemetry: Existing OpenTelemetry spans continue to work; new spans for heuristic classification
Open Questions
- What is the actual routing accuracy of the current LLM-based classifier? Need baseline measurement before comparing heuristic accuracy.
- Should the heuristic classifier be extensible (plugin pattern) or hardcoded? Starting with hardcoded for simplicity, can extend later.