17 KiB

Raw Blame History

title	status	created	plan-type	depth
feat: Chat Response Speed Optimization — Sub-1s First Token	active	2026-06-12	feat	standard

feat: Chat Response Speed Optimization — Sub-1s First Token

Summary

Optimize the fischer-agentkit conversation response pipeline to achieve sub-1-second first-token latency. The primary bottleneck is 1–2 extra LLM calls in the routing layer before the main ReAct loop. Secondary optimizations include parallel tool execution, async session I/O, and connection pool tuning. All changes are gated by configuration flags for safe rollback.

Problem Frame

Users experience 5–10 second delays before seeing any response in the chat interface. The root cause is a serial chain of LLM calls: CostAwareRouter.quick_classify() → IntentRouter._classify_with_llm() → ReActEngine LLM Think. The first two calls are routing overhead that add 2–6 seconds with no user-visible value. The third call is the actual reasoning step and cannot be eliminated, but its perceived latency can be reduced via streaming.

Current worst-case latency chain:

User message → quick_classify() [1-2s LLM]
             → _classify_with_llm() [1-2s LLM]
             → ReActEngine Think [2-5s LLM]
             → Tool Act [0.5-5s]
             → First token visible to user

Target latency chain:

User message → Local rule classification [<1ms]
             → ReActEngine Think (streaming) [first token in <1s]
             → Tool Act (parallel when possible)
             → First token visible to user

Requirements

ID	Requirement	Priority
R1	First token latency must be under 1 second for simple conversations (greetings, Q&A)	P0
R2	First token latency must be under 1 second for routed conversations when keyword matching succeeds	P0
R3	Routing accuracy must not degrade more than 10% compared to current LLM-based classification	P1
R4	All optimizations must be configurable with on/off switches for safe rollback	P0
R5	Parallel tool execution must preserve conversation history ordering	P1
R6	Async session writes must not lose messages on process crash	P1

Key Technical Decisions

KTD1: Replace LLM quick_classify with local heuristic

Decision: Replace CostAwareRouter.quick_classify() LLM call with a zero-cost local heuristic based on message length, keyword density, and tool-hint detection.

Rationale: The LLM classification adds 1–2s latency for a binary decision (simple vs complex). A local heuristic using the same signals already present in the message content (length, presence of tool-related keywords, question marks, etc.) can achieve ~85% accuracy at zero latency cost.

Alternative considered: Cache LLM classification results. Rejected because cache hit rate would be near-zero for conversational messages (each is unique).

KTD2: Merge quick_classify and intent classification into single LLM call

Decision: When LLM routing is needed (heuristic uncertainty), combine complexity scoring and intent classification into a single LLM call instead of two serial calls.

Rationale: Currently quick_classify() and _classify_with_llm() are separate LLM calls that could be merged into one prompt returning both complexity score and matched skill. This halves the routing LLM overhead when it cannot be avoided.

KTD3: Parallel execution of independent tool_calls

Decision: Execute multiple tool_calls from a single LLM response in parallel using asyncio.gather(), with results appended to conversation in tool_call_id order.

Rationale: When LLM returns multiple tool calls (e.g., search + calculate), they are independent and can run concurrently. Results must be appended in order for the next LLM call to see them correctly.

Risk: Some tool calls may have implicit dependencies. Mitigation: the LLM generally does not return dependent calls in a single response (it waits for results before calling the next). Add a config flag react.parallel_tools: false to disable if needed.

KTD4: Fire-and-forget session writes with write-ahead buffer

Decision: Make SessionManager.append_message() non-blocking by returning immediately after queuing the write, with a background task performing the actual I/O. Add a small in-memory buffer as write-ahead log to prevent message loss.

Rationale: Session writes (especially save_session() for updated_at) add unnecessary blocking. The user doesn't need to wait for persistence before seeing a response. A write-ahead buffer ensures messages survive brief failures.

KTD5: Unified httpx connection pool configuration

Decision: Configure explicit httpx.Limits on all LLM provider clients with sensible defaults for keepalive and connection pooling.

Rationale: Default httpx settings are reasonable but not optimized for high-frequency LLM API calls. Explicit configuration ensures consistent behavior across providers and enables tuning.

Scope Boundaries

In Scope

Routing layer optimization (CostAwareRouter, IntentRouter)
ReActEngine parallel tool execution
Session I/O async optimization
httpx connection pool tuning
Configuration flags for all changes
Test coverage for new behavior

Out of Scope

Frontend rendering optimization (separate concern)
LLM provider response time optimization (external dependency)
Memory/RAG pipeline optimization (covered by existing plan 009)
Compression strategy changes (covered by existing plan 013)
New LLM provider implementations

Deferred to Follow-Up Work

A/B testing framework for routing accuracy measurement
Performance benchmarking CI pipeline
WebSocket chat flow test coverage
WenxinProvider token-refresh client reuse

Implementation Units

U1. Local heuristic classifier for CostAwareRouter

Goal: Replace the LLM-based quick_classify() with a zero-cost local heuristic, gated by config flag router.classifier: heuristic | llm.

Requirements: R1, R2, R3, R4

Dependencies: None

Files:

src/agentkit/chat/skill_routing.py — add HeuristicClassifier class, modify CostAwareRouter.route()
src/agentkit/server/config.py — add router config section
agentkit.yaml — add router section with defaults
tests/unit/test_cost_aware_router.py — add heuristic classifier tests

Approach:

Create HeuristicClassifier class with a classify(content: str) -> float method that returns a complexity score (0.0–1.0) based on:
- Message length: short messages (<20 chars) → low complexity
- Question patterns: presence of "为什么", "如何", "怎么", "how", "why", "what" → moderate complexity
- Tool hints: presence of tool-related keywords (existing _tokenize_content + tool_hints list already in code) → high complexity
- Multi-sentence: messages with multiple sentences → higher complexity
- Code patterns: presence of code-like patterns (backticks, brackets) → higher complexity
Modify CostAwareRouter.__init__ to accept a classifier_mode parameter ("heuristic" or "llm")
Modify CostAwareRouter.route() Phase 1 to use HeuristicClassifier.classify() when mode is "heuristic"
Add router config section to ServerConfig with classifier field (default: "heuristic")
Wire config through create_app() to CostAwareRouter

Patterns to follow: Existing CostAwareRouter._match_layer0() rule-based pattern; existing _tokenize_content() for keyword extraction.

Test scenarios:

Short greeting → complexity < 0.3
Single question with "如何" → complexity 0.3–0.7
Multi-step request with tool keywords → complexity > 0.7
Code-related request → complexity > 0.7
Empty string → complexity 0.0
Very long message (>500 chars) → complexity > 0.5
Config flag classifier: llm falls back to LLM classification
Config flag classifier: heuristic uses local heuristic

Verification: All existing test_cost_aware_router.py tests pass; new heuristic tests pass; manual test shows first-token latency <1s for simple messages.

U2. Merged routing LLM call

Goal: When LLM routing is needed (heuristic uncertain or config forces LLM), combine complexity scoring and intent classification into a single LLM call.

Requirements: R2, R3, R4

Dependencies: U1

Files:

src/agentkit/chat/skill_routing.py — add MergedRouter method
src/agentkit/router/intent.py — add route_with_complexity() method
tests/unit/test_cost_aware_router.py — add merged routing tests
tests/unit/test_intent_router.py — add merged routing tests

Approach:

Add IntentRouter.route_with_complexity() method that returns both a RoutingResult and a complexity score in a single LLM call. The prompt asks the LLM to return {"skill": "...", "confidence": 0.9, "complexity": 0.5}.
Modify CostAwareRouter.route() so that when classifier is "llm", it calls route_with_complexity() instead of making two separate calls.
When classifier is "heuristic" and the heuristic returns uncertainty (score in 0.3–0.7 range), use route_with_complexity() as a single fallback call.

Patterns to follow: Existing _classify_with_llm() prompt structure; existing quick_classify() prompt structure.

Test scenarios:

Merged call returns both skill match and complexity score
Merged call with no matching skill returns complexity only
Merged call with invalid LLM response falls back to rule-based evaluation
Heuristic uncertain + merged call produces correct routing
Config classifier: llm uses merged call instead of two separate calls

Verification: Existing tests pass; merged routing reduces LLM calls from 2 to 1 when LLM routing is needed.

U3. Parallel tool execution in ReActEngine

Goal: Execute multiple independent tool_calls from a single LLM response in parallel, gated by config flag react.parallel_tools.

Requirements: R5, R4

Dependencies: None

Files:

src/agentkit/core/react.py — modify _execute_loop() and execute_stream() to use asyncio.gather()
src/agentkit/server/config.py — add react.parallel_tools config
agentkit.yaml — add react section
tests/unit/test_react_engine.py — add parallel execution tests

Approach:

Add parallel_tools: bool = True parameter to ReActEngine.__init__.
In _execute_loop() and execute_stream(), when response.tool_calls has >1 items and parallel_tools is True:
- Execute all tool calls concurrently with asyncio.gather(*[_execute_tool(tc.name, tc.arguments, tools) for tc in response.tool_calls], return_exceptions=True)
- Build tool result messages in tool_call_id order
- Append all results to conversation in order
When parallel_tools is False, keep current serial behavior.
For execute_stream(), yield all tool_call events first, then all tool_result events after gather completes.

Patterns to follow: Existing _execute_tool() method; existing Orchestrator._execute_plan() parallel group pattern in orchestrator.py.

Test scenarios:

Two independent tools execute in parallel, both results present in conversation
Parallel execution preserves tool_call_id ordering in conversation
One tool fails, other succeeds — partial results preserved
parallel_tools: false falls back to serial execution
Single tool_call works identically with parallel mode on/off
Tool results appended to conversation in correct order for next LLM call

Verification: Existing ReAct tests pass; new parallel tests pass; manual test with multi-tool request shows reduced execution time.

U4. Async session writes with write-ahead buffer

Goal: Make SessionManager.append_message() non-blocking by deferring save_session() and making append_message() fire-and-forget with a small write-ahead buffer.

Requirements: R6, R4

Dependencies: None

Files:

src/agentkit/session/manager.py — add async write queue and WAL buffer
tests/unit/test_session_manager.py — add async write tests

Approach:

Add an AsyncWriteQueue to SessionManager that:
- Accepts write operations (append_message, save_session) as tasks
- Executes them in a background asyncio.Task
- Maintains a small in-memory buffer of recent writes for crash recovery
- Provides await flush() for graceful shutdown
Modify append_message():
- Keep get_session() + validation as synchronous (needed for error checking)
- Queue store.append_message() + store.save_session() as a single async task
- Return the Message object immediately without waiting for persistence
Modify get_chat_messages() to first check the WAL buffer for uncommitted messages, then fall back to store.
Add flush() method called during session close and app shutdown.

Patterns to follow: Existing BackgroundRunner pattern in server/runner.py; existing TaskStore cleanup pattern.

Test scenarios:

append_message returns immediately, message persisted asynchronously
get_chat_messages includes WAL-buffered messages not yet persisted
flush() ensures all pending writes complete
Multiple rapid append_messages are batched correctly
Session close flushes pending writes
App shutdown flushes pending writes

Verification: Existing session tests pass; new async write tests pass; no message loss during normal operation.

U5. httpx connection pool configuration

Goal: Configure explicit httpx.Limits on all LLM provider clients for optimal connection reuse.

Requirements: R4

Dependencies: None

Files:

src/agentkit/llm/providers/openai.py — add httpx.Limits configuration
src/agentkit/llm/providers/anthropic.py — add httpx.Limits configuration
src/agentkit/llm/providers/gemini.py — add httpx.Limits configuration
src/agentkit/llm/config.py — add connection pool config fields
tests/unit/test_llm_provider.py — verify connection pool settings

Approach:

Add connection_pool section to ProviderConfig:
- max_connections: int = 100
- max_keepalive_connections: int = 20
- keepalive_expiry: float = 30.0
Pass httpx.Limits to all provider constructors.
Configure httpx.AsyncClient with explicit limits in each provider.

Patterns to follow: Existing ProviderConfig dataclass pattern; existing timeout parameter pattern.

Test scenarios:

Provider creates httpx client with configured limits
Default limits applied when not configured
Custom limits from config override defaults
Connection reuse verified via mock

Verification: Existing provider tests pass; connection pool settings applied correctly.

U6. Chat route pipeline optimization

Goal: Optimize the WebSocket chat handler to overlap I/O operations and reduce serial waits.

Requirements: R1, R2

Dependencies: U1, U4

Files:

src/agentkit/server/routes/chat.py — parallelize session operations
tests/unit/test_chat_routes.py — add pipeline optimization tests

Approach:

In _handle_chat_message(), parallelize:
- sm.append_message() (user message) and sm.get_chat_messages() — these can run concurrently since append_message now returns immediately (U4)
Move assistant message append_message() to fire-and-forget after streaming completes (already non-blocking with U4).
Reuse ReActEngine instance per session instead of creating new one per message.

Patterns to follow: Existing asyncio.gather pattern in orchestrator.

Test scenarios:

User message append and chat messages retrieval run concurrently
Assistant message persisted after streaming completes
ReActEngine reuse across messages in same session
Error during parallel operations handled gracefully

Verification: Existing chat route tests pass; manual test shows reduced latency.

Risks & Mitigations

Risk	Likelihood	Impact	Mitigation
Heuristic classifier misroutes requests	Medium	Medium — wrong skill or wrong execution mode	Config flag to revert to LLM; monitor routing accuracy via telemetry
Parallel tool execution breaks implicit dependencies	Low	High — incorrect results	Config flag to disable; LLM rarely returns dependent calls in single response
Async session writes lose messages on crash	Low	Medium — missing conversation history	WAL buffer + flush on shutdown; acceptable trade-off for speed
Merged LLM call prompt confuses the model	Low	Low — falls back to separate calls	Fallback to separate calls on parse failure

System-Wide Impact

Routing layer: CostAwareRouter and IntentRouter behavior changes when heuristic mode is active; existing LLM-based routing preserved as fallback
ReAct engine: Tool execution changes from serial to parallel; conversation history ordering preserved
Session management: Write operations become asynchronous; read operations check WAL buffer
Configuration: New router and react config sections in agentkit.yaml
Telemetry: Existing OpenTelemetry spans continue to work; new spans for heuristic classification

Open Questions

What is the actual routing accuracy of the current LLM-based classifier? Need baseline measurement before comparing heuristic accuracy.
Should the heuristic classifier be extensible (plugin pattern) or hardcoded? Starting with hardcoded for simplicity, can extend later.

17 KiB Raw Blame History Unescape Escape

feat: Chat Response Speed Optimization — Sub-1s First Token

Summary

Problem Frame

Requirements

Key Technical Decisions

KTD1: Replace LLM quick_classify with local heuristic

KTD2: Merge quick_classify and intent classification into single LLM call

KTD3: Parallel execution of independent tool_calls

KTD4: Fire-and-forget session writes with write-ahead buffer

KTD5: Unified httpx connection pool configuration

Scope Boundaries

In Scope

Out of Scope

Deferred to Follow-Up Work

Implementation Units

U1. Local heuristic classifier for CostAwareRouter

U2. Merged routing LLM call

U3. Parallel tool execution in ReActEngine

U4. Async session writes with write-ahead buffer

U5. httpx connection pool configuration

U6. Chat route pipeline optimization

Risks & Mitigations

System-Wide Impact

Open Questions

17 KiB

Raw Blame History