fischer-agentkit/docs/plans/2026-06-12-021-feat-chat-re...

17 KiB
Raw Blame History

title status created plan-type depth
feat: Chat Response Speed Optimization — Sub-1s First Token active 2026-06-12 feat standard

feat: Chat Response Speed Optimization — Sub-1s First Token

Summary

Optimize the fischer-agentkit conversation response pipeline to achieve sub-1-second first-token latency. The primary bottleneck is 12 extra LLM calls in the routing layer before the main ReAct loop. Secondary optimizations include parallel tool execution, async session I/O, and connection pool tuning. All changes are gated by configuration flags for safe rollback.

Problem Frame

Users experience 510 second delays before seeing any response in the chat interface. The root cause is a serial chain of LLM calls: CostAwareRouter.quick_classify() → IntentRouter._classify_with_llm() → ReActEngine LLM Think. The first two calls are routing overhead that add 26 seconds with no user-visible value. The third call is the actual reasoning step and cannot be eliminated, but its perceived latency can be reduced via streaming.

Current worst-case latency chain:

User message → quick_classify() [1-2s LLM]
             → _classify_with_llm() [1-2s LLM]
             → ReActEngine Think [2-5s LLM]
             → Tool Act [0.5-5s]
             → First token visible to user

Target latency chain:

User message → Local rule classification [<1ms]
             → ReActEngine Think (streaming) [first token in <1s]
             → Tool Act (parallel when possible)
             → First token visible to user

Requirements

ID Requirement Priority
R1 First token latency must be under 1 second for simple conversations (greetings, Q&A) P0
R2 First token latency must be under 1 second for routed conversations when keyword matching succeeds P0
R3 Routing accuracy must not degrade more than 10% compared to current LLM-based classification P1
R4 All optimizations must be configurable with on/off switches for safe rollback P0
R5 Parallel tool execution must preserve conversation history ordering P1
R6 Async session writes must not lose messages on process crash P1

Key Technical Decisions

KTD1: Replace LLM quick_classify with local heuristic

Decision: Replace CostAwareRouter.quick_classify() LLM call with a zero-cost local heuristic based on message length, keyword density, and tool-hint detection.

Rationale: The LLM classification adds 12s latency for a binary decision (simple vs complex). A local heuristic using the same signals already present in the message content (length, presence of tool-related keywords, question marks, etc.) can achieve ~85% accuracy at zero latency cost.

Alternative considered: Cache LLM classification results. Rejected because cache hit rate would be near-zero for conversational messages (each is unique).

KTD2: Merge quick_classify and intent classification into single LLM call

Decision: When LLM routing is needed (heuristic uncertainty), combine complexity scoring and intent classification into a single LLM call instead of two serial calls.

Rationale: Currently quick_classify() and _classify_with_llm() are separate LLM calls that could be merged into one prompt returning both complexity score and matched skill. This halves the routing LLM overhead when it cannot be avoided.

KTD3: Parallel execution of independent tool_calls

Decision: Execute multiple tool_calls from a single LLM response in parallel using asyncio.gather(), with results appended to conversation in tool_call_id order.

Rationale: When LLM returns multiple tool calls (e.g., search + calculate), they are independent and can run concurrently. Results must be appended in order for the next LLM call to see them correctly.

Risk: Some tool calls may have implicit dependencies. Mitigation: the LLM generally does not return dependent calls in a single response (it waits for results before calling the next). Add a config flag react.parallel_tools: false to disable if needed.

KTD4: Fire-and-forget session writes with write-ahead buffer

Decision: Make SessionManager.append_message() non-blocking by returning immediately after queuing the write, with a background task performing the actual I/O. Add a small in-memory buffer as write-ahead log to prevent message loss.

Rationale: Session writes (especially save_session() for updated_at) add unnecessary blocking. The user doesn't need to wait for persistence before seeing a response. A write-ahead buffer ensures messages survive brief failures.

KTD5: Unified httpx connection pool configuration

Decision: Configure explicit httpx.Limits on all LLM provider clients with sensible defaults for keepalive and connection pooling.

Rationale: Default httpx settings are reasonable but not optimized for high-frequency LLM API calls. Explicit configuration ensures consistent behavior across providers and enables tuning.

Scope Boundaries

In Scope

  • Routing layer optimization (CostAwareRouter, IntentRouter)
  • ReActEngine parallel tool execution
  • Session I/O async optimization
  • httpx connection pool tuning
  • Configuration flags for all changes
  • Test coverage for new behavior

Out of Scope

  • Frontend rendering optimization (separate concern)
  • LLM provider response time optimization (external dependency)
  • Memory/RAG pipeline optimization (covered by existing plan 009)
  • Compression strategy changes (covered by existing plan 013)
  • New LLM provider implementations

Deferred to Follow-Up Work

  • A/B testing framework for routing accuracy measurement
  • Performance benchmarking CI pipeline
  • WebSocket chat flow test coverage
  • WenxinProvider token-refresh client reuse

Implementation Units

U1. Local heuristic classifier for CostAwareRouter

Goal: Replace the LLM-based quick_classify() with a zero-cost local heuristic, gated by config flag router.classifier: heuristic | llm.

Requirements: R1, R2, R3, R4

Dependencies: None

Files:

  • src/agentkit/chat/skill_routing.py — add HeuristicClassifier class, modify CostAwareRouter.route()
  • src/agentkit/server/config.py — add router config section
  • agentkit.yaml — add router section with defaults
  • tests/unit/test_cost_aware_router.py — add heuristic classifier tests

Approach:

  1. Create HeuristicClassifier class with a classify(content: str) -> float method that returns a complexity score (0.01.0) based on:

    • Message length: short messages (<20 chars) → low complexity
    • Question patterns: presence of "为什么", "如何", "怎么", "how", "why", "what" → moderate complexity
    • Tool hints: presence of tool-related keywords (existing _tokenize_content + tool_hints list already in code) → high complexity
    • Multi-sentence: messages with multiple sentences → higher complexity
    • Code patterns: presence of code-like patterns (backticks, brackets) → higher complexity
  2. Modify CostAwareRouter.__init__ to accept a classifier_mode parameter ("heuristic" or "llm")

  3. Modify CostAwareRouter.route() Phase 1 to use HeuristicClassifier.classify() when mode is "heuristic"

  4. Add router config section to ServerConfig with classifier field (default: "heuristic")

  5. Wire config through create_app() to CostAwareRouter

Patterns to follow: Existing CostAwareRouter._match_layer0() rule-based pattern; existing _tokenize_content() for keyword extraction.

Test scenarios:

  • Short greeting → complexity < 0.3
  • Single question with "如何" → complexity 0.30.7
  • Multi-step request with tool keywords → complexity > 0.7
  • Code-related request → complexity > 0.7
  • Empty string → complexity 0.0
  • Very long message (>500 chars) → complexity > 0.5
  • Config flag classifier: llm falls back to LLM classification
  • Config flag classifier: heuristic uses local heuristic

Verification: All existing test_cost_aware_router.py tests pass; new heuristic tests pass; manual test shows first-token latency <1s for simple messages.


U2. Merged routing LLM call

Goal: When LLM routing is needed (heuristic uncertain or config forces LLM), combine complexity scoring and intent classification into a single LLM call.

Requirements: R2, R3, R4

Dependencies: U1

Files:

  • src/agentkit/chat/skill_routing.py — add MergedRouter method
  • src/agentkit/router/intent.py — add route_with_complexity() method
  • tests/unit/test_cost_aware_router.py — add merged routing tests
  • tests/unit/test_intent_router.py — add merged routing tests

Approach:

  1. Add IntentRouter.route_with_complexity() method that returns both a RoutingResult and a complexity score in a single LLM call. The prompt asks the LLM to return {"skill": "...", "confidence": 0.9, "complexity": 0.5}.

  2. Modify CostAwareRouter.route() so that when classifier is "llm", it calls route_with_complexity() instead of making two separate calls.

  3. When classifier is "heuristic" and the heuristic returns uncertainty (score in 0.30.7 range), use route_with_complexity() as a single fallback call.

Patterns to follow: Existing _classify_with_llm() prompt structure; existing quick_classify() prompt structure.

Test scenarios:

  • Merged call returns both skill match and complexity score
  • Merged call with no matching skill returns complexity only
  • Merged call with invalid LLM response falls back to rule-based evaluation
  • Heuristic uncertain + merged call produces correct routing
  • Config classifier: llm uses merged call instead of two separate calls

Verification: Existing tests pass; merged routing reduces LLM calls from 2 to 1 when LLM routing is needed.


U3. Parallel tool execution in ReActEngine

Goal: Execute multiple independent tool_calls from a single LLM response in parallel, gated by config flag react.parallel_tools.

Requirements: R5, R4

Dependencies: None

Files:

  • src/agentkit/core/react.py — modify _execute_loop() and execute_stream() to use asyncio.gather()
  • src/agentkit/server/config.py — add react.parallel_tools config
  • agentkit.yaml — add react section
  • tests/unit/test_react_engine.py — add parallel execution tests

Approach:

  1. Add parallel_tools: bool = True parameter to ReActEngine.__init__.

  2. In _execute_loop() and execute_stream(), when response.tool_calls has >1 items and parallel_tools is True:

    • Execute all tool calls concurrently with asyncio.gather(*[_execute_tool(tc.name, tc.arguments, tools) for tc in response.tool_calls], return_exceptions=True)
    • Build tool result messages in tool_call_id order
    • Append all results to conversation in order
  3. When parallel_tools is False, keep current serial behavior.

  4. For execute_stream(), yield all tool_call events first, then all tool_result events after gather completes.

Patterns to follow: Existing _execute_tool() method; existing Orchestrator._execute_plan() parallel group pattern in orchestrator.py.

Test scenarios:

  • Two independent tools execute in parallel, both results present in conversation
  • Parallel execution preserves tool_call_id ordering in conversation
  • One tool fails, other succeeds — partial results preserved
  • parallel_tools: false falls back to serial execution
  • Single tool_call works identically with parallel mode on/off
  • Tool results appended to conversation in correct order for next LLM call

Verification: Existing ReAct tests pass; new parallel tests pass; manual test with multi-tool request shows reduced execution time.


U4. Async session writes with write-ahead buffer

Goal: Make SessionManager.append_message() non-blocking by deferring save_session() and making append_message() fire-and-forget with a small write-ahead buffer.

Requirements: R6, R4

Dependencies: None

Files:

  • src/agentkit/session/manager.py — add async write queue and WAL buffer
  • tests/unit/test_session_manager.py — add async write tests

Approach:

  1. Add an AsyncWriteQueue to SessionManager that:

    • Accepts write operations (append_message, save_session) as tasks
    • Executes them in a background asyncio.Task
    • Maintains a small in-memory buffer of recent writes for crash recovery
    • Provides await flush() for graceful shutdown
  2. Modify append_message():

    • Keep get_session() + validation as synchronous (needed for error checking)
    • Queue store.append_message() + store.save_session() as a single async task
    • Return the Message object immediately without waiting for persistence
  3. Modify get_chat_messages() to first check the WAL buffer for uncommitted messages, then fall back to store.

  4. Add flush() method called during session close and app shutdown.

Patterns to follow: Existing BackgroundRunner pattern in server/runner.py; existing TaskStore cleanup pattern.

Test scenarios:

  • append_message returns immediately, message persisted asynchronously
  • get_chat_messages includes WAL-buffered messages not yet persisted
  • flush() ensures all pending writes complete
  • Multiple rapid append_messages are batched correctly
  • Session close flushes pending writes
  • App shutdown flushes pending writes

Verification: Existing session tests pass; new async write tests pass; no message loss during normal operation.


U5. httpx connection pool configuration

Goal: Configure explicit httpx.Limits on all LLM provider clients for optimal connection reuse.

Requirements: R4

Dependencies: None

Files:

  • src/agentkit/llm/providers/openai.py — add httpx.Limits configuration
  • src/agentkit/llm/providers/anthropic.py — add httpx.Limits configuration
  • src/agentkit/llm/providers/gemini.py — add httpx.Limits configuration
  • src/agentkit/llm/config.py — add connection pool config fields
  • tests/unit/test_llm_provider.py — verify connection pool settings

Approach:

  1. Add connection_pool section to ProviderConfig:

    • max_connections: int = 100
    • max_keepalive_connections: int = 20
    • keepalive_expiry: float = 30.0
  2. Pass httpx.Limits to all provider constructors.

  3. Configure httpx.AsyncClient with explicit limits in each provider.

Patterns to follow: Existing ProviderConfig dataclass pattern; existing timeout parameter pattern.

Test scenarios:

  • Provider creates httpx client with configured limits
  • Default limits applied when not configured
  • Custom limits from config override defaults
  • Connection reuse verified via mock

Verification: Existing provider tests pass; connection pool settings applied correctly.


U6. Chat route pipeline optimization

Goal: Optimize the WebSocket chat handler to overlap I/O operations and reduce serial waits.

Requirements: R1, R2

Dependencies: U1, U4

Files:

  • src/agentkit/server/routes/chat.py — parallelize session operations
  • tests/unit/test_chat_routes.py — add pipeline optimization tests

Approach:

  1. In _handle_chat_message(), parallelize:

    • sm.append_message() (user message) and sm.get_chat_messages() — these can run concurrently since append_message now returns immediately (U4)
  2. Move assistant message append_message() to fire-and-forget after streaming completes (already non-blocking with U4).

  3. Reuse ReActEngine instance per session instead of creating new one per message.

Patterns to follow: Existing asyncio.gather pattern in orchestrator.

Test scenarios:

  • User message append and chat messages retrieval run concurrently
  • Assistant message persisted after streaming completes
  • ReActEngine reuse across messages in same session
  • Error during parallel operations handled gracefully

Verification: Existing chat route tests pass; manual test shows reduced latency.


Risks & Mitigations

Risk Likelihood Impact Mitigation
Heuristic classifier misroutes requests Medium Medium — wrong skill or wrong execution mode Config flag to revert to LLM; monitor routing accuracy via telemetry
Parallel tool execution breaks implicit dependencies Low High — incorrect results Config flag to disable; LLM rarely returns dependent calls in single response
Async session writes lose messages on crash Low Medium — missing conversation history WAL buffer + flush on shutdown; acceptable trade-off for speed
Merged LLM call prompt confuses the model Low Low — falls back to separate calls Fallback to separate calls on parse failure

System-Wide Impact

  • Routing layer: CostAwareRouter and IntentRouter behavior changes when heuristic mode is active; existing LLM-based routing preserved as fallback
  • ReAct engine: Tool execution changes from serial to parallel; conversation history ordering preserved
  • Session management: Write operations become asynchronous; read operations check WAL buffer
  • Configuration: New router and react config sections in agentkit.yaml
  • Telemetry: Existing OpenTelemetry spans continue to work; new spans for heuristic classification

Open Questions

  • What is the actual routing accuracy of the current LLM-based classifier? Need baseline measurement before comparing heuristic accuracy.
  • Should the heuristic classifier be extensible (plugin pattern) or hardcoded? Starting with hardcoded for simplicity, can extend later.