fischer-agentkit/docs/plans/2026-06-12-021-feat-chat-re...

---
title: "feat: Chat Response Speed Optimization — Sub-1s First Token"
status: active
created: 2026-06-12
plan-type: feat
depth: standard
---

# feat: Chat Response Speed Optimization — Sub-1s First Token

## Summary

Optimize the fischer-agentkit conversation response pipeline to achieve sub-1-second first-token latency. The primary bottleneck is 1–2 extra LLM calls in the routing layer before the main ReAct loop. Secondary optimizations include parallel tool execution, async session I/O, and connection pool tuning. All changes are gated by configuration flags for safe rollback.

## Problem Frame

Users experience 5–10 second delays before seeing any response in the chat interface. The root cause is a serial chain of LLM calls: CostAwareRouter.quick_classify() → IntentRouter._classify_with_llm() → ReActEngine LLM Think. The first two calls are routing overhead that add 2–6 seconds with no user-visible value. The third call is the actual reasoning step and cannot be eliminated, but its perceived latency can be reduced via streaming.

**Current worst-case latency chain:**

```
User message → quick_classify() [1-2s LLM]
             → _classify_with_llm() [1-2s LLM]
             → ReActEngine Think [2-5s LLM]
             → Tool Act [0.5-5s]
             → First token visible to user
```

**Target latency chain:**

```
User message → Local rule classification [<1ms]
             → ReActEngine Think (streaming) [first token in <1s]
             → Tool Act (parallel when possible)
             → First token visible to user
```

## Requirements

| ID | Requirement | Priority |
|----|-------------|----------|
| R1 | First token latency must be under 1 second for simple conversations (greetings, Q&A) | P0 |
| R2 | First token latency must be under 1 second for routed conversations when keyword matching succeeds | P0 |
| R3 | Routing accuracy must not degrade more than 10% compared to current LLM-based classification | P1 |
| R4 | All optimizations must be configurable with on/off switches for safe rollback | P0 |
| R5 | Parallel tool execution must preserve conversation history ordering | P1 |
| R6 | Async session writes must not lose messages on process crash | P1 |

## Key Technical Decisions

### KTD1: Replace LLM quick_classify with local heuristic

**Decision:** Replace `CostAwareRouter.quick_classify()` LLM call with a zero-cost local heuristic based on message length, keyword density, and tool-hint detection.

**Rationale:** The LLM classification adds 1–2s latency for a binary decision (simple vs complex). A local heuristic using the same signals already present in the message content (length, presence of tool-related keywords, question marks, etc.) can achieve ~85% accuracy at zero latency cost.

**Alternative considered:** Cache LLM classification results. Rejected because cache hit rate would be near-zero for conversational messages (each is unique).

### KTD2: Merge quick_classify and intent classification into single LLM call

**Decision:** When LLM routing is needed (heuristic uncertainty), combine complexity scoring and intent classification into a single LLM call instead of two serial calls.

**Rationale:** Currently `quick_classify()` and `_classify_with_llm()` are separate LLM calls that could be merged into one prompt returning both complexity score and matched skill. This halves the routing LLM overhead when it cannot be avoided.

### KTD3: Parallel execution of independent tool_calls

**Decision:** Execute multiple tool_calls from a single LLM response in parallel using `asyncio.gather()`, with results appended to conversation in tool_call_id order.

**Rationale:** When LLM returns multiple tool calls (e.g., search + calculate), they are independent and can run concurrently. Results must be appended in order for the next LLM call to see them correctly.

**Risk:** Some tool calls may have implicit dependencies. Mitigation: the LLM generally does not return dependent calls in a single response (it waits for results before calling the next). Add a config flag `react.parallel_tools: false` to disable if needed.

### KTD4: Fire-and-forget session writes with write-ahead buffer

**Decision:** Make `SessionManager.append_message()` non-blocking by returning immediately after queuing the write, with a background task performing the actual I/O. Add a small in-memory buffer as write-ahead log to prevent message loss.

**Rationale:** Session writes (especially `save_session()` for updated_at) add unnecessary blocking. The user doesn't need to wait for persistence before seeing a response. A write-ahead buffer ensures messages survive brief failures.

### KTD5: Unified httpx connection pool configuration

**Decision:** Configure explicit `httpx.Limits` on all LLM provider clients with sensible defaults for keepalive and connection pooling.

**Rationale:** Default httpx settings are reasonable but not optimized for high-frequency LLM API calls. Explicit configuration ensures consistent behavior across providers and enables tuning.

## Scope Boundaries

### In Scope

- Routing layer optimization (CostAwareRouter, IntentRouter)
- ReActEngine parallel tool execution
- Session I/O async optimization
- httpx connection pool tuning
- Configuration flags for all changes
- Test coverage for new behavior

### Out of Scope

- Frontend rendering optimization (separate concern)
- LLM provider response time optimization (external dependency)
- Memory/RAG pipeline optimization (covered by existing plan 009)
- Compression strategy changes (covered by existing plan 013)
- New LLM provider implementations

### Deferred to Follow-Up Work

- A/B testing framework for routing accuracy measurement
- Performance benchmarking CI pipeline
- WebSocket chat flow test coverage
- WenxinProvider token-refresh client reuse

---

## Implementation Units

### U1. Local heuristic classifier for CostAwareRouter

**Goal:** Replace the LLM-based `quick_classify()` with a zero-cost local heuristic, gated by config flag `router.classifier: heuristic | llm`.

**Requirements:** R1, R2, R3, R4

**Dependencies:** None

**Files:**
- `src/agentkit/chat/skill_routing.py` — add `HeuristicClassifier` class, modify `CostAwareRouter.route()`
- `src/agentkit/server/config.py` — add `router` config section
- `agentkit.yaml` — add `router` section with defaults
- `tests/unit/test_cost_aware_router.py` — add heuristic classifier tests

**Approach:**

1. Create `HeuristicClassifier` class with a `classify(content: str) -> float` method that returns a complexity score (0.0–1.0) based on:
   - Message length: short messages (<20 chars) → low complexity
   - Question patterns: presence of "为什么", "如何", "怎么", "how", "why", "what" → moderate complexity
   - Tool hints: presence of tool-related keywords (existing `_tokenize_content` + `tool_hints` list already in code) → high complexity
   - Multi-sentence: messages with multiple sentences → higher complexity
   - Code patterns: presence of code-like patterns (backticks, brackets) → higher complexity

2. Modify `CostAwareRouter.__init__` to accept a `classifier_mode` parameter (`"heuristic"` or `"llm"`)

3. Modify `CostAwareRouter.route()` Phase 1 to use `HeuristicClassifier.classify()` when mode is `"heuristic"`

4. Add `router` config section to `ServerConfig` with `classifier` field (default: `"heuristic"`)

5. Wire config through `create_app()` to `CostAwareRouter`

**Patterns to follow:** Existing `CostAwareRouter._match_layer0()` rule-based pattern; existing `_tokenize_content()` for keyword extraction.

**Test scenarios:**
- Short greeting → complexity < 0.3
- Single question with "如何" → complexity 0.3–0.7
- Multi-step request with tool keywords → complexity > 0.7
- Code-related request → complexity > 0.7
- Empty string → complexity 0.0
- Very long message (>500 chars) → complexity > 0.5
- Config flag `classifier: llm` falls back to LLM classification
- Config flag `classifier: heuristic` uses local heuristic

**Verification:** All existing `test_cost_aware_router.py` tests pass; new heuristic tests pass; manual test shows first-token latency <1s for simple messages.

---

### U2. Merged routing LLM call

**Goal:** When LLM routing is needed (heuristic uncertain or config forces LLM), combine complexity scoring and intent classification into a single LLM call.

**Requirements:** R2, R3, R4

**Dependencies:** U1

**Files:**
- `src/agentkit/chat/skill_routing.py` — add `MergedRouter` method
- `src/agentkit/router/intent.py` — add `route_with_complexity()` method
- `tests/unit/test_cost_aware_router.py` — add merged routing tests
- `tests/unit/test_intent_router.py` — add merged routing tests

**Approach:**

1. Add `IntentRouter.route_with_complexity()` method that returns both a `RoutingResult` and a complexity score in a single LLM call. The prompt asks the LLM to return `{"skill": "...", "confidence": 0.9, "complexity": 0.5}`.

2. Modify `CostAwareRouter.route()` so that when `classifier` is `"llm"`, it calls `route_with_complexity()` instead of making two separate calls.

3. When `classifier` is `"heuristic"` and the heuristic returns uncertainty (score in 0.3–0.7 range), use `route_with_complexity()` as a single fallback call.

**Patterns to follow:** Existing `_classify_with_llm()` prompt structure; existing `quick_classify()` prompt structure.

**Test scenarios:**
- Merged call returns both skill match and complexity score
- Merged call with no matching skill returns complexity only
- Merged call with invalid LLM response falls back to rule-based evaluation
- Heuristic uncertain + merged call produces correct routing
- Config `classifier: llm` uses merged call instead of two separate calls

**Verification:** Existing tests pass; merged routing reduces LLM calls from 2 to 1 when LLM routing is needed.

---

### U3. Parallel tool execution in ReActEngine

**Goal:** Execute multiple independent tool_calls from a single LLM response in parallel, gated by config flag `react.parallel_tools`.

**Requirements:** R5, R4

**Dependencies:** None

**Files:**
- `src/agentkit/core/react.py` — modify `_execute_loop()` and `execute_stream()` to use `asyncio.gather()`
- `src/agentkit/server/config.py` — add `react.parallel_tools` config
- `agentkit.yaml` — add `react` section
- `tests/unit/test_react_engine.py` — add parallel execution tests

**Approach:**

1. Add `parallel_tools: bool = True` parameter to `ReActEngine.__init__`.

2. In `_execute_loop()` and `execute_stream()`, when `response.tool_calls` has >1 items and `parallel_tools` is True:
   - Execute all tool calls concurrently with `asyncio.gather(*[_execute_tool(tc.name, tc.arguments, tools) for tc in response.tool_calls], return_exceptions=True)`
   - Build tool result messages in tool_call_id order
   - Append all results to conversation in order

3. When `parallel_tools` is False, keep current serial behavior.

4. For `execute_stream()`, yield all `tool_call` events first, then all `tool_result` events after gather completes.

**Patterns to follow:** Existing `_execute_tool()` method; existing `Orchestrator._execute_plan()` parallel group pattern in `orchestrator.py`.

**Test scenarios:**
- Two independent tools execute in parallel, both results present in conversation
- Parallel execution preserves tool_call_id ordering in conversation
- One tool fails, other succeeds — partial results preserved
- `parallel_tools: false` falls back to serial execution
- Single tool_call works identically with parallel mode on/off
- Tool results appended to conversation in correct order for next LLM call

**Verification:** Existing ReAct tests pass; new parallel tests pass; manual test with multi-tool request shows reduced execution time.

---

### U4. Async session writes with write-ahead buffer

**Goal:** Make `SessionManager.append_message()` non-blocking by deferring `save_session()` and making `append_message()` fire-and-forget with a small write-ahead buffer.

**Requirements:** R6, R4

**Dependencies:** None

**Files:**
- `src/agentkit/session/manager.py` — add async write queue and WAL buffer
- `tests/unit/test_session_manager.py` — add async write tests

**Approach:**

1. Add an `AsyncWriteQueue` to `SessionManager` that:
   - Accepts write operations (append_message, save_session) as tasks
   - Executes them in a background `asyncio.Task`
   - Maintains a small in-memory buffer of recent writes for crash recovery
   - Provides `await flush()` for graceful shutdown

2. Modify `append_message()`:
   - Keep `get_session()` + validation as synchronous (needed for error checking)
   - Queue `store.append_message()` + `store.save_session()` as a single async task
   - Return the `Message` object immediately without waiting for persistence

3. Modify `get_chat_messages()` to first check the WAL buffer for uncommitted messages, then fall back to store.

4. Add `flush()` method called during session close and app shutdown.

**Patterns to follow:** Existing `BackgroundRunner` pattern in `server/runner.py`; existing `TaskStore` cleanup pattern.

**Test scenarios:**
- append_message returns immediately, message persisted asynchronously
- get_chat_messages includes WAL-buffered messages not yet persisted
- flush() ensures all pending writes complete
- Multiple rapid append_messages are batched correctly
- Session close flushes pending writes
- App shutdown flushes pending writes

**Verification:** Existing session tests pass; new async write tests pass; no message loss during normal operation.

---

### U5. httpx connection pool configuration

**Goal:** Configure explicit `httpx.Limits` on all LLM provider clients for optimal connection reuse.

**Requirements:** R4

**Dependencies:** None

**Files:**
- `src/agentkit/llm/providers/openai.py` — add `httpx.Limits` configuration
- `src/agentkit/llm/providers/anthropic.py` — add `httpx.Limits` configuration
- `src/agentkit/llm/providers/gemini.py` — add `httpx.Limits` configuration
- `src/agentkit/llm/config.py` — add connection pool config fields
- `tests/unit/test_llm_provider.py` — verify connection pool settings

**Approach:**

1. Add `connection_pool` section to `ProviderConfig`:
   - `max_connections: int = 100`
   - `max_keepalive_connections: int = 20`
   - `keepalive_expiry: float = 30.0`

2. Pass `httpx.Limits` to all provider constructors.

3. Configure `httpx.AsyncClient` with explicit limits in each provider.

**Patterns to follow:** Existing `ProviderConfig` dataclass pattern; existing `timeout` parameter pattern.

**Test scenarios:**
- Provider creates httpx client with configured limits
- Default limits applied when not configured
- Custom limits from config override defaults
- Connection reuse verified via mock

**Verification:** Existing provider tests pass; connection pool settings applied correctly.

---

### U6. Chat route pipeline optimization

**Goal:** Optimize the WebSocket chat handler to overlap I/O operations and reduce serial waits.

**Requirements:** R1, R2

**Dependencies:** U1, U4

**Files:**
- `src/agentkit/server/routes/chat.py` — parallelize session operations
- `tests/unit/test_chat_routes.py` — add pipeline optimization tests

**Approach:**

1. In `_handle_chat_message()`, parallelize:
   - `sm.append_message()` (user message) and `sm.get_chat_messages()` — these can run concurrently since append_message now returns immediately (U4)

2. Move assistant message `append_message()` to fire-and-forget after streaming completes (already non-blocking with U4).

3. Reuse `ReActEngine` instance per session instead of creating new one per message.

**Patterns to follow:** Existing `asyncio.gather` pattern in orchestrator.

**Test scenarios:**
- User message append and chat messages retrieval run concurrently
- Assistant message persisted after streaming completes
- ReActEngine reuse across messages in same session
- Error during parallel operations handled gracefully

**Verification:** Existing chat route tests pass; manual test shows reduced latency.

---

## Risks & Mitigations

| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Heuristic classifier misroutes requests | Medium | Medium — wrong skill or wrong execution mode | Config flag to revert to LLM; monitor routing accuracy via telemetry |
| Parallel tool execution breaks implicit dependencies | Low | High — incorrect results | Config flag to disable; LLM rarely returns dependent calls in single response |
| Async session writes lose messages on crash | Low | Medium — missing conversation history | WAL buffer + flush on shutdown; acceptable trade-off for speed |
| Merged LLM call prompt confuses the model | Low | Low — falls back to separate calls | Fallback to separate calls on parse failure |

## System-Wide Impact

- **Routing layer:** CostAwareRouter and IntentRouter behavior changes when heuristic mode is active; existing LLM-based routing preserved as fallback
- **ReAct engine:** Tool execution changes from serial to parallel; conversation history ordering preserved
- **Session management:** Write operations become asynchronous; read operations check WAL buffer
- **Configuration:** New `router` and `react` config sections in `agentkit.yaml`
- **Telemetry:** Existing OpenTelemetry spans continue to work; new spans for heuristic classification

## Open Questions

- What is the actual routing accuracy of the current LLM-based classifier? Need baseline measurement before comparing heuristic accuracy.
- Should the heuristic classifier be extensible (plugin pattern) or hardcoded? Starting with hardcoded for simplicity, can extend later.