fischer-agentkit/docs/plans/2026-06-12-021-feat-chat-re...

373 lines
17 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "feat: Chat Response Speed Optimization — Sub-1s First Token"
status: active
created: 2026-06-12
plan-type: feat
depth: standard
---
# feat: Chat Response Speed Optimization — Sub-1s First Token
## Summary
Optimize the fischer-agentkit conversation response pipeline to achieve sub-1-second first-token latency. The primary bottleneck is 12 extra LLM calls in the routing layer before the main ReAct loop. Secondary optimizations include parallel tool execution, async session I/O, and connection pool tuning. All changes are gated by configuration flags for safe rollback.
## Problem Frame
Users experience 510 second delays before seeing any response in the chat interface. The root cause is a serial chain of LLM calls: CostAwareRouter.quick_classify() → IntentRouter._classify_with_llm() → ReActEngine LLM Think. The first two calls are routing overhead that add 26 seconds with no user-visible value. The third call is the actual reasoning step and cannot be eliminated, but its perceived latency can be reduced via streaming.
**Current worst-case latency chain:**
```
User message → quick_classify() [1-2s LLM]
→ _classify_with_llm() [1-2s LLM]
→ ReActEngine Think [2-5s LLM]
→ Tool Act [0.5-5s]
→ First token visible to user
```
**Target latency chain:**
```
User message → Local rule classification [<1ms]
→ ReActEngine Think (streaming) [first token in <1s]
→ Tool Act (parallel when possible)
→ First token visible to user
```
## Requirements
| ID | Requirement | Priority |
|----|-------------|----------|
| R1 | First token latency must be under 1 second for simple conversations (greetings, Q&A) | P0 |
| R2 | First token latency must be under 1 second for routed conversations when keyword matching succeeds | P0 |
| R3 | Routing accuracy must not degrade more than 10% compared to current LLM-based classification | P1 |
| R4 | All optimizations must be configurable with on/off switches for safe rollback | P0 |
| R5 | Parallel tool execution must preserve conversation history ordering | P1 |
| R6 | Async session writes must not lose messages on process crash | P1 |
## Key Technical Decisions
### KTD1: Replace LLM quick_classify with local heuristic
**Decision:** Replace `CostAwareRouter.quick_classify()` LLM call with a zero-cost local heuristic based on message length, keyword density, and tool-hint detection.
**Rationale:** The LLM classification adds 12s latency for a binary decision (simple vs complex). A local heuristic using the same signals already present in the message content (length, presence of tool-related keywords, question marks, etc.) can achieve ~85% accuracy at zero latency cost.
**Alternative considered:** Cache LLM classification results. Rejected because cache hit rate would be near-zero for conversational messages (each is unique).
### KTD2: Merge quick_classify and intent classification into single LLM call
**Decision:** When LLM routing is needed (heuristic uncertainty), combine complexity scoring and intent classification into a single LLM call instead of two serial calls.
**Rationale:** Currently `quick_classify()` and `_classify_with_llm()` are separate LLM calls that could be merged into one prompt returning both complexity score and matched skill. This halves the routing LLM overhead when it cannot be avoided.
### KTD3: Parallel execution of independent tool_calls
**Decision:** Execute multiple tool_calls from a single LLM response in parallel using `asyncio.gather()`, with results appended to conversation in tool_call_id order.
**Rationale:** When LLM returns multiple tool calls (e.g., search + calculate), they are independent and can run concurrently. Results must be appended in order for the next LLM call to see them correctly.
**Risk:** Some tool calls may have implicit dependencies. Mitigation: the LLM generally does not return dependent calls in a single response (it waits for results before calling the next). Add a config flag `react.parallel_tools: false` to disable if needed.
### KTD4: Fire-and-forget session writes with write-ahead buffer
**Decision:** Make `SessionManager.append_message()` non-blocking by returning immediately after queuing the write, with a background task performing the actual I/O. Add a small in-memory buffer as write-ahead log to prevent message loss.
**Rationale:** Session writes (especially `save_session()` for updated_at) add unnecessary blocking. The user doesn't need to wait for persistence before seeing a response. A write-ahead buffer ensures messages survive brief failures.
### KTD5: Unified httpx connection pool configuration
**Decision:** Configure explicit `httpx.Limits` on all LLM provider clients with sensible defaults for keepalive and connection pooling.
**Rationale:** Default httpx settings are reasonable but not optimized for high-frequency LLM API calls. Explicit configuration ensures consistent behavior across providers and enables tuning.
## Scope Boundaries
### In Scope
- Routing layer optimization (CostAwareRouter, IntentRouter)
- ReActEngine parallel tool execution
- Session I/O async optimization
- httpx connection pool tuning
- Configuration flags for all changes
- Test coverage for new behavior
### Out of Scope
- Frontend rendering optimization (separate concern)
- LLM provider response time optimization (external dependency)
- Memory/RAG pipeline optimization (covered by existing plan 009)
- Compression strategy changes (covered by existing plan 013)
- New LLM provider implementations
### Deferred to Follow-Up Work
- A/B testing framework for routing accuracy measurement
- Performance benchmarking CI pipeline
- WebSocket chat flow test coverage
- WenxinProvider token-refresh client reuse
---
## Implementation Units
### U1. Local heuristic classifier for CostAwareRouter
**Goal:** Replace the LLM-based `quick_classify()` with a zero-cost local heuristic, gated by config flag `router.classifier: heuristic | llm`.
**Requirements:** R1, R2, R3, R4
**Dependencies:** None
**Files:**
- `src/agentkit/chat/skill_routing.py` — add `HeuristicClassifier` class, modify `CostAwareRouter.route()`
- `src/agentkit/server/config.py` — add `router` config section
- `agentkit.yaml` — add `router` section with defaults
- `tests/unit/test_cost_aware_router.py` — add heuristic classifier tests
**Approach:**
1. Create `HeuristicClassifier` class with a `classify(content: str) -> float` method that returns a complexity score (0.01.0) based on:
- Message length: short messages (<20 chars) low complexity
- Question patterns: presence of "为什么", "如何", "怎么", "how", "why", "what" moderate complexity
- Tool hints: presence of tool-related keywords (existing `_tokenize_content` + `tool_hints` list already in code) high complexity
- Multi-sentence: messages with multiple sentences higher complexity
- Code patterns: presence of code-like patterns (backticks, brackets) higher complexity
2. Modify `CostAwareRouter.__init__` to accept a `classifier_mode` parameter (`"heuristic"` or `"llm"`)
3. Modify `CostAwareRouter.route()` Phase 1 to use `HeuristicClassifier.classify()` when mode is `"heuristic"`
4. Add `router` config section to `ServerConfig` with `classifier` field (default: `"heuristic"`)
5. Wire config through `create_app()` to `CostAwareRouter`
**Patterns to follow:** Existing `CostAwareRouter._match_layer0()` rule-based pattern; existing `_tokenize_content()` for keyword extraction.
**Test scenarios:**
- Short greeting complexity < 0.3
- Single question with "如何" complexity 0.30.7
- Multi-step request with tool keywords complexity > 0.7
- Code-related request → complexity > 0.7
- Empty string → complexity 0.0
- Very long message (>500 chars) → complexity > 0.5
- Config flag `classifier: llm` falls back to LLM classification
- Config flag `classifier: heuristic` uses local heuristic
**Verification:** All existing `test_cost_aware_router.py` tests pass; new heuristic tests pass; manual test shows first-token latency <1s for simple messages.
---
### U2. Merged routing LLM call
**Goal:** When LLM routing is needed (heuristic uncertain or config forces LLM), combine complexity scoring and intent classification into a single LLM call.
**Requirements:** R2, R3, R4
**Dependencies:** U1
**Files:**
- `src/agentkit/chat/skill_routing.py` add `MergedRouter` method
- `src/agentkit/router/intent.py` add `route_with_complexity()` method
- `tests/unit/test_cost_aware_router.py` add merged routing tests
- `tests/unit/test_intent_router.py` add merged routing tests
**Approach:**
1. Add `IntentRouter.route_with_complexity()` method that returns both a `RoutingResult` and a complexity score in a single LLM call. The prompt asks the LLM to return `{"skill": "...", "confidence": 0.9, "complexity": 0.5}`.
2. Modify `CostAwareRouter.route()` so that when `classifier` is `"llm"`, it calls `route_with_complexity()` instead of making two separate calls.
3. When `classifier` is `"heuristic"` and the heuristic returns uncertainty (score in 0.30.7 range), use `route_with_complexity()` as a single fallback call.
**Patterns to follow:** Existing `_classify_with_llm()` prompt structure; existing `quick_classify()` prompt structure.
**Test scenarios:**
- Merged call returns both skill match and complexity score
- Merged call with no matching skill returns complexity only
- Merged call with invalid LLM response falls back to rule-based evaluation
- Heuristic uncertain + merged call produces correct routing
- Config `classifier: llm` uses merged call instead of two separate calls
**Verification:** Existing tests pass; merged routing reduces LLM calls from 2 to 1 when LLM routing is needed.
---
### U3. Parallel tool execution in ReActEngine
**Goal:** Execute multiple independent tool_calls from a single LLM response in parallel, gated by config flag `react.parallel_tools`.
**Requirements:** R5, R4
**Dependencies:** None
**Files:**
- `src/agentkit/core/react.py` modify `_execute_loop()` and `execute_stream()` to use `asyncio.gather()`
- `src/agentkit/server/config.py` add `react.parallel_tools` config
- `agentkit.yaml` add `react` section
- `tests/unit/test_react_engine.py` add parallel execution tests
**Approach:**
1. Add `parallel_tools: bool = True` parameter to `ReActEngine.__init__`.
2. In `_execute_loop()` and `execute_stream()`, when `response.tool_calls` has >1 items and `parallel_tools` is True:
- Execute all tool calls concurrently with `asyncio.gather(*[_execute_tool(tc.name, tc.arguments, tools) for tc in response.tool_calls], return_exceptions=True)`
- Build tool result messages in tool_call_id order
- Append all results to conversation in order
3. When `parallel_tools` is False, keep current serial behavior.
4. For `execute_stream()`, yield all `tool_call` events first, then all `tool_result` events after gather completes.
**Patterns to follow:** Existing `_execute_tool()` method; existing `Orchestrator._execute_plan()` parallel group pattern in `orchestrator.py`.
**Test scenarios:**
- Two independent tools execute in parallel, both results present in conversation
- Parallel execution preserves tool_call_id ordering in conversation
- One tool fails, other succeeds — partial results preserved
- `parallel_tools: false` falls back to serial execution
- Single tool_call works identically with parallel mode on/off
- Tool results appended to conversation in correct order for next LLM call
**Verification:** Existing ReAct tests pass; new parallel tests pass; manual test with multi-tool request shows reduced execution time.
---
### U4. Async session writes with write-ahead buffer
**Goal:** Make `SessionManager.append_message()` non-blocking by deferring `save_session()` and making `append_message()` fire-and-forget with a small write-ahead buffer.
**Requirements:** R6, R4
**Dependencies:** None
**Files:**
- `src/agentkit/session/manager.py` — add async write queue and WAL buffer
- `tests/unit/test_session_manager.py` — add async write tests
**Approach:**
1. Add an `AsyncWriteQueue` to `SessionManager` that:
- Accepts write operations (append_message, save_session) as tasks
- Executes them in a background `asyncio.Task`
- Maintains a small in-memory buffer of recent writes for crash recovery
- Provides `await flush()` for graceful shutdown
2. Modify `append_message()`:
- Keep `get_session()` + validation as synchronous (needed for error checking)
- Queue `store.append_message()` + `store.save_session()` as a single async task
- Return the `Message` object immediately without waiting for persistence
3. Modify `get_chat_messages()` to first check the WAL buffer for uncommitted messages, then fall back to store.
4. Add `flush()` method called during session close and app shutdown.
**Patterns to follow:** Existing `BackgroundRunner` pattern in `server/runner.py`; existing `TaskStore` cleanup pattern.
**Test scenarios:**
- append_message returns immediately, message persisted asynchronously
- get_chat_messages includes WAL-buffered messages not yet persisted
- flush() ensures all pending writes complete
- Multiple rapid append_messages are batched correctly
- Session close flushes pending writes
- App shutdown flushes pending writes
**Verification:** Existing session tests pass; new async write tests pass; no message loss during normal operation.
---
### U5. httpx connection pool configuration
**Goal:** Configure explicit `httpx.Limits` on all LLM provider clients for optimal connection reuse.
**Requirements:** R4
**Dependencies:** None
**Files:**
- `src/agentkit/llm/providers/openai.py` — add `httpx.Limits` configuration
- `src/agentkit/llm/providers/anthropic.py` — add `httpx.Limits` configuration
- `src/agentkit/llm/providers/gemini.py` — add `httpx.Limits` configuration
- `src/agentkit/llm/config.py` — add connection pool config fields
- `tests/unit/test_llm_provider.py` — verify connection pool settings
**Approach:**
1. Add `connection_pool` section to `ProviderConfig`:
- `max_connections: int = 100`
- `max_keepalive_connections: int = 20`
- `keepalive_expiry: float = 30.0`
2. Pass `httpx.Limits` to all provider constructors.
3. Configure `httpx.AsyncClient` with explicit limits in each provider.
**Patterns to follow:** Existing `ProviderConfig` dataclass pattern; existing `timeout` parameter pattern.
**Test scenarios:**
- Provider creates httpx client with configured limits
- Default limits applied when not configured
- Custom limits from config override defaults
- Connection reuse verified via mock
**Verification:** Existing provider tests pass; connection pool settings applied correctly.
---
### U6. Chat route pipeline optimization
**Goal:** Optimize the WebSocket chat handler to overlap I/O operations and reduce serial waits.
**Requirements:** R1, R2
**Dependencies:** U1, U4
**Files:**
- `src/agentkit/server/routes/chat.py` — parallelize session operations
- `tests/unit/test_chat_routes.py` — add pipeline optimization tests
**Approach:**
1. In `_handle_chat_message()`, parallelize:
- `sm.append_message()` (user message) and `sm.get_chat_messages()` — these can run concurrently since append_message now returns immediately (U4)
2. Move assistant message `append_message()` to fire-and-forget after streaming completes (already non-blocking with U4).
3. Reuse `ReActEngine` instance per session instead of creating new one per message.
**Patterns to follow:** Existing `asyncio.gather` pattern in orchestrator.
**Test scenarios:**
- User message append and chat messages retrieval run concurrently
- Assistant message persisted after streaming completes
- ReActEngine reuse across messages in same session
- Error during parallel operations handled gracefully
**Verification:** Existing chat route tests pass; manual test shows reduced latency.
---
## Risks & Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|-----------|--------|------------|
| Heuristic classifier misroutes requests | Medium | Medium — wrong skill or wrong execution mode | Config flag to revert to LLM; monitor routing accuracy via telemetry |
| Parallel tool execution breaks implicit dependencies | Low | High — incorrect results | Config flag to disable; LLM rarely returns dependent calls in single response |
| Async session writes lose messages on crash | Low | Medium — missing conversation history | WAL buffer + flush on shutdown; acceptable trade-off for speed |
| Merged LLM call prompt confuses the model | Low | Low — falls back to separate calls | Fallback to separate calls on parse failure |
## System-Wide Impact
- **Routing layer:** CostAwareRouter and IntentRouter behavior changes when heuristic mode is active; existing LLM-based routing preserved as fallback
- **ReAct engine:** Tool execution changes from serial to parallel; conversation history ordering preserved
- **Session management:** Write operations become asynchronous; read operations check WAL buffer
- **Configuration:** New `router` and `react` config sections in `agentkit.yaml`
- **Telemetry:** Existing OpenTelemetry spans continue to work; new spans for heuristic classification
## Open Questions
- What is the actual routing accuracy of the current LLM-based classifier? Need baseline measurement before comparing heuristic accuracy.
- Should the heuristic classifier be extensible (plugin pattern) or hardcoded? Starting with hardcoded for simplicity, can extend later.