From 027f7909aa00faf1719ff82b657b8cd69d9ac843 Mon Sep 17 00:00:00 2001 From: chiguyong Date: Fri, 3 Jul 2026 09:40:09 +0800 Subject: [PATCH] docs(solutions): CJK token estimation undercount fix Document the ContextCompressor CJK 4x underestimation bug and fix: - estimate_text_tokens() CJK 1:1 / ASCII 4:1 heuristic - _summarize() max_chars budget fix (P1: was * 4, allowed 4x CJK budget) - Linear compress flow + structured logging - Prevention: charset-aware heuristics, audit dependent truncation points --- ...context-compressor-cjk-token-estimation.md | 152 ++++++++++++++++++ 1 file changed, 152 insertions(+) create mode 100644 docs/solutions/logic-errors/context-compressor-cjk-token-estimation.md diff --git a/docs/solutions/logic-errors/context-compressor-cjk-token-estimation.md b/docs/solutions/logic-errors/context-compressor-cjk-token-estimation.md new file mode 100644 index 0000000..884d03f --- /dev/null +++ b/docs/solutions/logic-errors/context-compressor-cjk-token-estimation.md @@ -0,0 +1,152 @@ +--- +title: "ContextCompressor CJK token estimation undercounted by 4x" +date: 2026-07-03 +module: core/compressor +component: assistant +tags: + - cjk + - token-estimation + - context-compression + - react-engine + - heuristic +problem_type: logic_error +severity: high +symptoms: + - "estimate_tokens() uses len(content) // 4 (ASCII heuristic), undercounting CJK tokens by ~4x since CJK chars are ~1 char per token" + - "Context compression triggers too late for CJK-heavy conversations, risking context window overflow" + - "_summarize() pre-truncation uses max_chars = max_input_tokens * 4, allowing CJK text to send 4x the token budget to the LLM" + - "ReActEngine._should_compress() fallback inherits the same flawed len // 4 estimation for compressors without should_compress()" +root_cause: logic_error +resolution_type: code_fix +--- + +## Problem + +`ContextCompressor.estimate_tokens()` 使用 `len(content) // 4`(ASCII 启发式,4 字符 ≈ 1 token)估算 token 数。对 CJK(中文/日文/韩文)文本而言,1 个字符 ≈ 1 token,因此该估算器将 CJK token 数低估约 4 倍。这导致 CJK 为主的会话压缩触发过晚,存在上下文窗口溢出风险。`_summarize()` 的预截断 `max_chars = max_input_tokens * 4` 进一步放大了问题——允许 CJK 文本向 LLM 发送 4 倍 token 预算的输入。 + +## Symptoms + +- CJK 为主的会话在远超预期阈值(约 4 倍)后才触发压缩,`should_compress()` 返回 `False` 而实际 token 已超出 `model_context_limit` +- `_summarize()` 可能向 LLM 发送 4 倍 token 预算的 CJK 文本(P1 缺陷——可能触发上下文上限 / 400 错误) +- 中文长会话面临上下文窗口溢出 / 请求失败的高风险 +- 压缩流程依赖递归深度 `_compression_depth`,难以观测与调试,缺乏结构化日志 + +## What Didn't Work + +旧的 `len(content) // 4` 估算器基于 ASCII/拉丁语平均比例(约 4 字符/token)。对纯 CJK 文本(1 字符 ≈ 1 token),该估算产生约 4 倍偏低的估计。例如:4000 个中文字符会被估算为 1000 token,但实际约 4000 token。这导致: + +- `should_compress()` 的 headroom 阈值(`model_context_limit * 0.8`)直到实际 token 已 4 倍超出预期阈值才触发 +- `_summarize()` 中 `max_chars = max_input_tokens * 4` 在预算为 3200 token 时放行 12800 字符(≈12800 CJK token)— 此 P1 缺陷由 ce-code-review 捕获 (session history) +- 旧的递归式 `compress()` 依赖 `_compression_depth` 计数器,流程难以阅读,且 `_compress_aggressive()` 接收已压缩的 `compressed` 列表,存在 summary-of-summary(F-010)风险 + +## Solution + +**1. 新增 CJK 感知的 token 估算器**(`src/agentkit/core/compressor.py`): + +```python +def _is_cjk(char: str) -> bool: + """Check if a character is CJK (1 token ≈ 1 char).""" + cp = ord(char) + return ( + 0x4E00 <= cp <= 0x9FFF # CJK Unified Ideographs + or 0x3040 <= cp <= 0x30FF # Hiragana + Katakana + or 0xAC00 <= cp <= 0xD7AF # Hangul Syllables + ) + + +def estimate_text_tokens(text: str) -> int: + """Estimate token count: CJK 1:1, other characters 4:1.""" + cjk_count = 0 + non_cjk_count = 0 + for char in text: + if _is_cjk(char): + cjk_count += 1 + else: + non_cjk_count += 1 + return cjk_count + non_cjk_count // 4 +``` + +`estimate_tokens()` 改用 `estimate_text_tokens()`: + +```python +def estimate_tokens(self, messages: list[dict]) -> int: + """Estimate total tokens in message list (CJK 1:1, ASCII 4:1)""" + return sum(estimate_text_tokens(str(m.get("content", ""))) for m in messages) +``` + +**2. 修复 `_summarize()` 预截断**(P1 缺陷,由 ce-code-review 捕获): + +```python +# 修改前(CJK 可 4 倍超预算): +max_chars = max_input_tokens * 4 + +# 修改后(CJK 1:1 精确,ASCII 4:1 保守): +max_chars = max_input_tokens +conversation_text = conversation_text[:max_chars] + "\n...[truncated]" +``` + +**3. `ReActEngine._should_compress()` 回退路径同步使用 `estimate_text_tokens`**(`src/agentkit/core/react.py` 约 1750 行): + +```python +# Fallback: fixed threshold for compressors without headroom support +estimated_tokens = sum( + estimate_text_tokens(str(m.get("content", ""))) for m in conversation +) +return estimated_tokens > self._DEFAULT_COMPRESS_THRESHOLD +``` + +**4. 重写 `compress()` 为线性流程**,移除递归式 `_compression_depth`: + +```python +async def compress(self, messages: list[dict]) -> list[dict]: + """Linear flow: summarize -> aggressive -> truncate.""" + tokens_before = self.estimate_tokens(messages) + if tokens_before <= self._max_tokens: + return messages + # ... 分离 system/old/recent ... + # Step 1: summarize + # Step 2: aggressive (F-010: 传入原始 messages 而非 compressed,避免 summary-of-summary) + if self.estimate_tokens(compressed) > self._max_tokens: + compressed = await self._compress_aggressive(messages) + # Step 3: truncate as last resort + if self.estimate_tokens(compressed) > self._max_tokens: + compressed = self._truncate(compressed) + # Step 4: 结构化日志 + self._log_compression(tokens_before, tokens_after, len(messages), len(compressed), strategy) + return compressed +``` + +**5. 新增 `_log_compression()` 结构化日志**: + +```python +logger.info( + "context compressed: %d -> %d tokens (%.1f%%), messages: %d -> %d, strategy: %s", + tokens_before, tokens_after, ratio * 100, + msg_count_before, msg_count_after, strategy, +) +``` + +## Why This Works + +**根本原因**:CJK 字符在主流 tokenizer(BPE/WordPiece/SentencePiece)中近似 1:1 映射为 token,而 ASCII/拉丁文约 4 字符/token。`len(content) // 4` 把这 4 倍差异抹平了,导致 CJK 估算系统性偏低。 + +修复后的 `estimate_text_tokens()` 对 CJK 字符按 1:1 计数、对非 CJK 保留 4:1,既纠正了 CJK 偏差又维持了 ASCII 行为。`_summarize()` 的 `max_chars = max_input_tokens` 对 CJK 精确(1:1)、对 ASCII 保守(截断到 1/4 预算但安全),彻底消除了"4 倍超预算"路径。 + +`headroom_threshold=0.8` 吸收了纯 CJK 仍可能存在的 10-20% 估算偏差(`ponytail:` 注释已标注上限与升级路径——`litellm.token_counter` 或 provider 专用 tokenizer)。 + +**可维护性改进**:线性 `compress()` 流程(summarize → aggressive → truncate)移除了递归深度计数器,单次读取即可理解全部降级路径;`_compress_aggressive()` 接收原始 `messages` 而非已压缩的 `compressed`,规避了 F-010 的 summary-of-summary;`_log_compression()` 提供结构化观测字段(before/after/ratio/msg_count/strategy),使压缩行为可调试、可告警。 + +## Prevention + +- **字符集感知的估算启发式**:任何 token 估算逻辑必须考虑字符集差异。`len // 4` 仅对 ASCII 成立;CJK/emoji/其他多字节脚本需分别处理。涉及多语言输入时,优先使用 `litellm.token_counter` 或 provider 专用 tokenizer(当前 `estimate_text_tokens` 的 `ponytail` 注释已标注此升级路径) +- **修改估算逻辑时审计所有依赖的截断点**:本次修复同步审计了 `estimate_tokens()`、`_summarize()` 预截断、`ReActEngine._should_compress()` 回退路径。`_truncate()` 仍用 `len(content) > self._max_tokens * 4` 判断(OQ21 延期项),后续需同步迁移至字符集感知逻辑 +- **测试覆盖字符集矩阵**:新增测试覆盖纯 CJK、纯 ASCII、CJK+ASCII 混合、平假名/片假名、谚文等场景(`tests/unit/test_context_compressor.py`),验证估算器在各字符集下的正确性 +- **避免 summary-of-summary**:多级压缩时,后续阶段应接收原始输入而非前级压缩产物,防止信息逐级失真(F-010 教训) +- **结构化日志先行**:压缩是黑盒操作,必须输出 before/after token、压缩比、消息数、策略等结构化字段,便于线上问题定位 + +## References + +- Plan: `docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md`(3 轮 ce-doc-review,收敛到 U1 + U3) +- PR #21: `feat/context-compressor-cjk`(commits `be45fe4` + `3a05c4d`) +- Upstream context: `docs/plans/2026-06-24-004-feat-long-horizon-reliability-optimization-plan.md`(headroom 压缩引入点) +- Residual: OQ21(`_truncate()` `* 4` 一致性,P2 manual 跟进)