fischer-agentkit/context-compressor-cjk-token-estimation.md at 00b2dad36e08cdd1651a29a51864e04b1ae75efd

8.2 KiB

Raw Blame History

title

date

module

component

Problem

ContextCompressor.estimate_tokens() 使用 len(content) // 4（ASCII 启发式，4 字符 ≈ 1 token）估算 token 数。对 CJK（中文/日文/韩文）文本而言，1 个字符 ≈ 1 token，因此该估算器将 CJK token 数低估约 4 倍。这导致 CJK 为主的会话压缩触发过晚，存在上下文窗口溢出风险。_summarize() 的预截断 max_chars = max_input_tokens * 4 进一步放大了问题——允许 CJK 文本向 LLM 发送 4 倍 token 预算的输入。

Symptoms

CJK 为主的会话在远超预期阈值（约 4 倍）后才触发压缩，should_compress() 返回 False 而实际 token 已超出 model_context_limit
_summarize() 可能向 LLM 发送 4 倍 token 预算的 CJK 文本（P1 缺陷——可能触发上下文上限 / 400 错误）
中文长会话面临上下文窗口溢出 / 请求失败的高风险
压缩流程依赖递归深度 _compression_depth，难以观测与调试，缺乏结构化日志

What Didn't Work

旧的 len(content) // 4 估算器基于 ASCII/拉丁语平均比例（约 4 字符/token）。对纯 CJK 文本（1 字符 ≈ 1 token），该估算产生约 4 倍偏低的估计。例如：4000 个中文字符会被估算为 1000 token，但实际约 4000 token。这导致：

should_compress() 的 headroom 阈值（model_context_limit * 0.8）直到实际 token 已 4 倍超出预期阈值才触发
_summarize() 中 max_chars = max_input_tokens * 4 在预算为 3200 token 时放行 12800 字符（≈12800 CJK token）— 此 P1 缺陷由 ce-code-review 捕获 (session history)
旧的递归式 compress() 依赖 _compression_depth 计数器，流程难以阅读，且 _compress_aggressive() 接收已压缩的 compressed 列表，存在 summary-of-summary（F-010）风险

Solution

1. 新增 CJK 感知的 token 估算器（src/agentkit/core/compressor.py）：

def _is_cjk(char: str) -> bool:
    """Check if a character is CJK (1 token ≈ 1 char)."""
    cp = ord(char)
    return (
        0x4E00 <= cp <= 0x9FFF  # CJK Unified Ideographs
        or 0x3040 <= cp <= 0x30FF  # Hiragana + Katakana
        or 0xAC00 <= cp <= 0xD7AF  # Hangul Syllables
    )


def estimate_text_tokens(text: str) -> int:
    """Estimate token count: CJK 1:1, other characters 4:1."""
    cjk_count = 0
    non_cjk_count = 0
    for char in text:
        if _is_cjk(char):
            cjk_count += 1
        else:
            non_cjk_count += 1
    return cjk_count + non_cjk_count // 4

estimate_tokens() 改用 estimate_text_tokens()：

def estimate_tokens(self, messages: list[dict]) -> int:
    """Estimate total tokens in message list (CJK 1:1, ASCII 4:1)"""
    return sum(estimate_text_tokens(str(m.get("content", ""))) for m in messages)

2. 修复 _summarize() 预截断（P1 缺陷，由 ce-code-review 捕获）：

# 修改前（CJK 可 4 倍超预算）:
max_chars = max_input_tokens * 4

# 修改后（CJK 1:1 精确，ASCII 4:1 保守）:
max_chars = max_input_tokens
conversation_text = conversation_text[:max_chars] + "\n...[truncated]"

3. ReActEngine._should_compress() 回退路径同步使用 estimate_text_tokens（src/agentkit/core/react.py 约 1750 行）：

# Fallback: fixed threshold for compressors without headroom support
estimated_tokens = sum(
    estimate_text_tokens(str(m.get("content", ""))) for m in conversation
)
return estimated_tokens > self._DEFAULT_COMPRESS_THRESHOLD

4. 重写 compress() 为线性流程，移除递归式 _compression_depth：

async def compress(self, messages: list[dict]) -> list[dict]:
    """Linear flow: summarize -> aggressive -> truncate."""
    tokens_before = self.estimate_tokens(messages)
    if tokens_before <= self._max_tokens:
        return messages
    # ... 分离 system/old/recent ...
    # Step 1: summarize
    # Step 2: aggressive (F-010: 传入原始 messages 而非 compressed，避免 summary-of-summary)
    if self.estimate_tokens(compressed) > self._max_tokens:
        compressed = await self._compress_aggressive(messages)
    # Step 3: truncate as last resort
    if self.estimate_tokens(compressed) > self._max_tokens:
        compressed = self._truncate(compressed)
    # Step 4: 结构化日志
    self._log_compression(tokens_before, tokens_after, len(messages), len(compressed), strategy)
    return compressed

5. 新增 _log_compression() 结构化日志：

logger.info(
    "context compressed: %d -> %d tokens (%.1f%%), messages: %d -> %d, strategy: %s",
    tokens_before, tokens_after, ratio * 100,
    msg_count_before, msg_count_after, strategy,
)

Why This Works

根本原因：CJK 字符在主流 tokenizer（BPE/WordPiece/SentencePiece）中近似 1:1 映射为 token，而 ASCII/拉丁文约 4 字符/token。len(content) // 4 把这 4 倍差异抹平了，导致 CJK 估算系统性偏低。

修复后的 estimate_text_tokens() 对 CJK 字符按 1:1 计数、对非 CJK 保留 4:1，既纠正了 CJK 偏差又维持了 ASCII 行为。_summarize() 的 max_chars = max_input_tokens 对 CJK 精确（1:1）、对 ASCII 保守（截断到 1/4 预算但安全），彻底消除了"4 倍超预算"路径。

headroom_threshold=0.8 吸收了纯 CJK 仍可能存在的 10-20% 估算偏差（ponytail: 注释已标注上限与升级路径——litellm.token_counter 或 provider 专用 tokenizer）。

可维护性改进：线性 compress() 流程（summarize → aggressive → truncate）移除了递归深度计数器，单次读取即可理解全部降级路径；_compress_aggressive() 接收原始 messages 而非已压缩的 compressed，规避了 F-010 的 summary-of-summary；_log_compression() 提供结构化观测字段（before/after/ratio/msg_count/strategy），使压缩行为可调试、可告警。

Prevention

字符集感知的估算启发式：任何 token 估算逻辑必须考虑字符集差异。len // 4 仅对 ASCII 成立；CJK/emoji/其他多字节脚本需分别处理。涉及多语言输入时，优先使用 litellm.token_counter 或 provider 专用 tokenizer（当前 estimate_text_tokens 的 ponytail 注释已标注此升级路径）
修改估算逻辑时审计所有依赖的截断点：本次修复同步审计了 estimate_tokens()、_summarize() 预截断、ReActEngine._should_compress() 回退路径。_truncate() 仍用 len(content) > self._max_tokens * 4 判断（OQ21 延期项），后续需同步迁移至字符集感知逻辑
测试覆盖字符集矩阵：新增测试覆盖纯 CJK、纯 ASCII、CJK+ASCII 混合、平假名/片假名、谚文等场景（tests/unit/test_context_compressor.py），验证估算器在各字符集下的正确性
避免 summary-of-summary：多级压缩时，后续阶段应接收原始输入而非前级压缩产物，防止信息逐级失真（F-010 教训）
结构化日志先行：压缩是黑盒操作，必须输出 before/after token、压缩比、消息数、策略等结构化字段，便于线上问题定位

References

Plan: docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md（3 轮 ce-doc-review，收敛到 U1 + U3）
PR #21: feat/context-compressor-cjk（commits be45fe4 + 3a05c4d）
Upstream context: docs/plans/2026-06-24-004-feat-long-horizon-reliability-optimization-plan.md（headroom 压缩引入点）
Residual: OQ21（_truncate() * 4 一致性，P2 manual 跟进）

8.2 KiB Raw Blame History Unescape Escape