feat(compressor): CJK-aware token estimation + linear compress flow #21
Loading…
Reference in New Issue
No description provided.
Delete Branch "feat/context-compressor-cjk"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Implements U1 (CJK-aware token estimation) and U3 (linear compress flow + structured logging) from
docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md.U1: CJK-aware token estimation
estimate_text_tokens()— CJK 1:1 / ASCII 4:1 heuristic_is_cjk()— covers CJK Unified Ideographs, Hiragana/Katakana, Hangul Syllablesestimate_tokens()now usesestimate_text_tokens()(waslen(content) // 4, 4x underestimate for CJK)_summarize()pre-truncate uses CJK-awaremax_chars = max_input_tokens(was* 4, allowed 4x budget for CJK)ReActEngine._should_compress()fallback usesestimate_text_tokensfor compressors withoutshould_compress()U3: Linear compress flow + structured logging
compress()to linear flow: summarize -> aggressive -> truncate (removed recursive_compression_depth)_compress_aggressive()now receives originalmessages(notcompressed) to avoid summary-of-summary (F-010)_log_compression()— structured info log with tokens before/after, ratio, message counts, strategyReview Fixes (ce-code-review)
Applied 4 findings from
mode:agentreview:_summarize()max_chars = max_input_tokens(was* 4, allowed 4x CJK token budget)test_summarize_cjk_pre_truncation(CJK truncation coverage)test_should_compress_cjk_fallback_path(react.py fallback coverage)...[truncated]marker, not just length)Also applied
ce-simplify-code:estimate_tokens()->sum()generator one-liner.Residual Review Findings
_truncate()* 4char assumption inconsistent with CJK estimation. Documented in plan file as OQ21 — conservative truncation is safe on the truncate fallback path; expanding U1 scope to truncate path was deliberately deferred._is_cjk()does not cover full-width punctuation. Plan R1 explicitly limits scope to CJK Unified Ideographs + kana + hangul; full-width punctuation handling deferred.estimate_text_tokens()iterates char-by-char. Not actionable for typical message sizes; advisory only.Test Results
test_context_compressor.py+test_react_compression.py+test_compressor_auxiliary.py+test_compression_strategy.py: 99 passedPlan Reference
docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md(3 rounds of ce-doc-review, converged to U1 + U3)