feat(compressor): CJK-aware token estimation + linear compress flow #21

Merged
fischer merged 3 commits from feat/context-compressor-cjk into main 2026-07-03 09:40:29 +08:00

3 Commits

Author SHA1 Message Date
chiguyong 027f7909aa docs(solutions): CJK token estimation undercount fix
Test / backend-test (pull_request) Has been cancelled Details
Test / frontend-unit (pull_request) Has been cancelled Details
Test / api-e2e (pull_request) Has been cancelled Details
Test / frontend-e2e (pull_request) Has been cancelled Details
Document the ContextCompressor CJK 4x underestimation bug and fix:
- estimate_text_tokens() CJK 1:1 / ASCII 4:1 heuristic
- _summarize() max_chars budget fix (P1: was * 4, allowed 4x CJK budget)
- Linear compress flow + structured logging
- Prevention: charset-aware heuristics, audit dependent truncation points
2026-07-03 09:40:09 +08:00
chiguyong 3a05c4d1e6 fix(review): CJK pre-truncate budget + simplify estimate_tokens + test gaps
Test / backend-test (pull_request) Has been cancelled Details
Test / frontend-unit (pull_request) Has been cancelled Details
Test / api-e2e (pull_request) Has been cancelled Details
Test / frontend-e2e (pull_request) Has been cancelled Details
Apply 4 ce-code-review findings:
- P1: _summarize() max_chars = max_input_tokens (was * 4, allowed 4x CJK budget)
- P1: add test_summarize_cjk_pre_truncation (CJK truncation coverage)
- P2: add test_should_compress_cjk_fallback_path (react.py fallback coverage)
- P3: strengthen truncate test assertion (verify marker, not just length)

Also apply ce-simplify-code: estimate_tokens() -> sum() generator one-liner.

Tests: 99 passed. Ruff: clean.
2026-07-03 08:03:06 +08:00
chiguyong be45fe42c5 feat(compressor): CJK-aware token estimation + linear compress flow
U1: Add estimate_text_tokens() module-level function with CJK 1:1 / ASCII
4:1 heuristic. Update estimate_tokens(), _summarize() pre-truncation, and
react.py _should_compress() fallback to use it. Fixes 4x token
underestimation for Chinese/Japanese/Korean conversations.

U3: Rewrite compress() from recursive _compression_depth to linear flow
(summarize -> aggressive -> truncate). Add _log_compression() structured
logging with tokens_before/after/ratio/strategy. Remove _compression_depth
parameter from compress() and _compress_aggressive().

Per plan: docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md
2026-07-03 07:32:57 +08:00