feat(compressor): CJK-aware token estimation + linear compress flow #21

fischer · 2026-07-03T08:05:14+08:00

fischer commented

2026-07-03 08:05:14 +08:00

Summary

Implements U1 (CJK-aware token estimation) and U3 (linear compress flow + structured logging) from docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md.

U1: CJK-aware token estimation

Added estimate_text_tokens() — CJK 1:1 / ASCII 4:1 heuristic
Added _is_cjk() — covers CJK Unified Ideographs, Hiragana/Katakana, Hangul Syllables
estimate_tokens() now uses estimate_text_tokens() (was len(content) // 4, 4x underestimate for CJK)
_summarize() pre-truncate uses CJK-aware max_chars = max_input_tokens (was * 4, allowed 4x budget for CJK)
ReActEngine._should_compress() fallback uses estimate_text_tokens for compressors without should_compress()

U3: Linear compress flow + structured logging

Rewrote compress() to linear flow: summarize -> aggressive -> truncate (removed recursive _compression_depth)
_compress_aggressive() now receives original messages (not compressed) to avoid summary-of-summary (F-010)
Added _log_compression() — structured info log with tokens before/after, ratio, message counts, strategy

Review Fixes (ce-code-review)

Applied 4 findings from mode:agent review:

P1: _summarize() max_chars = max_input_tokens (was * 4, allowed 4x CJK token budget)
P1: Added test_summarize_cjk_pre_truncation (CJK truncation coverage)
P2: Added test_should_compress_cjk_fallback_path (react.py fallback coverage)
P3: Strengthened truncate test assertion (verify ...[truncated] marker, not just length)

Also applied ce-simplify-code: estimate_tokens() -> sum() generator one-liner.

Residual Review Findings

P2 (manual, OQ21): _truncate() * 4 char assumption inconsistent with CJK estimation. Documented in plan file as OQ21 — conservative truncation is safe on the truncate fallback path; expanding U1 scope to truncate path was deliberately deferred.
P2 (gated_auto, skipped): _is_cjk() does not cover full-width punctuation. Plan R1 explicitly limits scope to CJK Unified Ideographs + kana + hangul; full-width punctuation handling deferred.
P3 (advisory): estimate_text_tokens() iterates char-by-char. Not actionable for typical message sizes; advisory only.

Test Results

test_context_compressor.py + test_react_compression.py + test_compressor_auxiliary.py + test_compression_strategy.py: 99 passed
Ruff check + format: clean

Plan Reference

docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md (3 rounds of ce-doc-review, converged to U1 + U3)

## Summary Implements U1 (CJK-aware token estimation) and U3 (linear compress flow + structured logging) from `docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md`. ### U1: CJK-aware token estimation - Added `estimate_text_tokens()` — CJK 1:1 / ASCII 4:1 heuristic - Added `_is_cjk()` — covers CJK Unified Ideographs, Hiragana/Katakana, Hangul Syllables - `estimate_tokens()` now uses `estimate_text_tokens()` (was `len(content) // 4`, 4x underestimate for CJK) - `_summarize()` pre-truncate uses CJK-aware `max_chars = max_input_tokens` (was `* 4`, allowed 4x budget for CJK) - `ReActEngine._should_compress()` fallback uses `estimate_text_tokens` for compressors without `should_compress()` ### U3: Linear compress flow + structured logging - Rewrote `compress()` to linear flow: summarize -> aggressive -> truncate (removed recursive `_compression_depth`) - `_compress_aggressive()` now receives original `messages` (not `compressed`) to avoid summary-of-summary (F-010) - Added `_log_compression()` — structured info log with tokens before/after, ratio, message counts, strategy ## Review Fixes (ce-code-review) Applied 4 findings from `mode:agent` review: - **P1**: `_summarize()` `max_chars = max_input_tokens` (was `* 4`, allowed 4x CJK token budget) - **P1**: Added `test_summarize_cjk_pre_truncation` (CJK truncation coverage) - **P2**: Added `test_should_compress_cjk_fallback_path` (react.py fallback coverage) - **P3**: Strengthened truncate test assertion (verify `...[truncated]` marker, not just length) Also applied `ce-simplify-code`: `estimate_tokens()` -> `sum()` generator one-liner. ## Residual Review Findings - **P2 (manual, OQ21)**: `_truncate()` `* 4` char assumption inconsistent with CJK estimation. Documented in plan file as OQ21 — conservative truncation is safe on the truncate fallback path; expanding U1 scope to truncate path was deliberately deferred. - **P2 (gated_auto, skipped)**: `_is_cjk()` does not cover full-width punctuation. Plan R1 explicitly limits scope to CJK Unified Ideographs + kana + hangul; full-width punctuation handling deferred. - **P3 (advisory)**: `estimate_text_tokens()` iterates char-by-char. Not actionable for typical message sizes; advisory only. ## Test Results - `test_context_compressor.py` + `test_react_compression.py` + `test_compressor_auxiliary.py` + `test_compression_strategy.py`: **99 passed** - Ruff check + format: **clean** ## Plan Reference `docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md` (3 rounds of ce-doc-review, converged to U1 + U3)

fischer added 2 commits 2026-07-03 08:05:14 +08:00

be45fe42c5 feat(compressor): CJK-aware token estimation + linear compress flow

U1: Add estimate_text_tokens() module-level function with CJK 1:1 / ASCII
4:1 heuristic. Update estimate_tokens(), _summarize() pre-truncation, and
react.py _should_compress() fallback to use it. Fixes 4x token
underestimation for Chinese/Japanese/Korean conversations.

U3: Rewrite compress() from recursive _compression_depth to linear flow
(summarize -> aggressive -> truncate). Add _log_compression() structured
logging with tokens_before/after/ratio/strategy. Remove _compression_depth
parameter from compress() and _compress_aggressive().

Per plan: docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md

Test / backend-test (pull_request) Has been cancelled Details

Test / frontend-unit (pull_request) Has been cancelled Details

Test / api-e2e (pull_request) Has been cancelled Details

Test / frontend-e2e (pull_request) Has been cancelled Details

3a05c4d1e6 fix(review): CJK pre-truncate budget + simplify estimate_tokens + test gaps

Apply 4 ce-code-review findings:
- P1: _summarize() max_chars = max_input_tokens (was * 4, allowed 4x CJK budget)
- P1: add test_summarize_cjk_pre_truncation (CJK truncation coverage)
- P2: add test_should_compress_cjk_fallback_path (react.py fallback coverage)
- P3: strengthen truncate test assertion (verify marker, not just length)

Also apply ce-simplify-code: estimate_tokens() -> sum() generator one-liner.

Tests: 99 passed. Ruff: clean.

fischer added 1 commit 2026-07-03 09:40:12 +08:00

Test / backend-test (pull_request) Has been cancelled Details

Test / frontend-unit (pull_request) Has been cancelled Details

Test / api-e2e (pull_request) Has been cancelled Details

Test / frontend-e2e (pull_request) Has been cancelled Details

027f7909aa docs(solutions): CJK token estimation undercount fix

Document the ContextCompressor CJK 4x underestimation bug and fix:
- estimate_text_tokens() CJK 1:1 / ASCII 4:1 heuristic
- _summarize() max_chars budget fix (P1: was * 4, allowed 4x CJK budget)
- Linear compress flow + structured logging
- Prevention: charset-aware heuristics, audit dependent truncation points

fischer merged commit 00b2dad36e into main

2026-07-03 09:40:29 +08:00

fischer referenced this issue from a commit

2026-07-03 09:40:31 +08:00

feat(compressor): CJK-aware token estimation + linear compress flow (#21)

Sign in to join this conversation.

No reviewers

No Label

No Milestone

No project

No Assignees

1 Participants

Notifications

Due Date

The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: fischer/fischer-agentkit#21