feat(compressor): CJK-aware token estimation + linear compress flow #21

Merged
fischer merged 3 commits from feat/context-compressor-cjk into main 2026-07-03 09:40:29 +08:00
Owner

Summary

Implements U1 (CJK-aware token estimation) and U3 (linear compress flow + structured logging) from docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md.

U1: CJK-aware token estimation

  • Added estimate_text_tokens() — CJK 1:1 / ASCII 4:1 heuristic
  • Added _is_cjk() — covers CJK Unified Ideographs, Hiragana/Katakana, Hangul Syllables
  • estimate_tokens() now uses estimate_text_tokens() (was len(content) // 4, 4x underestimate for CJK)
  • _summarize() pre-truncate uses CJK-aware max_chars = max_input_tokens (was * 4, allowed 4x budget for CJK)
  • ReActEngine._should_compress() fallback uses estimate_text_tokens for compressors without should_compress()

U3: Linear compress flow + structured logging

  • Rewrote compress() to linear flow: summarize -> aggressive -> truncate (removed recursive _compression_depth)
  • _compress_aggressive() now receives original messages (not compressed) to avoid summary-of-summary (F-010)
  • Added _log_compression() — structured info log with tokens before/after, ratio, message counts, strategy

Review Fixes (ce-code-review)

Applied 4 findings from mode:agent review:

  • P1: _summarize() max_chars = max_input_tokens (was * 4, allowed 4x CJK token budget)
  • P1: Added test_summarize_cjk_pre_truncation (CJK truncation coverage)
  • P2: Added test_should_compress_cjk_fallback_path (react.py fallback coverage)
  • P3: Strengthened truncate test assertion (verify ...[truncated] marker, not just length)

Also applied ce-simplify-code: estimate_tokens() -> sum() generator one-liner.

Residual Review Findings

  • P2 (manual, OQ21): _truncate() * 4 char assumption inconsistent with CJK estimation. Documented in plan file as OQ21 — conservative truncation is safe on the truncate fallback path; expanding U1 scope to truncate path was deliberately deferred.
  • P2 (gated_auto, skipped): _is_cjk() does not cover full-width punctuation. Plan R1 explicitly limits scope to CJK Unified Ideographs + kana + hangul; full-width punctuation handling deferred.
  • P3 (advisory): estimate_text_tokens() iterates char-by-char. Not actionable for typical message sizes; advisory only.

Test Results

  • test_context_compressor.py + test_react_compression.py + test_compressor_auxiliary.py + test_compression_strategy.py: 99 passed
  • Ruff check + format: clean

Plan Reference

docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md (3 rounds of ce-doc-review, converged to U1 + U3)

## Summary Implements U1 (CJK-aware token estimation) and U3 (linear compress flow + structured logging) from `docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md`. ### U1: CJK-aware token estimation - Added `estimate_text_tokens()` — CJK 1:1 / ASCII 4:1 heuristic - Added `_is_cjk()` — covers CJK Unified Ideographs, Hiragana/Katakana, Hangul Syllables - `estimate_tokens()` now uses `estimate_text_tokens()` (was `len(content) // 4`, 4x underestimate for CJK) - `_summarize()` pre-truncate uses CJK-aware `max_chars = max_input_tokens` (was `* 4`, allowed 4x budget for CJK) - `ReActEngine._should_compress()` fallback uses `estimate_text_tokens` for compressors without `should_compress()` ### U3: Linear compress flow + structured logging - Rewrote `compress()` to linear flow: summarize -> aggressive -> truncate (removed recursive `_compression_depth`) - `_compress_aggressive()` now receives original `messages` (not `compressed`) to avoid summary-of-summary (F-010) - Added `_log_compression()` — structured info log with tokens before/after, ratio, message counts, strategy ## Review Fixes (ce-code-review) Applied 4 findings from `mode:agent` review: - **P1**: `_summarize()` `max_chars = max_input_tokens` (was `* 4`, allowed 4x CJK token budget) - **P1**: Added `test_summarize_cjk_pre_truncation` (CJK truncation coverage) - **P2**: Added `test_should_compress_cjk_fallback_path` (react.py fallback coverage) - **P3**: Strengthened truncate test assertion (verify `...[truncated]` marker, not just length) Also applied `ce-simplify-code`: `estimate_tokens()` -> `sum()` generator one-liner. ## Residual Review Findings - **P2 (manual, OQ21)**: `_truncate()` `* 4` char assumption inconsistent with CJK estimation. Documented in plan file as OQ21 — conservative truncation is safe on the truncate fallback path; expanding U1 scope to truncate path was deliberately deferred. - **P2 (gated_auto, skipped)**: `_is_cjk()` does not cover full-width punctuation. Plan R1 explicitly limits scope to CJK Unified Ideographs + kana + hangul; full-width punctuation handling deferred. - **P3 (advisory)**: `estimate_text_tokens()` iterates char-by-char. Not actionable for typical message sizes; advisory only. ## Test Results - `test_context_compressor.py` + `test_react_compression.py` + `test_compressor_auxiliary.py` + `test_compression_strategy.py`: **99 passed** - Ruff check + format: **clean** ## Plan Reference `docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md` (3 rounds of ce-doc-review, converged to U1 + U3)
fischer added 2 commits 2026-07-03 08:05:14 +08:00
be45fe42c5 feat(compressor): CJK-aware token estimation + linear compress flow
U1: Add estimate_text_tokens() module-level function with CJK 1:1 / ASCII
4:1 heuristic. Update estimate_tokens(), _summarize() pre-truncation, and
react.py _should_compress() fallback to use it. Fixes 4x token
underestimation for Chinese/Japanese/Korean conversations.

U3: Rewrite compress() from recursive _compression_depth to linear flow
(summarize -> aggressive -> truncate). Add _log_compression() structured
logging with tokens_before/after/ratio/strategy. Remove _compression_depth
parameter from compress() and _compress_aggressive().

Per plan: docs/plans/2026-07-02-003-feat-context-compressor-cjk-prefix-enhancement-plan.md
Test / backend-test (pull_request) Has been cancelled Details
Test / frontend-unit (pull_request) Has been cancelled Details
Test / api-e2e (pull_request) Has been cancelled Details
Test / frontend-e2e (pull_request) Has been cancelled Details
3a05c4d1e6
fix(review): CJK pre-truncate budget + simplify estimate_tokens + test gaps
Apply 4 ce-code-review findings:
- P1: _summarize() max_chars = max_input_tokens (was * 4, allowed 4x CJK budget)
- P1: add test_summarize_cjk_pre_truncation (CJK truncation coverage)
- P2: add test_should_compress_cjk_fallback_path (react.py fallback coverage)
- P3: strengthen truncate test assertion (verify marker, not just length)

Also apply ce-simplify-code: estimate_tokens() -> sum() generator one-liner.

Tests: 99 passed. Ruff: clean.
fischer added 1 commit 2026-07-03 09:40:12 +08:00
Test / backend-test (pull_request) Has been cancelled Details
Test / frontend-unit (pull_request) Has been cancelled Details
Test / api-e2e (pull_request) Has been cancelled Details
Test / frontend-e2e (pull_request) Has been cancelled Details
027f7909aa
docs(solutions): CJK token estimation undercount fix
Document the ContextCompressor CJK 4x underestimation bug and fix:
- estimate_text_tokens() CJK 1:1 / ASCII 4:1 heuristic
- _summarize() max_chars budget fix (P1: was * 4, allowed 4x CJK budget)
- Linear compress flow + structured logging
- Prevention: charset-aware heuristics, audit dependent truncation points
fischer merged commit 00b2dad36e into main 2026-07-03 09:40:29 +08:00
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
1 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: fischer/fischer-agentkit#21
No description provided.