docs(ce-compound): 记录 portal-platform 安全/可靠性修复批次

记录 ce-code-review 修复批次（commit 53faa60）的 10 个 P1/P2/P3 修复： - P1: WeCom 重放、缓存跨用户泄漏、webhook 异常风暴、shutdown 泄漏 - P2: Feishu TTL、无界任务集、配额 N+1、冗余 SHA-256、未用参数 - P3: DIRECT_CHAT 去重新增 docs/solutions/security-issues/portal-platform-security-reliability-fixes.md CONCEPTS.md 补充 3 个领域术语：Per-User Cache Namespace、Webhook Signature Freshness、Webhook Backpressure
2026-06-26 01:47:57 +08:00 · 2026-06-26 01:47:57 +08:00 · 75e9b58e46
parent 53faa60472
commit 75e9b58e46
2 changed files with 307 additions and 0 deletions
--- a/CONCEPTS.md
+++ b/CONCEPTS.md
@ -29,3 +29,14 @@ A phase dynamically inserted into a team plan when divergence is detected betwee

 ### Resume
 The act of rebuilding a crashed pipeline's runtime state from persisted checkpoints. Restores completed and failed phase statuses, rebuilds runtime counters, and re-persists any dynamically inserted phases so the restored plan matches what was executing at crash time.
+
+## Channels & Caching
+
+### Per-User Cache Namespace
+A security pattern for LLM response caching where the cache key includes `user_id` so that cached responses are isolated per user. When `per_user_namespace=True`, anonymous requests (`user_id=None`) must be rejected from caching — they cannot be namespaced and would pollute other users' cache blocks. `should_cache()` enforces this by returning `False` when the namespace is on but `user_id` is missing.
+
+### Webhook Signature Freshness
+The timestamp validation layered on top of webhook signature verification that bounds the replay window. Computes `abs(now - ts)` against a max-age constant (e.g. 300s) and rejects requests outside the window — the absolute value defends against both historical replays and future-dated requests. Without this, a valid signature alone provides zero replay protection.
+
+### Webhook Backpressure
+The pattern of bounding a webhook handler's in-flight background task set with a cap (typically `max_concurrent * 2`) and returning HTTP 429 when exceeded. The 2x margin absorbs short spikes; the 429 forces clients to back off rather than snowballing memory and coroutine exhaustion. The task set is also awaited on app shutdown so in-flight replies are not dropped.
--- a/docs/solutions/security-issues/portal-platform-security-reliability-fixes.md
+++ b/docs/solutions/security-issues/portal-platform-security-reliability-fixes.md
@ -0,0 +1,296 @@
+---
+title: "Portal platform security & reliability fixes — channels, LLM cache, server shutdown"
+date: 2026-06-26
+category: docs/solutions/security-issues
+module: channels/llm/server
+problem_type: security_issue
+component: service_object
+severity: high
+symptoms:
+  - "WeCom webhook accepted replayed requests within an unlimited time window (no timestamp freshness check)"
+  - "Anonymous LLM requests polluted per-user cache namespace when user_id was None (per_user_namespace=True ignored)"
+  - "Webhook receive_message() exceptions returned HTTP 500, triggering platform retry storms"
+  - "App shutdown leaked httpx connections and dropped in-flight IM replies (no await on _pending_webhook_tasks / close_all_adapters)"
+  - "Feishu token cache TTL was 300s instead of ~6900s, causing token refresh storms (24x too short)"
+root_cause: missing_validation
+resolution_type: code_fix
+tags: [webhook-security, replay-attack, cache-isolation, shutdown-cleanup, token-ttl, backpressure, n-plus-1, code-review-fixes]
+related_components:
+  - channels/wecom
+  - channels/feishu
+  - llm/cache
+  - llm/cache_key
+  - llm/gateway
+  - llm/config
+  - server/app
+  - server/routes/channels
+---
+
+# Portal platform security & reliability fixes — channels, LLM cache, server shutdown
+
+## Problem
+
+在 `feat/portal-platform-evolution` 分支上完成了一次 ce-code-review 修复批次（commit `53faa60`），针对代码评审发现的 10 个安全、可靠性与性能缺陷进行了集中修复。缺陷横跨通道接入（WeCom/Feishu webhook）、LLM 网关（缓存命名空间、配额查询）、应用生命周期（关闭泄漏）等子系统，按严重度分布为 4 个 P1（安全/可靠性）、5 个 P2（效率/可维护性）、1 个 P3（去重）。
+
+## Symptoms
+
+- **WeCom webhook 重放风险**：`verify_signature()` 仅校验签名正确性，不校验时间戳新鲜度，攻击者可无限重放历史合法请求。
+- **LLM 缓存跨用户泄漏**：`should_cache()` 对 `user_id` 参数执行 `_ = user_id` 直接丢弃，在 `per_user_namespace` 开启时，`user_id=None` 的请求仍会命中其他用户的缓存。
+- **Webhook 异常触发 500 重试风暴**：`receive_message()` 解析失败直接抛出未捕获异常，平台收到 500 后按退避策略无限重试，造成异常放大。
+- **应用关闭资源泄漏**：shutdown 流程未调用 `close_all_adapters()`、未 await `_pending_webhook_tasks`，导致 httpx 连接泄漏与 IM 回复丢失。
+- **配额查询 N+1 与缓存键冗余哈希**：配额检查对每个部门每个周期调用 4 次 `get_usage()`（实际只需 2 次唯一查询）；`generate_cache_key` 对每个组件单独 SHA-256 后再哈希一次，8-10 次冗余哈希。
+
+## What Didn't Work
+
+- **预存在的测试失败（环境问题，非代码问题）**：`litellm` 未安装导致依赖其的测试在本地跳过或报 collect error；`jieba` 未安装导致分词相关测试失败。这些是环境依赖缺失，与本次修复无关，不应误判为回归。
+- **WeCom 测试使用固定时间戳 `1609459200`（2021-01-01）**：在加入 `_SIGNATURE_MAX_AGE_SECONDS = 300` 新鲜度校验后，该固定时间戳距离当前时间已超出 5 分钟窗口，测试立即失败。需改用 `int(time.time())` 动态生成时间戳。
+- **缓存测试以默认 `user_id=None` 调用 `should_cache()`**：安全修复后 `per_user_namespace=True & user_id=None` 会返回 `False`，原本默认参数的测试用例不再缓存，需显式传入 `user_id` 或在测试中关闭 `per_user_namespace`。
+
+## Solution
+
+### P1 #1 — WeCom webhook 重放攻击修复
+
+**文件**：[src/agentkit/channels/wecom.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/channels/wecom.py)
+
+**问题**：`verify_signature()` 只比对签名，不校验时间戳新鲜度，重放窗口无限大。
+
+**修复**：新增 `_SIGNATURE_MAX_AGE_SECONDS = 300` 常量，在签名校验前先校验时间戳绝对偏差。
+
+```python
+_SIGNATURE_MAX_AGE_SECONDS = 300
+
+# In verify_signature():
+try:
+    ts_int = int(timestamp)
+except (TypeError, ValueError):
+    return False
+now = int(time.time())
+if abs(now - ts_int) > _SIGNATURE_MAX_AGE_SECONDS:
+    logger.warning(
+        "企微 webhook 时间戳超出 %ds 窗口: ts=%s now=%d",
+        _SIGNATURE_MAX_AGE_SECONDS, timestamp, now,
+    )
+    return False
+```
+
+**说明**：使用绝对值 `abs(now - ts_int)` 同时防御未来时间戳与历史重放；窗口设为 300s 兼顾时钟漂移与攻击面。
+
+### P1 #2 — LiteLLM 缓存跨用户泄漏修复
+
+**文件**：[src/agentkit/llm/cache.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/llm/cache.py)
+
+**问题**：`should_cache()` 形参 `user_id` 被显式丢弃（`_ = user_id`），`per_user_namespace` 开启时 `user_id=None` 的请求仍会命中缓存，造成跨用户数据泄漏。
+
+**修复**：重写 `should_cache()`，强制 per-user 命名空间安全。
+
+```python
+def should_cache(self, kb_caching_disabled: bool = False, user_id: str | None = None) -> bool:
+    if kb_caching_disabled:
+        return False
+    if self._config.per_user_namespace and user_id is None:
+        logger.debug("should_cache: per_user_namespace on but user_id=None — skip cache")
+        return False
+    return True
+```
+
+**说明**：`per_user_namespace` 开启时，`user_id=None` 视为不可命名空间化的请求，直接跳过缓存，避免命中他人缓存块。
+
+### P1 #3 — Webhook 异常风暴防御
+
+**文件**：[src/agentkit/server/routes/channels.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/server/routes/channels.py)
+
+**问题**：`receive_message()` 抛出异常时，路由返回 500，平台按重试策略无限重试，形成异常风暴。
+
+**修复**：在 URLVerification 处理之后兜底捕获异常，返回 `code:0` 抑制平台重试。
+
+```python
+except Exception as exc:  # noqa: BLE001 — 防止 receive_message 异常导致 500 触发平台重试风暴
+    logger.warning("receive_message 解析失败 channel=%s: %s", channel_id, exc)
+    return {"code": 0, "msg": "invalid_payload"}
+```
+
+**说明**：`code:0` 是平台约定的"已接收"信号，可终止重试；`BLE001` 广度捕获是有意为之，webhook 入口必须吞掉所有解析异常。
+
+### P1 #4 — 应用关闭泄漏修复
+
+**文件**：[src/agentkit/server/app.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/server/app.py)
+
+**问题**：shutdown 流程未关闭 channel adapters、未等待后台 webhook 任务，导致 httpx 连接泄漏与 IM 回复丢失。
+
+**修复**：在 calendar scheduler 停止后追加 adapter 关闭与 webhook 任务等待逻辑。
+
+```python
+from agentkit.server.routes.channels import _pending_webhook_tasks
+
+if _pending_webhook_tasks:
+    logger.info("等待 %d 个后台 webhook 任务完成", len(_pending_webhook_tasks))
+    await asyncio.gather(*_pending_webhook_tasks, return_exceptions=True)
+try:
+    from agentkit.server.routes.channels import close_all_adapters
+    await close_all_adapters()
+except Exception:
+    logger.debug("close_all_adapters 异常已忽略")
+```
+
+**说明**：`return_exceptions=True` 保证一个任务失败不阻断其他任务收尾；`close_all_adapters` 异常吞掉是因为关闭路径上再抛异常已无意义。
+
+### P2 #8 — Feishu token TTL 修正
+
+**文件**：[src/agentkit/channels/feishu.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/channels/feishu.py)
+
+**问题**：`_TOKEN_CACHE_TTL = 300.0` 比实际有效期（2h）短 24 倍，造成每 5 分钟强制刷新一次 token，无谓增加 QPS。
+
+**修复**：`300.0` → `6900.0`（2h - 5min 余量），并更新注释。
+
+```python
+_TOKEN_CACHE_TTL = 6900.0  # 2h - 5min 余量，避免临界点失效
+```
+
+### P2 #10 — 无界 webhook 任务集
+
+**文件**：[src/agentkit/server/routes/channels.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/server/routes/channels.py)
+
+**问题**：`_pending_webhook_tasks` set 在高负载下无上限增长，可能耗尽内存与协程。
+
+**修复**：加入上界检查（`_WEBHOOK_MAX_CONCURRENT * 2`），超限时返回 HTTP 429。
+
+```python
+if len(_pending_webhook_tasks) >= _WEBHOOK_MAX_CONCURRENT * 2:
+    logger.warning("webhook 后台任务积压 %d，拒绝新任务", len(_pending_webhook_tasks))
+    raise HTTPException(status_code=429, detail="服务器繁忙，请稍后重试")
+```
+
+**说明**：2x 余量允许短时尖峰消化；429 让客户端退避而非雪崩。
+
+### P2 #12 — 配额检查 N+1 查询消除
+
+**文件**：[src/agentkit/llm/gateway.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/llm/gateway.py)
+
+**问题**：配额检查对每个部门每个周期调用 4 次 `get_usage()`（token 与 cost 各一次，但实际可合并为一次查询），重复查询放大数据库压力。
+
+**修复**：抽取 `_get_usage_summary` 每周期只查一次，token 与 cost 共用 summary，再通过 `_check_quota_value` 分别校验。
+
+```python
+for period in ("daily", "monthly"):
+    summary = self._get_usage_summary(dept_id, period)
+    current_tokens = int(summary.total_tokens)
+    current_cost_cents = float(summary.total_cost) * 100.0
+    await self._check_quota_value(quota_service, db, dept_id, period, "token_limit", current_tokens)
+    await self._check_quota_value(quota_service, db, dept_id, period, "cost_limit", current_cost_cents)
+
+def _get_usage_summary(self, department_id: str, period: str) -> UsageSummary:
+    now = datetime.now(timezone.utc)
+    if period == "monthly":
+        start = now.replace(day=1, hour=0, minute=0, second=0, microsecond=0)
+    else:
+        start = now.replace(hour=0, minute=0, second=0, microsecond=0)
+    return self._usage_tracker.get_usage(department_id=department_id, start_time=start, end_time=now)
+```
+
+**说明**：查询数从 4 次/部门/周期降至 1 次/部门/周期，下降 75%。
+
+### P2 #13 — 缓存键冗余 SHA-256 消除
+
+**文件**：[src/agentkit/llm/cache_key.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/llm/cache_key.py)
+
+**问题**：`generate_cache_key` 对每个组件单独 SHA-256（8-10 次），再对哈希拼接再做一次 SHA-256，CPU 浪费且无安全增益。
+
+**修复**：改为单次 SHA-256，使用 `\x1f`（Unit Separator）分隔组件，防止组件内容注入分隔符。
+
+```python
+parts = [
+    f"m:{model}",
+    f"s:{system_prompt}",
+    f"msg:{json.dumps(messages, sort_keys=True, ensure_ascii=False)}",
+    f"t:{temperature:.2f}",
+    f"tools:{json.dumps(tools, sort_keys=True, ensure_ascii=False) if tools is not None else 'null'}",
+    f"tc:{tool_choice}",
+    f"mt:{max_tokens}",
+]
+if user_id is not None:
+    parts.append(f"u:{user_id}")
+if kb_acl_hash is not None:
+    parts.append(f"a:{kb_acl_hash}")
+combined = "\x1f".join(parts)  # US (Unit Separator) 防止组件内容注入分隔符
+return hashlib.sha256(combined.encode()).hexdigest()
+```
+
+**说明**：移除了 `_hash_str` 与 `_hash_json` 辅助函数；`\x1f` 是 ASCII 控制字符，正常文本中不会出现，天然防注入。
+
+### P2 #18 — 移除未使用的 secrets_store 参数
+
+**文件**：[src/agentkit/llm/config.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/llm/config.py)
+
+**问题**：`get_api_key()` 接受 `secrets_store` 形参但从未使用（同步方法无法 await 异步 `get_secret`），属于误导性 API。
+
+**修复**：从签名中移除 `secrets_store` 参数。
+
+```python
+def get_api_key(self) -> str:
+    """同步读取 API Key — 返回 plaintext。"""
+    if self.api_key_encrypted:
+        logger.debug("get_api_key: encrypted key set — use aget_api_key for decryption")
+    return self.api_key
+```
+
+**说明**：调用方需同步明文 key 时直接调用；需要解密时使用 `aget_api_key()`。
+
+### P3 #21 — DIRECT_CHAT 路径去重
+
+**文件**：[src/agentkit/server/routes/channels.py](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/src/agentkit/server/routes/channels.py)
+
+**问题**：DIRECT_CHAT 逻辑在主路径与 ReAct 回退路径中重复实现，维护时易漂移。
+
+**修复**：抽取 `_direct_chat()` helper，两条路径共用。
+
+```python
+async def _direct_chat(llm_gateway: Any, routing: Any) -> str:
+    """DIRECT_CHAT 路径 — 直接调用 LLM（主路径与 ReAct 回退共用）。"""
+    response = await llm_gateway.chat(
+        messages=[{"role": "user", "content": message.content}],
+        model=routing.model or "default",
+    )
+    return response.content
+```
+
+## Verification
+
+修复完成后按以下方式验证：
+
+- **ruff check**：`src/` 干净，仅剩预存在的 `gui_mode` F821（与本次修复无关）。
+- **Channels 测试**：`pytest tests/unit/server/routes/test_channels.py` — 137 passed。
+- **配置迁移测试**：`pytest tests/unit/llm/test_config_migration.py` — 全部通过。
+- **配额强制测试**：`pytest tests/unit/llm/test_quota_enforcement.py` — 11 passed。
+- **litellm 相关测试**：因 `litellm` 未安装被 skip（环境问题，非代码问题）。
+- **WeCom 测试调整**：将固定时间戳 `1609459200` 改为 `int(time.time())` 动态生成后，新鲜度校验测试通过。
+- **缓存测试调整**：显式传入 `user_id` 或关闭 `per_user_namespace` 后，`should_cache()` 测试通过。
+
+## Prevention
+
+### 安全
+
+- **Webhook 签名必须校验时间戳新鲜度**：任何带时间戳的 webhook 签名校验都应同时验证 `abs(now - ts)` 在合理窗口内（建议 300s），且使用绝对值防御未来时间戳。仅校验签名等于零防御。
+- **Per-user 缓存命名空间必须强制 user_id 校验**：`per_user_namespace=True` 时，`user_id=None` 必须跳过缓存，不可降级为全局缓存。`should_cache` 类方法应显式拒绝缺失 user_id 的请求。
+- **缓存键分隔符使用 ASCII 控制字符**：拼接多组件生成哈希键时，使用 `\x1f`（Unit Separator）而非可见字符（`:`、`|`），杜绝组件内容注入分隔符的攻击面。
+
+### 可靠性
+
+- **Webhook handler 必须兜底捕获异常**：任何 webhook 入口路由都应在业务逻辑外包一层 `except Exception`，返回平台约定的"已接收"信号（如 `code:0`），避免 500 触发重试风暴。`# noqa: BLE001` 是有意为之。
+- **应用 shutdown 必须关闭资源与等待后台任务**：shutdown 流程必须显式调用 `close_all_adapters()`、`await asyncio.gather(*_pending_webhook_tasks, return_exceptions=True)`，避免连接泄漏与回复丢失。
+- **后台任务集必须有上界**：任何 `_pending_tasks` set 都应设置 `max_concurrent * 2` 上界，超限返回 429 让客户端退避，防止内存耗尽。
+- **Token TTL 必须匹配真实有效期**：缓存 token 时 TTL 应为"真实有效期 - 5min 余量"，避免临界点失效；不要凭直觉设短值。
+
+### 性能
+
+- **配额/统计类查询避免 N+1**：同一周期内 token 与 cost 共享一次 `get_usage()` 查询，不要为每个指标单独查询。抽取 `_get_usage_summary` 统一入口。
+- **缓存键避免冗余哈希**：单次 SHA-256 足够安全，无需对每个组件单独哈希后再哈希。移除 `_hash_str` / `_hash_json` 类辅助函数。
+
+### 测试
+
+- **涉及时间戳的测试使用动态值**：不要硬编码 `1609459200` 等历史时间戳，统一用 `int(time.time())` 生成，避免新鲜度校验引入后测试立即失效。
+- **涉及 per-user 命名的测试显式传入 user_id**：测试 `should_cache` 类方法时，显式传入 `user_id` 或在 fixture 中关闭 `per_user_namespace`，避免默认参数变更导致回归。
+
+## Related Docs
+
+- [long-horizon-reliability-code-review-fixes.md](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/docs/solutions/logic-errors/long-horizon-reliability-code-review-fixes.md) — 上一批 U1-U7 长期可靠性代码评审修复，与本批次同属 code-review 修复系列，可在遇到类似可靠性问题时交叉参考。
+- [bitable-companion-service-security-reliability-patterns.md](file:///Users/Chiguyong/Code/Fischer/fischer-agentkit/docs/solutions/architecture-patterns/bitable-companion-service-security-reliability-patterns.md) — Bitable 伴生服务的安全/可靠性架构模式（SSRF、SQL 注入、IDOR、缓存失效等），与本批次的 LLM 缓存隔离威胁模型不同但可对照阅读。