fischer-agentkit/docs/plans/2026-06-26-006-fix-p0-secur...

---
title: "fix: P0 安全与多实例一致性加固"
date: 2026-06-26
type: fix
status: planned
origin: 代码走查报告（portal-platform-evolution 合并后审查）
---

# fix: P0 安全与多实例一致性加固

## Summary

修复代码走查发现的 4 项高优先级问题，分两组：(1) 安全 fail-open 默认值 — Gateway KB cache fail-closed + MCP dangerous-tool 黑名单过滤；(2) 多实例部署一致性 — SecretsStore 与 Channels 全局状态迁移到 Redis，消除多 worker 下的安全与状态共享缺口。

## Problem Frame

`portal-platform-evolution` 合并引入了多端渠道（U10-U12）、MCP 协议（U13/U16）、LiteLLM 缓存（U17）等特性。代码走查发现两类系统性风险：

1. **安全 fail-open**：Gateway 在 KB settings 读取异常时默认启用缓存（可能泄漏禁用缓存的 KB 数据）；MCP 端点暴露所有工具（含 ShellTool 等危险工具），绕过 chat 流程的 confirmation 机制。
2. **多实例不一致**：SecretsStore 和 Channels 路由使用模块级内存字典存储凭证、nonce、限流、backpressure 状态，多 worker 部署下失效，与 README 宣称的 K8s 多实例部署目标冲突。

## Requirements

- **R1**：Gateway KB cache 在 settings 读取失败时必须 fail-closed（禁用缓存），不得 fail-open
- **R2**：MCP `/tools/call` 和 `/tools/list`（含 JSON-RPC 端点）必须过滤危险工具，禁止绕过 chat confirmation 流程
- **R3**：SecretsStore 加密凭证必须存储在 Redis，多 worker 共享，保留 AES-256-GCM 加密层
- **R4**：Channels webhook nonce dedup、rate limit、backpressure、渠道配置必须存储在 Redis，多 worker 一致
- **R5**：所有改动不破坏现有单进程行为（Redis 不可用时降级到内存，需显式 log warning）
- **R6**：每个修复单元包含可运行的验证测试

## Key Technical Decisions

### KTD1: H4 危险工具过滤采用黑名单

**决策**：使用 `_MCP_BLOCKED_TOOLS: frozenset[str]` 黑名单，而非在 Tool 基类增加 `requires_confirmation` 元数据字段。

**理由**：黑名单改动最小（ponytail 原则），仅影响 mcp/server.py 一个文件。元数据方案需修改 Tool ABC + 所有工具子类 + ToolRegistry，改动面过大。黑名单天花板已在代码注释标注（`ponytail: 黑名单需手动维护，新增危险工具需同步更新`）。

### KTD2: H1/H2 Redis 连接复用 app.state.working_redis_client

**决策**：复用 `app.py` lifespan 已创建的 `app.state.working_redis_client`（`aioredis.from_url(redis_url, decode_responses=True)`），通过 FastAPI `Depends` 或 `request.app.state` 注入到 channels 路由和 SecretsStore。

**理由**：避免新建 Redis 连接池（资源浪费），与 WorkingMemory、TaskManager 等现有子系统共享同一连接。`decode_responses=True` 与 RedisBus 模式一致。

### KTD3: 降级策略 — Redis 不可用时 fail-closed

**决策**：Redis 连接失败时，webhook 端点返回 503（而非降级到内存），secrets 读取返回 None（而非空字典）。

**理由**：安全优先。降级到内存会导致多 worker 间状态不一致，nonce dedup 失效可能引发重放攻击。fail-closed 符合 AGENTS.md 安全约定。

## Implementation Units

### U1. Gateway KB cache fail-closed

**Goal**: KB settings 读取异常时禁用缓存，而非默认启用。

**Requirements**: R1

**Dependencies**: 无

**Files**:
- `src/agentkit/llm/gateway.py`（修改第 137-146 行）
- `tests/unit/llm/test_gateway_cache_failclosed.py`（新建）

**Approach**: 将 `kb_caching_disabled` 默认值从 `False` 改为 `True`（fail-closed），仅在 settings 成功读取且 `caching_disabled=False` 时才启用缓存。读取异常或 settings 为 None 时保持 `True`。

**Patterns to follow**: 现有 `should_cache` 已有 `per_user_namespace + user_id=None` 防护逻辑，本单元补齐 KB settings 读取层的防护。

**Test scenarios**:
- **Happy path**: settings 正常读取 `caching_disabled=False` → 缓存启用，`should_cache` 返回 True
- **Happy path**: settings 正常读取 `caching_disabled=True` → 缓存禁用，`should_cache` 返回 False
- **Error path**: `get_settings_store().get_settings()` 抛异常 → `kb_caching_disabled=True`，`should_cache` 返回 False（fail-closed）
- **Edge case**: settings 返回 None（KB 不存在）→ `kb_caching_disabled=True`
- **Edge case**: `kb_id=None`（非 RAG 请求）→ 不查 settings，缓存正常启用

**Verification**: `pytest tests/unit/llm/test_gateway_cache_failclosed.py -v` 全绿；mock settings store 抛异常时断言 `should_cache` 返回 False。

---

### U2. MCP dangerous-tool 黑名单过滤

**Goal**: 阻止危险工具通过 MCP 端点执行，绕过 chat confirmation 流程。

**Requirements**: R2

**Dependencies**: 无

**Files**:
- `src/agentkit/mcp/server.py`（修改 `_all_tools`、`_find_tool`、REST `call_tool`、JSON-RPC `tools/call`）
- `tests/unit/mcp/test_server_dangerous_tools.py`（新建）

**Approach**: 在 `server.py` 顶部定义 `_MCP_BLOCKED_TOOLS` 黑名单（`shell`、`file_write`、`file_delete`、`terminal_execute`）。`_all_tools()` 和 `_find_tool()` 过滤黑名单工具。`list_tools` 和 JSON-RPC `tools/list` 不返回黑名单工具。`call_tool` 对黑名单工具返回 404。

**Technical design**（directional guidance）:

```python
# ponytail: 黑名单需手动维护，新增危险工具需同步更新。
# 天花板：若工具名动态生成或别名注册，黑名单可能漏判。
# 升级路径：Tool 基类增加 requires_confirmation 元数据字段。
_MCP_BLOCKED_TOOLS: frozenset[str] = frozenset({
    "shell", "file_write", "file_delete", "terminal_execute",
})
```

**Patterns to follow**: `shell.py` 已有 `_DANGEROUS_BINARIES` frozenset 模式。

**Test scenarios**:
- **Happy path**: `GET /tools/list` 不返回黑名单工具
- **Happy path**: `POST /tools/call` with `{"name": "safe_tool"}` 正常执行
- **Error path**: `POST /tools/call` with `{"name": "shell"}` → 404
- **Error path**: `POST /tools/call` with `{"name": "file_delete"}` → 404
- **Integration**: JSON-RPC `tools/call` with `{"method": "tools/call", "params": {"name": "shell"}}` → `isError: True`
- **Edge case**: JSON-RPC `tools/list` 不包含黑名单工具

**Verification**: `pytest tests/unit/mcp/test_server_dangerous_tools.py -v` 全绿；黑名单工具在 list 和 call 端点均不可见。

---

### U3. SecretsStore Redis 迁移

**Goal**: 加密凭证存储迁移到 Redis，多 worker 共享，保留 AES-256-GCM 加密层。

**Requirements**: R3, R5

**Dependencies**: 无（可与 U4 并行，共享 Redis 注入基础设施）

**Files**:
- `src/agentkit/channels/secrets.py`（修改 `SecretsStore.__init__`、`set_secret`、`get_secret`、`delete_secret`、`list_keys`）
- `src/agentkit/server/routes/channels.py`（修改 `_get_secrets_store` 注入 Redis）
- `src/agentkit/server/app.py`（lifespan 中将 `working_redis_client` 传入 SecretsStore）
- `tests/unit/channels/test_secrets_redis.py`（新建）

**Approach**: `SecretsStore.__init__` 新增 `redis` 参数（`aioredis.Redis | None`）。`_store` 字典替换为 Redis 调用：`set_secret` → `redis.set(f"secrets:{key}", entry.model_dump_json())`；`get_secret` → `redis.get` + `SecretEntry.model_validate_json`；`delete_secret` → `redis.delete`；`list_keys` → `redis.scan_iter`。加密层（`encrypt`/`decrypt`/`_derive_key`）不变。Redis 为 None 时降级到内存（仅开发模式，log warning）。

**Patterns to follow**: `RedisBus._get_redis()` 懒初始化模式；`SharedWorkspace` 的 `redis_client` 注入模式。

**Test scenarios**:
- **Happy path**: `set_secret("k", "v")` 后 `get_secret("k")` 返回 `"v"`（Redis mock）
- **Happy path**: `delete_secret("k")` 后 `get_secret("k")` 返回 None
- **Happy path**: `list_keys(prefix="feishu:")` 返回匹配前缀的 key 列表
- **Edge case**: Redis 为 None（降级模式）→ 内存字典行为，log warning
- **Error path**: Redis 连接失败 → `get_secret` 抛异常或返回 None（fail-closed）
- **Integration**: 加密-存储-读取-解密往返：明文 → encrypt → Redis SET → Redis GET → decrypt → 明文一致

**Verification**: `pytest tests/unit/channels/test_secrets_redis.py -v` 全绿；多 worker 场景下 worker A `set_secret` 后 worker B `get_secret` 能读到（需 Redis faker 或集成测试）。

---

### U4. Channels 全局状态 Redis 迁移

**Goal**: nonce dedup、rate limit、backpressure、渠道配置迁移到 Redis，多 worker 一致。

**Requirements**: R4, R5, R6

**Dependencies**: U3（共享 Redis 注入基础设施）

**Files**:
- `src/agentkit/server/routes/channels.py`（修改 `_channels`、`_rate_limits`、`_seen_nonces`、`_pending_webhook_tasks` 相关逻辑）
- `tests/unit/channels/test_webhook_redis_state.py`（新建）

**Approach**:

1. **nonce dedup** → Redis `SET NX EX`：`await redis.set(f"nonce:{nonce}", "1", ex=300, nx=True)` 返回 None 表示重复。
2. **rate limit** → Redis ZSET 滑动窗口：`ZREMRANGEBYSCORE` 清理过期 + `ZADD` + `ZCARD` 计数。
3. **backpressure** → Redis `INCR`/`DECR` 共享计数器：`INCR webhook:concurrent`，超限 `DECR` 并返回 429，执行完毕 `DECR`。`EXPIRE 30s` 防 crash 计数不归零。
4. **渠道配置** → Redis Hash：`HSET channel:{id}` 存配置 JSON，`HGETALL` 读取。`_adapter_cache` 保留进程内缓存（只读快照，配置变更时 invalidate）。
5. **Redis 注入**：webhook handler 通过 `request.app.state.working_redis_client` 获取连接。

**Technical design**（directional guidance）:

```python
# nonce dedup — 原子操作，TTL 自动过期
is_new = await redis.set(f"nonce:{nonce}", "1", ex=int(_NONCE_TTL), nx=True)
if not is_new:
    return Response(status_code=200)  # 重复事件，飞书要求 3s 内响应

# rate limit — ZSET 滑动窗口
key = f"ratelimit:{client_ip}"
now = time.time()
pipe = redis.pipeline()
pipe.zremrangebyscore(key, 0, now - _RATE_LIMIT_WINDOW)
pipe.zadd(key, {str(now): now})
pipe.zcard(key)
pipe.expire(key, int(_RATE_LIMIT_WINDOW))
_, _, count, _ = await pipe.execute()
if count > _RATE_LIMIT_MAX:
    return Response(status_code=429)

# backpressure — 共享计数器
current = await redis.incr("webhook:concurrent")
if current > _WEBHOOK_MAX_CONCURRENT * 2:
    await redis.decr("webhook:concurrent")
    return Response(status_code=429)
try:
    # ... process webhook ...
finally:
    await redis.decr("webhook:concurrent")
```

**Patterns to follow**: Redis pipeline 模式（`usage_store.py` 已有）；`_adapter_cache` 保留进程内缓存（`_invalidate_adapter_cache` 逻辑不变）。

**Test scenarios**:
- **Happy path**: 首次 nonce → `SET NX` 返回 True，请求正常处理
- **Happy path**: 重复 nonce → `SET NX` 返回 None，返回 200（不重复处理）
- **Happy path**: rate limit 窗口内 100 请求通过，第 101 请求返回 429
- **Happy path**: backpressure 并发 < 2x 上限时请求通过
- **Error path**: backpressure 并发 >= 2x 上限时返回 429
- **Edge case**: nonce TTL 过期后相同 nonce 可再次使用（mock `time.sleep` 或 Redis TTL）
- **Edge case**: rate limit 窗口滚动后计数重置
- **Integration**: 渠道配置写入 Redis 后，新 worker 的 `_adapter_cache` miss 时从 Redis 读取并缓存

**Verification**: `pytest tests/unit/channels/test_webhook_redis_state.py -v` 全绿；`ruff check src/ && ruff format src/` 通过；现有 `tests/unit/channels/test_wecom.py` 等不回归。

---

## Scope Boundaries

### In Scope

- 4 项高优先级问题修复（H3/H4/H1/H2）
- Redis 连接复用现有 `app.state.working_redis_client`
- 单元测试覆盖每个修复单元

### Out of Scope

- 中优先级问题（M1-M8）：feishu/slack decode 异常、dingtalk 时间戳窗口、mcp/client 性能、app.py shutdown 超时等
- 低优先级问题（L1-L3）：X-Forwarded-For 信任、生产 guard 触发条件、Any 类型
- PostgreSQL 持久化迁移（Redis 已满足多实例共享需求）
- MCP confirmation 协议实现（黑名单方案足够，元数据方案为后续升级路径）
- Tool 基类 `requires_confirmation` 元数据重构

### Deferred to Follow-Up Work

- M1 feishu/slack `body.decode("utf-8")` 异常捕获 → 独立 PR
- M4 mcp/server.py 异常消息脱敏 + REST/JSON-RPC 代码去重 → 独立 PR
- M6 app.py shutdown `asyncio.gather` 超时 → 独立 PR
- H4 升级路径：Tool 基类 `requires_confirmation` 元数据 → 下个迭代

## Risks & Dependencies

- **风险 1**：H1/H2 Redis 迁移后，Redis 不可用会导致 webhook 端点完全失效（fail-closed）。缓解：lifespan 启动时 ping Redis，不可用时 log error 但不阻止启动（单进程降级模式仍可用）。
- **风险 2**：U4 backpressure `INCR/DECR` 非原子，crash 可能导致计数不归零。缓解：`EXPIRE 30s` 安全网 + 定期清理。
- **依赖 1**：U4 依赖 U3 完成的 Redis 注入基础设施。
- **依赖 2**：所有单元测试使用 `fakeredis` 或 `unittest.mock.AsyncMock` mock Redis，不依赖真实 Redis 实例。

## System-Wide Impact

- **运维**：多 worker 部署（gunicorn -w 4）现在可正确共享渠道状态，无需 sticky session
- **安全**：MCP 端点不再暴露危险工具；KB 缓存 fail-closed 防止数据泄漏
- **性能**：Redis 网络往返替代内存字典，单请求延迟增加 ~1ms（可接受）；backpressure 跨 worker 生效
- **兼容性**：单进程开发模式（无 Redis）降级到内存存储，行为不变

## Open Questions

无 — 所有技术决策已在 KTD1-KTD3 中明确。