14 KiB

Raw Blame History

title	date	type	status	origin
fix: P0 安全与多实例一致性加固	2026-06-26	fix	planned	代码走查报告（portal-platform-evolution 合并后审查）

fix: P0 安全与多实例一致性加固

Summary

修复代码走查发现的 4 项高优先级问题，分两组：(1) 安全 fail-open 默认值 — Gateway KB cache fail-closed + MCP dangerous-tool 黑名单过滤；(2) 多实例部署一致性 — SecretsStore 与 Channels 全局状态迁移到 Redis，消除多 worker 下的安全与状态共享缺口。

Problem Frame

portal-platform-evolution 合并引入了多端渠道（U10-U12）、MCP 协议（U13/U16）、LiteLLM 缓存（U17）等特性。代码走查发现两类系统性风险：

安全 fail-open：Gateway 在 KB settings 读取异常时默认启用缓存（可能泄漏禁用缓存的 KB 数据）；MCP 端点暴露所有工具（含 ShellTool 等危险工具），绕过 chat 流程的 confirmation 机制。
多实例不一致：SecretsStore 和 Channels 路由使用模块级内存字典存储凭证、nonce、限流、backpressure 状态，多 worker 部署下失效，与 README 宣称的 K8s 多实例部署目标冲突。

Requirements

R1：Gateway KB cache 在 settings 读取失败时必须 fail-closed（禁用缓存），不得 fail-open
R2：MCP /tools/call 和 /tools/list（含 JSON-RPC 端点）必须过滤危险工具，禁止绕过 chat confirmation 流程
R3：SecretsStore 加密凭证必须存储在 Redis，多 worker 共享，保留 AES-256-GCM 加密层
R4：Channels webhook nonce dedup、rate limit、backpressure、渠道配置必须存储在 Redis，多 worker 一致
R5：所有改动不破坏现有单进程行为（Redis 不可用时降级到内存，需显式 log warning）
R6：每个修复单元包含可运行的验证测试

Key Technical Decisions

KTD1: H4 危险工具过滤采用黑名单

决策：使用 _MCP_BLOCKED_TOOLS: frozenset[str] 黑名单，而非在 Tool 基类增加 requires_confirmation 元数据字段。

理由：黑名单改动最小（ponytail 原则），仅影响 mcp/server.py 一个文件。元数据方案需修改 Tool ABC + 所有工具子类 + ToolRegistry，改动面过大。黑名单天花板已在代码注释标注（ponytail: 黑名单需手动维护，新增危险工具需同步更新）。

KTD2: H1/H2 Redis 连接复用 app.state.working_redis_client

决策：复用 app.py lifespan 已创建的 app.state.working_redis_client（aioredis.from_url(redis_url, decode_responses=True)），通过 FastAPI Depends 或 request.app.state 注入到 channels 路由和 SecretsStore。

理由：避免新建 Redis 连接池（资源浪费），与 WorkingMemory、TaskManager 等现有子系统共享同一连接。decode_responses=True 与 RedisBus 模式一致。

KTD3: 降级策略 — Redis 不可用时 fail-closed

决策：Redis 连接失败时，webhook 端点返回 503（而非降级到内存），secrets 读取返回 None（而非空字典）。

理由：安全优先。降级到内存会导致多 worker 间状态不一致，nonce dedup 失效可能引发重放攻击。fail-closed 符合 AGENTS.md 安全约定。

Implementation Units

U1. Gateway KB cache fail-closed

Goal: KB settings 读取异常时禁用缓存，而非默认启用。

Requirements: R1

Dependencies: 无

Files:

src/agentkit/llm/gateway.py（修改第 137-146 行）
tests/unit/llm/test_gateway_cache_failclosed.py（新建）

Approach: 将 kb_caching_disabled 默认值从 False 改为 True（fail-closed），仅在 settings 成功读取且 caching_disabled=False 时才启用缓存。读取异常或 settings 为 None 时保持 True。

Patterns to follow: 现有 should_cache 已有 per_user_namespace + user_id=None 防护逻辑，本单元补齐 KB settings 读取层的防护。

Test scenarios:

Happy path: settings 正常读取 caching_disabled=False → 缓存启用，should_cache 返回 True
Happy path: settings 正常读取 caching_disabled=True → 缓存禁用，should_cache 返回 False
Error path: get_settings_store().get_settings() 抛异常 → kb_caching_disabled=True，should_cache 返回 False（fail-closed）
Edge case: settings 返回 None（KB 不存在）→ kb_caching_disabled=True
Edge case: kb_id=None（非 RAG 请求）→ 不查 settings，缓存正常启用

Verification: pytest tests/unit/llm/test_gateway_cache_failclosed.py -v 全绿；mock settings store 抛异常时断言 should_cache 返回 False。

U2. MCP dangerous-tool 黑名单过滤

Goal: 阻止危险工具通过 MCP 端点执行，绕过 chat confirmation 流程。

Requirements: R2

Dependencies: 无

Files:

src/agentkit/mcp/server.py（修改 _all_tools、_find_tool、REST call_tool、JSON-RPC tools/call）
tests/unit/mcp/test_server_dangerous_tools.py（新建）

Approach: 在 server.py 顶部定义 _MCP_BLOCKED_TOOLS 黑名单（shell、file_write、file_delete、terminal_execute）。_all_tools() 和 _find_tool() 过滤黑名单工具。list_tools 和 JSON-RPC tools/list 不返回黑名单工具。call_tool 对黑名单工具返回 404。

Technical design（directional guidance）:

# ponytail: 黑名单需手动维护，新增危险工具需同步更新。
# 天花板：若工具名动态生成或别名注册，黑名单可能漏判。
# 升级路径：Tool 基类增加 requires_confirmation 元数据字段。
_MCP_BLOCKED_TOOLS: frozenset[str] = frozenset({
    "shell", "file_write", "file_delete", "terminal_execute",
})

Patterns to follow: shell.py 已有 _DANGEROUS_BINARIES frozenset 模式。

Test scenarios:

Happy path: GET /tools/list 不返回黑名单工具
Happy path: POST /tools/call with {"name": "safe_tool"} 正常执行
Error path: POST /tools/call with {"name": "shell"} → 404
Error path: POST /tools/call with {"name": "file_delete"} → 404
Integration: JSON-RPC tools/call with {"method": "tools/call", "params": {"name": "shell"}} → isError: True
Edge case: JSON-RPC tools/list 不包含黑名单工具

Verification: pytest tests/unit/mcp/test_server_dangerous_tools.py -v 全绿；黑名单工具在 list 和 call 端点均不可见。

U3. SecretsStore Redis 迁移

Goal: 加密凭证存储迁移到 Redis，多 worker 共享，保留 AES-256-GCM 加密层。

Requirements: R3, R5

Dependencies: 无（可与 U4 并行，共享 Redis 注入基础设施）

Files:

src/agentkit/channels/secrets.py（修改 SecretsStore.__init__、set_secret、get_secret、delete_secret、list_keys）
src/agentkit/server/routes/channels.py（修改 _get_secrets_store 注入 Redis）
src/agentkit/server/app.py（lifespan 中将 working_redis_client 传入 SecretsStore）
tests/unit/channels/test_secrets_redis.py（新建）

Approach: SecretsStore.__init__ 新增 redis 参数（aioredis.Redis | None）。_store 字典替换为 Redis 调用：set_secret → redis.set(f"secrets:{key}", entry.model_dump_json())；get_secret → redis.get + SecretEntry.model_validate_json；delete_secret → redis.delete；list_keys → redis.scan_iter。加密层（encrypt/decrypt/_derive_key）不变。Redis 为 None 时降级到内存（仅开发模式，log warning）。

Patterns to follow: RedisBus._get_redis() 懒初始化模式；SharedWorkspace 的 redis_client 注入模式。

Test scenarios:

Happy path: set_secret("k", "v") 后 get_secret("k") 返回 "v"（Redis mock）
Happy path: delete_secret("k") 后 get_secret("k") 返回 None
Happy path: list_keys(prefix="feishu:") 返回匹配前缀的 key 列表
Edge case: Redis 为 None（降级模式）→ 内存字典行为，log warning
Error path: Redis 连接失败 → get_secret 抛异常或返回 None（fail-closed）
Integration: 加密-存储-读取-解密往返：明文 → encrypt → Redis SET → Redis GET → decrypt → 明文一致

Verification: pytest tests/unit/channels/test_secrets_redis.py -v 全绿；多 worker 场景下 worker A set_secret 后 worker B get_secret 能读到（需 Redis faker 或集成测试）。

U4. Channels 全局状态 Redis 迁移

Goal: nonce dedup、rate limit、backpressure、渠道配置迁移到 Redis，多 worker 一致。

Requirements: R4, R5, R6

Dependencies: U3（共享 Redis 注入基础设施）

Files:

src/agentkit/server/routes/channels.py（修改 _channels、_rate_limits、_seen_nonces、_pending_webhook_tasks 相关逻辑）
tests/unit/channels/test_webhook_redis_state.py（新建）

Approach:

nonce dedup → Redis SET NX EX：await redis.set(f"nonce:{nonce}", "1", ex=300, nx=True) 返回 None 表示重复。
rate limit → Redis ZSET 滑动窗口：ZREMRANGEBYSCORE 清理过期 + ZADD + ZCARD 计数。
backpressure → Redis INCR/DECR 共享计数器：INCR webhook:concurrent，超限 DECR 并返回 429，执行完毕 DECR。EXPIRE 30s 防 crash 计数不归零。
渠道配置 → Redis Hash：HSET channel:{id} 存配置 JSON，HGETALL 读取。_adapter_cache 保留进程内缓存（只读快照，配置变更时 invalidate）。
Redis 注入：webhook handler 通过 request.app.state.working_redis_client 获取连接。

Technical design（directional guidance）:

# nonce dedup — 原子操作，TTL 自动过期
is_new = await redis.set(f"nonce:{nonce}", "1", ex=int(_NONCE_TTL), nx=True)
if not is_new:
    return Response(status_code=200)  # 重复事件，飞书要求 3s 内响应

# rate limit — ZSET 滑动窗口
key = f"ratelimit:{client_ip}"
now = time.time()
pipe = redis.pipeline()
pipe.zremrangebyscore(key, 0, now - _RATE_LIMIT_WINDOW)
pipe.zadd(key, {str(now): now})
pipe.zcard(key)
pipe.expire(key, int(_RATE_LIMIT_WINDOW))
_, _, count, _ = await pipe.execute()
if count > _RATE_LIMIT_MAX:
    return Response(status_code=429)

# backpressure — 共享计数器
current = await redis.incr("webhook:concurrent")
if current > _WEBHOOK_MAX_CONCURRENT * 2:
    await redis.decr("webhook:concurrent")
    return Response(status_code=429)
try:
    # ... process webhook ...
finally:
    await redis.decr("webhook:concurrent")

Patterns to follow: Redis pipeline 模式（usage_store.py 已有）；_adapter_cache 保留进程内缓存（_invalidate_adapter_cache 逻辑不变）。

Test scenarios:

Happy path: 首次 nonce → SET NX 返回 True，请求正常处理
Happy path: 重复 nonce → SET NX 返回 None，返回 200（不重复处理）
Happy path: rate limit 窗口内 100 请求通过，第 101 请求返回 429
Happy path: backpressure 并发 < 2x 上限时请求通过
Error path: backpressure 并发 >= 2x 上限时返回 429
Edge case: nonce TTL 过期后相同 nonce 可再次使用（mock time.sleep 或 Redis TTL）
Edge case: rate limit 窗口滚动后计数重置
Integration: 渠道配置写入 Redis 后，新 worker 的 _adapter_cache miss 时从 Redis 读取并缓存

Verification: pytest tests/unit/channels/test_webhook_redis_state.py -v 全绿；ruff check src/ && ruff format src/ 通过；现有 tests/unit/channels/test_wecom.py 等不回归。

Scope Boundaries

In Scope

4 项高优先级问题修复（H3/H4/H1/H2）
Redis 连接复用现有 app.state.working_redis_client
单元测试覆盖每个修复单元

Out of Scope

中优先级问题（M1-M8）：feishu/slack decode 异常、dingtalk 时间戳窗口、mcp/client 性能、app.py shutdown 超时等
低优先级问题（L1-L3）：X-Forwarded-For 信任、生产 guard 触发条件、Any 类型
PostgreSQL 持久化迁移（Redis 已满足多实例共享需求）
MCP confirmation 协议实现（黑名单方案足够，元数据方案为后续升级路径）
Tool 基类 requires_confirmation 元数据重构

Deferred to Follow-Up Work

M1 feishu/slack body.decode("utf-8") 异常捕获 → 独立 PR
M4 mcp/server.py 异常消息脱敏 + REST/JSON-RPC 代码去重 → 独立 PR
M6 app.py shutdown asyncio.gather 超时 → 独立 PR
H4 升级路径：Tool 基类 requires_confirmation 元数据 → 下个迭代

Risks & Dependencies

风险 1：H1/H2 Redis 迁移后，Redis 不可用会导致 webhook 端点完全失效（fail-closed）。缓解：lifespan 启动时 ping Redis，不可用时 log error 但不阻止启动（单进程降级模式仍可用）。
风险 2：U4 backpressure INCR/DECR 非原子，crash 可能导致计数不归零。缓解：EXPIRE 30s 安全网 + 定期清理。
依赖 1：U4 依赖 U3 完成的 Redis 注入基础设施。
依赖 2：所有单元测试使用 fakeredis 或 unittest.mock.AsyncMock mock Redis，不依赖真实 Redis 实例。

System-Wide Impact

运维：多 worker 部署（gunicorn -w 4）现在可正确共享渠道状态，无需 sticky session
安全：MCP 端点不再暴露危险工具；KB 缓存 fail-closed 防止数据泄漏
性能：Redis 网络往返替代内存字典，单请求延迟增加 ~1ms（可接受）；backpressure 跨 worker 生效
兼容性：单进程开发模式（无 Redis）降级到内存存储，行为不变

Open Questions

无 — 所有技术决策已在 KTD1-KTD3 中明确。

14 KiB Raw Blame History Unescape Escape

fix: P0 安全与多实例一致性加固

Summary

Problem Frame

Requirements

Key Technical Decisions

KTD1: H4 危险工具过滤采用黑名单

KTD2: H1/H2 Redis 连接复用 app.state.working_redis_client

KTD3: 降级策略 — Redis 不可用时 fail-closed

Implementation Units

U1. Gateway KB cache fail-closed

U2. MCP dangerous-tool 黑名单过滤

U3. SecretsStore Redis 迁移

U4. Channels 全局状态 Redis 迁移

Scope Boundaries

In Scope

Out of Scope

Deferred to Follow-Up Work

Risks & Dependencies

System-Wide Impact

Open Questions

14 KiB

Raw Blame History