docs: add bitable security/reliability patterns solution doc + CONCEPTS.md
Deploy to Production / deploy (push) Has been cancelled Details

- docs/solutions/architecture-patterns/bitable-companion-service-security-reliability-patterns.md
  Knowledge-track doc capturing 10 security/reliability patterns from the
  bitable companion service (SSRF prevention, SQL injection, IDOR, atomic
  task claiming, cache invalidation, composite cursor, batch ops, async
  I/O safety, OOM prevention, internal token auth)

- CONCEPTS.md
  Seeded with 3 core domain nouns: Bitable, Field Ownership, Recalc

- AGENTS.md
  Added discoverability tips for docs/solutions/ and CONCEPTS.md
This commit is contained in:
chiguyong 2026-06-25 01:25:06 +08:00
parent bbbf9cd40a
commit 71eaf8dc7c
3 changed files with 448 additions and 0 deletions

View File

@ -196,6 +196,8 @@ CLI 参数 > `agentkit.yaml` > 环境变量(`${VAR:-default}`> `.env` > 硬
- 测试文件:`tests/unit/` 和 `tests/integration/`
- 前端 storesPinia每个领域一个chat、team、settings
- 前端组件:`src/agentkit/server/frontend/src/components/`
- 知识库:`docs/solutions/` — 已记录的问题解决方案bug 修复、最佳实践、架构模式按类别组织YAML frontmatter 可搜索(`module`、`tags`、`problem_type`)。在已记录的领域内实现或调试时相关。
- 领域词汇:`CONCEPTS.md` — 项目共享领域词汇表(实体、命名流程、状态概念),在熟悉代码库或讨论领域概念时相关。
## 边界

14
CONCEPTS.md Normal file
View File

@ -0,0 +1,14 @@
# Concepts
Shared domain vocabulary for this project — entities, named processes, and status concepts with project-specific meaning. Seeded with core domain vocabulary, then accretes as ce-compound and ce-compound-refresh process learnings; direct edits are fine. Glossary only, not a spec or catch-all.
## Bitable
### Bitable
A companion service providing multi-dimensional table storage (structured records with typed fields, views, and formula columns) that runs alongside the main AgentKit server. Owns its own PostgreSQL schema and exposes a REST API at `/api/v1/bitable`. Agents interact with it over HTTP using an internal service token; end-users interact via JWT auth.
### Field Ownership
The model controlling which actor may modify a field's definition or values. Every field has an owner of either `agent` or `user`. Agent-owned fields are written by the BitableTool during data ingestion; user-owned fields are edited by humans in the UI. Upsert operations update only agent-owned fields via `jsonb_set` — user-owned fields are never overwritten by agent writes.
### Recalc
The background process that recomputes formula field values after source data changes. When a record's data columns are written, a recalc task is enqueued; a RecalcWorker claims tasks atomically (FOR UPDATE SKIP LOCKED), evaluates the formula DAG in topological order, and writes results back. The reaper resets stale `calculating` tasks to `pending` for crash recovery.

View File

@ -0,0 +1,432 @@
---
title: "Bitable Companion Service: Security and Reliability Patterns"
date: 2026-06-25
category: docs/solutions/architecture-patterns/
module: bitable
problem_type: architecture_pattern
component: service_object
severity: high
applies_when:
- Building a companion service that runs alongside a main server with its own schema and REST API boundary
- Accepting user-supplied URLs for server-side fetching (SSRF prevention required)
- Interpolating dynamic identifiers into SQL (field_id, table_id validation before interpolation)
- Multi-tenant resource access where every table/record endpoint must verify ownership (IDOR)
- Concurrent background workers claiming tasks from a shared queue (atomic claiming)
- Caching computed values (formula engines) that must invalidate on schema changes
- Paginating with non-id sort columns requiring stable composite cursors
- Bulk data import/export paths that risk OOM without streaming and batching
- Mixing synchronous libraries (openpyxl, pandas) into an async server (asyncio.to_thread)
- Service-to-service authentication between companion and main server (internal token)
tags: [security, ssrf, sql-injection, idor, concurrency, cache-invalidation, pagination, batch-operations, async-io, internal-auth, oom-prevention, companion-service, bitable]
---
# Bitable Companion Service: Security and Reliability Patterns
## Context
The Fischer AgentKit project (Python 3.11+, FastAPI, SQLAlchemy 2 async, Vue 3) introduced a **Bitable** (multi-dimensional table) companion service to give agents and end-users a structured-record store alongside AgentKit's existing chat, memory, and orchestration layers. Unlike a greenfield module, the Bitable service had to slot into a deployed system that already runs an HTTP API, a PostgreSQL database, and a Redis bus — and it had to do so without weakening AgentKit's existing security posture.
During `ce-code-review` and `ce-debug` of the new service, multiple security and reliability issues surfaced. They were not exotic bugs; they were the standard failure modes that show up whenever a CRUD service with URL-based ingestion, dynamic SQL filtering, file uploads, and background workers is bolted onto an existing platform:
- SSRF exposure from a "fetch Excel from URL" feature
- SQL injection through field IDs interpolated into JSONB path literals
- IDOR (insecure direct object reference) on every table-level endpoint
- Race conditions when multiple recalc workers pull from the same queue
- Stale formula-engine caches after field schema changes
- Broken pagination when sorting by non-id columns
- N+1 queries on batch lookups and upserts
- Event-loop blocking from sync Excel/DB-reflect calls
- OOM risk from unbounded uploads and bulk deletes
This document captures the architecture patterns and security/reliability practices applied during the fix, so the next companion service (or the next reviewer of this one) starts from a known-good baseline rather than re-deriving each rule from first principles.
## Guidance
The patterns below are organized by the concern they address. Each is shown with the actual code shape used in the Bitable service.
### 1. Companion Service Isolation
A companion service should be **isolated at the data, API, and auth layers**, even when co-deployed with the host system. Sharing tables or in-process calls saves a few lines today and couples failure domains forever.
**Data isolation.** The Bitable service owns its own `bitable` PostgreSQL schema. No AgentKit core table is read or written by Bitable code, and vice versa. A bug in either system cannot corrupt the other's data.
**API boundary.** Even though Bitable ships inside the same FastAPI app, agents and core code call it over HTTP at `/api/v1/bitable`. This keeps the boundary explicit, makes the service trivially extractable later, and forces every caller through the same auth and validation path. The lifespan init in `src/agentkit/server/app.py` mounts the Bitable router and initializes its engine/schema alongside the core app, but the route prefix is the contract.
**Internal token auth.** Agent-to-Bitable calls use a separate `X-Internal-Token` header verified with `hmac.compare_digest` (constant-time, side-channel resistant). This is distinct from the user-facing JWT auth — internal service callers get a bypass on ownership checks (see pattern 4) but never on validation.
```python
# src/agentkit/server/routes/bitable.py
async def require_bitable_auth(request: Request) -> dict[str, Any]:
internal_token = getattr(request.app.state, "bitable_internal_token", None)
if internal_token:
provided = request.headers.get("X-Internal-Token", "")
if provided and hmac.compare_digest(provided, internal_token):
return {"user_id": "__bitable_internal__", "internal": True}
user = await get_current_user(request)
if user is None:
raise HTTPException(status_code=401, detail="Authentication required")
return user
```
### 2. SSRF Prevention in URL-based Ingestion
Any feature that fetches a user-supplied URL is an SSRF vector. The standard defense is: validate scheme, resolve hostname, reject private/loopback/reserved IPs, and **re-validate on every redirect hop**. The Bitable `parse_excel_url` flow implements all four.
```python
# src/agentkit/bitable/ingestion/excel.py
def _assert_safe_host(host: str) -> None:
try:
addr = ipaddress.ip_address(host)
except ValueError:
infos = socket.getaddrinfo(host, None)
for info in infos:
ip_str = info[4][0]
try:
addr = ipaddress.ip_address(ip_str)
except ValueError:
continue
if _is_unsafe_ip(addr):
raise ValueError(f"Host {host!r} resolves to private/reserved IP {addr}")
return
if _is_unsafe_ip(addr):
raise ValueError(f"Host {host!r} is a private/loopback/reserved IP: {addr}")
def _is_unsafe_ip(addr: ipaddress._BaseAddress) -> bool:
return (
addr.is_private or addr.is_loopback or addr.is_link_local
or addr.is_reserved or addr.is_multicast or addr.is_unspecified
)
```
Two details matter:
1. **Check every resolved address, not just the first.** DNS can return multiple A records; an attacker may mix a public and a private IP. Reject if *any* resolved address is unsafe.
2. **Follow redirects manually with re-validation.** `httpx.get(url, follow_redirects=True)` silently follows to `http://169.254.169.254/...`. The ingestion code disables auto-redirect, inspects the `Location` header, and runs `_assert_safe_host` on each hop before fetching.
### 3. SQL Injection Prevention via Field ID Validation
The Bitable repository interpolates field IDs into JSONB path literals (`values->>'field_id'`). Because these are string interpolations into SQL fragments, an unvalidated `field_id` is a direct injection vector. The defense is **allow-list validation at the service layer**: before any filter or sort reaches the repository, the service confirms every `field_id` against the table's actual fields.
```python
# src/agentkit/bitable/service.py — list_records_filtered
if filters or sorts:
fields = await self._repo.list_fields(table_id)
valid_field_ids = {f.id for f in fields}
for f in filters or []:
fid = f.get("field_id", "")
if fid not in valid_field_ids:
raise ValueError(f"Unknown field_id in filter: {fid}")
```
This is stronger than parameterization alone: parameterization prevents injection but still lets a caller reference a field from a *different* table if the ID leaks. The allow-list check enforces that the field belongs to the table being queried.
### 4. IDOR Protection via Ownership Checks
Every table-level endpoint (`GET /tables/{id}`, `POST /tables/{id}/records`, etc.) is an IDOR risk if it accepts an ID and returns data without checking ownership. The Bitable service centralizes this in a `_check_table_ownership` helper called at the top of every table-scoped route.
```python
# src/agentkit/server/routes/bitable.py
async def _check_table_ownership(service, table_id, user):
table = await service.get_table(table_id)
if table is None:
raise HTTPException(status_code=404, detail="Table not found")
if user.get("internal"):
return # Internal service token bypasses ownership
if table.owner_user_id and table.owner_user_id != user.get("user_id"):
raise HTTPException(status_code=403, detail="Not authorized")
```
Two design choices are load-bearing:
- **404 before 403.** Returning 404 for a table the caller doesn't own prevents existence disclosure. An attacker probing IDs cannot distinguish "does not exist" from "exists but not yours."
- **Internal bypass is explicit and scoped.** Internal service callers (pattern 1) skip ownership but still go through validation. The bypass is a single `if user.get("internal"): return` — never a blanket `if internal: skip_all_checks`.
### 5. Atomic Task Claiming with FOR UPDATE SKIP LOCKED
When multiple recalc workers pull from the same `recalc_queue` table, a naive `SELECT ... WHERE status='pending' LIMIT 1` lets two workers grab the same row. The fix is a single atomic claim query using `FOR UPDATE SKIP LOCKED`:
```python
# src/agentkit/bitable/repository.py — claim_recalc_tasks
subq = (
select(RecalcQueueModel.id)
.where(RecalcQueueModel.status == RecalcStatus.pending.value)
.order_by(RecalcQueueModel.queued_at)
.limit(limit)
.with_for_update(skip_locked=True)
).subquery()
stmt = (
update(RecalcQueueModel)
.where(RecalcQueueModel.id.in_(select(subq.c.id)))
.values(status=RecalcStatus.calculating.value)
.returning(RecalcQueueModel)
)
```
`SKIP LOCKED` means a row already locked by another worker is invisible to this query, so concurrent workers naturally partition the queue without coordination. The `UPDATE ... RETURNING` wraps the claim and the state transition in one statement, so there is no window where a row is "claimed but still pending."
### 6. Formula Engine Cache Invalidation
The `RecalcWorker` caches a `FormulaEngine` per `table_id` (`_engines`). Without invalidation, a field rename or type change leaves the worker computing against a stale schema. The service calls `worker.invalidate_engine(table_id)` on every field create/update/delete.
```python
# src/agentkit/bitable/service.py — _invalidate_engine_cache
def _invalidate_engine_cache(self, table_id: str) -> None:
if self._recalc_worker is not None:
self._recalc_worker.invalidate_engine(table_id)
```
The construction-order wrinkle: the service creates the worker, but the worker needs a back-reference to the service for recalc triggers. This cycle is broken by `set_recalc_worker()` — the service constructs the worker, then registers itself. Cache invalidation is the service calling *into* the worker; recalc triggering is the worker calling *into* the service. Both directions work because registration happens after construction.
### 7. Composite Cursor Pagination for Non-id Sorts
When a list endpoint sorts by a field value (not the primary key), the naive `WHERE id > cursor` breaks: the cursor is an id, but the sort is by `sort_val`, so the next page is wrong. The fix is a **composite cursor**: base64-encoded JSON carrying both the last row's id and its sort values.
```json
{"id": "rec_abc", "sv": ["2026-06-20", "high"]}
```
The repository decodes the cursor and uses a row-value comparison to fetch the next page:
```sql
-- ASC sort
WHERE (sort_col1, sort_col2, id) > (cursor_val1, cursor_val2, cursor_id)
-- DESC sort
WHERE (sort_col1, sort_col2, id) < (cursor_val1, cursor_val2, cursor_id)
```
The `id` is always the final tie-breaker so pagination is stable even when sort values collide. This lives in `src/agentkit/bitable/repository.py` (`list_records_filtered`).
### 8. Batch Operations to Eliminate N+1
Three batch patterns eliminate the most common N+1 loops in the service:
- **`find_records_by_pk_batch`**: one `SELECT ... WHERE pk = ANY(:pk_values)` instead of N lookups.
- **`create_records_batch`**: `insert().values(rows).returning()` for bulk insert with returned rows.
- **`upsert_records`**: one `SELECT` to load existing records by primary key, then partition the input into insert and update lists — two queries total, not N+1.
```python
# src/agentkit/bitable/service.py — upsert_records (batch pattern)
pk_values_by_str = {str(rec_values.get(pk_field_id)): rec_values for rec_values in records}
existing_map = await self._repo.find_records_by_pk_batch(
table_id, primary_key_field_id, list(pk_values_by_str.keys())
)
for pk_str, rec_values in pk_values_by_str.items():
existing = existing_map.get(pk_str)
if existing is None:
to_insert.append(rec_values)
else:
to_update.append((rec_values, existing.id))
```
### 9. Async I/O Safety
Sync I/O must never run on the event loop thread. The Bitable tool layer wraps every sync call in `asyncio.to_thread()`:
```python
# src/agentkit/tools/bitable_tool.py
if file_path:
sheets = await asyncio.to_thread(parse_excel, file_path)
else:
sheets = await asyncio.to_thread(parse_excel_url, file_url)
```
This covers `parse_excel` (openpyxl is sync), `parse_excel_url` (httpx sync client), and `import_db_table` (SQLAlchemy reflection is sync). Without `to_thread`, a single slow Excel parse blocks every other request on the worker.
### 10. OOM Prevention
Three bounds keep memory predictable under load:
- **Streaming file upload.** The upload endpoint reads the body in 64KB chunks and rejects the request the moment the running size exceeds 10MB — no buffering the whole file first.
- **Batch delete via cursor iteration.** Attachment cleanup on table delete iterates records with cursor pagination and deletes attachments per page, instead of `SELECT *` loading every record into memory.
- **Batch insert in 500-row chunks.** `create_records_batch` splits large inputs into 500-row chunks to keep statement size and memory bounded.
```python
# src/agentkit/server/routes/bitable.py — streaming upload
total_size = 0
with open(file_path, "wb") as f:
while True:
chunk = await file.read(64 * 1024) # 64KB chunks
if not chunk:
break
total_size += len(chunk)
if total_size > MAX_UPLOAD_SIZE:
f.close()
file_path.unlink(missing_ok=True)
raise HTTPException(status_code=413, detail="File exceeds 10 MB limit")
f.write(chunk)
```
## Why This Matters
These patterns are not theoretical. Each maps to a concrete failure mode observed in review or debug:
| Pattern | Failure it prevents |
|---|---|
| Service isolation | A Bitable bug corrupting AgentKit core tables; an AgentKit outage taking down Bitable |
| SSRF prevention | Cloud metadata endpoint exfiltration (`169.254.169.254`); internal port scanning |
| Field ID validation | SQL injection via `values->>'field_id'` path literals |
| Ownership checks | IDOR — user A reading user B's tables by guessing IDs |
| `SKIP LOCKED` claiming | Duplicate recalc execution; lost tasks under worker contention |
| Cache invalidation | Stale formula results after field schema changes — silent data corruption |
| Composite cursor | Broken pagination — skipped or duplicated rows on non-id sorts |
| Batch operations | N+1 queries turning a 1000-record import into 1001 round-trips |
| `asyncio.to_thread` | Event-loop stalls blocking every concurrent request |
| OOM bounds | Worker killed mid-request on a 500MB upload or a million-row delete |
The cost of skipping any one of these is a security advisory or an outage. The cost of applying them is a few dozen lines per pattern, most of which are reusable across the next companion service.
A second reason this matters: **the patterns compose**. Service isolation (1) is what makes the internal-token bypass (4) safe — the bypass is scoped to one service, not the whole app. Field ID validation (3) is what makes the composite cursor (7) safe — the cursor carries sort values that are already validated. `SKIP LOCKED` (5) is what makes the recalc cache invalidation (6) correct — without atomic claiming, two workers could invalidate and recompute the same engine concurrently. Treating these as independent checklists misses the dependencies.
## When to Apply
Apply these patterns whenever you are:
- **Adding a companion service to an existing platform.** The isolation, internal-token, and ownership-check patterns apply regardless of the service's domain.
- **Building any feature that fetches a user-supplied URL.** SSRF prevention (scheme validation, IP allow-listing, manual redirect re-validation) is mandatory, not optional — even for "internal-only" features, because internal-only has a way of becoming external.
- **Interpolating user-controlled identifiers into SQL.** Field IDs, column names, and table names cannot be parameterized; they must be allow-listed against the actual schema.
- **Running background workers against a shared queue.** `FOR UPDATE SKIP LOCKED` is the correct default for any claim-then-process pattern on PostgreSQL.
- **Caching derived state tied to a mutable schema.** Any cache keyed by `table_id` (or equivalent) needs an invalidation hook on schema changes, and the hook must be called from the same code path that mutates the schema.
- **Paginating by a non-unique sort key.** A composite cursor with the primary key as the final tie-breaker is the general solution.
- **Doing bulk reads, writes, or deletes.** Batch the operation and bound the chunk size.
- **Calling sync libraries from an async framework.** `asyncio.to_thread` is the floor; an async-native client is better when available.
These patterns do **not** apply when:
- The service is a throwaway prototype with no external callers — isolation overhead is not justified.
- The URL fetch is hardcoded to a trusted internal endpoint with no user input — SSRF prevention is redundant (but document why).
- The sort is always by primary key — a simple `WHERE id > cursor` is correct and cheaper than a composite cursor.
## Examples
### Example 1: SSRF — before and after
**Before (vulnerable):**
```python
def parse_excel_url(url: str) -> Workbook:
resp = httpx.get(url, follow_redirects=True) # follows redirects silently, no IP check
return parse_excel_bytes(resp.content)
```
An attacker submits `http://internal-service.attacker.com/redirect` which 302s to `http://169.254.169.254/latest/meta-data/iam/security-credentials/`. The server happily fetches cloud metadata.
**After (hardened):**
```python
def parse_excel_url(url: str) -> list[ParsedSheet]:
parsed = urlparse(url)
if parsed.scheme not in ("http", "https"):
raise ValueError(f"Disallowed URL scheme: {parsed.scheme!r}")
_assert_safe_host(parsed.hostname)
resp = httpx.get(url, follow_redirects=False)
seen_redirects = 0
while resp.is_redirect and seen_redirects < 5:
seen_redirects += 1
next_url = httpx.URL(url).join(resp.headers["location"])
if next_url.scheme not in ("http", "https") or not next_url.host:
raise ValueError(f"Unsafe redirect target: {next_url}")
_assert_safe_host(next_url.host)
resp = httpx.get(next_url, follow_redirects=False)
url = str(next_url)
return parse_excel_bytes(resp.content)
```
Every hop is re-validated. A redirect to a private IP raises before the fetch.
### Example 2: IDOR — before and after
**Before (vulnerable):**
```python
@router.get("/tables/{table_id}/records")
async def list_records(table_id: str, user=Depends(get_user)):
return await service.list_records(table_id) # no ownership check
```
User B passes user A's `table_id` and reads all records.
**After (hardened):**
```python
@router.get("/tables/{table_id}/records")
async def list_records(table_id: str, user=Depends(require_bitable_auth)):
await _check_table_ownership(service, table_id, user)
return await service.list_records(table_id)
```
The ownership helper returns 404 (not 403) for tables the caller doesn't own, so existence is not disclosed.
### Example 3: N+1 — before and after
**Before (N+1):**
```python
async def upsert_records(table_id, records):
for r in records:
existing = await repo.find_record_by_pk(table_id, r.pk) # N queries
if existing:
await repo.update_record(existing.id, r) # N queries
else:
await repo.create_record(table_id, r) # N queries
```
1000 records → 2000+ queries.
**After (batched):**
```python
async def upsert_records(table_id, records, pk_field_id):
pk_values = [str(r.get(pk_field_id)) for r in records]
existing_map = await repo.find_records_by_pk_batch(table_id, pk_field_id, pk_values) # 1 query
to_insert = [r for r in records if str(r.get(pk_field_id)) not in existing_map]
to_update = [(r, existing_map[str(r.get(pk_field_id))].id) for r in records if str(r.get(pk_field_id)) in existing_map]
for r in to_insert:
await repo.create_record(table_id, r)
for r, eid in to_update:
await repo.upsert_record_agent_fields(eid, agent_values)
```
1000 records → 1 SELECT + N INSERT/UPDATE (batch INSERT available via `create_records_batch`).
### Example 4: Composite cursor — before and after
**Before (broken on non-id sort):**
```python
# sort by "priority" field, paginate by id
stmt = text("SELECT * FROM bitable_records WHERE table_id = :tid ORDER BY values->>'priority' LIMIT 50")
if cursor:
stmt = stmt.where(text("id > :cursor")) # WRONG: cursor is id, sort is priority
```
Page 2 returns rows with `id > last_id` but sorted by priority — rows with high priority but low id are skipped.
**After (composite cursor):**
```python
# cursor = base64({"id": "rec_abc", "sv": ["high"]})
if cursor:
decoded = json.loads(base64.b64decode(cursor))
cursor_id = decoded["id"]
cursor_sort_vals = decoded["sv"]
# Row-value comparison: (priority, id) > (cursor_val, cursor_id) for ASC
where_clauses.append(
f"(values->>'priority', id) > (:csv0, :cursor_id)"
)
```
Pagination is stable regardless of sort direction or value collisions.
## Related
- **`docs/plans/2026-06-24-001-feat-bitable-companion-service-plan.md`** — implementation plan with KTD7 (formula AST security), KTD11 (internal token auth), batch chunking, async recalc worker design.
- **`docs/brainstorms/2026-06-24-bitable-module-requirements.md`** — origin requirements defining companion service architecture, upsert semantics, field ownership model.
- **`docs/plans/2026-06-14-001-feat-p0-production-hardening-plan.md`** — cross-cutting SSRF protection patterns (KTD4: protocol whitelist + private IP range blocking).
- **`docs/plans/2026-06-10-018-fix-agentkit-p2-hardening-plan.md`** — SSRF DNS rebinding, `hmac.compare_digest`, concurrency safety patterns.
- **AgentKit AGENTS.md** — project-level rules on async generator safety, `hmac.compare_digest` for API key comparison, and the `HandoffTransport` bounded-queue sentinel pattern.
- **PostgreSQL docs — `FOR UPDATE SKIP LOCKED`** — canonical reference for concurrent queue claiming.
- **OWASP A01:2021 — Broken Access Control** — covers the IDOR pattern addressed by ownership checks.
- **OWASP A10:2021 — Server-Side Request Forgery** — covers the URL-ingestion SSRF pattern.
Future companion services in this repo (a vector store, a workflow runner, etc.) should be reviewed against this checklist before merge. The patterns are domain-agnostic; only the field names change.