23 KiB

Raw Blame History

title

date

Bitable Companion Service: Security and Reliability Patterns

Context

The Fischer AgentKit project (Python 3.11+, FastAPI, SQLAlchemy 2 async, Vue 3) introduced a Bitable (multi-dimensional table) companion service to give agents and end-users a structured-record store alongside AgentKit's existing chat, memory, and orchestration layers. Unlike a greenfield module, the Bitable service had to slot into a deployed system that already runs an HTTP API, a PostgreSQL database, and a Redis bus — and it had to do so without weakening AgentKit's existing security posture.

During ce-code-review and ce-debug of the new service, multiple security and reliability issues surfaced. They were not exotic bugs; they were the standard failure modes that show up whenever a CRUD service with URL-based ingestion, dynamic SQL filtering, file uploads, and background workers is bolted onto an existing platform:

SSRF exposure from a "fetch Excel from URL" feature
SQL injection through field IDs interpolated into JSONB path literals
IDOR (insecure direct object reference) on every table-level endpoint
Race conditions when multiple recalc workers pull from the same queue
Stale formula-engine caches after field schema changes
Broken pagination when sorting by non-id columns
N+1 queries on batch lookups and upserts
Event-loop blocking from sync Excel/DB-reflect calls
OOM risk from unbounded uploads and bulk deletes

This document captures the architecture patterns and security/reliability practices applied during the fix, so the next companion service (or the next reviewer of this one) starts from a known-good baseline rather than re-deriving each rule from first principles.

Guidance

The patterns below are organized by the concern they address. Each is shown with the actual code shape used in the Bitable service.

1. Companion Service Isolation

A companion service should be isolated at the data, API, and auth layers, even when co-deployed with the host system. Sharing tables or in-process calls saves a few lines today and couples failure domains forever.

Data isolation. The Bitable service owns its own bitable PostgreSQL schema. No AgentKit core table is read or written by Bitable code, and vice versa. A bug in either system cannot corrupt the other's data.

API boundary. Even though Bitable ships inside the same FastAPI app, agents and core code call it over HTTP at /api/v1/bitable. This keeps the boundary explicit, makes the service trivially extractable later, and forces every caller through the same auth and validation path. The lifespan init in src/agentkit/server/app.py mounts the Bitable router and initializes its engine/schema alongside the core app, but the route prefix is the contract.

Internal token auth. Agent-to-Bitable calls use a separate X-Internal-Token header verified with hmac.compare_digest (constant-time, side-channel resistant). This is distinct from the user-facing JWT auth — internal service callers get a bypass on ownership checks (see pattern 4) but never on validation.

# src/agentkit/server/routes/bitable.py
async def require_bitable_auth(request: Request) -> dict[str, Any]:
    internal_token = getattr(request.app.state, "bitable_internal_token", None)
    if internal_token:
        provided = request.headers.get("X-Internal-Token", "")
        if provided and hmac.compare_digest(provided, internal_token):
            return {"user_id": "__bitable_internal__", "internal": True}
    user = await get_current_user(request)
    if user is None:
        raise HTTPException(status_code=401, detail="Authentication required")
    return user

2. SSRF Prevention in URL-based Ingestion

Any feature that fetches a user-supplied URL is an SSRF vector. The standard defense is: validate scheme, resolve hostname, reject private/loopback/reserved IPs, and re-validate on every redirect hop. The Bitable parse_excel_url flow implements all four.

# src/agentkit/bitable/ingestion/excel.py
def _assert_safe_host(host: str) -> None:
    try:
        addr = ipaddress.ip_address(host)
    except ValueError:
        infos = socket.getaddrinfo(host, None)
        for info in infos:
            ip_str = info[4][0]
            try:
                addr = ipaddress.ip_address(ip_str)
            except ValueError:
                continue
            if _is_unsafe_ip(addr):
                raise ValueError(f"Host {host!r} resolves to private/reserved IP {addr}")
        return
    if _is_unsafe_ip(addr):
        raise ValueError(f"Host {host!r} is a private/loopback/reserved IP: {addr}")

def _is_unsafe_ip(addr: ipaddress._BaseAddress) -> bool:
    return (
        addr.is_private or addr.is_loopback or addr.is_link_local
        or addr.is_reserved or addr.is_multicast or addr.is_unspecified
    )

Two details matter:

Check every resolved address, not just the first. DNS can return multiple A records; an attacker may mix a public and a private IP. Reject if any resolved address is unsafe.
Follow redirects manually with re-validation. httpx.get(url, follow_redirects=True) silently follows to http://169.254.169.254/.... The ingestion code disables auto-redirect, inspects the Location header, and runs _assert_safe_host on each hop before fetching.

3. SQL Injection Prevention via Field ID Validation

The Bitable repository interpolates field IDs into JSONB path literals (values->>'field_id'). Because these are string interpolations into SQL fragments, an unvalidated field_id is a direct injection vector. The defense is allow-list validation at the service layer: before any filter or sort reaches the repository, the service confirms every field_id against the table's actual fields.

# src/agentkit/bitable/service.py — list_records_filtered
if filters or sorts:
    fields = await self._repo.list_fields(table_id)
    valid_field_ids = {f.id for f in fields}
    for f in filters or []:
        fid = f.get("field_id", "")
        if fid not in valid_field_ids:
            raise ValueError(f"Unknown field_id in filter: {fid}")

This is stronger than parameterization alone: parameterization prevents injection but still lets a caller reference a field from a different table if the ID leaks. The allow-list check enforces that the field belongs to the table being queried.

4. IDOR Protection via Ownership Checks

Every table-level endpoint (GET /tables/{id}, POST /tables/{id}/records, etc.) is an IDOR risk if it accepts an ID and returns data without checking ownership. The Bitable service centralizes this in a _check_table_ownership helper called at the top of every table-scoped route.

# src/agentkit/server/routes/bitable.py
async def _check_table_ownership(service, table_id, user):
    table = await service.get_table(table_id)
    if table is None:
        raise HTTPException(status_code=404, detail="Table not found")
    if user.get("internal"):
        return  # Internal service token bypasses ownership
    if table.owner_user_id and table.owner_user_id != user.get("user_id"):
        raise HTTPException(status_code=403, detail="Not authorized")

Two design choices are load-bearing:

404 before 403. Returning 404 for a table the caller doesn't own prevents existence disclosure. An attacker probing IDs cannot distinguish "does not exist" from "exists but not yours."
Internal bypass is explicit and scoped. Internal service callers (pattern 1) skip ownership but still go through validation. The bypass is a single if user.get("internal"): return — never a blanket if internal: skip_all_checks.

5. Atomic Task Claiming with FOR UPDATE SKIP LOCKED

When multiple recalc workers pull from the same recalc_queue table, a naive SELECT ... WHERE status='pending' LIMIT 1 lets two workers grab the same row. The fix is a single atomic claim query using FOR UPDATE SKIP LOCKED:

# src/agentkit/bitable/repository.py — claim_recalc_tasks
subq = (
    select(RecalcQueueModel.id)
    .where(RecalcQueueModel.status == RecalcStatus.pending.value)
    .order_by(RecalcQueueModel.queued_at)
    .limit(limit)
    .with_for_update(skip_locked=True)
).subquery()
stmt = (
    update(RecalcQueueModel)
    .where(RecalcQueueModel.id.in_(select(subq.c.id)))
    .values(status=RecalcStatus.calculating.value)
    .returning(RecalcQueueModel)
)

SKIP LOCKED means a row already locked by another worker is invisible to this query, so concurrent workers naturally partition the queue without coordination. The UPDATE ... RETURNING wraps the claim and the state transition in one statement, so there is no window where a row is "claimed but still pending."

6. Formula Engine Cache Invalidation

The RecalcWorker caches a FormulaEngine per table_id (_engines). Without invalidation, a field rename or type change leaves the worker computing against a stale schema. The service calls worker.invalidate_engine(table_id) on every field create/update/delete.

# src/agentkit/bitable/service.py — _invalidate_engine_cache
def _invalidate_engine_cache(self, table_id: str) -> None:
    if self._recalc_worker is not None:
        self._recalc_worker.invalidate_engine(table_id)

The construction-order wrinkle: the service creates the worker, but the worker needs a back-reference to the service for recalc triggers. This cycle is broken by set_recalc_worker() — the service constructs the worker, then registers itself. Cache invalidation is the service calling into the worker; recalc triggering is the worker calling into the service. Both directions work because registration happens after construction.

7. Composite Cursor Pagination for Non-id Sorts

When a list endpoint sorts by a field value (not the primary key), the naive WHERE id > cursor breaks: the cursor is an id, but the sort is by sort_val, so the next page is wrong. The fix is a composite cursor: base64-encoded JSON carrying both the last row's id and its sort values.

{"id": "rec_abc", "sv": ["2026-06-20", "high"]}

The repository decodes the cursor and uses a row-value comparison to fetch the next page:

-- ASC sort
WHERE (sort_col1, sort_col2, id) > (cursor_val1, cursor_val2, cursor_id)
-- DESC sort
WHERE (sort_col1, sort_col2, id) < (cursor_val1, cursor_val2, cursor_id)

The id is always the final tie-breaker so pagination is stable even when sort values collide. This lives in src/agentkit/bitable/repository.py (list_records_filtered).

8. Batch Operations to Eliminate N+1

Three batch patterns eliminate the most common N+1 loops in the service:

find_records_by_pk_batch: one SELECT ... WHERE pk = ANY(:pk_values) instead of N lookups.
create_records_batch: insert().values(rows).returning() for bulk insert with returned rows.
upsert_records: one SELECT to load existing records by primary key, then partition the input into insert and update lists — two queries total, not N+1.

# src/agentkit/bitable/service.py — upsert_records (batch pattern)
pk_values_by_str = {str(rec_values.get(pk_field_id)): rec_values for rec_values in records}
existing_map = await self._repo.find_records_by_pk_batch(
    table_id, primary_key_field_id, list(pk_values_by_str.keys())
)
for pk_str, rec_values in pk_values_by_str.items():
    existing = existing_map.get(pk_str)
    if existing is None:
        to_insert.append(rec_values)
    else:
        to_update.append((rec_values, existing.id))

9. Async I/O Safety

Sync I/O must never run on the event loop thread. The Bitable tool layer wraps every sync call in asyncio.to_thread():

# src/agentkit/tools/bitable_tool.py
if file_path:
    sheets = await asyncio.to_thread(parse_excel, file_path)
else:
    sheets = await asyncio.to_thread(parse_excel_url, file_url)

This covers parse_excel (openpyxl is sync), parse_excel_url (httpx sync client), and import_db_table (SQLAlchemy reflection is sync). Without to_thread, a single slow Excel parse blocks every other request on the worker.

10. OOM Prevention

Three bounds keep memory predictable under load:

Streaming file upload. The upload endpoint reads the body in 64KB chunks and rejects the request the moment the running size exceeds 10MB — no buffering the whole file first.
Batch delete via cursor iteration. Attachment cleanup on table delete iterates records with cursor pagination and deletes attachments per page, instead of SELECT * loading every record into memory.
Batch insert in 500-row chunks. create_records_batch splits large inputs into 500-row chunks to keep statement size and memory bounded.

# src/agentkit/server/routes/bitable.py — streaming upload
total_size = 0
with open(file_path, "wb") as f:
    while True:
        chunk = await file.read(64 * 1024)  # 64KB chunks
        if not chunk:
            break
        total_size += len(chunk)
        if total_size > MAX_UPLOAD_SIZE:
            f.close()
            file_path.unlink(missing_ok=True)
            raise HTTPException(status_code=413, detail="File exceeds 10 MB limit")
        f.write(chunk)

Why This Matters

These patterns are not theoretical. Each maps to a concrete failure mode observed in review or debug:

Pattern	Failure it prevents
Service isolation	A Bitable bug corrupting AgentKit core tables; an AgentKit outage taking down Bitable
SSRF prevention	Cloud metadata endpoint exfiltration (`169.254.169.254`); internal port scanning
Field ID validation	SQL injection via `values->>'field_id'` path literals
Ownership checks	IDOR — user A reading user B's tables by guessing IDs
`SKIP LOCKED` claiming	Duplicate recalc execution; lost tasks under worker contention
Cache invalidation	Stale formula results after field schema changes — silent data corruption
Composite cursor	Broken pagination — skipped or duplicated rows on non-id sorts
Batch operations	N+1 queries turning a 1000-record import into 1001 round-trips
`asyncio.to_thread`	Event-loop stalls blocking every concurrent request
OOM bounds	Worker killed mid-request on a 500MB upload or a million-row delete

The cost of skipping any one of these is a security advisory or an outage. The cost of applying them is a few dozen lines per pattern, most of which are reusable across the next companion service.

A second reason this matters: the patterns compose. Service isolation (1) is what makes the internal-token bypass (4) safe — the bypass is scoped to one service, not the whole app. Field ID validation (3) is what makes the composite cursor (7) safe — the cursor carries sort values that are already validated. SKIP LOCKED (5) is what makes the recalc cache invalidation (6) correct — without atomic claiming, two workers could invalidate and recompute the same engine concurrently. Treating these as independent checklists misses the dependencies.

When to Apply

Apply these patterns whenever you are:

Adding a companion service to an existing platform. The isolation, internal-token, and ownership-check patterns apply regardless of the service's domain.
Building any feature that fetches a user-supplied URL. SSRF prevention (scheme validation, IP allow-listing, manual redirect re-validation) is mandatory, not optional — even for "internal-only" features, because internal-only has a way of becoming external.
Interpolating user-controlled identifiers into SQL. Field IDs, column names, and table names cannot be parameterized; they must be allow-listed against the actual schema.
Running background workers against a shared queue. FOR UPDATE SKIP LOCKED is the correct default for any claim-then-process pattern on PostgreSQL.
Caching derived state tied to a mutable schema. Any cache keyed by table_id (or equivalent) needs an invalidation hook on schema changes, and the hook must be called from the same code path that mutates the schema.
Paginating by a non-unique sort key. A composite cursor with the primary key as the final tie-breaker is the general solution.
Doing bulk reads, writes, or deletes. Batch the operation and bound the chunk size.
Calling sync libraries from an async framework. asyncio.to_thread is the floor; an async-native client is better when available.

These patterns do not apply when:

The service is a throwaway prototype with no external callers — isolation overhead is not justified.
The URL fetch is hardcoded to a trusted internal endpoint with no user input — SSRF prevention is redundant (but document why).
The sort is always by primary key — a simple WHERE id > cursor is correct and cheaper than a composite cursor.

Examples

Example 1: SSRF — before and after

Before (vulnerable):

def parse_excel_url(url: str) -> Workbook:
    resp = httpx.get(url, follow_redirects=True)  # follows redirects silently, no IP check
    return parse_excel_bytes(resp.content)

An attacker submits http://internal-service.attacker.com/redirect which 302s to http://169.254.169.254/latest/meta-data/iam/security-credentials/. The server happily fetches cloud metadata.

After (hardened):

def parse_excel_url(url: str) -> list[ParsedSheet]:
    parsed = urlparse(url)
    if parsed.scheme not in ("http", "https"):
        raise ValueError(f"Disallowed URL scheme: {parsed.scheme!r}")
    _assert_safe_host(parsed.hostname)

    resp = httpx.get(url, follow_redirects=False)
    seen_redirects = 0
    while resp.is_redirect and seen_redirects < 5:
        seen_redirects += 1
        next_url = httpx.URL(url).join(resp.headers["location"])
        if next_url.scheme not in ("http", "https") or not next_url.host:
            raise ValueError(f"Unsafe redirect target: {next_url}")
        _assert_safe_host(next_url.host)
        resp = httpx.get(next_url, follow_redirects=False)
        url = str(next_url)
    return parse_excel_bytes(resp.content)

Every hop is re-validated. A redirect to a private IP raises before the fetch.

Example 2: IDOR — before and after

Before (vulnerable):

@router.get("/tables/{table_id}/records")
async def list_records(table_id: str, user=Depends(get_user)):
    return await service.list_records(table_id)  # no ownership check

User B passes user A's table_id and reads all records.

After (hardened):

@router.get("/tables/{table_id}/records")
async def list_records(table_id: str, user=Depends(require_bitable_auth)):
    await _check_table_ownership(service, table_id, user)
    return await service.list_records(table_id)

The ownership helper returns 404 (not 403) for tables the caller doesn't own, so existence is not disclosed.

Example 3: N+1 — before and after

Before (N+1):

async def upsert_records(table_id, records):
    for r in records:
        existing = await repo.find_record_by_pk(table_id, r.pk)  # N queries
        if existing:
            await repo.update_record(existing.id, r)  # N queries
        else:
            await repo.create_record(table_id, r)  # N queries

1000 records → 2000+ queries.

After (batched):

async def upsert_records(table_id, records, pk_field_id):
    pk_values = [str(r.get(pk_field_id)) for r in records]
    existing_map = await repo.find_records_by_pk_batch(table_id, pk_field_id, pk_values)  # 1 query
    to_insert = [r for r in records if str(r.get(pk_field_id)) not in existing_map]
    to_update = [(r, existing_map[str(r.get(pk_field_id))].id) for r in records if str(r.get(pk_field_id)) in existing_map]
    for r in to_insert:
        await repo.create_record(table_id, r)
    for r, eid in to_update:
        await repo.upsert_record_agent_fields(eid, agent_values)

1000 records → 1 SELECT + N INSERT/UPDATE (batch INSERT available via create_records_batch).

Example 4: Composite cursor — before and after

Before (broken on non-id sort):

# sort by "priority" field, paginate by id
stmt = text("SELECT * FROM bitable_records WHERE table_id = :tid ORDER BY values->>'priority' LIMIT 50")
if cursor:
    stmt = stmt.where(text("id > :cursor"))  # WRONG: cursor is id, sort is priority

Page 2 returns rows with id > last_id but sorted by priority — rows with high priority but low id are skipped.

After (composite cursor):

# cursor = base64({"id": "rec_abc", "sv": ["high"]})
if cursor:
    decoded = json.loads(base64.b64decode(cursor))
    cursor_id = decoded["id"]
    cursor_sort_vals = decoded["sv"]
    # Row-value comparison: (priority, id) > (cursor_val, cursor_id) for ASC
    where_clauses.append(
        f"(values->>'priority', id) > (:csv0, :cursor_id)"
    )

Pagination is stable regardless of sort direction or value collisions.

docs/plans/2026-06-24-001-feat-bitable-companion-service-plan.md — implementation plan with KTD7 (formula AST security), KTD11 (internal token auth), batch chunking, async recalc worker design.
docs/brainstorms/2026-06-24-bitable-module-requirements.md — origin requirements defining companion service architecture, upsert semantics, field ownership model.
docs/plans/2026-06-14-001-feat-p0-production-hardening-plan.md — cross-cutting SSRF protection patterns (KTD4: protocol whitelist + private IP range blocking).
docs/plans/2026-06-10-018-fix-agentkit-p2-hardening-plan.md — SSRF DNS rebinding, hmac.compare_digest, concurrency safety patterns.
AgentKit AGENTS.md — project-level rules on async generator safety, hmac.compare_digest for API key comparison, and the HandoffTransport bounded-queue sentinel pattern.
PostgreSQL docs — FOR UPDATE SKIP LOCKED — canonical reference for concurrent queue claiming.
OWASP A01:2021 — Broken Access Control — covers the IDOR pattern addressed by ownership checks.
OWASP A10:2021 — Server-Side Request Forgery — covers the URL-ingestion SSRF pattern.

Future companion services in this repo (a vector store, a workflow runner, etc.) should be reviewed against this checklist before merge. The patterns are domain-agnostic; only the field names change.

23 KiB Raw Blame History