fischer-agentkit/docs/plans/2026-06-10-018-fix-agentkit...

13 KiB

fix: AgentKit P2 Security & Reliability Hardening

Status: active Created: 2026-06-10 Origin: ce-code-review findings on main branch after 7-capabilities merge


Summary

Fix all remaining P2 issues from the final code review of the feat/agentkit-7-capabilities merge. These 16 issues span security (SSRF, auth, injection), reliability (retry, concurrency, resource leaks), and architecture (auth consistency, config reload). All are non-blocking for current functionality but represent real risk in production.

Problem Frame

After merging 7 core capabilities and fixing P0/P1 issues, the codebase has 16 P2 findings that fall into four categories:

  1. Security gaps — SSRF DNS rebinding, API key timing attacks, WebSocket auth bypass, approval race conditions, shell pattern matching, .env injection
  2. Reliability gaps — retry storms, unbounded memory growth, concurrent modification, config change loss
  3. Architecture gaps — dual auth layer inconsistency, WebSocket broadcast without isolation
  4. Defensive gaps — process kill race, lazy init race, httpx client double-creation

Requirements

  • R1: All P2 security findings must be resolved or explicitly deferred with documented rationale
  • R2: Retry logic must use exponential backoff with jitter
  • R3: All in-memory stores must have bounded growth with proper eviction
  • R4: Concurrent access to shared mutable state must be protected
  • R5: Authentication must be consistent across all endpoints and transports
  • R6: No regression in existing test suite (764+ tests must pass)

Key Technical Decisions

  • KTD1: DNS resolution for SSRF — Use asyncio.getaddrinfo (not blocking socket.getaddrinfo) to resolve hostnames before IP validation. Fail-closed on DNS errors. Accept that TOCTOU remains between check and connect; full mitigation requires IP pinning at the httpx transport level (deferred).
  • KTD2: Auth consolidation strategy — Fix per-endpoint auth first (hmac.compare_digest, WebSocket auth-before-accept), then consolidate into middleware in a separate pass. This avoids a big-bang refactor and allows incremental testing.
  • KTD3: WorkflowStore concurrency — Add asyncio.Lock to WorkflowStore, convert sync methods to async. This is the most invasive change but necessary for correctness under concurrent load.
  • KTD4: Shell dangerous patterns — Replace substring matching with token-based matching using the already-parsed shlex tokens. Separate binary-level and flag-level dangerous patterns.
  • KTD5: WebSocket isolation — Replace flat subscriber list with dict[str, set[WebSocket]] keyed by execution_id. Clients subscribe to specific executions.

Implementation Units

U1. API Key Timing Attack & WebSocket Auth-Before-Accept

Goal: Eliminate timing side-channel in API key comparison and fix WebSocket accept-before-auth pattern.

Requirements: R5

Dependencies: None

Files:

  • src/agentkit/server/routes/portal.py
  • src/agentkit/server/routes/workflows.py
  • src/agentkit/server/middleware.py
  • tests/unit/server/test_portal_routes.py
  • tests/unit/server/test_workflow_routes.py

Approach:

  1. Add import hmac to portal.py, workflows.py, middleware.py
  2. Replace all != API key comparisons with hmac.compare_digest()
  3. In both WebSocket handlers, move auth validation BEFORE websocket.accept(). On auth failure, call await websocket.close(code=4001) without accepting, then return
  4. In middleware.py, replace provided_key not in valid_keys with any(hmac.compare_digest(...) for k in valid_keys)
  5. Handle None/empty provided key by defaulting to empty string before comparison

Test scenarios:

  • API key comparison with correct key returns True
  • API key comparison with incorrect key returns False
  • API key comparison with None provided defaults to empty string
  • WebSocket connection without api_key is rejected (no 101 upgrade)
  • WebSocket connection with wrong api_key is rejected
  • WebSocket connection with correct api_key is accepted

Verification: All existing tests pass; new auth tests pass


U2. Shell Dangerous Pattern Token-Based Matching

Goal: Replace substring-based dangerous pattern matching with precise token-based matching to eliminate false positives and bypasses.

Requirements: R1

Dependencies: None

Files:

  • src/agentkit/tools/shell.py
  • tests/unit/test_shell_tool.py

Approach:

  1. Replace _DANGEROUS_PATTERNS tuple with two new structures:
    • _DANGEROUS_BINARIES: set of binary names that are always dangerous (e.g., rm, mkfs, dd, shutdown)
    • _DANGEROUS_BINARY_FLAGS: dict mapping binary name to set of dangerous flag combinations (e.g., kill: {-9, -KILL}, chmod: {777})
  2. In _is_dangerous(), after shlex parsing:
    • Check binary against _DANGEROUS_BINARIES (exact match)
    • Check binary + flags against _DANGEROUS_BINARY_FLAGS
    • Keep regex-based arg patterns for cross-token matches (e.g., > /dev/, drop table)
  3. Remove overly broad patterns like format, erase as substrings; add format as binary-only
  4. Normalize whitespace in command before matching

Test scenarios:

  • rm -rf / detected as dangerous (binary match)
  • echo 'performing rm operations' NOT detected as dangerous (no binary match)
  • kill -9 1234 detected as dangerous (binary + flag match)
  • kill -15 1234 NOT detected as dangerous (flag not in dangerous set)
  • rm\t/tmp/file detected as dangerous (tab normalization)
  • format C: detected as dangerous (binary match)
  • informatica NOT detected as dangerous (not a binary match)

Verification: Shell tool tests pass with new pattern logic


U3. .env File Prefix Filtering & Retry Backoff

Goal: Prevent arbitrary environment variable injection from .env files and add exponential backoff to plan executor retries.

Requirements: R1, R2

Dependencies: None

Files:

  • src/agentkit/server/app.py
  • src/agentkit/core/plan_executor.py
  • tests/unit/test_plan_executor.py

Approach:

  1. In app.py .env loading, add an allowlist of prefixes: AGENTKIT_, OPENAI_, ANTHROPIC_, GEMINI_, TAVILY_, SERPER_, plus exact matches for DATABASE_URL, REDIS_URL
  2. Log a warning when a .env variable is skipped due to prefix filtering
  3. In plan_executor.py _execute_step_with_retry(), add exponential backoff with jitter after each failure:
    • base_retry_delay=1.0s, max_retry_delay=30.0s
    • delay = min(base * 2^(retry-1), max) * (0.5 + random() * 0.5)
    • Skip delay on first attempt (retry_count == 0)
  4. Add base_retry_delay and max_retry_delay as constructor parameters with defaults

Test scenarios:

  • .env with AGENTKIT_FOO=bar is loaded
  • .env with PATH=/malicious is skipped with warning
  • .env with OPENAI_API_KEY=sk-xxx is loaded
  • Retry backoff: first retry waits ~1s, second ~2s, third ~4s (with jitter)
  • Max retry delay caps at 30s
  • Plan executor with max_retries=0 still works (no backoff)

Verification: Unit tests pass; .env loading respects prefix filter


U4. WorkflowStore Concurrency & Resource Management

Goal: Make WorkflowStore thread-safe, fix unbounded growth, add proper cleanup for approval events and running tasks.

Requirements: R3, R4

Dependencies: U1 (hmac.compare_digest in workflow routes)

Files:

  • src/agentkit/server/routes/workflows.py
  • tests/unit/server/test_workflow_routes.py

Approach:

  1. Add asyncio.Lock to WorkflowStore; convert save, delete, create_execution, update_execution to async methods
  2. Initialize _running_tasks in __init__ instead of lazy getattr
  3. Add _evict_execution() helper that removes execution AND its approval events (with event.set() to wake waiting coroutines)
  4. Replace _ws_subscribers: list with _ws_subscribers: dict[str, set[WebSocket]] keyed by execution_id, protected by asyncio.Lock
  5. Add _subscribe(execution_id, ws) and _unsubscribe(execution_id, ws) helpers
  6. Modify _broadcast_ws to accept execution_id parameter and only send to relevant subscribers
  7. Update all call sites from sync to async (all endpoint functions already async)
  8. Add _execution_locks: dict[str, asyncio.Lock] for per-execution approve/cancel serialization

Test scenarios:

  • Concurrent save operations don't cause KeyError or over-eviction
  • Evicted execution's approval events are cleaned up
  • WebSocket subscriber receives only messages for subscribed execution_id
  • WebSocket disconnect properly cleans up from subscriber dict
  • Concurrent approve/cancel for same execution serialized by lock
  • _running_tasks initialized in init, no getattr fallback

Verification: Workflow route tests pass; no RuntimeError from concurrent modification


U5. SSRF DNS Resolution & ComputerUse Client Safety

Goal: Add DNS resolution to SSRF protection and make ComputerUseTool's httpx client creation thread-safe.

Requirements: R1

Dependencies: None

Files:

  • src/agentkit/utils/security.py
  • src/agentkit/tools/computer_use.py
  • tests/unit/test_security.py
  • tests/unit/tools/test_computer_use.py

Approach:

  1. In is_safe_url(), after the ValueError catch for domain names, add DNS resolution:
    • Use socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM) (sync, acceptable for tool init)
    • Validate each resolved IP against _is_unsafe_ip()
    • Fail-closed on DNS errors (return False)
    • Add 3-second timeout on DNS resolution
  2. Add is_safe_url_async() variant using asyncio.getaddrinfo for async callers
  3. In ComputerUseTool, add asyncio.Lock to _get_http_client(), make it async with double-check locking
  4. Update caller from client = self._get_http_client() to client = await self._get_http_client()

Test scenarios:

  • is_safe_url with domain resolving to 127.0.0.1 returns False
  • is_safe_url with domain resolving to public IP returns True
  • is_safe_url with unresolvable domain returns False (fail-closed)
  • is_safe_url with ::ffff:127.0.0.1 returns False
  • is_safe_url with .internal TLD returns False
  • ComputerUseTool concurrent _get_http_client returns same client instance
  • ComputerUseTool close() properly cleans up client

Verification: Security and computer_use tests pass


U6. Config Hot-Reload & Defensive Fixes

Goal: Fix config hot-reload lock race, add proc.kill() error handling, and initialize config reload lock eagerly.

Requirements: R4

Dependencies: None

Files:

  • src/agentkit/server/app.py
  • src/agentkit/tools/shell.py

Approach:

  1. In app.py _on_config_change():
    • Initialize lock eagerly in create_app() (store on app.state._config_reload_lock)
    • Replace lock.locked() skip with pending-flag pattern: set app.state._config_reload_pending = True, then in _reload() loop while pending flag is set
    • This ensures no config change is silently dropped
  2. In shell.py _execute_standalone():
    • Wrap proc.kill() in try/except (ProcessLookupError, OSError)
    • Use proc.returncode if available instead of hardcoded -1

Test scenarios:

  • Two rapid config changes both get applied (no silent drop)
  • Config reload pending flag is cleared after reload completes
  • Shell timeout with already-exited process returns clean error (no ProcessLookupError)
  • Shell timeout output still shows timeout message

Verification: App startup and shell tool tests pass


Scope Boundaries

In Scope

  • All 16 P2 findings from ce-code-review
  • Test coverage for new security/reliability patterns

Deferred to Follow-Up Work

  • Full auth middleware consolidation (ASGI-level WebSocket middleware) — complex, separate PR
  • IP pinning at httpx transport level for complete SSRF protection
  • python-dotenv migration for robust .env parsing
  • clients.yaml caching in middleware (per-request disk read)
  • WorkflowStore persistence (currently in-memory only)

Outside This Plan's Identity

  • New features or capability additions
  • Performance optimization beyond what's needed for reliability
  • UI/UX changes

Risks & Mitigations

Risk Impact Mitigation
WorkflowStore async refactor breaks existing endpoints High Update all call sites; comprehensive test coverage
DNS resolution adds latency to URL validation Low 3s timeout; acceptable for tool initialization
Token-based shell matching misses new bypass patterns Medium Default-to-dangerous fallback; add patterns as discovered
.env prefix filtering breaks existing deployments Medium Comprehensive allowlist; log warnings for skipped vars
WebSocket isolation breaks existing WS clients Medium Maintain backward compat: accept without subscription, send all
Config reload pending loop runs indefinitely Low Coalescing naturally limits; only re-runs if new change arrived

System-Wide Impact

  • Security posture significantly improved: timing attacks, SSRF, auth bypass all addressed
  • Reliability improved: retry storms prevented, memory bounded, concurrent access safe
  • API compatibility: WorkflowStore method signatures change from sync to async — all call sites updated
  • WebSocket protocol: subscription message becomes recommended but not required for backward compat