13 KiB
fix: AgentKit P2 Security & Reliability Hardening
Status: active Created: 2026-06-10 Origin: ce-code-review findings on main branch after 7-capabilities merge
Summary
Fix all remaining P2 issues from the final code review of the feat/agentkit-7-capabilities merge. These 16 issues span security (SSRF, auth, injection), reliability (retry, concurrency, resource leaks), and architecture (auth consistency, config reload). All are non-blocking for current functionality but represent real risk in production.
Problem Frame
After merging 7 core capabilities and fixing P0/P1 issues, the codebase has 16 P2 findings that fall into four categories:
- Security gaps — SSRF DNS rebinding, API key timing attacks, WebSocket auth bypass, approval race conditions, shell pattern matching, .env injection
- Reliability gaps — retry storms, unbounded memory growth, concurrent modification, config change loss
- Architecture gaps — dual auth layer inconsistency, WebSocket broadcast without isolation
- Defensive gaps — process kill race, lazy init race, httpx client double-creation
Requirements
- R1: All P2 security findings must be resolved or explicitly deferred with documented rationale
- R2: Retry logic must use exponential backoff with jitter
- R3: All in-memory stores must have bounded growth with proper eviction
- R4: Concurrent access to shared mutable state must be protected
- R5: Authentication must be consistent across all endpoints and transports
- R6: No regression in existing test suite (764+ tests must pass)
Key Technical Decisions
- KTD1: DNS resolution for SSRF — Use
asyncio.getaddrinfo(not blockingsocket.getaddrinfo) to resolve hostnames before IP validation. Fail-closed on DNS errors. Accept that TOCTOU remains between check and connect; full mitigation requires IP pinning at the httpx transport level (deferred). - KTD2: Auth consolidation strategy — Fix per-endpoint auth first (hmac.compare_digest, WebSocket auth-before-accept), then consolidate into middleware in a separate pass. This avoids a big-bang refactor and allows incremental testing.
- KTD3: WorkflowStore concurrency — Add
asyncio.Lockto WorkflowStore, convert sync methods to async. This is the most invasive change but necessary for correctness under concurrent load. - KTD4: Shell dangerous patterns — Replace substring matching with token-based matching using the already-parsed shlex tokens. Separate binary-level and flag-level dangerous patterns.
- KTD5: WebSocket isolation — Replace flat subscriber list with
dict[str, set[WebSocket]]keyed by execution_id. Clients subscribe to specific executions.
Implementation Units
U1. API Key Timing Attack & WebSocket Auth-Before-Accept
Goal: Eliminate timing side-channel in API key comparison and fix WebSocket accept-before-auth pattern.
Requirements: R5
Dependencies: None
Files:
src/agentkit/server/routes/portal.pysrc/agentkit/server/routes/workflows.pysrc/agentkit/server/middleware.pytests/unit/server/test_portal_routes.pytests/unit/server/test_workflow_routes.py
Approach:
- Add
import hmacto portal.py, workflows.py, middleware.py - Replace all
!=API key comparisons withhmac.compare_digest() - In both WebSocket handlers, move auth validation BEFORE
websocket.accept(). On auth failure, callawait websocket.close(code=4001)without accepting, then return - In middleware.py, replace
provided_key not in valid_keyswithany(hmac.compare_digest(...) for k in valid_keys) - Handle None/empty provided key by defaulting to empty string before comparison
Test scenarios:
- API key comparison with correct key returns True
- API key comparison with incorrect key returns False
- API key comparison with None provided defaults to empty string
- WebSocket connection without api_key is rejected (no 101 upgrade)
- WebSocket connection with wrong api_key is rejected
- WebSocket connection with correct api_key is accepted
Verification: All existing tests pass; new auth tests pass
U2. Shell Dangerous Pattern Token-Based Matching
Goal: Replace substring-based dangerous pattern matching with precise token-based matching to eliminate false positives and bypasses.
Requirements: R1
Dependencies: None
Files:
src/agentkit/tools/shell.pytests/unit/test_shell_tool.py
Approach:
- Replace
_DANGEROUS_PATTERNStuple with two new structures:_DANGEROUS_BINARIES: set of binary names that are always dangerous (e.g.,rm,mkfs,dd,shutdown)_DANGEROUS_BINARY_FLAGS: dict mapping binary name to set of dangerous flag combinations (e.g.,kill: {-9, -KILL},chmod: {777})
- In
_is_dangerous(), after shlex parsing:- Check
binaryagainst_DANGEROUS_BINARIES(exact match) - Check
binary+ flags against_DANGEROUS_BINARY_FLAGS - Keep regex-based arg patterns for cross-token matches (e.g.,
> /dev/,drop table)
- Check
- Remove overly broad patterns like
format,eraseas substrings; addformatas binary-only - Normalize whitespace in command before matching
Test scenarios:
rm -rf /detected as dangerous (binary match)echo 'performing rm operations'NOT detected as dangerous (no binary match)kill -9 1234detected as dangerous (binary + flag match)kill -15 1234NOT detected as dangerous (flag not in dangerous set)rm\t/tmp/filedetected as dangerous (tab normalization)format C:detected as dangerous (binary match)informaticaNOT detected as dangerous (not a binary match)
Verification: Shell tool tests pass with new pattern logic
U3. .env File Prefix Filtering & Retry Backoff
Goal: Prevent arbitrary environment variable injection from .env files and add exponential backoff to plan executor retries.
Requirements: R1, R2
Dependencies: None
Files:
src/agentkit/server/app.pysrc/agentkit/core/plan_executor.pytests/unit/test_plan_executor.py
Approach:
- In app.py .env loading, add an allowlist of prefixes:
AGENTKIT_,OPENAI_,ANTHROPIC_,GEMINI_,TAVILY_,SERPER_, plus exact matches forDATABASE_URL,REDIS_URL - Log a warning when a .env variable is skipped due to prefix filtering
- In plan_executor.py
_execute_step_with_retry(), add exponential backoff with jitter after each failure:base_retry_delay=1.0s,max_retry_delay=30.0sdelay = min(base * 2^(retry-1), max) * (0.5 + random() * 0.5)- Skip delay on first attempt (retry_count == 0)
- Add
base_retry_delayandmax_retry_delayas constructor parameters with defaults
Test scenarios:
- .env with
AGENTKIT_FOO=baris loaded - .env with
PATH=/maliciousis skipped with warning - .env with
OPENAI_API_KEY=sk-xxxis loaded - Retry backoff: first retry waits ~1s, second ~2s, third ~4s (with jitter)
- Max retry delay caps at 30s
- Plan executor with max_retries=0 still works (no backoff)
Verification: Unit tests pass; .env loading respects prefix filter
U4. WorkflowStore Concurrency & Resource Management
Goal: Make WorkflowStore thread-safe, fix unbounded growth, add proper cleanup for approval events and running tasks.
Requirements: R3, R4
Dependencies: U1 (hmac.compare_digest in workflow routes)
Files:
src/agentkit/server/routes/workflows.pytests/unit/server/test_workflow_routes.py
Approach:
- Add
asyncio.Lockto WorkflowStore; convertsave,delete,create_execution,update_executionto async methods - Initialize
_running_tasksin__init__instead of lazygetattr - Add
_evict_execution()helper that removes execution AND its approval events (withevent.set()to wake waiting coroutines) - Replace
_ws_subscribers: listwith_ws_subscribers: dict[str, set[WebSocket]]keyed by execution_id, protected byasyncio.Lock - Add
_subscribe(execution_id, ws)and_unsubscribe(execution_id, ws)helpers - Modify
_broadcast_wsto acceptexecution_idparameter and only send to relevant subscribers - Update all call sites from sync to async (all endpoint functions already async)
- Add
_execution_locks: dict[str, asyncio.Lock]for per-execution approve/cancel serialization
Test scenarios:
- Concurrent save operations don't cause KeyError or over-eviction
- Evicted execution's approval events are cleaned up
- WebSocket subscriber receives only messages for subscribed execution_id
- WebSocket disconnect properly cleans up from subscriber dict
- Concurrent approve/cancel for same execution serialized by lock
- _running_tasks initialized in init, no getattr fallback
Verification: Workflow route tests pass; no RuntimeError from concurrent modification
U5. SSRF DNS Resolution & ComputerUse Client Safety
Goal: Add DNS resolution to SSRF protection and make ComputerUseTool's httpx client creation thread-safe.
Requirements: R1
Dependencies: None
Files:
src/agentkit/utils/security.pysrc/agentkit/tools/computer_use.pytests/unit/test_security.pytests/unit/tools/test_computer_use.py
Approach:
- In
is_safe_url(), after the ValueError catch for domain names, add DNS resolution:- Use
socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM)(sync, acceptable for tool init) - Validate each resolved IP against
_is_unsafe_ip() - Fail-closed on DNS errors (return False)
- Add 3-second timeout on DNS resolution
- Use
- Add
is_safe_url_async()variant usingasyncio.getaddrinfofor async callers - In ComputerUseTool, add
asyncio.Lockto_get_http_client(), make it async with double-check locking - Update caller from
client = self._get_http_client()toclient = await self._get_http_client()
Test scenarios:
- is_safe_url with domain resolving to 127.0.0.1 returns False
- is_safe_url with domain resolving to public IP returns True
- is_safe_url with unresolvable domain returns False (fail-closed)
- is_safe_url with ::ffff:127.0.0.1 returns False
- is_safe_url with .internal TLD returns False
- ComputerUseTool concurrent _get_http_client returns same client instance
- ComputerUseTool close() properly cleans up client
Verification: Security and computer_use tests pass
U6. Config Hot-Reload & Defensive Fixes
Goal: Fix config hot-reload lock race, add proc.kill() error handling, and initialize config reload lock eagerly.
Requirements: R4
Dependencies: None
Files:
src/agentkit/server/app.pysrc/agentkit/tools/shell.py
Approach:
- In app.py
_on_config_change():- Initialize lock eagerly in
create_app()(store onapp.state._config_reload_lock) - Replace
lock.locked()skip with pending-flag pattern: setapp.state._config_reload_pending = True, then in_reload()loop while pending flag is set - This ensures no config change is silently dropped
- Initialize lock eagerly in
- In shell.py
_execute_standalone():- Wrap
proc.kill()intry/except (ProcessLookupError, OSError) - Use
proc.returncodeif available instead of hardcoded -1
- Wrap
Test scenarios:
- Two rapid config changes both get applied (no silent drop)
- Config reload pending flag is cleared after reload completes
- Shell timeout with already-exited process returns clean error (no ProcessLookupError)
- Shell timeout output still shows timeout message
Verification: App startup and shell tool tests pass
Scope Boundaries
In Scope
- All 16 P2 findings from ce-code-review
- Test coverage for new security/reliability patterns
Deferred to Follow-Up Work
- Full auth middleware consolidation (ASGI-level WebSocket middleware) — complex, separate PR
- IP pinning at httpx transport level for complete SSRF protection
- python-dotenv migration for robust .env parsing
- clients.yaml caching in middleware (per-request disk read)
- WorkflowStore persistence (currently in-memory only)
Outside This Plan's Identity
- New features or capability additions
- Performance optimization beyond what's needed for reliability
- UI/UX changes
Risks & Mitigations
| Risk | Impact | Mitigation |
|---|---|---|
| WorkflowStore async refactor breaks existing endpoints | High | Update all call sites; comprehensive test coverage |
| DNS resolution adds latency to URL validation | Low | 3s timeout; acceptable for tool initialization |
| Token-based shell matching misses new bypass patterns | Medium | Default-to-dangerous fallback; add patterns as discovered |
| .env prefix filtering breaks existing deployments | Medium | Comprehensive allowlist; log warnings for skipped vars |
| WebSocket isolation breaks existing WS clients | Medium | Maintain backward compat: accept without subscription, send all |
| Config reload pending loop runs indefinitely | Low | Coalescing naturally limits; only re-runs if new change arrived |
System-Wide Impact
- Security posture significantly improved: timing attacks, SSRF, auth bypass all addressed
- Reliability improved: retry storms prevented, memory bounded, concurrent access safe
- API compatibility: WorkflowStore method signatures change from sync to async — all call sites updated
- WebSocket protocol: subscription message becomes recommended but not required for backward compat