# fix: AgentKit P2 Security & Reliability Hardening **Status:** active **Created:** 2026-06-10 **Origin:** ce-code-review findings on main branch after 7-capabilities merge --- ## Summary Fix all remaining P2 issues from the final code review of the `feat/agentkit-7-capabilities` merge. These 16 issues span security (SSRF, auth, injection), reliability (retry, concurrency, resource leaks), and architecture (auth consistency, config reload). All are non-blocking for current functionality but represent real risk in production. ## Problem Frame After merging 7 core capabilities and fixing P0/P1 issues, the codebase has 16 P2 findings that fall into four categories: 1. **Security gaps** — SSRF DNS rebinding, API key timing attacks, WebSocket auth bypass, approval race conditions, shell pattern matching, .env injection 2. **Reliability gaps** — retry storms, unbounded memory growth, concurrent modification, config change loss 3. **Architecture gaps** — dual auth layer inconsistency, WebSocket broadcast without isolation 4. **Defensive gaps** — process kill race, lazy init race, httpx client double-creation ## Requirements - R1: All P2 security findings must be resolved or explicitly deferred with documented rationale - R2: Retry logic must use exponential backoff with jitter - R3: All in-memory stores must have bounded growth with proper eviction - R4: Concurrent access to shared mutable state must be protected - R5: Authentication must be consistent across all endpoints and transports - R6: No regression in existing test suite (764+ tests must pass) ## Key Technical Decisions - **KTD1: DNS resolution for SSRF** — Use `asyncio.getaddrinfo` (not blocking `socket.getaddrinfo`) to resolve hostnames before IP validation. Fail-closed on DNS errors. Accept that TOCTOU remains between check and connect; full mitigation requires IP pinning at the httpx transport level (deferred). - **KTD2: Auth consolidation strategy** — Fix per-endpoint auth first (hmac.compare_digest, WebSocket auth-before-accept), then consolidate into middleware in a separate pass. This avoids a big-bang refactor and allows incremental testing. - **KTD3: WorkflowStore concurrency** — Add `asyncio.Lock` to WorkflowStore, convert sync methods to async. This is the most invasive change but necessary for correctness under concurrent load. - **KTD4: Shell dangerous patterns** — Replace substring matching with token-based matching using the already-parsed shlex tokens. Separate binary-level and flag-level dangerous patterns. - **KTD5: WebSocket isolation** — Replace flat subscriber list with `dict[str, set[WebSocket]]` keyed by execution_id. Clients subscribe to specific executions. --- ## Implementation Units ### U1. API Key Timing Attack & WebSocket Auth-Before-Accept **Goal:** Eliminate timing side-channel in API key comparison and fix WebSocket accept-before-auth pattern. **Requirements:** R5 **Dependencies:** None **Files:** - `src/agentkit/server/routes/portal.py` - `src/agentkit/server/routes/workflows.py` - `src/agentkit/server/middleware.py` - `tests/unit/server/test_portal_routes.py` - `tests/unit/server/test_workflow_routes.py` **Approach:** 1. Add `import hmac` to portal.py, workflows.py, middleware.py 2. Replace all `!=` API key comparisons with `hmac.compare_digest()` 3. In both WebSocket handlers, move auth validation BEFORE `websocket.accept()`. On auth failure, call `await websocket.close(code=4001)` without accepting, then return 4. In middleware.py, replace `provided_key not in valid_keys` with `any(hmac.compare_digest(...) for k in valid_keys)` 5. Handle None/empty provided key by defaulting to empty string before comparison **Test scenarios:** - API key comparison with correct key returns True - API key comparison with incorrect key returns False - API key comparison with None provided defaults to empty string - WebSocket connection without api_key is rejected (no 101 upgrade) - WebSocket connection with wrong api_key is rejected - WebSocket connection with correct api_key is accepted **Verification:** All existing tests pass; new auth tests pass --- ### U2. Shell Dangerous Pattern Token-Based Matching **Goal:** Replace substring-based dangerous pattern matching with precise token-based matching to eliminate false positives and bypasses. **Requirements:** R1 **Dependencies:** None **Files:** - `src/agentkit/tools/shell.py` - `tests/unit/test_shell_tool.py` **Approach:** 1. Replace `_DANGEROUS_PATTERNS` tuple with two new structures: - `_DANGEROUS_BINARIES`: set of binary names that are always dangerous (e.g., `rm`, `mkfs`, `dd`, `shutdown`) - `_DANGEROUS_BINARY_FLAGS`: dict mapping binary name to set of dangerous flag combinations (e.g., `kill: {-9, -KILL}`, `chmod: {777}`) 2. In `_is_dangerous()`, after shlex parsing: - Check `binary` against `_DANGEROUS_BINARIES` (exact match) - Check `binary` + flags against `_DANGEROUS_BINARY_FLAGS` - Keep regex-based arg patterns for cross-token matches (e.g., `> /dev/`, `drop table`) 3. Remove overly broad patterns like `format`, `erase` as substrings; add `format` as binary-only 4. Normalize whitespace in command before matching **Test scenarios:** - `rm -rf /` detected as dangerous (binary match) - `echo 'performing rm operations'` NOT detected as dangerous (no binary match) - `kill -9 1234` detected as dangerous (binary + flag match) - `kill -15 1234` NOT detected as dangerous (flag not in dangerous set) - `rm\t/tmp/file` detected as dangerous (tab normalization) - `format C:` detected as dangerous (binary match) - `informatica` NOT detected as dangerous (not a binary match) **Verification:** Shell tool tests pass with new pattern logic --- ### U3. .env File Prefix Filtering & Retry Backoff **Goal:** Prevent arbitrary environment variable injection from .env files and add exponential backoff to plan executor retries. **Requirements:** R1, R2 **Dependencies:** None **Files:** - `src/agentkit/server/app.py` - `src/agentkit/core/plan_executor.py` - `tests/unit/test_plan_executor.py` **Approach:** 1. In app.py .env loading, add an allowlist of prefixes: `AGENTKIT_`, `OPENAI_`, `ANTHROPIC_`, `GEMINI_`, `TAVILY_`, `SERPER_`, plus exact matches for `DATABASE_URL`, `REDIS_URL` 2. Log a warning when a .env variable is skipped due to prefix filtering 3. In plan_executor.py `_execute_step_with_retry()`, add exponential backoff with jitter after each failure: - `base_retry_delay=1.0s`, `max_retry_delay=30.0s` - `delay = min(base * 2^(retry-1), max) * (0.5 + random() * 0.5)` - Skip delay on first attempt (retry_count == 0) 4. Add `base_retry_delay` and `max_retry_delay` as constructor parameters with defaults **Test scenarios:** - .env with `AGENTKIT_FOO=bar` is loaded - .env with `PATH=/malicious` is skipped with warning - .env with `OPENAI_API_KEY=sk-xxx` is loaded - Retry backoff: first retry waits ~1s, second ~2s, third ~4s (with jitter) - Max retry delay caps at 30s - Plan executor with max_retries=0 still works (no backoff) **Verification:** Unit tests pass; .env loading respects prefix filter --- ### U4. WorkflowStore Concurrency & Resource Management **Goal:** Make WorkflowStore thread-safe, fix unbounded growth, add proper cleanup for approval events and running tasks. **Requirements:** R3, R4 **Dependencies:** U1 (hmac.compare_digest in workflow routes) **Files:** - `src/agentkit/server/routes/workflows.py` - `tests/unit/server/test_workflow_routes.py` **Approach:** 1. Add `asyncio.Lock` to WorkflowStore; convert `save`, `delete`, `create_execution`, `update_execution` to async methods 2. Initialize `_running_tasks` in `__init__` instead of lazy `getattr` 3. Add `_evict_execution()` helper that removes execution AND its approval events (with `event.set()` to wake waiting coroutines) 4. Replace `_ws_subscribers: list` with `_ws_subscribers: dict[str, set[WebSocket]]` keyed by execution_id, protected by `asyncio.Lock` 5. Add `_subscribe(execution_id, ws)` and `_unsubscribe(execution_id, ws)` helpers 6. Modify `_broadcast_ws` to accept `execution_id` parameter and only send to relevant subscribers 7. Update all call sites from sync to async (all endpoint functions already async) 8. Add `_execution_locks: dict[str, asyncio.Lock]` for per-execution approve/cancel serialization **Test scenarios:** - Concurrent save operations don't cause KeyError or over-eviction - Evicted execution's approval events are cleaned up - WebSocket subscriber receives only messages for subscribed execution_id - WebSocket disconnect properly cleans up from subscriber dict - Concurrent approve/cancel for same execution serialized by lock - _running_tasks initialized in __init__, no getattr fallback **Verification:** Workflow route tests pass; no RuntimeError from concurrent modification --- ### U5. SSRF DNS Resolution & ComputerUse Client Safety **Goal:** Add DNS resolution to SSRF protection and make ComputerUseTool's httpx client creation thread-safe. **Requirements:** R1 **Dependencies:** None **Files:** - `src/agentkit/utils/security.py` - `src/agentkit/tools/computer_use.py` - `tests/unit/test_security.py` - `tests/unit/tools/test_computer_use.py` **Approach:** 1. In `is_safe_url()`, after the ValueError catch for domain names, add DNS resolution: - Use `socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM)` (sync, acceptable for tool init) - Validate each resolved IP against `_is_unsafe_ip()` - Fail-closed on DNS errors (return False) - Add 3-second timeout on DNS resolution 2. Add `is_safe_url_async()` variant using `asyncio.getaddrinfo` for async callers 3. In ComputerUseTool, add `asyncio.Lock` to `_get_http_client()`, make it async with double-check locking 4. Update caller from `client = self._get_http_client()` to `client = await self._get_http_client()` **Test scenarios:** - is_safe_url with domain resolving to 127.0.0.1 returns False - is_safe_url with domain resolving to public IP returns True - is_safe_url with unresolvable domain returns False (fail-closed) - is_safe_url with ::ffff:127.0.0.1 returns False - is_safe_url with .internal TLD returns False - ComputerUseTool concurrent _get_http_client returns same client instance - ComputerUseTool close() properly cleans up client **Verification:** Security and computer_use tests pass --- ### U6. Config Hot-Reload & Defensive Fixes **Goal:** Fix config hot-reload lock race, add proc.kill() error handling, and initialize config reload lock eagerly. **Requirements:** R4 **Dependencies:** None **Files:** - `src/agentkit/server/app.py` - `src/agentkit/tools/shell.py` **Approach:** 1. In app.py `_on_config_change()`: - Initialize lock eagerly in `create_app()` (store on `app.state._config_reload_lock`) - Replace `lock.locked()` skip with pending-flag pattern: set `app.state._config_reload_pending = True`, then in `_reload()` loop while pending flag is set - This ensures no config change is silently dropped 2. In shell.py `_execute_standalone()`: - Wrap `proc.kill()` in `try/except (ProcessLookupError, OSError)` - Use `proc.returncode` if available instead of hardcoded -1 **Test scenarios:** - Two rapid config changes both get applied (no silent drop) - Config reload pending flag is cleared after reload completes - Shell timeout with already-exited process returns clean error (no ProcessLookupError) - Shell timeout output still shows timeout message **Verification:** App startup and shell tool tests pass --- ## Scope Boundaries ### In Scope - All 16 P2 findings from ce-code-review - Test coverage for new security/reliability patterns ### Deferred to Follow-Up Work - Full auth middleware consolidation (ASGI-level WebSocket middleware) — complex, separate PR - IP pinning at httpx transport level for complete SSRF protection - python-dotenv migration for robust .env parsing - clients.yaml caching in middleware (per-request disk read) - WorkflowStore persistence (currently in-memory only) ### Outside This Plan's Identity - New features or capability additions - Performance optimization beyond what's needed for reliability - UI/UX changes --- ## Risks & Mitigations | Risk | Impact | Mitigation | |------|--------|------------| | WorkflowStore async refactor breaks existing endpoints | High | Update all call sites; comprehensive test coverage | | DNS resolution adds latency to URL validation | Low | 3s timeout; acceptable for tool initialization | | Token-based shell matching misses new bypass patterns | Medium | Default-to-dangerous fallback; add patterns as discovered | | .env prefix filtering breaks existing deployments | Medium | Comprehensive allowlist; log warnings for skipped vars | | WebSocket isolation breaks existing WS clients | Medium | Maintain backward compat: accept without subscription, send all | | Config reload pending loop runs indefinitely | Low | Coalescing naturally limits; only re-runs if new change arrived | --- ## System-Wide Impact - **Security posture** significantly improved: timing attacks, SSRF, auth bypass all addressed - **Reliability** improved: retry storms prevented, memory bounded, concurrent access safe - **API compatibility**: WorkflowStore method signatures change from sync to async — all call sites updated - **WebSocket protocol**: subscription message becomes recommended but not required for backward compat