fischer-agentkit/docs/plans/2026-06-10-018-fix-agentkit...

# fix: AgentKit P2 Security & Reliability Hardening

**Status:** active
**Created:** 2026-06-10
**Origin:** ce-code-review findings on main branch after 7-capabilities merge

---

## Summary

Fix all remaining P2 issues from the final code review of the `feat/agentkit-7-capabilities` merge. These 16 issues span security (SSRF, auth, injection), reliability (retry, concurrency, resource leaks), and architecture (auth consistency, config reload). All are non-blocking for current functionality but represent real risk in production.

## Problem Frame

After merging 7 core capabilities and fixing P0/P1 issues, the codebase has 16 P2 findings that fall into four categories:

1. **Security gaps** — SSRF DNS rebinding, API key timing attacks, WebSocket auth bypass, approval race conditions, shell pattern matching, .env injection
2. **Reliability gaps** — retry storms, unbounded memory growth, concurrent modification, config change loss
3. **Architecture gaps** — dual auth layer inconsistency, WebSocket broadcast without isolation
4. **Defensive gaps** — process kill race, lazy init race, httpx client double-creation

## Requirements

- R1: All P2 security findings must be resolved or explicitly deferred with documented rationale
- R2: Retry logic must use exponential backoff with jitter
- R3: All in-memory stores must have bounded growth with proper eviction
- R4: Concurrent access to shared mutable state must be protected
- R5: Authentication must be consistent across all endpoints and transports
- R6: No regression in existing test suite (764+ tests must pass)

## Key Technical Decisions

- **KTD1: DNS resolution for SSRF** — Use `asyncio.getaddrinfo` (not blocking `socket.getaddrinfo`) to resolve hostnames before IP validation. Fail-closed on DNS errors. Accept that TOCTOU remains between check and connect; full mitigation requires IP pinning at the httpx transport level (deferred).
- **KTD2: Auth consolidation strategy** — Fix per-endpoint auth first (hmac.compare_digest, WebSocket auth-before-accept), then consolidate into middleware in a separate pass. This avoids a big-bang refactor and allows incremental testing.
- **KTD3: WorkflowStore concurrency** — Add `asyncio.Lock` to WorkflowStore, convert sync methods to async. This is the most invasive change but necessary for correctness under concurrent load.
- **KTD4: Shell dangerous patterns** — Replace substring matching with token-based matching using the already-parsed shlex tokens. Separate binary-level and flag-level dangerous patterns.
- **KTD5: WebSocket isolation** — Replace flat subscriber list with `dict[str, set[WebSocket]]` keyed by execution_id. Clients subscribe to specific executions.

---

## Implementation Units

### U1. API Key Timing Attack & WebSocket Auth-Before-Accept

**Goal:** Eliminate timing side-channel in API key comparison and fix WebSocket accept-before-auth pattern.

**Requirements:** R5

**Dependencies:** None

**Files:**
- `src/agentkit/server/routes/portal.py`
- `src/agentkit/server/routes/workflows.py`
- `src/agentkit/server/middleware.py`
- `tests/unit/server/test_portal_routes.py`
- `tests/unit/server/test_workflow_routes.py`

**Approach:**
1. Add `import hmac` to portal.py, workflows.py, middleware.py
2. Replace all `!=` API key comparisons with `hmac.compare_digest()`
3. In both WebSocket handlers, move auth validation BEFORE `websocket.accept()`. On auth failure, call `await websocket.close(code=4001)` without accepting, then return
4. In middleware.py, replace `provided_key not in valid_keys` with `any(hmac.compare_digest(...) for k in valid_keys)`
5. Handle None/empty provided key by defaulting to empty string before comparison

**Test scenarios:**
- API key comparison with correct key returns True
- API key comparison with incorrect key returns False
- API key comparison with None provided defaults to empty string
- WebSocket connection without api_key is rejected (no 101 upgrade)
- WebSocket connection with wrong api_key is rejected
- WebSocket connection with correct api_key is accepted

**Verification:** All existing tests pass; new auth tests pass

---

### U2. Shell Dangerous Pattern Token-Based Matching

**Goal:** Replace substring-based dangerous pattern matching with precise token-based matching to eliminate false positives and bypasses.

**Requirements:** R1

**Dependencies:** None

**Files:**
- `src/agentkit/tools/shell.py`
- `tests/unit/test_shell_tool.py`

**Approach:**
1. Replace `_DANGEROUS_PATTERNS` tuple with two new structures:
   - `_DANGEROUS_BINARIES`: set of binary names that are always dangerous (e.g., `rm`, `mkfs`, `dd`, `shutdown`)
   - `_DANGEROUS_BINARY_FLAGS`: dict mapping binary name to set of dangerous flag combinations (e.g., `kill: {-9, -KILL}`, `chmod: {777}`)
2. In `_is_dangerous()`, after shlex parsing:
   - Check `binary` against `_DANGEROUS_BINARIES` (exact match)
   - Check `binary` + flags against `_DANGEROUS_BINARY_FLAGS`
   - Keep regex-based arg patterns for cross-token matches (e.g., `> /dev/`, `drop table`)
3. Remove overly broad patterns like `format`, `erase` as substrings; add `format` as binary-only
4. Normalize whitespace in command before matching

**Test scenarios:**
- `rm -rf /` detected as dangerous (binary match)
- `echo 'performing rm operations'` NOT detected as dangerous (no binary match)
- `kill -9 1234` detected as dangerous (binary + flag match)
- `kill -15 1234` NOT detected as dangerous (flag not in dangerous set)
- `rm\t/tmp/file` detected as dangerous (tab normalization)
- `format C:` detected as dangerous (binary match)
- `informatica` NOT detected as dangerous (not a binary match)

**Verification:** Shell tool tests pass with new pattern logic

---

### U3. .env File Prefix Filtering & Retry Backoff

**Goal:** Prevent arbitrary environment variable injection from .env files and add exponential backoff to plan executor retries.

**Requirements:** R1, R2

**Dependencies:** None

**Files:**
- `src/agentkit/server/app.py`
- `src/agentkit/core/plan_executor.py`
- `tests/unit/test_plan_executor.py`

**Approach:**
1. In app.py .env loading, add an allowlist of prefixes: `AGENTKIT_`, `OPENAI_`, `ANTHROPIC_`, `GEMINI_`, `TAVILY_`, `SERPER_`, plus exact matches for `DATABASE_URL`, `REDIS_URL`
2. Log a warning when a .env variable is skipped due to prefix filtering
3. In plan_executor.py `_execute_step_with_retry()`, add exponential backoff with jitter after each failure:
   - `base_retry_delay=1.0s`, `max_retry_delay=30.0s`
   - `delay = min(base * 2^(retry-1), max) * (0.5 + random() * 0.5)`
   - Skip delay on first attempt (retry_count == 0)
4. Add `base_retry_delay` and `max_retry_delay` as constructor parameters with defaults

**Test scenarios:**
- .env with `AGENTKIT_FOO=bar` is loaded
- .env with `PATH=/malicious` is skipped with warning
- .env with `OPENAI_API_KEY=sk-xxx` is loaded
- Retry backoff: first retry waits ~1s, second ~2s, third ~4s (with jitter)
- Max retry delay caps at 30s
- Plan executor with max_retries=0 still works (no backoff)

**Verification:** Unit tests pass; .env loading respects prefix filter

---

### U4. WorkflowStore Concurrency & Resource Management

**Goal:** Make WorkflowStore thread-safe, fix unbounded growth, add proper cleanup for approval events and running tasks.

**Requirements:** R3, R4

**Dependencies:** U1 (hmac.compare_digest in workflow routes)

**Files:**
- `src/agentkit/server/routes/workflows.py`
- `tests/unit/server/test_workflow_routes.py`

**Approach:**
1. Add `asyncio.Lock` to WorkflowStore; convert `save`, `delete`, `create_execution`, `update_execution` to async methods
2. Initialize `_running_tasks` in `__init__` instead of lazy `getattr`
3. Add `_evict_execution()` helper that removes execution AND its approval events (with `event.set()` to wake waiting coroutines)
4. Replace `_ws_subscribers: list` with `_ws_subscribers: dict[str, set[WebSocket]]` keyed by execution_id, protected by `asyncio.Lock`
5. Add `_subscribe(execution_id, ws)` and `_unsubscribe(execution_id, ws)` helpers
6. Modify `_broadcast_ws` to accept `execution_id` parameter and only send to relevant subscribers
7. Update all call sites from sync to async (all endpoint functions already async)
8. Add `_execution_locks: dict[str, asyncio.Lock]` for per-execution approve/cancel serialization

**Test scenarios:**
- Concurrent save operations don't cause KeyError or over-eviction
- Evicted execution's approval events are cleaned up
- WebSocket subscriber receives only messages for subscribed execution_id
- WebSocket disconnect properly cleans up from subscriber dict
- Concurrent approve/cancel for same execution serialized by lock
- _running_tasks initialized in __init__, no getattr fallback

**Verification:** Workflow route tests pass; no RuntimeError from concurrent modification

---

### U5. SSRF DNS Resolution & ComputerUse Client Safety

**Goal:** Add DNS resolution to SSRF protection and make ComputerUseTool's httpx client creation thread-safe.

**Requirements:** R1

**Dependencies:** None

**Files:**
- `src/agentkit/utils/security.py`
- `src/agentkit/tools/computer_use.py`
- `tests/unit/test_security.py`
- `tests/unit/tools/test_computer_use.py`

**Approach:**
1. In `is_safe_url()`, after the ValueError catch for domain names, add DNS resolution:
   - Use `socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM)` (sync, acceptable for tool init)
   - Validate each resolved IP against `_is_unsafe_ip()`
   - Fail-closed on DNS errors (return False)
   - Add 3-second timeout on DNS resolution
2. Add `is_safe_url_async()` variant using `asyncio.getaddrinfo` for async callers
3. In ComputerUseTool, add `asyncio.Lock` to `_get_http_client()`, make it async with double-check locking
4. Update caller from `client = self._get_http_client()` to `client = await self._get_http_client()`

**Test scenarios:**
- is_safe_url with domain resolving to 127.0.0.1 returns False
- is_safe_url with domain resolving to public IP returns True
- is_safe_url with unresolvable domain returns False (fail-closed)
- is_safe_url with ::ffff:127.0.0.1 returns False
- is_safe_url with .internal TLD returns False
- ComputerUseTool concurrent _get_http_client returns same client instance
- ComputerUseTool close() properly cleans up client

**Verification:** Security and computer_use tests pass

---

### U6. Config Hot-Reload & Defensive Fixes

**Goal:** Fix config hot-reload lock race, add proc.kill() error handling, and initialize config reload lock eagerly.

**Requirements:** R4

**Dependencies:** None

**Files:**
- `src/agentkit/server/app.py`
- `src/agentkit/tools/shell.py`

**Approach:**
1. In app.py `_on_config_change()`:
   - Initialize lock eagerly in `create_app()` (store on `app.state._config_reload_lock`)
   - Replace `lock.locked()` skip with pending-flag pattern: set `app.state._config_reload_pending = True`, then in `_reload()` loop while pending flag is set
   - This ensures no config change is silently dropped
2. In shell.py `_execute_standalone()`:
   - Wrap `proc.kill()` in `try/except (ProcessLookupError, OSError)`
   - Use `proc.returncode` if available instead of hardcoded -1

**Test scenarios:**
- Two rapid config changes both get applied (no silent drop)
- Config reload pending flag is cleared after reload completes
- Shell timeout with already-exited process returns clean error (no ProcessLookupError)
- Shell timeout output still shows timeout message

**Verification:** App startup and shell tool tests pass

---

## Scope Boundaries

### In Scope
- All 16 P2 findings from ce-code-review
- Test coverage for new security/reliability patterns

### Deferred to Follow-Up Work
- Full auth middleware consolidation (ASGI-level WebSocket middleware) — complex, separate PR
- IP pinning at httpx transport level for complete SSRF protection
- python-dotenv migration for robust .env parsing
- clients.yaml caching in middleware (per-request disk read)
- WorkflowStore persistence (currently in-memory only)

### Outside This Plan's Identity
- New features or capability additions
- Performance optimization beyond what's needed for reliability
- UI/UX changes

---

## Risks & Mitigations

| Risk | Impact | Mitigation |
|------|--------|------------|
| WorkflowStore async refactor breaks existing endpoints | High | Update all call sites; comprehensive test coverage |
| DNS resolution adds latency to URL validation | Low | 3s timeout; acceptable for tool initialization |
| Token-based shell matching misses new bypass patterns | Medium | Default-to-dangerous fallback; add patterns as discovered |
| .env prefix filtering breaks existing deployments | Medium | Comprehensive allowlist; log warnings for skipped vars |
| WebSocket isolation breaks existing WS clients | Medium | Maintain backward compat: accept without subscription, send all |
| Config reload pending loop runs indefinitely | Low | Coalescing naturally limits; only re-runs if new change arrived |

---

## System-Wide Impact

- **Security posture** significantly improved: timing attacks, SSRF, auth bypass all addressed
- **Reliability** improved: retry storms prevented, memory bounded, concurrent access safe
- **API compatibility**: WorkflowStore method signatures change from sync to async — all call sites updated
- **WebSocket protocol**: subscription message becomes recommended but not required for backward compat