fischer-agentkit/docs/plans/2026-06-10-018-fix-agentkit...

288 lines
13 KiB
Markdown

# fix: AgentKit P2 Security & Reliability Hardening
**Status:** active
**Created:** 2026-06-10
**Origin:** ce-code-review findings on main branch after 7-capabilities merge
---
## Summary
Fix all remaining P2 issues from the final code review of the `feat/agentkit-7-capabilities` merge. These 16 issues span security (SSRF, auth, injection), reliability (retry, concurrency, resource leaks), and architecture (auth consistency, config reload). All are non-blocking for current functionality but represent real risk in production.
## Problem Frame
After merging 7 core capabilities and fixing P0/P1 issues, the codebase has 16 P2 findings that fall into four categories:
1. **Security gaps** — SSRF DNS rebinding, API key timing attacks, WebSocket auth bypass, approval race conditions, shell pattern matching, .env injection
2. **Reliability gaps** — retry storms, unbounded memory growth, concurrent modification, config change loss
3. **Architecture gaps** — dual auth layer inconsistency, WebSocket broadcast without isolation
4. **Defensive gaps** — process kill race, lazy init race, httpx client double-creation
## Requirements
- R1: All P2 security findings must be resolved or explicitly deferred with documented rationale
- R2: Retry logic must use exponential backoff with jitter
- R3: All in-memory stores must have bounded growth with proper eviction
- R4: Concurrent access to shared mutable state must be protected
- R5: Authentication must be consistent across all endpoints and transports
- R6: No regression in existing test suite (764+ tests must pass)
## Key Technical Decisions
- **KTD1: DNS resolution for SSRF** — Use `asyncio.getaddrinfo` (not blocking `socket.getaddrinfo`) to resolve hostnames before IP validation. Fail-closed on DNS errors. Accept that TOCTOU remains between check and connect; full mitigation requires IP pinning at the httpx transport level (deferred).
- **KTD2: Auth consolidation strategy** — Fix per-endpoint auth first (hmac.compare_digest, WebSocket auth-before-accept), then consolidate into middleware in a separate pass. This avoids a big-bang refactor and allows incremental testing.
- **KTD3: WorkflowStore concurrency** — Add `asyncio.Lock` to WorkflowStore, convert sync methods to async. This is the most invasive change but necessary for correctness under concurrent load.
- **KTD4: Shell dangerous patterns** — Replace substring matching with token-based matching using the already-parsed shlex tokens. Separate binary-level and flag-level dangerous patterns.
- **KTD5: WebSocket isolation** — Replace flat subscriber list with `dict[str, set[WebSocket]]` keyed by execution_id. Clients subscribe to specific executions.
---
## Implementation Units
### U1. API Key Timing Attack & WebSocket Auth-Before-Accept
**Goal:** Eliminate timing side-channel in API key comparison and fix WebSocket accept-before-auth pattern.
**Requirements:** R5
**Dependencies:** None
**Files:**
- `src/agentkit/server/routes/portal.py`
- `src/agentkit/server/routes/workflows.py`
- `src/agentkit/server/middleware.py`
- `tests/unit/server/test_portal_routes.py`
- `tests/unit/server/test_workflow_routes.py`
**Approach:**
1. Add `import hmac` to portal.py, workflows.py, middleware.py
2. Replace all `!=` API key comparisons with `hmac.compare_digest()`
3. In both WebSocket handlers, move auth validation BEFORE `websocket.accept()`. On auth failure, call `await websocket.close(code=4001)` without accepting, then return
4. In middleware.py, replace `provided_key not in valid_keys` with `any(hmac.compare_digest(...) for k in valid_keys)`
5. Handle None/empty provided key by defaulting to empty string before comparison
**Test scenarios:**
- API key comparison with correct key returns True
- API key comparison with incorrect key returns False
- API key comparison with None provided defaults to empty string
- WebSocket connection without api_key is rejected (no 101 upgrade)
- WebSocket connection with wrong api_key is rejected
- WebSocket connection with correct api_key is accepted
**Verification:** All existing tests pass; new auth tests pass
---
### U2. Shell Dangerous Pattern Token-Based Matching
**Goal:** Replace substring-based dangerous pattern matching with precise token-based matching to eliminate false positives and bypasses.
**Requirements:** R1
**Dependencies:** None
**Files:**
- `src/agentkit/tools/shell.py`
- `tests/unit/test_shell_tool.py`
**Approach:**
1. Replace `_DANGEROUS_PATTERNS` tuple with two new structures:
- `_DANGEROUS_BINARIES`: set of binary names that are always dangerous (e.g., `rm`, `mkfs`, `dd`, `shutdown`)
- `_DANGEROUS_BINARY_FLAGS`: dict mapping binary name to set of dangerous flag combinations (e.g., `kill: {-9, -KILL}`, `chmod: {777}`)
2. In `_is_dangerous()`, after shlex parsing:
- Check `binary` against `_DANGEROUS_BINARIES` (exact match)
- Check `binary` + flags against `_DANGEROUS_BINARY_FLAGS`
- Keep regex-based arg patterns for cross-token matches (e.g., `> /dev/`, `drop table`)
3. Remove overly broad patterns like `format`, `erase` as substrings; add `format` as binary-only
4. Normalize whitespace in command before matching
**Test scenarios:**
- `rm -rf /` detected as dangerous (binary match)
- `echo 'performing rm operations'` NOT detected as dangerous (no binary match)
- `kill -9 1234` detected as dangerous (binary + flag match)
- `kill -15 1234` NOT detected as dangerous (flag not in dangerous set)
- `rm\t/tmp/file` detected as dangerous (tab normalization)
- `format C:` detected as dangerous (binary match)
- `informatica` NOT detected as dangerous (not a binary match)
**Verification:** Shell tool tests pass with new pattern logic
---
### U3. .env File Prefix Filtering & Retry Backoff
**Goal:** Prevent arbitrary environment variable injection from .env files and add exponential backoff to plan executor retries.
**Requirements:** R1, R2
**Dependencies:** None
**Files:**
- `src/agentkit/server/app.py`
- `src/agentkit/core/plan_executor.py`
- `tests/unit/test_plan_executor.py`
**Approach:**
1. In app.py .env loading, add an allowlist of prefixes: `AGENTKIT_`, `OPENAI_`, `ANTHROPIC_`, `GEMINI_`, `TAVILY_`, `SERPER_`, plus exact matches for `DATABASE_URL`, `REDIS_URL`
2. Log a warning when a .env variable is skipped due to prefix filtering
3. In plan_executor.py `_execute_step_with_retry()`, add exponential backoff with jitter after each failure:
- `base_retry_delay=1.0s`, `max_retry_delay=30.0s`
- `delay = min(base * 2^(retry-1), max) * (0.5 + random() * 0.5)`
- Skip delay on first attempt (retry_count == 0)
4. Add `base_retry_delay` and `max_retry_delay` as constructor parameters with defaults
**Test scenarios:**
- .env with `AGENTKIT_FOO=bar` is loaded
- .env with `PATH=/malicious` is skipped with warning
- .env with `OPENAI_API_KEY=sk-xxx` is loaded
- Retry backoff: first retry waits ~1s, second ~2s, third ~4s (with jitter)
- Max retry delay caps at 30s
- Plan executor with max_retries=0 still works (no backoff)
**Verification:** Unit tests pass; .env loading respects prefix filter
---
### U4. WorkflowStore Concurrency & Resource Management
**Goal:** Make WorkflowStore thread-safe, fix unbounded growth, add proper cleanup for approval events and running tasks.
**Requirements:** R3, R4
**Dependencies:** U1 (hmac.compare_digest in workflow routes)
**Files:**
- `src/agentkit/server/routes/workflows.py`
- `tests/unit/server/test_workflow_routes.py`
**Approach:**
1. Add `asyncio.Lock` to WorkflowStore; convert `save`, `delete`, `create_execution`, `update_execution` to async methods
2. Initialize `_running_tasks` in `__init__` instead of lazy `getattr`
3. Add `_evict_execution()` helper that removes execution AND its approval events (with `event.set()` to wake waiting coroutines)
4. Replace `_ws_subscribers: list` with `_ws_subscribers: dict[str, set[WebSocket]]` keyed by execution_id, protected by `asyncio.Lock`
5. Add `_subscribe(execution_id, ws)` and `_unsubscribe(execution_id, ws)` helpers
6. Modify `_broadcast_ws` to accept `execution_id` parameter and only send to relevant subscribers
7. Update all call sites from sync to async (all endpoint functions already async)
8. Add `_execution_locks: dict[str, asyncio.Lock]` for per-execution approve/cancel serialization
**Test scenarios:**
- Concurrent save operations don't cause KeyError or over-eviction
- Evicted execution's approval events are cleaned up
- WebSocket subscriber receives only messages for subscribed execution_id
- WebSocket disconnect properly cleans up from subscriber dict
- Concurrent approve/cancel for same execution serialized by lock
- _running_tasks initialized in __init__, no getattr fallback
**Verification:** Workflow route tests pass; no RuntimeError from concurrent modification
---
### U5. SSRF DNS Resolution & ComputerUse Client Safety
**Goal:** Add DNS resolution to SSRF protection and make ComputerUseTool's httpx client creation thread-safe.
**Requirements:** R1
**Dependencies:** None
**Files:**
- `src/agentkit/utils/security.py`
- `src/agentkit/tools/computer_use.py`
- `tests/unit/test_security.py`
- `tests/unit/tools/test_computer_use.py`
**Approach:**
1. In `is_safe_url()`, after the ValueError catch for domain names, add DNS resolution:
- Use `socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM)` (sync, acceptable for tool init)
- Validate each resolved IP against `_is_unsafe_ip()`
- Fail-closed on DNS errors (return False)
- Add 3-second timeout on DNS resolution
2. Add `is_safe_url_async()` variant using `asyncio.getaddrinfo` for async callers
3. In ComputerUseTool, add `asyncio.Lock` to `_get_http_client()`, make it async with double-check locking
4. Update caller from `client = self._get_http_client()` to `client = await self._get_http_client()`
**Test scenarios:**
- is_safe_url with domain resolving to 127.0.0.1 returns False
- is_safe_url with domain resolving to public IP returns True
- is_safe_url with unresolvable domain returns False (fail-closed)
- is_safe_url with ::ffff:127.0.0.1 returns False
- is_safe_url with .internal TLD returns False
- ComputerUseTool concurrent _get_http_client returns same client instance
- ComputerUseTool close() properly cleans up client
**Verification:** Security and computer_use tests pass
---
### U6. Config Hot-Reload & Defensive Fixes
**Goal:** Fix config hot-reload lock race, add proc.kill() error handling, and initialize config reload lock eagerly.
**Requirements:** R4
**Dependencies:** None
**Files:**
- `src/agentkit/server/app.py`
- `src/agentkit/tools/shell.py`
**Approach:**
1. In app.py `_on_config_change()`:
- Initialize lock eagerly in `create_app()` (store on `app.state._config_reload_lock`)
- Replace `lock.locked()` skip with pending-flag pattern: set `app.state._config_reload_pending = True`, then in `_reload()` loop while pending flag is set
- This ensures no config change is silently dropped
2. In shell.py `_execute_standalone()`:
- Wrap `proc.kill()` in `try/except (ProcessLookupError, OSError)`
- Use `proc.returncode` if available instead of hardcoded -1
**Test scenarios:**
- Two rapid config changes both get applied (no silent drop)
- Config reload pending flag is cleared after reload completes
- Shell timeout with already-exited process returns clean error (no ProcessLookupError)
- Shell timeout output still shows timeout message
**Verification:** App startup and shell tool tests pass
---
## Scope Boundaries
### In Scope
- All 16 P2 findings from ce-code-review
- Test coverage for new security/reliability patterns
### Deferred to Follow-Up Work
- Full auth middleware consolidation (ASGI-level WebSocket middleware) — complex, separate PR
- IP pinning at httpx transport level for complete SSRF protection
- python-dotenv migration for robust .env parsing
- clients.yaml caching in middleware (per-request disk read)
- WorkflowStore persistence (currently in-memory only)
### Outside This Plan's Identity
- New features or capability additions
- Performance optimization beyond what's needed for reliability
- UI/UX changes
---
## Risks & Mitigations
| Risk | Impact | Mitigation |
|------|--------|------------|
| WorkflowStore async refactor breaks existing endpoints | High | Update all call sites; comprehensive test coverage |
| DNS resolution adds latency to URL validation | Low | 3s timeout; acceptable for tool initialization |
| Token-based shell matching misses new bypass patterns | Medium | Default-to-dangerous fallback; add patterns as discovered |
| .env prefix filtering breaks existing deployments | Medium | Comprehensive allowlist; log warnings for skipped vars |
| WebSocket isolation breaks existing WS clients | Medium | Maintain backward compat: accept without subscription, send all |
| Config reload pending loop runs indefinitely | Low | Coalescing naturally limits; only re-runs if new change arrived |
---
## System-Wide Impact
- **Security posture** significantly improved: timing attacks, SSRF, auth bypass all addressed
- **Reliability** improved: retry storms prevented, memory bounded, concurrent access safe
- **API compatibility**: WorkflowStore method signatures change from sync to async — all call sites updated
- **WebSocket protocol**: subscription message becomes recommended but not required for backward compat