13 KiB

Raw Permalink Blame History

fix: AgentKit P2 Security & Reliability Hardening

Status: active Created: 2026-06-10 Origin: ce-code-review findings on main branch after 7-capabilities merge

Summary

Fix all remaining P2 issues from the final code review of the feat/agentkit-7-capabilities merge. These 16 issues span security (SSRF, auth, injection), reliability (retry, concurrency, resource leaks), and architecture (auth consistency, config reload). All are non-blocking for current functionality but represent real risk in production.

Problem Frame

After merging 7 core capabilities and fixing P0/P1 issues, the codebase has 16 P2 findings that fall into four categories:

Security gaps — SSRF DNS rebinding, API key timing attacks, WebSocket auth bypass, approval race conditions, shell pattern matching, .env injection
Reliability gaps — retry storms, unbounded memory growth, concurrent modification, config change loss
Architecture gaps — dual auth layer inconsistency, WebSocket broadcast without isolation
Defensive gaps — process kill race, lazy init race, httpx client double-creation

Requirements

R1: All P2 security findings must be resolved or explicitly deferred with documented rationale
R2: Retry logic must use exponential backoff with jitter
R3: All in-memory stores must have bounded growth with proper eviction
R4: Concurrent access to shared mutable state must be protected
R5: Authentication must be consistent across all endpoints and transports
R6: No regression in existing test suite (764+ tests must pass)

Key Technical Decisions

KTD1: DNS resolution for SSRF — Use asyncio.getaddrinfo (not blocking socket.getaddrinfo) to resolve hostnames before IP validation. Fail-closed on DNS errors. Accept that TOCTOU remains between check and connect; full mitigation requires IP pinning at the httpx transport level (deferred).
KTD2: Auth consolidation strategy — Fix per-endpoint auth first (hmac.compare_digest, WebSocket auth-before-accept), then consolidate into middleware in a separate pass. This avoids a big-bang refactor and allows incremental testing.
KTD3: WorkflowStore concurrency — Add asyncio.Lock to WorkflowStore, convert sync methods to async. This is the most invasive change but necessary for correctness under concurrent load.
KTD4: Shell dangerous patterns — Replace substring matching with token-based matching using the already-parsed shlex tokens. Separate binary-level and flag-level dangerous patterns.
KTD5: WebSocket isolation — Replace flat subscriber list with dict[str, set[WebSocket]] keyed by execution_id. Clients subscribe to specific executions.

Implementation Units

U1. API Key Timing Attack & WebSocket Auth-Before-Accept

Goal: Eliminate timing side-channel in API key comparison and fix WebSocket accept-before-auth pattern.

Requirements: R5

Dependencies: None

Files:

src/agentkit/server/routes/portal.py
src/agentkit/server/routes/workflows.py
src/agentkit/server/middleware.py
tests/unit/server/test_portal_routes.py
tests/unit/server/test_workflow_routes.py

Approach:

Add import hmac to portal.py, workflows.py, middleware.py
Replace all != API key comparisons with hmac.compare_digest()
In both WebSocket handlers, move auth validation BEFORE websocket.accept(). On auth failure, call await websocket.close(code=4001) without accepting, then return
In middleware.py, replace provided_key not in valid_keys with any(hmac.compare_digest(...) for k in valid_keys)
Handle None/empty provided key by defaulting to empty string before comparison

Test scenarios:

API key comparison with correct key returns True
API key comparison with incorrect key returns False
API key comparison with None provided defaults to empty string
WebSocket connection without api_key is rejected (no 101 upgrade)
WebSocket connection with wrong api_key is rejected
WebSocket connection with correct api_key is accepted

Verification: All existing tests pass; new auth tests pass

U2. Shell Dangerous Pattern Token-Based Matching

Goal: Replace substring-based dangerous pattern matching with precise token-based matching to eliminate false positives and bypasses.

Requirements: R1

Dependencies: None

Files:

src/agentkit/tools/shell.py
tests/unit/test_shell_tool.py

Approach:

Replace _DANGEROUS_PATTERNS tuple with two new structures:
- _DANGEROUS_BINARIES: set of binary names that are always dangerous (e.g., rm, mkfs, dd, shutdown)
- _DANGEROUS_BINARY_FLAGS: dict mapping binary name to set of dangerous flag combinations (e.g., kill: {-9, -KILL}, chmod: {777})
In _is_dangerous(), after shlex parsing:
- Check binary against _DANGEROUS_BINARIES (exact match)
- Check binary + flags against _DANGEROUS_BINARY_FLAGS
- Keep regex-based arg patterns for cross-token matches (e.g., > /dev/, drop table)
Remove overly broad patterns like format, erase as substrings; add format as binary-only
Normalize whitespace in command before matching

Test scenarios:

rm -rf / detected as dangerous (binary match)
echo 'performing rm operations' NOT detected as dangerous (no binary match)
kill -9 1234 detected as dangerous (binary + flag match)
kill -15 1234 NOT detected as dangerous (flag not in dangerous set)
rm\t/tmp/file detected as dangerous (tab normalization)
format C: detected as dangerous (binary match)
informatica NOT detected as dangerous (not a binary match)

Verification: Shell tool tests pass with new pattern logic

U3. .env File Prefix Filtering & Retry Backoff

Goal: Prevent arbitrary environment variable injection from .env files and add exponential backoff to plan executor retries.

Requirements: R1, R2

Dependencies: None

Files:

src/agentkit/server/app.py
src/agentkit/core/plan_executor.py
tests/unit/test_plan_executor.py

Approach:

In app.py .env loading, add an allowlist of prefixes: AGENTKIT_, OPENAI_, ANTHROPIC_, GEMINI_, TAVILY_, SERPER_, plus exact matches for DATABASE_URL, REDIS_URL
Log a warning when a .env variable is skipped due to prefix filtering
In plan_executor.py _execute_step_with_retry(), add exponential backoff with jitter after each failure:
- base_retry_delay=1.0s, max_retry_delay=30.0s
- delay = min(base * 2^(retry-1), max) * (0.5 + random() * 0.5)
- Skip delay on first attempt (retry_count == 0)
Add base_retry_delay and max_retry_delay as constructor parameters with defaults

Test scenarios:

.env with AGENTKIT_FOO=bar is loaded
.env with PATH=/malicious is skipped with warning
.env with OPENAI_API_KEY=sk-xxx is loaded
Retry backoff: first retry waits ~1s, second ~2s, third ~4s (with jitter)
Max retry delay caps at 30s
Plan executor with max_retries=0 still works (no backoff)

Verification: Unit tests pass; .env loading respects prefix filter

U4. WorkflowStore Concurrency & Resource Management

Goal: Make WorkflowStore thread-safe, fix unbounded growth, add proper cleanup for approval events and running tasks.

Requirements: R3, R4

Dependencies: U1 (hmac.compare_digest in workflow routes)

Files:

src/agentkit/server/routes/workflows.py
tests/unit/server/test_workflow_routes.py

Approach:

Add asyncio.Lock to WorkflowStore; convert save, delete, create_execution, update_execution to async methods
Initialize _running_tasks in __init__ instead of lazy getattr
Add _evict_execution() helper that removes execution AND its approval events (with event.set() to wake waiting coroutines)
Replace _ws_subscribers: list with _ws_subscribers: dict[str, set[WebSocket]] keyed by execution_id, protected by asyncio.Lock
Add _subscribe(execution_id, ws) and _unsubscribe(execution_id, ws) helpers
Modify _broadcast_ws to accept execution_id parameter and only send to relevant subscribers
Update all call sites from sync to async (all endpoint functions already async)
Add _execution_locks: dict[str, asyncio.Lock] for per-execution approve/cancel serialization

Test scenarios:

Concurrent save operations don't cause KeyError or over-eviction
Evicted execution's approval events are cleaned up
WebSocket subscriber receives only messages for subscribed execution_id
WebSocket disconnect properly cleans up from subscriber dict
Concurrent approve/cancel for same execution serialized by lock
_running_tasks initialized in init, no getattr fallback

Verification: Workflow route tests pass; no RuntimeError from concurrent modification

U5. SSRF DNS Resolution & ComputerUse Client Safety

Goal: Add DNS resolution to SSRF protection and make ComputerUseTool's httpx client creation thread-safe.

Requirements: R1

Dependencies: None

Files:

src/agentkit/utils/security.py
src/agentkit/tools/computer_use.py
tests/unit/test_security.py
tests/unit/tools/test_computer_use.py

Approach:

In is_safe_url(), after the ValueError catch for domain names, add DNS resolution:
- Use socket.getaddrinfo(hostname, None, socket.AF_UNSPEC, socket.SOCK_STREAM) (sync, acceptable for tool init)
- Validate each resolved IP against _is_unsafe_ip()
- Fail-closed on DNS errors (return False)
- Add 3-second timeout on DNS resolution
Add is_safe_url_async() variant using asyncio.getaddrinfo for async callers
In ComputerUseTool, add asyncio.Lock to _get_http_client(), make it async with double-check locking
Update caller from client = self._get_http_client() to client = await self._get_http_client()

Test scenarios:

is_safe_url with domain resolving to 127.0.0.1 returns False
is_safe_url with domain resolving to public IP returns True
is_safe_url with unresolvable domain returns False (fail-closed)
is_safe_url with ::ffff:127.0.0.1 returns False
is_safe_url with .internal TLD returns False
ComputerUseTool concurrent _get_http_client returns same client instance
ComputerUseTool close() properly cleans up client

Verification: Security and computer_use tests pass

U6. Config Hot-Reload & Defensive Fixes

Goal: Fix config hot-reload lock race, add proc.kill() error handling, and initialize config reload lock eagerly.

Requirements: R4

Dependencies: None

Files:

src/agentkit/server/app.py
src/agentkit/tools/shell.py

Approach:

In app.py _on_config_change():
- Initialize lock eagerly in create_app() (store on app.state._config_reload_lock)
- Replace lock.locked() skip with pending-flag pattern: set app.state._config_reload_pending = True, then in _reload() loop while pending flag is set
- This ensures no config change is silently dropped
In shell.py _execute_standalone():
- Wrap proc.kill() in try/except (ProcessLookupError, OSError)
- Use proc.returncode if available instead of hardcoded -1

Test scenarios:

Two rapid config changes both get applied (no silent drop)
Config reload pending flag is cleared after reload completes
Shell timeout with already-exited process returns clean error (no ProcessLookupError)
Shell timeout output still shows timeout message

Verification: App startup and shell tool tests pass

Scope Boundaries

In Scope

All 16 P2 findings from ce-code-review
Test coverage for new security/reliability patterns

Deferred to Follow-Up Work

Full auth middleware consolidation (ASGI-level WebSocket middleware) — complex, separate PR
IP pinning at httpx transport level for complete SSRF protection
python-dotenv migration for robust .env parsing
clients.yaml caching in middleware (per-request disk read)
WorkflowStore persistence (currently in-memory only)

Outside This Plan's Identity

New features or capability additions
Performance optimization beyond what's needed for reliability
UI/UX changes

Risks & Mitigations

Risk	Impact	Mitigation
WorkflowStore async refactor breaks existing endpoints	High	Update all call sites; comprehensive test coverage
DNS resolution adds latency to URL validation	Low	3s timeout; acceptable for tool initialization
Token-based shell matching misses new bypass patterns	Medium	Default-to-dangerous fallback; add patterns as discovered
.env prefix filtering breaks existing deployments	Medium	Comprehensive allowlist; log warnings for skipped vars
WebSocket isolation breaks existing WS clients	Medium	Maintain backward compat: accept without subscription, send all
Config reload pending loop runs indefinitely	Low	Coalescing naturally limits; only re-runs if new change arrived

System-Wide Impact

Security posture significantly improved: timing attacks, SSRF, auth bypass all addressed
Reliability improved: retry storms prevented, memory bounded, concurrent access safe
API compatibility: WorkflowStore method signatures change from sync to async — all call sites updated
WebSocket protocol: subscription message becomes recommended but not required for backward compat

13 KiB Raw Permalink Blame History

fix: AgentKit P2 Security & Reliability Hardening

Summary

Problem Frame

Requirements

Key Technical Decisions

Implementation Units

U1. API Key Timing Attack & WebSocket Auth-Before-Accept

U2. Shell Dangerous Pattern Token-Based Matching

U3. .env File Prefix Filtering & Retry Backoff

U4. WorkflowStore Concurrency & Resource Management

U5. SSRF DNS Resolution & ComputerUse Client Safety

U6. Config Hot-Reload & Defensive Fixes

Scope Boundaries

In Scope

Deferred to Follow-Up Work

Outside This Plan's Identity

Risks & Mitigations

System-Wide Impact

13 KiB

Raw Permalink Blame History