Commit Graph

5 Commits

Author SHA1 Message Date
chiguyong 840d1afd6a fix: resolve benchmark failures from root cause (LLM timeout, WebSocket, latency stats)
U1: LLM reasoning - difficulty-based timeout (easy=20s/medium=40s/hard=60s)
    + streaming keyword detection for hard tasks with non-stream fallback
U2: GUI WebSocket - remove unreliable HTTP pre-check (FastAPI returns 404
    for HTTP GET to WS endpoints), directly test WS connection, treat
    {"type":"connected"} as pass (ping/pong is bonus info)
U3: Verification latency - exclude timeout-tagged cases from P95/p99
    percentile calculation (accuracy stats unaffected)
U4: LLM Gateway - add timeout field to LLMRequest, gateway.chat()/
    chat_stream() passthrough for provider-level timeout support

Test results: 62/63 pass (98.4%), gui-004 fixed, no regressions
pytest: 64 passed, ruff: clean
2026-06-17 13:32:54 +08:00
chiguyong a1318df420 feat: add LLM and GUI benchmark modes with real agent testing 2026-06-17 12:55:19 +08:00
chiguyong 1fbfd9d132 refactor: standardize benchmark with industry methodology (P/R/F1, multi-run, baseline) 2026-06-17 12:01:34 +08:00
chiguyong d361177cc7 docs: add detailed Chinese benchmark report with industry comparison 2026-06-17 11:34:56 +08:00
chiguyong d00995504d feat: comprehensive capability benchmark and agentkit benchmark CLI 2026-06-17 11:28:09 +08:00