fischer-agentkit/test-results/benchmark
chiguyong 840d1afd6a fix: resolve benchmark failures from root cause (LLM timeout, WebSocket, latency stats)
U1: LLM reasoning - difficulty-based timeout (easy=20s/medium=40s/hard=60s)
    + streaming keyword detection for hard tasks with non-stream fallback
U2: GUI WebSocket - remove unreliable HTTP pre-check (FastAPI returns 404
    for HTTP GET to WS endpoints), directly test WS connection, treat
    {"type":"connected"} as pass (ping/pong is bonus info)
U3: Verification latency - exclude timeout-tagged cases from P95/p99
    percentile calculation (accuracy stats unaffected)
U4: LLM Gateway - add timeout field to LLMRequest, gateway.chat()/
    chat_stream() passthrough for provider-level timeout support

Test results: 62/63 pass (98.4%), gui-004 fixed, no regressions
pytest: 64 passed, ruff: clean
2026-06-17 13:32:54 +08:00
..
baseline.json refactor: standardize benchmark with industry methodology (P/R/F1, multi-run, baseline) 2026-06-17 12:01:34 +08:00
benchmark_report.html feat: comprehensive capability benchmark and agentkit benchmark CLI 2026-06-17 11:28:09 +08:00
benchmark_report.json fix: resolve benchmark failures from root cause (LLM timeout, WebSocket, latency stats) 2026-06-17 13:32:54 +08:00
benchmark_report.md fix: resolve benchmark failures from root cause (LLM timeout, WebSocket, latency stats) 2026-06-17 13:32:54 +08:00
benchmark_report.txt feat: add LLM and GUI benchmark modes with real agent testing 2026-06-17 12:55:19 +08:00
benchmark_report_cn.md docs: add detailed Chinese benchmark report with industry comparison 2026-06-17 11:34:56 +08:00