AgentKit Benchmark Report

Timestamp: 2026-06-17T03:26:25.072956+00:00

Version: 0.1.0

Overall Score: 98.0%

Summary: 50/51 tests passed (1 failed) across 7 dimensions.

Dimension Results

DimensionTotalPassFailScore
preprocessing1514193.3%
overfitting330100.0%
efficiency550100.0%
tool_search10100100.0%
event_model660100.0%
spec_management770100.0%
verification550100.0%
OVERALL5150198.0%

Failed Cases

[preprocessing] skill_prefix_direct
expected: skill_react
actual: direct_chat