ToolHop Leaderboard

About ToolHop

ToolHop evaluates whether models can decompose multi-hop queries, invoke the right tools in order, and use feedback from each step. The dataset is built with a query-driven pipeline (tool creation, document refinement, code generation).

Scenarios

Direct: no tools. Mandatory: models must use provided tools. Free: tools are optional. Numbers are answer correctness (↑). Expanded rows show the per-scenario breakdown.

Submit results

To add a model, use the evaluation scripts in the repository, then open a PR or issue with your numbers so we can update this table.