ToolHop

A query-driven benchmark for multi-hop tool use in large language models

Fudan University ByteDance
995
User Queries
3,912
Tools
35
Models

Leaderboard

Rank Model Family Answer correctness

About ToolHop

ToolHop evaluates whether models can decompose multi-hop queries, invoke the right tools in order, and use feedback from each step. The dataset is built with a query-driven pipeline (tool creation, document refinement, code generation).

Scenarios

Direct: no tools. Mandatory: models must use provided tools. Free: tools are optional. Numbers are answer correctness (↑). Expanded rows show the per-scenario breakdown.

Submit results

To add a model, use the evaluation scripts in the repository, then open a PR or issue with your numbers so we can update this table.