| Rank | Model | Family | Answer correctness |
|---|
ToolHop evaluates whether models can decompose multi-hop queries, invoke the right tools in order, and use feedback from each step. The dataset is built with a query-driven pipeline (tool creation, document refinement, code generation).
Direct: no tools. Mandatory: models must use provided tools. Free: tools are optional. Numbers are answer correctness (↑). Expanded rows show the per-scenario breakdown.
To add a model, use the evaluation scripts in the repository, then open a PR or issue with your numbers so we can update this table.