GSM8K Benchmark Leaderboard

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems. Models must solve multi-step mathematical reasoning problems.

Benchmark Details

GSM8K uses exact_match_number scoring and currently lists 1,319 examples in version 2024-01.

Public Runs

10 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.

What GSM8K Measures

GSM8K (Grade School Math 8K) contains 8,500 grade school level math word problems requiring two to eight steps of arithmetic reasoning. Problems are linguistically diverse and designed to avoid pattern matching — each problem requires genuine multi-step reasoning rather than formula lookup. The model is expected to show its work and conclude with a final numerical answer following a "####" delimiter.

Benchscope uses exact match on the number extracted after the "####" delimiter. Answers that arrive at the correct value through different intermediate steps are scored correctly.

Why GSM8K Matters and Caveats

GSM8K differentiates models that can follow multi-step arithmetic from those that cannot, and remains useful for comparing provider-hosted endpoints serving the same model. Unlike MATH, most frontier models achieve scores above 85%, so GSM8K has limited headroom for differentiating top-tier endpoints. It is most useful for comparing mid-tier models or confirming that a provider's infrastructure has not degraded basic reasoning.

The "####" delimiter convention is prompt-dependent. Custom-prompt runs may not produce outputs in the expected format, causing scoring failures unrelated to model capability. GSM8K was published in 2021 and high-capability models likely trained on these problems, meaning high scores partly reflect training data exposure.

Top GSM8K Models and Endpoints

Use GSM8K to compare whether hosted endpoints preserve basic multi-step arithmetic reasoning. The best GSM8K endpoint should combine a high exact-match score with low p50 latency and enough evaluated samples to make the result stable.

For harder math reasoning, compare the same provider or model family on MATH. GSM8K is useful for arithmetic reliability, while MATH is more useful for differentiating strong reasoning endpoints.

Recent GSM8K Runs

code.newcli.com on GSM8K: failed; score unavailable; latency unavailable; 1,319 samples.
api.code-relay.com on GSM8K: partial; 98.5%; 5748 ms p50 latency; 1,319 samples.
4router.net on GSM8K: failed; score unavailable; latency unavailable; 1,319 samples.
code.newcli.com on GSM8K: cancelled; score unavailable; latency unavailable; 1,319 samples.
4router.net on GSM8K: cancelled; score unavailable; latency unavailable; 1,319 samples.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.