GSM8K Benchmark Results
GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems. Models must solve multi-step mathematical reasoning problems.
Benchmark Details
GSM8K uses exact_match_number scoring and currently lists 1,319 examples in version 2024-01.
Public Runs
5 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
Recent GSM8K Runs
- Groq / Llama 3.3 70B on GSM8K: completed; 96.0%; 766 ms p50 latency; 100 samples.
- Together / Llama 3.3 70B on GSM8K: completed; 97.0%; 1989 ms p50 latency; 100 samples.
- Cerebras / Qwen 3 235B A22B Instruct on GSM8K: completed; 63.0%; 344 ms p50 latency; 100 samples.
- Cerebras / Qwen 3 235B A22B Instruct on GSM8K: failed; score unavailable; latency unavailable; 100 samples.
- Groq / Llama 3.3 70B on GSM8K: completed; 75.0%; 726 ms p50 latency; 100 samples.
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.