GSM8K Benchmark Results

GSM8K is a dataset of 8.5K high quality linguistically diverse grade school math word problems. Models must solve multi-step mathematical reasoning problems.

Benchmark Details

GSM8K uses exact_match_number scoring and currently lists 1,319 examples in version 2024-01.

Public Runs

5 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.

Recent GSM8K Runs

  • Groq / Llama 3.3 70B on GSM8K: completed; 96.0%; 766 ms p50 latency; 100 samples.
  • Together / Llama 3.3 70B on GSM8K: completed; 97.0%; 1989 ms p50 latency; 100 samples.
  • Cerebras / Qwen 3 235B A22B Instruct on GSM8K: completed; 63.0%; 344 ms p50 latency; 100 samples.
  • Cerebras / Qwen 3 235B A22B Instruct on GSM8K: failed; score unavailable; latency unavailable; 100 samples.
  • Groq / Llama 3.3 70B on GSM8K: completed; 75.0%; 726 ms p50 latency; 100 samples.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.