MATH Benchmark Results

MATH is a dataset of competition math problems covering algebra, geometry, number theory, and more, requiring exact symbolic answers. Models are evaluated with LaTeX-normalised expression matching.

Benchmark Details

MATH uses lighteval_native scoring and currently lists 5,000 examples in version 2024-01.

Public Runs

5 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.

Recent MATH Runs

  • Together / Llama 3.3 70B on MATH: partial; 44.4%; 6133 ms p50 latency; 10 samples.
  • Cerebras / Qwen 3 235B A22B Instruct on MATH: completed; 60.0%; 2118 ms p50 latency; 20 samples.
  • Groq / Llama 3.3 70B on MATH: completed; 30.4%; 1967 ms p50 latency; 23 samples.
  • Cerebras / Qwen 3 235B A22B Instruct on MATH: partial; 57.9%; 1714 ms p50 latency; 20 samples.
  • Groq / Llama 3.3 70B on MATH: completed; 39.1%; 2092 ms p50 latency; 23 samples.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.