MATH Benchmark Results
MATH is a dataset of competition math problems covering algebra, geometry, number theory, and more, requiring exact symbolic answers. Models are evaluated with LaTeX-normalised expression matching.
Benchmark Details
MATH uses lighteval_native scoring and currently lists 5,000 examples in version 2024-01.
Public Runs
5 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
Recent MATH Runs
- Together / Llama 3.3 70B on MATH: partial; 44.4%; 6133 ms p50 latency; 10 samples.
- Cerebras / Qwen 3 235B A22B Instruct on MATH: completed; 60.0%; 2118 ms p50 latency; 20 samples.
- Groq / Llama 3.3 70B on MATH: completed; 30.4%; 1967 ms p50 latency; 23 samples.
- Cerebras / Qwen 3 235B A22B Instruct on MATH: partial; 57.9%; 1714 ms p50 latency; 20 samples.
- Groq / Llama 3.3 70B on MATH: completed; 39.1%; 2092 ms p50 latency; 23 samples.
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.