LLM Math Benchmark Leaderboard
MATH is a dataset of competition math problems covering algebra, geometry, number theory, and more, requiring exact symbolic answers. Models are evaluated with LaTeX-normalised expression matching.
Benchmark Details
MATH uses lighteval_native scoring and currently lists 5,000 examples in version 2024-01.
Public Runs
6 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
What MATH Measures
MATH is a benchmark of competition-level mathematics problems drawn from AMC 10, AMC 12, AIME, and similar competitions. Problems span seven subject areas: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. Difficulty levels range from 1 (accessible) to 5 (competition-hard). The dataset contains approximately 12,500 problems.
Scoring uses equivalence checking on the extracted final answer. Benchscope parses the model output to extract a mathematical expression or value and checks it against the reference answer. The scoring method and answer normalization rules matter significantly — small differences in how LaTeX expressions are handled can shift scores.
Why MATH Matters for Endpoint Selection
MATH scores are highly differentiated across model families. Unlike benchmarks where top models cluster above 90%, MATH exposes wider score gaps and is a strong signal for identifying genuinely capable reasoning endpoints.
For the same model family hosted by different providers, MATH scores can reveal whether quantization or serving configuration degrades multi-step reasoning ability. A provider that introduces lossy quantization may show a smaller drop on factual benchmarks than on MATH, where reasoning chains are longer and errors compound.
How to Interpret MATH Results
Scores vary substantially across difficulty levels within the same run — a model scoring 70% overall may be near-perfect on Prealgebra but below 30% on Precalculus. Benchscope reports aggregate scores across the evaluated sample. Compare runs with matching sample sizes where possible.
Latency on MATH problems tends to be longer than on multiple-choice benchmarks because problems require extended reasoning and longer model outputs. Partial runs on MATH deserve particular scrutiny: if the subset underrepresents hard problems, the aggregate score will appear artificially high.
Caveats
MATH requires careful answer normalization — LaTeX and symbolic expressions can be equivalent but tokenized differently, and scoring implementations differ. Scores are sensitive to the prompting strategy: few-shot examples and chain-of-thought prompting significantly affect results, so canonical vs custom prompt status matters more here than on factual recall benchmarks.
Competition-level problems test a specialized form of mathematical reasoning that does not generalize directly to applied math, programming, or everyday arithmetic. A high MATH score does not guarantee good performance on coding problems that involve math, or on word problems that require math and language understanding together.
Competition Math Benchmark Results
Benchscope's MATH page is the best fit for competition math benchmark intent: algebra, geometry, number theory, counting and probability, intermediate algebra, precalculus, and other contest-style reasoning tasks. For easier grade-school arithmetic word problems, use GSM8K instead.
Recent MATH Runs
- Vertex MaaS / MiniMax M2 on MATH: queued; score unavailable; latency unavailable; 1 samples.
- Together / Llama 3.3 70B on MATH: partial; 44.4%; 6133 ms p50 latency; 10 samples.
- Cerebras / Qwen 3 235B A22B Instruct on MATH: completed; 60.0%; 2118 ms p50 latency; 20 samples.
- Groq / Llama 3.3 70B on MATH: completed; 30.4%; 1967 ms p50 latency; 23 samples.
- Cerebras / Qwen 3 235B A22B Instruct on MATH: partial; 57.9%; 1714 ms p50 latency; 20 samples.
Related
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.