Best LLM Endpoint for MATH

Compare hosted LLM endpoints on the MATH benchmark using public Benchscope runs. MATH contains 5,000 competition-style math problems and is useful for evaluating multi-step mathematical reasoning. Use this page to compare endpoint score, latency, prompt mode, sample size, and raw outputs.

Current Recommendation

Best MATH score: Cerebras / Qwen 3 235B A22B Instruct — 60.0%
Fastest endpoint: Groq / Llama 3.3 70B — 1967 ms p50

Based on completed canonical-prompt public runs. Partial and custom-prompt runs are excluded from winner claims.

Top MATH Scores — Canonical Prompt, Public Runs

# Endpoint Provider Model Score p50 Latency Samples
1 Cerebras / Qwen 3 235B A22B Instruct Cerebras Qwen 3 235B A22B Instruct 60.0% 2118 ms 20
2 Groq / Llama 3.3 70B Groq Llama 3.3 70B 39.1% 2092 ms 23
3 Groq / Llama 3.3 70B Groq Llama 3.3 70B 30.4% 1967 ms 23
4 Cerebras / Qwen 3 235B A22B Instruct (partial) Cerebras Qwen 3 235B A22B Instruct 57.9% 1714 ms 20
5 Together / Llama 3.3 70B (partial) Together AI Llama 3.3 70B 44.4% 6133 ms 10

Canonical-prompt completed public runs, sorted by score. Explore all MATH runs →

What MATH Measures

MATH is a benchmark of competition-level problems drawn from AMC 10, AMC 12, AIME, and similar contests. Problems span seven subjects: Prealgebra, Algebra, Number Theory, Counting and Probability, Geometry, Intermediate Algebra, and Precalculus. Scoring uses equivalence checking on the extracted final answer. This comparison uses canonical-prompt runs only — results are directly comparable across providers and model families. MATH is useful for evaluating competition-style mathematical reasoning. It is more demanding than GSM8K and often better at exposing reasoning differences between endpoints, but it is narrower than MMLU and should not be treated as a general intelligence score.

How to Interpret the Results

Higher MATH scores indicate stronger multi-step mathematical reasoning under these specific evaluation conditions. When comparing the same model family across providers, MATH score differences reflect infrastructure, quantization, or serving configuration rather than underlying model capability. Prefer full-benchmark runs for stable score estimates. Benchscope compares endpoints, not just model names. The same model family can behave differently across providers because of serving stack, quantization, rate limits, and inference settings.

Caveats

A high MATH score does not mean this endpoint will perform best for your use case. MATH requires careful answer normalization — scores may differ from results in other benchmarking systems even for identical model versions. Prompting strategy significantly affects MATH results. All results here use canonical prompts, but the specific canonical prompt matters more on MATH than on multiple-choice benchmarks. Treat small score differences cautiously unless the runs use the same prompt mode, sample scope, benchmark version, and generation settings.

How to Choose an Endpoint for Math Reasoning

  • Choose the highest-scoring endpoint if answer correctness matters more than speed.
  • Choose the fastest endpoint if you need interactive tutoring, grading, or agent loops.
  • Choose the best cost-adjusted endpoint if you run large batches of math problems.
  • Use MATH together with GSM8K: GSM8K is easier grade-school math, while MATH is more demanding competition-style reasoning.
  • Prefer canonical-prompt runs with full benchmark samples for stable estimates.

Frequently Asked Questions

What is the best LLM endpoint for MATH?

The best MATH endpoint changes as providers update their deployments. See the rankings table above for the current top-scoring canonical-prompt run on Benchscope. MATH scores spread widely across endpoints, making the top ranking meaningful.

What does the MATH benchmark measure?

MATH contains 5,000 competition-style math problems spanning algebra, counting and probability, geometry, intermediate algebra, number theory, precalculus, and prealgebra. It tests multi-step mathematical reasoning rather than simple arithmetic or factual recall. See all MATH runs on Benchscope.

Is MATH harder than GSM8K?

Yes. GSM8K tests grade school multi-step arithmetic where most frontier models score above 85%. MATH tests competition-level problems where scores spread more widely, making it a stronger signal for reasoning capability. Compare endpoints on GSM8K.

Which endpoint is fastest for math reasoning?

See the rankings table above for latency data. MATH problems require longer outputs than multiple-choice benchmarks, so p50 latency is higher and varies more across providers.

Are MATH benchmark results comparable across providers?

Canonical-prompt runs with matching sample sizes are directly comparable. Score differences between providers hosting the same model reflect infrastructure choices, not underlying model differences. Read the methodology for full comparability rules.

Why do provider-hosted endpoints differ for the same model family?

Provider-hosted versions of the same model may differ in quantization level, serving configuration, hardware, and inference settings. These differences compound on MATH because multi-step reasoning chains are sensitive to token-level differences. See the Llama 3.3 70B comparison for an example.

Evaluate Your Endpoint

Benchscope is free to use. Run MATH on your own hosted endpoint and make your results public to appear in this comparison.

Run this benchmark on your endpoint → · Explore public MATH runs

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.