LLM Benchmark Results by Provider
Each benchmark page groups comparable public evaluation runs for one task suite, anchored to a fixed prompt, scoring method, and sample size. Use these pages to compare how providers and endpoints perform on the same task under the same conditions.
How to Use Benchmark Pages
Open a benchmark page to see accuracy scores and latency across all providers that have run it publicly. Filter by provider, model family, or prompt mode. Canonical-prompt runs give the fairest cross-provider comparisons. Check the methodology for how each benchmark is scored.
Flagship Benchmarks
MMLU covers 57-subject knowledge breadth and has the most public runs on Benchscope. MATH covers competition-level mathematical problem solving and produces the most differentiated scores across providers. These are the best starting points for endpoint comparison.
What Makes Results Comparable
Benchscope anchors each run to a specific task, version, sample size, prompt mode, and scoring method. Runs with different prompt modes or sample sizes require care before direct comparison. The methodology page explains each of these factors.
Current Benchmark Coverage
- MMLU: Massive Multitask Language Understanding; 14,042 examples; 23 public runs.
- GSM8K: Grade School Math 8K; 1,319 examples; 18 public runs.
- MuSR: Multi-Step Soft Reasoning; 756 examples; 15 public runs.
- IFEval: Instruction Following Evaluation; 541 examples; 14 public runs.
- MATH: Competition Mathematics; 5,000 examples; 6 public runs.
- GAIA (Text-Only): Curated text-only subset of GAIA real-world reasoning tasks; 18 examples; 3 public runs.
- Garak: Refusal rate on Garak's static harmful prompt datasets; 953 examples; 0 public runs.
Browse by Benchmark
- MMLU — 57-subject knowledge breadth across STEM, humanities, and professional domains
- MATH — competition-level mathematical problem solving across algebra, geometry, and number theory
- GSM8K — grade school math word problems requiring multi-step arithmetic reasoning
- IFEval — instruction-following compliance verified by programmatic rule checks
- MuSR — multi-step compositional reasoning over narrative contexts
- GAIA — real-world multi-step reasoning tasks (text-only subset)
- Garak Safety — LLM refusal rate and jailbreak resistance across DoNotAnswer and DAN prompts
Read the methodology to understand how runs are defined, scored, and compared across benchmarks.
Editorial Comparisons
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.