MMLU Benchmark Leaderboard
MMLU is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. The benchmark covers 57 subjects across STEM, humanities, social sciences, and other areas.
Benchmark Details
MMLU uses exact_match_letter scoring and currently lists 14,042 examples in version 2024-01.
Public Runs
10 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
What MMLU Measures
MMLU (Massive Multitask Language Understanding) presents 14,042 four-choice multiple-choice questions across 57 subjects, covering STEM fields, humanities, social sciences, and professional domains like medicine and law. Questions test factual knowledge and reasoning at difficulty levels ranging from elementary to professional. The benchmark was designed to evaluate knowledge acquired during pretraining, using zero-shot and few-shot prompting.
Each question has one correct letter answer (A, B, C, or D). Benchscope extracts the first letter from the model output and scores it as correct or incorrect. The primary score is the fraction of questions answered correctly across the evaluated sample.
Why MMLU Matters for Endpoint Selection
MMLU is one of the most widely cited benchmarks in LLM evaluations because it tests broad knowledge coverage across diverse domains. High MMLU scores correlate with general knowledge capability but do not guarantee instruction following, code generation, or creative output quality.
MMLU results on Benchscope let you compare how different provider-hosted endpoints handle the same model family under comparable conditions. Because canonical-prompt MMLU runs use a fixed evaluation prompt, score differences between providers hosting the same model reflect infrastructure, quantization, or serving configuration — not underlying model differences.
How to Interpret MMLU Results
Compare canonical-prompt runs for the cleanest signal. Check sample size — runs on the full 14,042-example dataset provide more stable estimates than smaller subsets. Scores above 85% are common for frontier models, so MMLU has limited headroom for differentiating top-tier endpoints at that range. Latency figures show p50 inference speed per example, which is useful for understanding provider throughput independent of accuracy.
Partial runs may skew results if the evaluated subset does not represent the full subject distribution. MMLU covers a wide range of difficulty and subject matter — a partial sample may accidentally over- or under-represent particular subjects.
Caveats
MMLU has known contamination risk. Questions may have appeared in training data, meaning high scores partly reflect memorization rather than generalizable reasoning. The benchmark was last significantly updated in 2021 and some questions contain known errors. Results across different benchmark versions are not comparable.
MMLU tests knowledge recall under multiple-choice constraints and does not measure generation quality, instruction following, or compositional reasoning.
Top MMLU Models and Endpoints
Use the public run table to identify the strongest MMLU endpoint by score, then compare p50 latency before choosing a provider. The best endpoint for broad knowledge is not always the fastest endpoint for production workloads.
For commercial endpoint selection, pair this benchmark page with the ranked MMLU comparison page. The benchmark page gives the full public run context; the comparison page summarizes the best LLM endpoint for MMLU by score, latency, provider, and model family.
MMLU Benchmark Latency
Latency matters when MMLU-style prompts are used for routing, grading, classification, or high-volume evaluation jobs. Benchscope reports p50 latency alongside score so you can compare quality and speed instead of ranking endpoints by accuracy alone.
Recent MMLU Runs
- 4router.net on MMLU: completed; 100.0%; 2748 ms p50 latency; 1 samples.
- code.newcli.com on MMLU: queued; score unavailable; latency unavailable; 14,042 samples.
- api.code-relay.com on MMLU: failed; score unavailable; latency unavailable; 14,042 samples.
- 4router.net on MMLU: partial; 92.5%; 3192 ms p50 latency; 14,042 samples.
- code.newcli.com on MMLU: cancelled; score unavailable; latency unavailable; 14,042 samples.
Related
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.