MMLU Benchmark Results

MMLU is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. The benchmark covers 57 subjects across STEM, humanities, social sciences, and other areas.

Benchmark Details

MMLU uses exact_match_letter scoring and currently lists 14,042 examples in version 2024-01.

Public Runs

4 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.

Recent MMLU Runs

Together / Llama 3.3 70B on MMLU: completed; 85.0%; 885 ms p50 latency; 100 samples.
Groq / Llama 3.3 70B on MMLU: partial; 86.5%; 374 ms p50 latency; 100 samples.
Groq / Llama 3.3 70B on MMLU: completed; 30.0%; 608 ms p50 latency; 100 samples.
Groq / Llama 3.3 70B on MMLU: completed; 30.0%; 824 ms p50 latency; 10 samples.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.