MMLU Benchmark Results
MMLU is a benchmark designed to measure knowledge acquired during pretraining by evaluating models exclusively in zero-shot and few-shot settings. The benchmark covers 57 subjects across STEM, humanities, social sciences, and other areas.
Benchmark Details
MMLU uses exact_match_letter scoring and currently lists 14,042 examples in version 2024-01.
Public Runs
4 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
Recent MMLU Runs
- Together / Llama 3.3 70B on MMLU: completed; 85.0%; 885 ms p50 latency; 100 samples.
- Groq / Llama 3.3 70B on MMLU: partial; 86.5%; 374 ms p50 latency; 100 samples.
- Groq / Llama 3.3 70B on MMLU: completed; 30.0%; 608 ms p50 latency; 100 samples.
- Groq / Llama 3.3 70B on MMLU: completed; 30.0%; 824 ms p50 latency; 10 samples.
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.