How LLM Benchmark Results Work
Benchscope records evaluation metadata at every step so results can be inspected, compared, and trusted with appropriate caveats. This page explains what each field means and what you can and cannot infer from public runs.
What a Run Is
A run is one evaluation of a specific provider-hosted endpoint on a specific benchmark. It records the score, latency, sample size, prompt used, lifecycle state, and raw per-example outputs — including the rendered prompt and the full model output for every question.
What Makes Runs Comparable
Runs are most directly comparable when they share the same benchmark, version, prompt mode (canonical), and sample size. Differences in any of these factors require care before interpreting score gaps as capability differences.
Canonical vs Custom Prompts
Canonical prompts follow the standard evaluation prompt and are intended for direct cross-provider comparison. Custom prompts deviate from the standard — results may differ significantly and are flagged. Custom prompt scores reflect both model capability and prompt design choices.
Sample Scope and Sample Size
Full runs evaluate every example in the benchmark dataset. Partial runs evaluate a subset, either by design or because the run was interrupted. Partial runs are marked and their sample size is always shown. Small samples produce less stable score estimates, particularly on diverse benchmarks like MMLU.
Lifecycle States
Runs go through: Queued (submitted, waiting to start), Running (actively evaluating), Completed (all examples evaluated and scored), Partial (stopped before completion, results available but incomplete), Failed (error prevented results), or Cancelled. Only Completed and Partial runs have score data.
What You Can and Cannot Infer
You can compare scores across providers hosting the same model family on canonical-prompt runs of matching sample size. You cannot infer that a higher-scoring endpoint will perform better on your specific task — benchmarks test specific skills under controlled conditions. Benchmark scores reflect the combination of model capability, provider infrastructure, and evaluation methodology.
Why Raw Outputs Matter
Aggregate scores can hide interesting patterns. Benchscope records the full rendered prompt and raw model output for every example so you can inspect how a specific question was presented, what the model said, and why it was scored as correct or incorrect. Raw outputs are the ground truth behind every aggregate number.
Explore on Benchscope
- MMLU — see how 57-subject knowledge is scored and compared
- MATH — see how competition math answers are extracted and scored
- GSM8K — see how grade school math word problems are scored
- GAIA — see how real-world reasoning tasks are scored
- All benchmarks on Benchscope
- Model families and hosted endpoints
- Groq benchmark results
- Together AI benchmark results
- Cerebras benchmark results
- Best LLM endpoint for MMLU
- Best LLM endpoint for MATH
- Best LLM endpoint for GSM8K
- Llama 3.3 70B on Groq vs Together AI
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.