How LLM Benchmark Results Work

Benchscope records evaluation metadata at every step so results can be inspected, compared, and trusted with appropriate caveats. This page explains what each field means and what you can and cannot infer from public runs.

What a Run Is

A run is one evaluation of a specific provider-hosted endpoint on a specific benchmark. It records the score, latency, sample size, prompt used, lifecycle state, and raw per-example outputs — including the rendered prompt and the full model output for every question.

What Makes Runs Comparable

Runs are most directly comparable when they share the same benchmark, version, prompt mode (canonical), and sample size. Differences in any of these factors require care before interpreting score gaps as capability differences.

Canonical vs Custom Prompts

Canonical prompts follow the standard evaluation prompt and are intended for direct cross-provider comparison. Custom prompts deviate from the standard — results may differ significantly and are flagged. Custom prompt scores reflect both model capability and prompt design choices.

Sample Scope and Sample Size

Full runs evaluate every example in the benchmark dataset. Partial runs evaluate a subset, either by design or because the run was interrupted. Partial runs are marked and their sample size is always shown. Small samples produce less stable score estimates, particularly on diverse benchmarks like MMLU.

Lifecycle States

Runs go through: Queued (submitted, waiting to start), Running (actively evaluating), Completed (all examples evaluated and scored), Partial (stopped before completion, results available but incomplete), Failed (error prevented results), or Cancelled. Only Completed and Partial runs have score data.

What You Can and Cannot Infer

You can compare scores across providers hosting the same model family on canonical-prompt runs of matching sample size. You cannot infer that a higher-scoring endpoint will perform better on your specific task — benchmarks test specific skills under controlled conditions. Benchmark scores reflect the combination of model capability, provider infrastructure, and evaluation methodology.

Why Raw Outputs Matter

Aggregate scores can hide interesting patterns. Benchscope records the full rendered prompt and raw model output for every example so you can inspect how a specific question was presented, what the model said, and why it was scored as correct or incorrect. Raw outputs are the ground truth behind every aggregate number.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.