Benchmark Methodology

Benchscope records transparent evaluation metadata so public LLM benchmark results can be inspected and compared with the right caveats.

What Counts As A Run

A run is one evaluation of a specific provider-hosted endpoint on a specific benchmark. It records score, latency, sample size, prompt, lifecycle state, and raw per-example outputs.

Model Family Vs Hosting Provider

A model family names the underlying model. A hosting provider names the company or platform serving that model. Benchscope treats each provider-hosted endpoint as a separate comparison target.

Canonical Vs Custom Prompts

Canonical prompts follow the standard evaluation prompt and are intended for direct comparison. Custom prompts are flagged because they can significantly affect model performance.

Run States

Runs can be queued, running, completed, partial, failed, or cancelled. Partial results may be useful, but sample size and completion status should be considered before comparison.

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.