Compare LLM Endpoints by Benchmark

Find the best-performing provider-hosted LLM endpoint for your benchmark. Browse MMLU, MATH, GSM8K, IFEval, and MuSR results — then inspect scores, latency, prompts, and raw outputs for every public run.

Benchmark Results

Each benchmark page groups comparable public runs for one task suite, anchored to a fixed prompt, scoring method, and sample size. MMLU and MATH have the most coverage.

Model Families and Endpoints

A model family is the underlying model identity. A hosted endpoint is a specific provider's deployment of that model. The same model family can produce different benchmark results depending on which provider hosts it.

Runs

Public runs record the endpoint, benchmark, prompt mode, lifecycle state, score, latency, sample count, and raw per-example outputs. Use canonical-prompt runs for the fairest cross-provider comparisons.

Methodology

Benchscope records full evaluation metadata at every step. Read the methodology to understand how runs are defined, what makes results comparable, and what caveats apply.

Flagship Benchmarks

Provider Benchmark Results

Editorial Comparisons

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.