Compare LLM Endpoints by Benchmark

Find the best-performing provider-hosted LLM endpoint for your benchmark. Browse MMLU, MATH, GSM8K, IFEval, and MuSR results — then inspect scores, latency, prompts, and raw outputs for every public run.

What you can inspect

Inspectable by default: public runs, raw outputs, prompt details, score, latency, sample counts, and methodology.

Public runs: Representative benchmark runs are linked and inspectable.
Raw outputs: Review model responses instead of relying only on scores.
Prompt details: See prompt mode, setup, and run configuration.
Score + latency: Compare quality and speed in the same view.
Sample counts: Check how much data supports each result.
Methodology: Understand comparability, caveats, and scoring.

Benchmark Results

Each benchmark page groups comparable public runs for one task suite, anchored to a fixed prompt, scoring method, and sample size. MMLU and MATH have the most coverage.

Model Families and Endpoints

A model family is the underlying model identity. A hosted endpoint is a specific provider's deployment of that model. The same model family can produce different benchmark results depending on which provider hosts it.

Runs

Public runs record the endpoint, benchmark, prompt mode, lifecycle state, score, latency, sample count, and raw per-example outputs. Use canonical-prompt runs for the fairest cross-provider comparisons.

Methodology

Benchscope records full evaluation metadata at every step. Read the methodology to understand how runs are defined, what makes results comparable, and what caveats apply.

Flagship Benchmarks

Provider Benchmark Results

Featured Model Pages

Llama 3.3 70B benchmark results — Compare Llama 3.3 70B benchmark results across providers and endpoints.
Qwen 3 235B A22B benchmark results — Compare Qwen 3 235B A22B benchmark results across providers and endpoints.
DeepSeek R1 benchmark results — Review DeepSeek R1 benchmark scores, latency, and provider coverage.
Llama 4 Scout 17B benchmark results — Review Llama 4 Scout 17B benchmark scores, latency, and provider coverage.
GPT-OSS 120B benchmark results — Review GPT-OSS 120B benchmark scores, latency, and provider coverage.

Editorial Comparisons

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.