Model Benchmark Results Across Providers

A model family is the underlying model — Llama 3.3 70B, GPT-4o, Gemini 1.5 Pro. A hosted endpoint is a specific provider's deployment of that model. The same model family can produce different benchmark results depending on which provider hosts it.

Why Provider Matters

Provider-hosted versions of the same model family may differ in quantization level, serving infrastructure, rate limits, and configuration. These differences can affect accuracy, latency, and instruction-following behavior. Benchscope treats each provider-hosted endpoint as a separate comparison target so these differences are visible.

How to Use Model Pages

Open a model family page to see all hosted endpoints and their benchmark results side by side. Filter by benchmark or provider. Use MMLU and MATH runs for the most cross-provider coverage. The methodology page explains how endpoints and model families are defined.

Model Families With Public Runs

Llama 3.3 70B: 25 public runs across 3 endpoints.
Qwen 3 235B A22B Instruct: 11 public runs across 2 endpoints.
MiniMax M2: 6 public runs across 2 endpoints.
Z.ai GLM 4.7: 6 public runs across 1 endpoint.
Qwen3 32B: 4 public runs across 1 endpoint.
Llama 3.1 8B: 2 public runs across 3 endpoints.
GPT-OSS 120B: 1 public runs across 2 endpoints.
GPT-OSS 20B: 1 public runs across 1 endpoint.
Kimi K2.5: 1 public runs across 1 endpoint.

Read the methodology for how Benchscope defines model families, hosting providers, and endpoints — and what makes results comparable across providers.

Featured Model Pages

Llama 3.3 70B benchmark results — Compare Llama 3.3 70B benchmark results across providers and endpoints.
Qwen 3 235B A22B benchmark results — Compare Qwen 3 235B A22B benchmark results across providers and endpoints.
DeepSeek R1 benchmark results — Review DeepSeek R1 benchmark scores, latency, and provider coverage.
Llama 4 Scout 17B benchmark results — Review Llama 4 Scout 17B benchmark scores, latency, and provider coverage.
GPT-OSS 120B benchmark results — Review GPT-OSS 120B benchmark scores, latency, and provider coverage.

Editorial Comparisons

Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.