Together AI Benchmark Results
Together AI is a cloud inference platform hosting a broad range of open-weight model families. Benchscope records public evaluation runs across 1 model family hosted on Together AI, covering MUSR, MATH, MMLU, IFEVAL.
About Together AI Endpoints
Together AI provides broad model coverage across open-weight families. Benchmark scores from Together AI endpoints reflect their specific deployment of each model. Results can differ from the same model hosted by another provider due to quantization choices, serving configuration, and infrastructure differences. Use canonical-prompt runs for the cleanest cross-provider comparisons.
Hosted Model Families
Model families with public evaluation runs on Together AI: Llama 3.3 70B.
Recent Together AI Runs
- Together / Llama 3.3 70B on MUSR: completed; 30.0%; 597 ms p50 latency; 20 samples.
- Together / Llama 3.3 70B on MATH: partial; 44.4%; 6133 ms p50 latency; 10 samples.
- Together / Llama 3.3 70B on MMLU: completed; 85.0%; 885 ms p50 latency; 100 samples.
- Together / Llama 3.3 70B on IFEVAL: partial; 90.3%; 3824 ms p50 latency; 100 samples.
- Together / Llama 3.3 70B on GSM8K: completed; 97.0%; 1989 ms p50 latency; 100 samples.
Related
- MMLU benchmark results across all providers
- MATH benchmark results across all providers
- GSM8K benchmark results across all providers
- All model families on Benchscope
- Best LLM endpoint for MMLU
- Best LLM endpoint for MATH
- Best LLM endpoint for GSM8K
- Llama 3.3 70B on Groq vs Together AI
- How benchmark results are defined and compared
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.