IFEval Benchmark Results
IFEval measures a model's ability to follow verifiable natural language instructions such as 'respond in bullet points' or 'use exactly 3 paragraphs'. Scoring is prompt-level: the model must satisfy ALL instructions in a prompt to count as correct.
Benchmark Details
IFEval uses lighteval_native scoring and currently lists 541 examples in version 2024-01.
Public Runs
10 recent public runs are included in this static snapshot. Enable JavaScript to filter by provider, model family, prompt mode, and status.
Recent IFEval Runs
- Together / Llama 3.3 70B on IFEVAL: partial; 90.3%; 3824 ms p50 latency; 100 samples.
- Groq / Llama 3.3 70B on IFEVAL: completed; 90.0%; 1108 ms p50 latency; 100 samples.
- Cerebras / Qwen 3 235B A22B Instruct on IFEVAL: completed; 67.0%; 553 ms p50 latency; 100 samples.
- Groq / Llama 3.3 70B on IFEVAL: partial; 70.7%; 907 ms p50 latency; 100 samples.
- Groq / Llama 3.3 70B on IFEVAL: completed; 100.0%; 1121 ms p50 latency; 10 samples.
Benchscope is a JavaScript app. If the interactive interface does not load, enable JavaScript or use the links above for the main public sections.