Judge Arena: Benchmarking LLMs as Evaluators

Benchmarking LLMs as evaluators. Voting has been sunset — the leaderboard below reflects the final community rankings.



Judge Arena uses Together AI for inference of open-source models. FP8 models are named as -- "Turbo" where the performance of the FP16 reference models is closely matched:

"Together Turbo achieves this performance while maintaining full accuracy compared to Meta's reference implementation across all models. Llama-3.1-405B-Instruct-Turbo matches the accuracy of Meta reference models."