Judge Arena: Benchmarking LLMs as Evaluators
Benchmarking LLMs as evaluators. Voting has been sunset — the leaderboard below reflects the final community rankings.
Judge Arena uses Together AI for inference of open-source models. FP8 models are named as -- "Turbo" where the performance of the FP16 reference models is closely matched: