Gradio

If you have any inquiries or wish to collaborate: [email protected]

Welcome to the TuRTLe Model Leaderboard! TuRTLe is a unified evaluation framework designed to systematically assess Large Language Models (LLMs) in RTL (Register-Transfer Level) generation for hardware design. Evaluation criteria include syntax correctness, functional accuracy, synthesizability, and post-synthesis quality (PPA: Power, Performance, Area). TuRTLe integrates multiple benchmarks to highlight strengths and weaknesses of available LLMs. Use the filters below to explore different RTL benchmarks, simulators and models.

UPDATE (DEC 2025): We introduce a brand new benchmark for Module Completion, based on Tiny Tapeout with more than 1000 real-world designs

UPDATE (NOV 2025): We release a new codebase TuRTLe v2 with API support and local Docker evaluation. Added Kimi K2 Inst, DeepSeek V3.1 Terminus, and Google's Gemini 2.5 Flash

UPDATE (OCT 2025): Added Hermes-4-14B, Qwen3-8B, and Seed-OSS-36B to the leaderboard. Implemented Other Models tab and moved models to it

UPDATE (SEPT 2025): Added gpt-oss-20b and gpt-oss-120b to the leaderboard

UPDATE (JULY 2025): Our TuRTLe paper was accepted to MLCAD 2025 in September (Santa Cruz, CA)

UPDATE (JUNE 2025): We make our framework open-source on GitHub and we add 7 new recent models! For a total of 40 base and instruct models and 5 RTL benchmarks


1	🔵	Qwen3 Coder 480B A35B	480	262144	86.79	23.43	66.79
2	🟢	gpt-oss-120b reasoning	120	131072	76.98	20.9	57.92
3	🟢	DeepSeek R1-0528 reasoning	685	163840	77.7	20.73	59.21
4	🟢	Kimi K2 Instruct 0905 new	1000	262144	76.58	19	57.86
5	🟢	Qwen2.5 72B	72.7	32768	62.44	14.7	43.06
6	🟢	Qwen2.5 14B 1M new	14.8	1010000	50.59	12.49	34.76
7	🔵	QwenCoder 2.5 32B	32.5	32768	60.17	11.85	40.73
8	🟢	Qwen2.5 14B new	14.8	32768	45.61	8.97	30.44
9	🔴	HaVen-CodeQwen	7.25	65536	33.37	4.2	16.56
10	🟢	Qwen2.5 7B new	7.61	32768	13.68	3.48	9.26
11	🔴	OriGen	6.74	4096	6.21	3.31	5.37