Welcome to the TuRTLe Model Leaderboard! TuRTLe is a unified evaluation framework designed to systematically assess Large Language Models (LLMs) in RTL (Register-Transfer Level) generation for hardware design. Evaluation criteria include syntax correctness, functional accuracy, synthesizability, and post-synthesis quality (PPA: Power, Performance, Area). TuRTLe integrates multiple benchmarks to highlight strengths and weaknesses of available LLMs. Use the filters below to explore different RTL benchmarks, simulators and models.
UPDATE (DEC 2025): We introduce a brand new benchmark for Module Completion, based on Tiny Tapeout with more than 1000 real-world designs
UPDATE (NOV 2025): We release a new codebase TuRTLe v2 with API support and local Docker evaluation. Added Kimi K2 Inst, DeepSeek V3.1 Terminus, and Google's Gemini 2.5 Flash
UPDATE (OCT 2025): Added Hermes-4-14B, Qwen3-8B, and Seed-OSS-36B to the leaderboard. Implemented Other Models tab and moved models to it
UPDATE (SEPT 2025): Added gpt-oss-20b and gpt-oss-120b to the leaderboard
UPDATE (JULY 2025): Our TuRTLe paper was accepted to MLCAD 2025 in September (Santa Cruz, CA)
UPDATE (JUNE 2025): We make our framework open-source on GitHub and we add 7 new recent models! For a total of 40 base and instruct models and 5 RTL benchmarks
| 1 | 🔵 | DeepSeek R1-0528 reasoning | 1000 | 1010000 | 86.79 | 23.43 | 66.79 |