We're releasing Darwin-4B-David, the first second-generation model in the Darwin Opus family. By evolving an already-evolved model, it achieves 85.0% on GPQA Diamond β surpassing its 58.6% original ancestor and even gemma-4-31B (84.3%) β with just 4.5B parameters.
Second-Generation Evolution Most merges start from a base model and produce a single offspring. Darwin-4B-David breaks this pattern. The Father (Darwin-4B-Opus) was already evolved from gemma-4-E4B-it with Claude Opus reasoning distillation β a Gen-1 model. The Mother (DavidAU's DECKARD-Expresso-Universe) brings Unsloth deep tuning across 5 in-house datasets with thinking mode by default. Crossbreeding these two produced the first Gen-2 Darwin model.
Darwin V6's Model MRI scanned both parents across all 42 layers, assigning independent optimal ratios per layer. The Mother's creativity and Korean language hotspot (Layer 22-25, weight 0.95) was maximally absorbed, while the Father's reasoning core (Layer 30-40, weight 0.48) was preserved. This is "Merge = Evolve" applied recursively β evolution of evolution.
Benchmarks Darwin-4B-David scores 85.0% on GPQA Diamond (+26.4%p over original 58.6%), evaluated generatively with maj@8 (8 generations per question, majority vote), Epoch AI prompt format, thinking mode enabled, 50 sampled questions. On ARC-Challenge (25-shot, loglikelihood), both score 64.93% β expected, as loglikelihood doesn't capture thinking-mode reasoning differences.
Why This Matters gemma-4-31B (30.7B) scores 84.3%. Darwin-4B-David surpasses it at 1/7th the size β no training, no RL, just 45 minutes of MRI-guided DARE-TIES on one H100. The name "David" honors Mother creator DavidAU and evokes David vs. Goliath.
π World Model Bench β does your world model actually think?
FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.
We just released WM Bench β the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint β not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?
Those are cognitive questions. No existing benchmark asks them. So we built one.
- π P1 Perception (25%) β Can it read the scene? - π§ P2 Cognition (45%) β Does it predict threats, escalate emotions, utilize memory? - π₯ P3 Embodiment (30%) β Does the body respond with the right motion?
All evaluation is via simple JSON I/O β no 3D engine, no special hardware. Any model with an API can participate.
We also built PROMETHEUS as a live reference implementation β runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive β Predict β Decide β Act). Scored 726/1000 (Grade B) on Track C β the only directly verified model so far. Submissions from other teams very welcome.
FINAL Bench Released: The Real Bottleneck to AGI Is Self-Correction
We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs β the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong.
Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) β the ability to say "I might be wrong", and ER (Error Recovery) β the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology.
Three Findings Across 9 SOTA Models
We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks:
1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning β it is self-correction.
2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct β the most dangerous AI safety profile.
3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001).
from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")
Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs
FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.