76 7 338

ginipick

AI & ML interests

None yet

Recent Activity

updated a Space about 13 hours ago

ginipick/ai-news-daily

published a Space about 13 hours ago

ginipick/ai-news-daily

reacted to SeaWolf-AI's post with 👀 about 22 hours ago

Why This Matters — David Defeats Goliath MODEL: https://huggingface.co/FINAL-Bench/Darwin-4B-David SPACE: https://huggingface.co/spaces/FINAL-Bench/Darwin-4B-david We're releasing Darwin-4B-David, the first second-generation model in the Darwin Opus family. By evolving an already-evolved model, it achieves 85.0% on GPQA Diamond — surpassing its 58.6% original ancestor and even gemma-4-31B (84.3%) — with just 4.5B parameters. Second-Generation Evolution Most merges start from a base model and produce a single offspring. Darwin-4B-David breaks this pattern. The Father (Darwin-4B-Opus) was already evolved from gemma-4-E4B-it with Claude Opus reasoning distillation — a Gen-1 model. The Mother (DavidAU's DECKARD-Expresso-Universe) brings Unsloth deep tuning across 5 in-house datasets with thinking mode by default. Crossbreeding these two produced the first Gen-2 Darwin model. Darwin V6's Model MRI scanned both parents across all 42 layers, assigning independent optimal ratios per layer. The Mother's creativity and Korean language hotspot (Layer 22-25, weight 0.95) was maximally absorbed, while the Father's reasoning core (Layer 30-40, weight 0.48) was preserved. This is "Merge = Evolve" applied recursively — evolution of evolution. Benchmarks Darwin-4B-David scores 85.0% on GPQA Diamond (+26.4%p over original 58.6%), evaluated generatively with maj@8 (8 generations per question, majority vote), Epoch AI prompt format, thinking mode enabled, 50 sampled questions. On ARC-Challenge (25-shot, loglikelihood), both score 64.93% — expected, as loglikelihood doesn't capture thinking-mode reasoning differences. Why This Matters gemma-4-31B (30.7B) scores 84.3%. Darwin-4B-David surpasses it at 1/7th the size — no training, no RL, just 45 minutes of MRI-guided DARE-TIES on one H100. The name "David" honors Mother creator DavidAU and evokes David vs. Goliath.

View all activity

Organizations

updated a Space about 13 hours ago

AI News Daily

🌐

전세계 AI 트렌드 기사를 자동 수집 (최대 50개)

published a Space about 13 hours ago

AI News Daily

🌐

전세계 AI 트렌드 기사를 자동 수집 (최대 50개)

reactedto SeaWolf-AI's post with 👀 about 22 hours ago

Post

1873

Why This Matters — David Defeats Goliath

MODEL: FINAL-Bench/Darwin-4B-David
SPACE: FINAL-Bench/Darwin-4B-david

We're releasing Darwin-4B-David, the first second-generation model in the Darwin Opus family. By evolving an already-evolved model, it achieves 85.0% on GPQA Diamond — surpassing its 58.6% original ancestor and even gemma-4-31B (84.3%) — with just 4.5B parameters.

Second-Generation Evolution
Most merges start from a base model and produce a single offspring. Darwin-4B-David breaks this pattern. The Father (Darwin-4B-Opus) was already evolved from gemma-4-E4B-it with Claude Opus reasoning distillation — a Gen-1 model. The Mother (DavidAU's DECKARD-Expresso-Universe) brings Unsloth deep tuning across 5 in-house datasets with thinking mode by default. Crossbreeding these two produced the first Gen-2 Darwin model.

Darwin V6's Model MRI scanned both parents across all 42 layers, assigning independent optimal ratios per layer. The Mother's creativity and Korean language hotspot (Layer 22-25, weight 0.95) was maximally absorbed, while the Father's reasoning core (Layer 30-40, weight 0.48) was preserved. This is "Merge = Evolve" applied recursively — evolution of evolution.

Benchmarks
Darwin-4B-David scores 85.0% on GPQA Diamond (+26.4%p over original 58.6%), evaluated generatively with maj@8 (8 generations per question, majority vote), Epoch AI prompt format, thinking mode enabled, 50 sampled questions. On ARC-Challenge (25-shot, loglikelihood), both score 64.93% — expected, as loglikelihood doesn't capture thinking-mode reasoning differences.

Why This Matters
gemma-4-31B (30.7B) scores 84.3%. Darwin-4B-David surpasses it at 1/7th the size — no training, no RL, just 45 minutes of MRI-guided DARE-TIES on one H100. The name "David" honors Mother creator DavidAU and evokes David vs. Goliath.

reactedto SeaWolf-AI's post with 🔥 13 days ago

Post

4668

🌍 World Model Bench — does your world model actually think?

FID measures realism. FVD measures smoothness. But neither tells you whether the model understood the scene.

We just released WM Bench — the first benchmark for cognitive intelligence in world models. The core question: when a beast charges from 3 meters away, does the model know to sprint — not walk? Does it respond differently to a human vs an animal? Does it remember the left corridor was blocked two steps ago?

Those are cognitive questions. No existing benchmark asks them. So we built one.

3 Pillars · 10 Categories · 100 Scenarios · 1,000-point scale

- 👁 P1 Perception (25%) — Can it read the scene?
- 🧠 P2 Cognition (45%) — Does it predict threats, escalate emotions, utilize memory?
- 🔥 P3 Embodiment (30%) — Does the body respond with the right motion?

All evaluation is via simple JSON I/O — no 3D engine, no special hardware. Any model with an API can participate.

We also built PROMETHEUS as a live reference implementation — runs in your browser on a T4, no install needed. Combines FloodDiffusion motion generation with a LLM cognitive brain (Perceive → Predict → Decide → Act). Scored 726/1000 (Grade B) on Track C — the only directly verified model so far. Submissions from other teams very welcome.

---

🗂 Dataset → FINAL-Bench/World-Model
🌍 Demo → FINAL-Bench/World-Model
🏆 Leaderboard → FINAL-Bench/worldmodel-bench
📝 Article → https://huggingface.co/blog/FINAL-Bench/world-model

Part of the FINAL Bench Family — alongside FINAL Bench (Feb 2026). Feedback on rubrics and missing models always welcome!

updated a Space 13 days ago

Siteagent Serverside

📚

Generate copy, reply reviews, and search products with AI

published a Space 13 days ago

Siteagent Serverside

📚

Generate copy, reply reviews, and search products with AI

updated a Space 19 days ago

HWP-Agent

📄

Generate AI-written HWPX documents from a prompt

New activity in ginigen/hwp-agent 20 days ago

Upload 19 files

#1 opened 20 days ago by

ginipick

published a Space 20 days ago

HWP-Agent

📄

Generate AI-written HWPX documents from a prompt

liked a Space 26 days ago

SiteAgent - AI 웹 어시스턴트

🤖

어떤 웹 페이지에서든 동작하는 AI 어시스턴트.

reactedto SeaWolf-AI's post with 👍 about 2 months ago

Post

4311

FINAL Bench Released: The Real Bottleneck to AGI Is Self-Correction

We release FINAL Bench, the first benchmark for measuring functional metacognition in LLMs — the ability to detect and correct one's own reasoning errors. Every existing benchmark measures final-answer accuracy. None measures whether AI knows it is wrong.

Dataset: [FINAL-Bench/Metacognitive]( FINAL-Bench/Metacognitive) | 100 Tasks | 15 Domains | 8 TICOS Types | Apache 2.0

Leaderboard: FINAL-Bench/Leaderboard

Article: https://huggingface.co/blog/FINAL-Bench/metacognitive

Core Innovation

Our 5-axis rubric separates what no prior benchmark could: MA (Metacognitive Accuracy) — the ability to say "I might be wrong", and ER (Error Recovery) — the ability to actually fix it. This maps directly to the monitoring-control model of Nelson & Narens (1990) in cognitive psychology.

Three Findings Across 9 SOTA Models

We evaluated GPT-5.2, Claude Opus 4.6, Gemini 3 Pro, DeepSeek-V3.2, Kimi K2.5, and others across 100 expert-level tasks:

1. ER Dominance. 94.8% of MetaCog gain comes from Error Recovery alone. The bottleneck to AGI is not knowledge or reasoning — it is self-correction.

2. Declarative-Procedural Gap. All 9 models can verbalize uncertainty (MA = 0.694) but cannot act on it (ER = 0.302). They sound humble but fail to self-correct — the most dangerous AI safety profile.

3. Difficulty Effect. Harder tasks benefit dramatically more from metacognition (Pearson r = -0.777, p < 0.001).

from datasets import load_dataset
dataset = load_dataset("FINAL-Bench/Metacognitive", split="train")

Paper: FINAL Bench: Measuring Functional Metacognitive Reasoning in LLMs

FINAL Bench is the first tool to tell apart what AI truly knows from what it merely pretends to know.