⏱️ Built a small Space for Visual Chronometer / Pulse of Motion.
Upload a video and estimate its Physical FPS: the frame rate implied by visual motion, independent of metadata. Useful to inspect “chronometric hallucination” in generated videos: clips that look smooth, but move with the wrong physical time scale.
Most leaderboards measure accuracy. We measure metacognition — whether a model catches its own errors. Benchmark + leaderboard + adapters, all open. 🎉
The surprise: even a K-AI #1 model (JGOS-31B-Citizen) is the strongest on multiple-choice traps (trap_rate 0.005 — ~2 misses in 400) yet blind to its own free-form mistakes (self-confidence AUROC = 0.5, pure random). A tiny base-frozen adapter recovers that signal.
Two independent axes (never compared across a row): ① trap_rate — does it fall for tempting trap options? (lower = stronger) ② adapter gain Δ — how much a lightweight adapter catches errors the model itself misses. (higher = more adapter value)
What's open: 📊 300+100 trap problems (each with a hidden trap + TICOS type) 🏆 24-model leaderboard 🧩 11 per-model adapters — adapters, NOT fine-tunes (base stays frozen; the adapter just reads the hidden state → P(wrong))
Submit any HF model → auto-scored daily at 09:00 KST and added to the board.
- Arctic RL integrates with VeRL and SkyRL today; enable ZoRRo with one config flag, no code changes required - ZoRRo delivers up to 6x actor-update acceleration and a 3.5x end-to-end training speedup, reducing Arctic-Text2SQL-R2 training from ~5 days to ~36 hours on 32 H200 GPUs - Arctic-Text2SQL-R2 achieved higher accuracy scores (48.7) than Gemini 3.1 Pro (47.9) and Claude 4.7 (47.3) on Snowflake's evaluated enterprise SQL benchmark under the tested conditions - Two open source recipes ship with this release: a text-to-SQL recipe that improved BIRD dev accuracy from 59.92% to 70.35%, and a multi-hop QA recipe that improved average accuracy from 69.6% to 72.3%
🍳 The RoboCasa Kitchen Leaderboard What does it take for a robot to handle kitchen chores the way a person does? It has to see (Vision), understand instructions (Language), and actually act (Action) — and VLA (Vision-Language-Action) models are emerging as the answer. They're the bridge between large multimodal models and real-world embodied control.
RoboCasa Kitchen is a leading robot-learning benchmark in which a single-arm robot (Franka Panda) performs 24 atomic manipulation tasks — picking up cups and bowls, opening drawers and doors, turning faucets, pressing buttons, and more — inside a photorealistic simulated kitchen. Because the layout and object placement are randomized every episode, it tests genuine generalization rather than memorized motions. The score (success rate, SR) is the average fraction of the 24 tasks completed as instructed, measured over multiple seeds so results aren't down to luck.
The catch: this benchmark has no official leaderboard, and protocols (number of demonstrations, evaluation setup) differ from paper to paper, leaving scores scattered. Lining the numbers up naively quickly turns into an apples-to-oranges comparison.
This leaderboard fixes that by collecting published scores with their sources and comparing only what is genuinely comparable. It's split into three tables:
🏆 Kitchen 24-task (matched) — head-to-head under identical conditions (per the RLDX-1 Technical Report). This is the core ranking you can actually trust. ➕ Other protocols — self-reported under different setups (e.g. fewer demos). Not directly comparable, so kept separate. 🤖 GR1-Tabletop — a different, humanoid-based variant suite, separated to avoid confusion.
Any researcher can submit their own model's score directly, and submissions are reviewed before they appear on the board. Every number links to its source paper, so you can verify it yourself.
I placed 🥈 2nd in the LeHome Challenge (ICRA 2026), and 🥇 1st of 62 teams in the first simulation round. Now I'm open-sourcing the full solution — code, tech report, and final weights.
The task: teach a cheap two-armed robot (SO-ARM101) to fold 4 garment types — long/short tops and pants. Garment category is hidden at eval. Round 1 in sim (auto-scored), round 2 on a real robot (jury-scored).
I trained a VLA policy with an RL loop on top. The key ideas:
🧠 The policy is its own value function. From the same forward pass that picks the next action chunk, cheap heads predict success probability, task completion %, garment type, and future keypoint distances + a Q-residual. Those become the advantage signal for RL — no separate critic.
🔁 A fully asynchronous RL loop coordinated only through the HF Hub: 1 trainer (H200) ships a fresh checkpoint ~every 40 min while N rollout workers (and a human doing teleop / DAgger corrections) collect data in parallel. Nobody waits — it uses the off-policy nature of the loop to the fullest.
📈 Binary success is too sparse, so I densify it into per-frame advantage via GAE — from objective keypoint checkpoints, the success-probability value baseline, and completion %.
🎛️ The RL combines AWR + RECAP. I also tune the inference knobs — execution length, playback speed, inpainting overlap, CFG scale, best-of-N — with a per-parameter Thompson-sampling bandit folded into rollout collection.
🔧 Round 2: with only ~1 week and no access to the eval robot — so the pipeline was sim → my robot → their robot, leaning on heavy augmentation to make the policy more robust.
🇮🇳 New in my Hindi LLM Series: Gemma-4 E4B, fine-tuned for Hindi — and it runs on your laptop's CPU. I fine-tuned Google's new Gemma-4 E4B on ~10k Hindi instruction pairs (AI4Bharat: anudesh + dolly) using Unsloth + LoRA, on a single L4 GPU. Then I ran an honest side-by-side eval: base Gemma-4 vs my fine-tune, across 25 Hindi prompts. The results were interesting 👇 ✅ My fine-tune is more concise — ask for "3 tips" and it gives exactly 3. Base writes a 1,200-character essay.
✅ Pure native Hindi — base keeps slipping into English ("संतुलित आहार (Eat a Balanced Diet)", "तारा (Star)"). My fine-tune stays in clean Hindi.
✅ Tighter instruction-following — ask for a "short message" and it gives one, not a menu of options. ⚖️ And to be honest: base Gemma-4 is more detailed and comprehensive. I didn't build a "smarter" model — I built a focused, Hindi-native, edge-friendly one that runs as a 5GB GGUF (Q4) on CPU. 🔗 Try it:
We’re excited to release NRS_QWEN_MYTHOS_1M — a powerful reasoning model built on Qwen 3.5 9B! At SKT AI LABS, we’ve supercharged this 9B model with our proprietary Neural Reasoning System (NRS) to deliver next-level performance.
🔥 Why This Model is a Game-Changer: ✅ 100x Reasoning Capacity — Exceptional deep logical thinking and complex problem-solving ✅ 1 Million Token Context — Perfect for massive codebases, long documents, and multi-turn agentic workflows ✅ Advanced Thinking Mode — Native <think> tags for true step-by-step Chain-of-Thought reasoning ✅ Tool-Use Ready — Optimized for Python execution, Web Search, and self-correction ✅ Blazing Fast — Runs smoothly on consumer GPUs like RTX 3090/4090
Whether you’re a developer building coding agents, a researcher working with long-context data, or someone who loves powerful reasoning — this model is built for you.
We're moving beyond model capabilities and toward the infrastructure needed for agents to work together.
Over the past few weeks we've seen meaningful momentum around the foundational building blocks of the emerging agentic web.
Agent Name Service (ANS) is addressing identity and trust. Agentic Resource Discovery (ARD) is helping standardize how agents discover resources and capabilities.
Together, these efforts represent something bigger than individual projects.
They point toward an ecosystem built on open, interoperable infrastructure rather than isolated implementations.
As builders, we'll likely spend the next few years solving challenges around identity, discovery, trust, interoperability, and governance—not just model performance.
It will be interesting to see how these efforts evolve—and where the community chooses to collaborate next.
--- 🚀 Gemma-4-A4B 98e v7-coder cohort — loop-fixed re-release. Two 20.8B MoE coders (4B-active), fresh-map prunes of Gemma 4 26B-A4B, 30/128 experts dropped per layer. The headline isn't a benchmark: the agentic loop is gone at the weights, not papered over by the sampler.
🔧 How: at prune time we force-keep the 46 agentic_eog experts a loop-protection signal flags as load-bearing for clean multi-turn termination (+ shared-FFN α=1.2). Result: 0 loops across 48 seeds on every published tier.
🎯 Both land near GPQA ~51 — graduate science is the budget axis, neither is a science model. Pick v7-coder for the broad LCB-medium + HumanEval lead; v7-coderx for the all-hard slice and HE+.
🧪 The harness we used to prove the fix is now an omk tool: agentic-loop-harness replays a frozen agentic conversation across a sampler×seed matrix and reports a fail-rate per chat-template, so you can isolate a loop to one variable. Model-agnostic — any OpenAI-compatible server. The version we shared with Google: google/gemma-4-12B-it#41
hey, I'm doing some experimenting, looping around :slight_smile: --- **kompress-v6** *shipped* — trained on Claude Code agent patterns (bash output, file reads, stack traces, search results, JSON tool responses). 3k synthetic pairs + 2k existing, fine-tuned from v4, $0.20 on vast.ai.
Results: heretic exact_pct 0.962 (v4: 0.967), keep_rate 0.854 (v4: 0.823), override delta 0. Model got more conservative — higher keep_rate on structured technical content. Real proxy: v4 compressed 9.5%, v6 compressed 4.2% on the same session. Less aggressive, fewer must-keep tokens dropped on paths and identifiers.
Interesting failure: self-labeling with v4+override collapsed mk_in_ref to 0.652. TokenExpiredError splits into Token+Expired+Error — subtokens that don't individually match the must-keep regex, so the force-keep never fires. Generator references (mk_in_ref=1.0 by construction) ended up being better labels than v4's compressed output for agent data. Fix for next run: slide a 2-3 subtoken window instead of checking individual subtokens. Would let self-labeling work on agent content and potentially produce a more compression-aggressive v7.
🚀 Introducing PerceptionDLM — the first multimodal diffusion LLM for parallel region perception!
Most MLLMs are autoregressive, so captioning N regions costs N sequential passes. PerceptionDLM instead describes ALL masked regions in a single denoising process. 🧩
✨ Highlights • ⚡ Up to 3.4× faster on dense multi-region captioning, with stable per-image latency • 🏆 PerceptionDLM-Base beats LLaDA-V on 15/16 multimodal benchmarks (new SOTA among open diffusion VLMs) • 📊 New benchmark: ParaDLC-Bench — jointly evaluates caption quality AND inference efficiency • 🔓 Code, models & benchmark all open-sourced
Over the past few days, SupraLabs has been mentioned in a public discussion regarding small language models, scaling laws, and training methodology. We'd like to clarify our position.
Before anything else, we want to make one thing absolutely clear: we have great respect for Lane and the work being done at Glint Research. At no point was our intention to disrespect Lane, Glint Research, or their research. What began as a technical discussion about model scaling and training methodology unfortunately became much more personal than we ever intended. From our perspective, it was simply an exchange of technical opinions, and we sincerely hope it remains that way. We'd also like to acknowledge that one of our own comments during the discussion was poorly worded. Referring to a benchmark as "fake" was imprecise. What we intended to criticize was the comparison methodology, not the integrity of the evaluation itself. Comparing a merged checkpoint against a single checkpoint is, in our view, not an apples-to-apples comparison.
That said, this was never the core of the discussion.
Our disagreement was not about SLERP, model merging, or whether training a small model on massive amounts of data is an interesting research direction. We support experimentation and unconventional ideas.
The actual point of disagreement was much simpler.
The statement that a 1M parameter model trained on 1 trillion tokens will become a "100M killer" is, today, a prediction, not an experimental result. Could it happen? Perhaps. Would it be exciting if it did? Absolutely.
But until benchmark results, reproducible evaluations, and independent validation exist, we believe such statements should be presented as hypotheses rather than established conclusions. Research advances by testing ideas, not by assuming their outcomes.
We sincerely wish Lane and everyone at Glint Research success in their experiments.
Excited to share our paper: Think Again or Think Longer? Selective Verification for Budget-Aware Reasoning
A common assumption in test-time reasoning is that giving a model more chances to think or verify should improve performance. Our results show that this is only partly true.
We introduce SEVRA, a serving-layer controller that decides when a frozen reasoning model should keep its initial answer and when it should actively verify it. Instead of treating verification as always useful, SEVRA asks a more deployment-focused question:
Is this specific attempt likely recoverable by verification?
We evaluate this through helpful fixes, harmful flips, extra calls, and realized token cost.
Some key takeaways:
* Selective verification improves over always verifying on MATH500 while reducing harmful flips. * On GSM8K, the controller verifies only a small fraction of examples but still improves accuracy. * However, a longer initial solve can sometimes match selective verification with fewer realized tokens. * Cheap serving-visible features, such as completion status, token count, and finalizer use, nearly match larger learned gates. * On CommonsenseQA, always-on verification hurts, showing that the best test-time compute action is workload-dependent.
The main deployment lesson is simple:
Tune the initial reasoning budget first. Then use selective recovery when explicit checks, bounded retries, auditability, or regression-risk control matter.
Would love feedback from the community, especially on broader test-time compute allocation, risk-aware verification, and practical serving policies for reasoning models.
Excited to open-source the VisDrone Aerial Object Detection Model Zoo on Hugging Face.
The collection includes multiple YOLO variants trained and evaluated on the VisDrone benchmark for aerial object detection, with accompanying documentation and performance metrics.
If you're working on drones, aerial surveillance, robotics, or small-object detection, I hope these models save you some time.
SRT Showcase: Watch a Frozen Model Think, Token by Token
A frozen Qwen-2.5-7B now narrates its own interpretation in real time. SRT Showcase is the most complete public demonstration of computational semiotics to date, running the backbone with the SRT Adapter and Activation Verbalizer. As the model generates, every token is tinted by its predictive effort, and at the highest-effort positions the Verbalizer decodes the hidden state directly into natural language. You see what the model is representing at the exact moment its computation is most active.
Every verbalization is validated, not asserted. Each decoded thought is re-encoded and compared back to the original hidden state, and the reconstruction closely approximates it. The "this is what the model was thinking" claim carries its own fidelity badge. This is grounded introspection, not plausible narration.
The Showcase goes further than the trace. An A/B panel runs the same prompt with SRT injection on and off under an identical seed, so the side-channel's effect is directly observable. A curated gallery walks through confident recall, false premises, misconceptions, reasoning pivots, genuine uncertainty, and safety boundaries. Live entropy and divergence meters track the crystallization process token by token, with per-layer traces and reflexivity estimates on hover.
None of the backbone weights are touched. The entire mechanism is a lightweight reflexive layer over a frozen model, which is why the same read-out heads already port from Qwen-2.5-7B up to a 235B Mixture of Experts. Frozen models can now be verbalized in real time. No retraining. No fine-tuning. No black box.
First request is a brief cold start while ZeroGPU acquires a GPU. Bring your own prompt.
I just released Inflect-Nano-v1, an ultra-small 4.63 parameter text-to-speech model.
The main idea is simple: instead of only making the acoustic model tiny and relying on a larger external vocoder, Inflect-Nano-v1 keeps the complete text-to-waveform stack under 5M parameters.
Quick facts: - 4.63M total inference parameters - 3.46M acoustic model - 1.17M vocoder - 24 kHz audio - English-only - Single male voice - Runs locally with a simple PyTorch inference script
Why I made it: Most modern TTS models are much larger, and even many “small TTS” projects depend on a separate vocoder. I wanted to see how far a complete tiny TTS stack could be pushed while still producing usable speech.
It is not SOTA, and I am not trying to claim it competes with large TTS systems. The interesting part is the size-to-functionality ratio.
What works: It can generate arbitrary English speech locally, and the model is small enough to be interesting for:
- local voice assistants - embedded/edge experiments - browser or WASM-style TTS exploration - efficient inference research - tiny-model baselines
Limitations: The quality is still limited. It can sound robotic, stumble on difficult unseen text, and the vocoder is still a clear bottleneck. Long or unusual prompts are less reliable.
So I would frame this as a research/demo release, not a production TTS engine.
I’d love feedback from people interested in: - tiny speech models - vocoders - local TTS - efficient inference - embedded speech synthesis - improving small-model generalization
If people find it useful, I’m interested in putting more training budget into a stronger v2.
The article for aleph attention routing needs more work on vision, as the vision portion has not been fully validated, while the LM prototype has been semi-validated for small and medium-small scale. I will post my findings in the coming days with the consequences of training an LM and a VIT utilizing the prototype system.
The current structure for the Geometric Vocabulary does nearly reflect the intended shape as discussed in the earlier posts and articles, so that's coming along nicely - but there are stipulations and problems involved that I did not foresee.
My apologies for the incomplete article I just released on a whim. I jumped to the conclusion a bit early in anticipation before the formulas were fully converged. I also released an early post the other day speaking about the prototype AlephLM - which I removed as an invalid conclusion.
I'm doing my best to only release validated empirical information instead of speculative - however I do sometimes jump to conclusions without proper validation from time to time. Occasionally, I get a bit theory-overzealous and require tidying up through thorough experimentation which I'm currently approaching directly.
published a small source-backed dataset for reviewing AI-assisted code and AI-written English without turning it into an accusation game. Dataset: yava-code/ai-authorship-signals-2026 The dataset has 10 review signals across two domains: code: comment-to-code ratio, dependency hallucination, security misses, edge cases writing: overused AI vocabulary, low section variation, detector bias against non-native English Each row includes: signal why it matters risk level review action source ids The main idea: do not ask "was this made by AI?" first. Ask what needs review, what evidence exists, and what failure mode would hurt production. I also grouped the related work here: yava-code/applied-small-ai-portfolio-6a304c83f9f1d089a28c101b
A tiny ReAct-style agent where the trace is the interface: click a thought, retry a branch, label weak/useful nodes, and export preference pairs for DPO/RL-style training.
Space: build-small-hackathon/glass-box-agent Demo: included in the Space at assets/glass-box-agent-demo.mp4 Track: An Adventure in Thousand Token Wood
From Plain English to DuckDB SQL: Building LFEDS 🏫 I just shipped Local First Education Data Stack— a plain-English-to-SQL assistant for school district analytics — for the HF Build Small Hackathon.
The problem: school staff have useful data (attendance, grades, enrollment, discipline) but no fast, private way to ask questions. Most AI tools send that data to a cloud API. LFED doesn't.
What it does: → Type a question like "What's the average GPA for chronically absent students in 2023-2024?" → A fine-tuned Qwen2.5-Coder-14B model generates DuckDB SQL → A validation layer rejects anything that isn't a SELECT → Results come back as a summary, table, CSV download, and the SQL itself
Two flavors: - Live Space demo: transformers + PEFT on HF ZeroGPU - Local-first: llama.cpp + GGUF Q4_K_M on your own machine — no data leaves
The fine-tune: - 27,859 synthetic NL→SQL pairs - Unsloth QLoRA r=32 on Qwen2.5-Coder-14B - Trained on Modal A10G
Hardest lessons were not model training: 1. Scope the model's job tightly — schema + few-shots + SELECT only. 2. Validate before executing. Always. 3. ZeroGPU is PyTorch-only; llama.cpp won't work there. 4. Gradio's scoped Svelte CSS beats generic selectors — inspect the live DOM. 5. modal deploy + fn.spawn() is fire-and-forget; modal run dies if your terminal drops. 6. Data artifacts matter as much as the model — Parquet seeds, dataset card, model card.
I also published the training dataset: 25,886 question→SQL pairs on the Hub.