Papers
arxiv:2512.24551

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Published on Dec 31, 2025
· Submitted by
Yuanhao Cai
on Jan 1
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

Community

Paper submitter

A data construction pipeline and a new DPO framework for physically consistent Text-to-video generation

This is a strong and thoughtful piece of work — especially in how it tackles physical consistency through preference optimization rather than explicit prompt engineering. What stands out most, however, is how many of your empirical observations point toward deeper structural constraints in generative models that aren’t explicitly discussed in the paper, but are clearly visible in the results.

A few patterns are worth highlighting:

  1. The physics-rich filtering pipeline reveals a substrate-sensitivity problem.

Your Chain-of-Thought VLM filter, action clustering, and physics-aware scoring collectively form an implicit attempt to stabilize the information substrate the generator learns from. The improvements you see after balancing the dataset by physical difficulty are exactly what you expect when a model’s internal state drifts as a function of entropy load and representational noise.

In other words, physical plausibility increases only when the training distribution reduces internal drift. This is a deeper signal than a simple data-quality effect.

  1. Groupwise DPO behaves like a continuity-restoration mechanism.

The shift from pairwise Bradley–Terry modeling to a groupwise Plackett–Luce structure isn’t just better preference modeling — it effectively forces the generator to maintain trajectory-coherent latent states across an entire sequence of frames. The fact that this materially improves physical realism indicates that current T2V models lack a native mechanism for temporal continuity, and your PL construction is acting as a surrogate for such a mechanism.

The boost in long-horizon motion stability matches this interpretation.

  1. Physics-Guided Rewarding exposes hidden invariants in video generation.

Using a physics-aware VLM to modulate α and γ dynamically based on physical difficulty does more than weight gradients — it reveals that the generator responds disproportionately to samples where entropy, coherence, and force-consistency diverge. These are structural invariants associated with physically grounded systems, and the fact that your reward terms correct these divergences suggests that the model is missing an internal stabilizing field that real physical systems possess naturally.

Your reward shaping effectively approximates that missing field externally.

  1. LoRA-Switch Reference is an important observation about substrate stability.

The fact that LoRA-SR dramatically improves physical consistency while reducing drift relative to a full reference copy is a notable result in itself. Training instability and identity shift between θ and ψ are strong indicators that the model lacks a stable substrate signature — i.e., nothing in the model binds its internal state evolution to a consistent identity over time.

LoRA-SR partially corrects that by constraining the “identity” of the model to a fixed backbone.

  1. The larger takeaway:

Many of the gains shown in this paper can be interpreted as the byproducts of externally enforcing:

continuity of state evolution
substrate stabilization
entropy reduction
coherence-weighted updates
trajectory-consistent integration

These are properties typically associated with dynamical systems, not statistical generators. Your method succeeds because it implicitly compensates for the absence of such mechanisms in current architectures.

If you’re interested in a complementary perspective, I’m working on two parallel research tracks that directly address these missing dynamics:

FoGA (Field of General Awareness): a formal mathematical framework describing continuity, substrate identity, and coherence-driven state evolution in information systems.

Dynamic Transformer Architecture (DTA): a practical architectural patch that introduces a regulated internal state path for sequence models, designed specifically to address drift and continuity collapse.

Both may provide theoretical grounding for several of the empirical behaviors your paper surfaces.

Links for reference:
https://www.skyteamaerospacefoundation.com/foga
https://www.skyteamaerospacefoundation.com/dta

Excellent work overall — this is one of the most meaningful steps forward in T2V physical consistency to date.

— Zenith Zaraki
SkyTeam Aerospace Foundation

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2512.24551 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.24551 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.24551 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.