More explorations can be done for diffusion in the spectral domain.
What if directly training a diffusion model in the spectral domain? We can use a causal transformer where only high-freq tokens attend to low-freq ones? It is also more natural for super-resolution tasks?
High-fidelity generation is hitting a scaling crisis as DiT compute grows with image resolution and video length. But do we need high-resolution denoising at every step?
We introduce Spectral Progressive Diffusion, a plug-and-play framework for efficient image and video
Introducing Omni, one unified model can support any-to-any multimodal modeling, including multimodal understanding, image/video generation and editing, world modeling and 3D reconstruction. All in one that adopts standard mixture-of-experts arch with only 3B activations.
The discriminator is also trained on ImageNet (initialized from a pretrained SiT/JiT and later trained jointly with the generator), but it does not cause any metric hacking.
If you are referring to this prior research (arxiv.org/abs/2203.06026), it shows that training a FFHQ/LSUN GAN with an ImageNet feature network as the discriminator can lead to leakage and artificially improved FID scores. However, when both training and evaluation are conducted on ImageNet, there is no such leakage.
Then the question becomes whether using a discriminator in general helps exploit the null space of the FID metric. We do not believe so, as the improvement in FID is clearly reflected in improved human perceptual preference. We present uncurated ImageNet comparison results in the appendix of the paper.
The method also works well with guidance; full metrics are provided in the paper.
Continuous Adversarial Flow Models (CAFMs)
Paper: arxiv.org/abs/2604.11521
Flow matching generates poor samples without guidance because the MSE loss induces incorrect generalization. Instead of an isotropic Euclidean distance, we need a manifold-aware criterion—but how can we obtain it?
CAFMs bring adversarial training to continuous time. Learning velocity with a discriminator induces better generalization because the discriminator as a criterion can learn the manifold!
Also unlike flow matching’s forward KL objective, adversarial training allows optimizing different divergences. CAFMs can generate sharper and higher-quality samples.
Adversarial training in continuous time also avoids the vanishing gradient problem, leading to stable training.
CAFMs can be trained from scratch or used to post-train existing flow models. Post-training SiT/JiT for just 10 epochs yields large FID improvements. We also observe significant GenEval and DPG improvements when post-training text-to-image models.
More details in this thread!
@Jacoed@YouJiacheng Not quite the same. Discriminator projects high dimensional input to a scalar output, and the loss is applied on the scalar, but the transformation is learned and can capture the manifold. This is the key difference compared to fixed isotopic Euclidean distance
The intuition is that, regardless of the method, we need a criterion metric. Most work just use Euclidean distance, but this induces incorrect generalization. We want to switch to some kind of perceptual metric, but this needs to be learned and generalized by another network. If we use a fixed perceptual network, then the generator may exploit the null space, so we have to adversarial train the criterion network too. This becomes GAN-like. Maybe there are better approaches in the future but this was my intuition for researching adversarial methods.
Guidance is not faithful to the original data distribution. CFG can literally drop modes! It can also generate out-of-distribution samples (overly canonical, over exposure, AI synthetic-like instead of photorealistic images)
Today we are using guidance to compensate the flaw in the original flow matching objective. CAFMs are trying to "fix" the problem in the first place. Guidance can still be applied on top. We can achieve even better performance with both combined.
I actually have wrote a paper called "Diffusion Model with Perceptual Loss" (arxiv.org/abs/2401.00110) two years ago. The problem of using a fixed perceptual network is that the generator will try to exploit the null space, creating artifacts.
CAFMs is my solution to that. With an adversarial approach, the criterion is updated in the loop so the generator cannot exploit the null space.
@junmingong Flow matching is indeed very efficient and scalable for pre-training. But the problem of flow matching is the use of MSE loss, which induces the wrong generalization. CAFMs can be used as a post-training approach. This combines the best of both worlds.
@_akhaliq Thanks for sharing! Continuous Adversarial Flow Models brings adversarial training to continuous time. Instead of learning velocity with MSE criterion as in flow matching, we use a learned criterion that captures data manifold. Quick post-training. Better generation!
5. Conclusion and thoughts
Our Adversarial Flow line of research explores ways to integrate adversarial modeling and flow modeling, two of the most influential paradigms in generative modeling. Adversarial Flow Models (AFMs) bring adversarial training to discrete-time flow modeling. Now, Continuous Adversarial Flow Models (CAFMs) further extend this idea to continuous time.
I think being able to do adversarial training in continuous time will unlock many more interesting explorations!
Our method is also different from guidance. Both CAFM and FM ensure convergence to the empirical distribution (i.e., the overfitted ground-truth distribution). They differ only in their finite-capacity generalization, while still remaining faithful to the original distribution. In contrast, guidance does not guarantee faithfulness to the original distribution. Guidance can lead to out-of-distribution (OOD) samples, canonical samples, and other distortions. Accurately and faithfully generating the original data distribution remains an important area of research.
We do not claim that CAFMs can always generate high-quality samples without guidance. When training samples are sparse or contain outliers, the manifold learned by the discriminator is not guaranteed to be correct. Guidance can still be applied orthogonally to achieve low-temperature sampling.
Recently, representational latent spaces (RAE) have become a popular research direction. These methods change the data space in which flow matching operates and therefore implicitly affect the model’s generalization. However, they do not directly address the problem of MSE and require operating in latent space. CAFMs directly modify the loss objective to induce different generalization and work effectively even in pixel space. Other representation-alignment approaches (REPA), may also have implicit connections to our work. We hope our work inspires further insights in the research community.
193 Followers 659 FollowingBuilding real-time interactive video world models
prev:
Founding Engineer at Morpheus AI(acq. by Roblox)
intern @haoailab building FastVideo
12K Followers 845 FollowingResearch Director scaling world models @GoogleDeepMind, Honorary Associate Professor @UCL_DARK. Dad (🧒👶🐶), CFC fan, BJJ. Views are my own :)
2K Followers 3K FollowingSenior Research Scientist @GoogleDeepMind, core contributor of Gemini Pretraining and Omni Post-training; Prev: PhD @CornellCIS, BS @Tsinghua_Uni
68K Followers 2K FollowingResearch Scientist at Google DeepMind (WaveNet, Imagen, Veo). I tweet about deep learning (research + software), music, generative models (personal account).
6K Followers 2K FollowingAssistant Prof @CIS_Penn and Staff ML Researcher at @Apple (MLR) | ex-FAIR | PhD @HKUniversity | Research on Generative AI & World Models. また、日本語もできます。
497 Followers 503 FollowingExploring toward the next level of visual content creation. Ex @runwayml as a Foundational Contributor of Gen-3 alpha, Frames, and Gen-4. Opinions are my own.
19K Followers 4 FollowingTweeting interesting papers submitted at https://t.co/rXX8x0HzXV.
Submit your own at https://t.co/QhbJKXBd4Q, and link models/datasets/demos to it!
4.9M Followers 4 FollowingOpenAI’s mission is to ensure that artificial general intelligence benefits all of humanity. We’re hiring: https://t.co/dJGr6LgzPA