DRAGON: Distributional Rewards Optimize Diffusion Generative Models

Yatong Bai1,2*   Jonah Casebeer2   Somayeh Sojoudi 1   Nicholas J. Bryan2  

1University of California, Berkeley
2Adobe Research
*Work done during an internship at Adobe Research

arXiv Paper Video 🤗 HF paper

Abstract


We present Distributional Rewards for Generative Optimization (DRAGON), a versatile framework for fine-tuning content generation models towards a desired outcome. Compared with traditional reinforcement learning with human feedback (RLHF) or pairwise preference approaches such as direct preference optimization (DPO), DRAGON is more flexible. It can optimize reward functions that evaluate either individual examples or a distribution of them, making it compatible with a broad spectrum of instance-wise, instance-to-distribution, and distribution-to-distribution rewards. Leveraging this versatility, we construct novel reward functions by selecting an encoder and a set of reference examples to create an exemplar distribution. When cross-modality encoders such as CLAP are used, the reference examples may be of a different modality (e.g., text versus audio). Then, DRAGON gathers online and on-policy generations, scores them to construct a positive demonstration set and a negative set, and leverages the contrast between the two sets to maximize the reward. For evaluation, we fine-tune an audio-domain text-to-music diffusion model with 20 different reward functions, including a custom music aesthetics model, CLAP score, Vendi diversity, and Frechet audio distance (FAD). We further compare instance-wise (per-song) and full-dataset FAD settings while ablating multiple FAD encoders and reference sets. Over all 20 target rewards, DRAGON achieves an 81.45% average win rate. Moreover, reward functions based on exemplar sets indeed enhance generations and are comparable to model-based rewards. With an appropriate exemplar set, DRAGON achieves a 60.95% human-voted music quality win rate without training on human preference annotations. As such, DRAGON exhibits a new approach to designing and optimizing reward functions for improving human-perceived quality. Example generations can be found at https://ml-dragon.github.io/web.


Bibtex

          
          @article{bai2025dragon,
              title={DRAGON: Optimizing Distributional Quality Metrics to Enhance Diffusion Models}, 
              author={Bai, Yaong and Casebeer, Jonah and Sojoudi, Somayeh and Bryan, Nicholas J.},
              year={2025},
              archivePrefix={arXiv},
              primaryClass={cs.SD},
          }
                    

Good Examples Comparison

Comparing audio examples from different stages/models. All audio samples are provided in their original format.
Text Prompt
Reference
Aesthetics (ours)
Per-Song VAE-FAD (ours)
Only instrumental and based on electronic samples that picks up as the song progresses
A cinematic piece with a very bright piano, later joined by a drum machine, hand clapping and a violin synth
film music in a polka or circus style that is melodic and in a waltz
Christmas music, jingle bells instruments, happy, playful, nostalgic
hard-core rap music with an angry mood
Epic world music for film, featuring dynamic drums, orchestra drums, and congas
Eurodance pop song with synth stabs and a heavy stereo echo, and arpeggios
Broadcast programming and internet documentary music with a dreamy and inspiring mood, featuring acoustic guit
Yearning film score with slow strings, heartfelt piano, and deep love at 60 bpm
super hard core rap with an 808 kick
Progressive electronic song with an intro of African percussions
Bossa nova or samba duet with jazzy chords on a nylon guitar, and maracas
a show stopping broadway musical opening number
Electro dance song to play in the pub to cheer up the crowd

Randomly Picked Examples

All audio samples are provided in MP3 format.
Text Prompt
Reference
Aesthetics (ours)
Per-Song VAE-FAD (ours)
Popular music with a moderate tempo, synthesized bass, and happy mood
Vlog-worthy pop track with a moderate tempo, featuring dynamic drums, bass, piano, and vocal elements
cute and happy acoustic folk guitar with piano
Emotional and powerful solo piano performance in a classical film style with a quick pace
Film score with a mysterious and suspenseful mood
Positive vibe like something good is happening or about to happen based primarily on finger picking acoustic
This song gives a narrative of discovery or working something out I could imagine hearing it at the end of a v
A great song with a fun chorus
chill, groovy hip-hop
sweet violin for a romantic evening
theme song for a 90s sitcom
Haunting expansive sound as if you are in space
NOT piano
Festive and happy electronic music with piano, keyboard synthesizer, and drums
Powerful and emotional acoustic pop
Powerful and emotional film score with a mix of piano, keyboard synthesizer, winds, and flute
authentic Japanese music for a travel show
An electronic music track with a dark synth theme and a muffled drum machine beat. The track has a lot of spac
solo guitar, rock
Nostalgic and sad techno tune
This is an electronic music track with spacey ethereal synths accompanied with string like elements and later
A sentimental and romantic instrumental track for a film, using electric guitar, piano, and keyboard synthesiz
solo acoustic paino, bright
Fast-paced product sale background music with electric piano
Pop rock song with a hopeful feel and a chorus that audience can sing along to
Indie rock song from a 4 piece band that has a faster tempo and energetic beat
Dreamy and emotional film score for a documentary or drama
Electro-ambient track with a simple soft but sharp drum machine beat, the synth parts enter one after the othe
Untitled
music for a crime scene podcast