Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis

Qi Yang1, Binjie Mao2, Zili Wang1, Xing Nie1, Pengfei Gao2, Ying Guo2, Cheng Zhen2, Pengfei Yan2, Shiming Xiang1
1Institute of Automation, Chinese Academy of Sciences, 2Meituan

Visualization

Overview.

Vanilla Foley v.s. Draw an Audio. The traditional methods produce the entire audio only from video inputs, demonstrating limitations in controllability and flexibility. Draw an Audio, in contrast, offers a more appealing alternative that employs multiple instructions to produce high-quality synchronized audio and can produce mixed audio in multi-stages, thereby exhibiting greater practical application.

Architecture

Overview.

The architecture of Draw an Audio, which incorporates a Latent Diffusion Model (LDM) as the foundational model, a Text Condition Model for text instruction, a Masked-Attention Module (MAM) for video instruction, a Time-Loudness Module (TLM) for signal instruction.

Demos

It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Multiple Instruction Videos

It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Generated Audio1

Generated Audio2

Mixed Audio

The little squirrel is smiling and speaking to us.

The sound of river waves.

(Video from Sora)

Comparison with other methods

It is recommended to use earphones to hear the demos videos, raise the volume and zoom in the videos.

Ground Truth

SpecVQGAN

Diff-Foley

Draw an Audio (Ours)

The train moving down the tracks, the train horn being blown, and the traffic lights changing colors, all of which contribute to the overall soundscape of the scene.

The ball being hit by a baseball bat, the players running and shouting during the game, and the movements of the players on the field.

A goat vocalizing softly and then screaming loudly.

Crickets chirping very loudly.

A helicopter is taking flight with the blades slapping.

Camera muffling and wind blowing into a microphone as a vehicle engine runs while idle.

A man is playing the violin in front of a crowd, and another man is playing the cello, both of them producing sound through their instruments as they perform for the audience.