Back to posts…

See-2-Sound @ SIGGRAPH

Work led by: Rishit Dagli, CS, University of Toronto, also Intern at NVIDIA

Authors: Rishit Dagli, Shivesh Prakash, Robert Wu, Houman Khosravani

We were so pleased that Rishit’s work was accepted to SIGGRAPH 2025, as a poster.

Generating combined visual and auditory sensory experiences is critical for immersive content. We introduce SEE-2-SOUND, a training-free pipeline that turns an image, GIF, or video into 5.1 spatial audio. SEE-2-SOUND sequentially: (i) segments visual sound sources; (ii) estimates their 3-D positions from monocular depth; (iii) synthesises mono audio for every source; and (iv) renders the mix with room acoustics. Built entirely from off-the-shelf models, the method needs no fine-tuning and runs in zero-shot mode on real or generated media. We demonstrate compelling results for generating spatial audio from videos, images, dynamic images, and media generated by learned approaches.

Relevant Links

Back to posts…