See-2-Sound @ SIGGRAP 2025
See-2-Sound @ SIGGRAPH
Work led by: Rishit Dagli, CS, University of Toronto, also Intern at NVIDIA
Authors: Rishit Dagli, Shivesh Prakash, Robert Wu, Houman Khosravani
We were so pleased that Rishit’s work was accepted to SIGGRAPH 2025, as a poster.
Generating combined visual and auditory sensory experiences is critical for immersive content. We introduce SEE-2-SOUND, a training-free pipeline that turns an image, GIF, or video into 5.1 spatial audio. SEE-2-SOUND sequentially: (i) segments visual sound sources; (ii) estimates their 3-D positions from monocular depth; (iii) synthesises mono audio for every source; and (iv) renders the mix with room acoustics. Built entirely from off-the-shelf models, the method needs no fine-tuning and runs in zero-shot mode on real or generated media. We demonstrate compelling results for generating spatial audio from videos, images, dynamic images, and media generated by learned approaches.
Relevant Links
- Paper: SIGGRAPH 2025
- Paper: arxiv.com
- Project Website: github.io
Enjoy Reading This Article?
Here are some more articles you might like to read next: