Have you ever watched an ASMR video where the ASMRist speaks into the ear microphones? You get sound in both audio channels as if someone is whispering directly into your ears. That’s called spatial audio and Adobe researchers are using these types of videos to teach machines to match sound locations in them.
Because we, as humans, have the ability to establish spatial correspondences between our visual and auditory senses, we can immediately notice that the visual and audio streams are consistent in the first video and flipped in the second video. Our ability to link the location of what we see with what we hear enables us to interpret and navigate the world more effectively (e.g., a loud clatter draws our visual attention telling us where to look; when interacting with a group of people, we leverage spatial cues to help us disambiguate different speakers). In turn, understanding audio-visual spatial correspondence could enable machines to interact more seamlessly with the real world, improving performance on audio-visual tasks such as video understanding and robot navigation.
In our work, “Telling left from right: Learning spatial correspondence between sight and sound” (authors Karren Yang, Bryan Russell and Justin Salamon)–which is being presented this week as an oral presentation at the Conference on Computer Vision and Pattern Recognition (CVPR) 2020, and as an invited talk at the CVPR 2020 Sight and Sound workshop–we present a novel approach for teaching computers spatial correspondence of sight and sound.
In order to teach our system, we train it to tell whether the spatial audio (i.e., left and right audio channels) have been flipped or not. This surprisingly simple task results in a strong audiovisual representation that can be used in a variety of downstream applications. Our approach is “self supervised,” which means that it does not rely on manual annotations. Instead, the structure of the data itself provides supervision to the algorithm, meaning that all we need to train it is a large dataset of videos with spatial audio. Our approach goes beyond previous approaches as it learns cues directly from stereo audio in order to match the perceived localization of a sound with its position in the video.
Check out the YouTube-ASMR-300K dataset that contains over 900 hours of ASMR video data if you want to try something similar yourself. Or if you’re in need of some tingles.