Meta has built three new artificial intelligence models designed to make sounds more realistic in mixed and virtual reality experiences
Meta has built three new artificial intelligence (AI) models to make sounds more realistic in mixed and virtual reality experiences. The three AI models, Visual-Acoustic Matching, Visually-Informed Dereverberation and VisualVoice, focus on human speech and sounds in video and are designed to push “us toward a more immersive reality at a faster rate,” the company said.
“Acoustics play a role in how sound will be experienced in the metaverse, and we believe AI will be core to delivering realistic sound quality,” said Meta’s AI researchers and audio specialists from its Reality Labs team.
They built the AI models in collaboration with researchers from the University of Texas at Austin and are making these models for audio-visual understanding open to developers. The self-supervised Visual-Acoustic Matching model, called AViTAR, adjusts audio to match the space of a target image.
The self-supervised training objective learns acoustic matching from in-the-wild web videos, despite lacking acoustically mismatched audio and unlabelled data. VisualVoice learns in a way that’s similar to how people master new skills, by learning visual and auditory cues from unlabelled videos to achieve audio-visual speech separation.
For example, imagine being able to attend a group meeting in the metaverse with colleagues from around the world, but instead of people having fewer conversations and talking over one another, the reverberation and acoustics would adjust accordingly as they moved around the virtual space and joined smaller groups.
“VisualVoice generalises well to challenging real-world videos of diverse scenarios,” said Meta AI researchers.