Microsoft’s new AI bot VALL-E can be trained with only a three-second audio sample
A team of Microsoft researchers has created an innovative text-to-speech AI model named VALL-E. Once trained, it can replicate a person’s voice almost perfectly. The team requires only a three-second audio sample to train this Microsoft’s new AI bot.
Moreover, the researchers claim that once the AI tool learns a specific voice, VALL-E can synthesise audio of that person saying anything and do it in a way that attempts to preserve the speaker’s emotional tone, as well as the environment where the speaker is in. The developers of Microsoft’s VALL-E may be utilised for high-quality text-to-speech applications and speech editing, which would allow a person’s voice recording to be altered and changed from a text transcript, and in conjunction with other generative AI models like GPT-3 to create content. A technique dubbed EnCodec, which Meta revealed in October 2022, is the foundation for Microsoft’s VALL-E.
VALL-E produces discrete audio codec codes from text and acoustic cues, in contrast to conventional text-to-speech systems that typically synthesise speech by modifying waveforms. VALL-E decodes a person’s voice into tokens after conducting a voice analysis. Then it matches what it “knows” about how that voice would sound if it spoke additional words with the training data.
Microsoft has trained the synthesis abilities of its new VALL-voice E using the audio library LibriLight, which Meta, the parent company of Facebook, assembled. More than 7,000 different people are represented among the 60,000 hours of English-language speech that were primarily extracted from LibriVox public domain audiobooks. For Microsoft’s new AI bot to produce an acceptable result, the voice in the three-second sample must closely resemble a voice in the training data.
In addition to preserving a speaker’s vocal timbre and emotional tone, VALL-E can also imitate the “acoustic environment” of the sample audio. For example, the audio output will simulate the acoustic and frequency characteristics of a phone call in its synthetic output, which is another way of saying that it would sound like a phone call. Furthermore, Microsoft’s samples (included in the “Synthesis of Diversity” section) demonstrate how VALL-E may generate various voice tones by changing the random seed used during creation. Microsoft AI Research is creating artificial intelligence machines that complement human reasoning to augment and enrich our experience and competencies.