When Machines Speak Up


Synthetic voices can help publish spoken audio with special focus on diction, accent, tempo, intonation and inflection across languages. Datatechvibe delves into the tech that’s helping machines find their voice

Instagram added two features to Reels: text-to-speech and voice effects last week. Text-to-speech systems (TTS) were first developed to aid the visually challenged by offering a computer-generated spoken voice that could “read” text on the screen to the user. Nowadays, Instagrammers often use text-to-speech voice overs, not so much for its accessibility benefits, but because it’s a funny trend to have a monotone, computerised voice narrating their content. But that’s far from the most exciting part of this news. The platform currently backs 90 languages, including Arabic, Hindi, Japanese and Portuguese, among others. The technology that powers TTS has come a long way.

In the late 1960s, the first recorded use of TTS synthesis took place using a room-sized IBM computer at Bell Labs, where researchers recreated the song Daisy Bell. The audio quality left much to be desired, but it wasn’t until years later that it would see any improvement. The advent of Machine Learning and advances in Deep Neural Networks (DNN) was a game-changer for TTS. It made human-sounding synthetic speech possible.

Today, Microsoft Azure offers users the chance to use more than 270 neural voices across 119 languages and variants. Arabic is the fifth most widely spoken language globally with more than 400 million people speaking the language. But the nature of the language is complex with 30 main dialects. This is the main reason for the delay. In 2019, Amazon Polly, a cloud language service, added Zeina, an Arabic female voice profile to their database of 59 voices in 29 languages. Zeina speaks Modern Standard Arabic (MSA), the most widely understood version. It was a big win for the region to say the least. Further, in 2020, California-based startup Kanari AI and Qatar Computing Research Institute together built a model that was able to detect 19 Arabic dialects.

Testing tongues

The first thing that a TTS system does is convert input text into linguistic representation. This is called text-to-phonetic or grapheme-to-phoneme conversion. Certain languages pose more of a challenge than others here. For example, in Finnish, the conversion is simple because written text corresponds almost entirely to how it is pronounced. For English and most of the other languages the conversion is much more complicated. There is a large set of different rules and their exceptions which are needed to produce the correct pronunciation and intonation in synthesised speech. At this stage, it also tackles homographs, words pronounced in different ways according to what they mean (think read, red).

After processing the text, the machine must focus on creation of linguistic data for correct pronunciation, and the analysis of prosodic features for correct intonation, stress and duration. These are language-specific problems. In addition to this, written text is ambiguous. Words can mean more than one thing and it depends on human beings to fill the blanks and add context to make sense of it. Things like numbers, dates, times, abbreviations, acronyms, and special characters like currency symbols need to be turned into words. This is harder than it sounds. The number 1789 might refer to a quantity of items (one thousand seven hundred and eighty-nine), a year or a time (seventeen eighty-nine), or maybe a padlock combination (one-seven-eight-nine). In each case, it needs to be read out differently.

The human brain fills in a lot of these gaps. But teaching a machine to do it is much the same as teaching a child to read. With the addition of statistical probability techniques, of course. In this case, the Hidden Markov Models is a popular solution to spot patterns in text and add part of speech tags that can help the machine understand from the context, how it needs to read the word.

Now, to convert words to sounds, the machine uses a phenome dictionary to match the two. Different TTS models may use one of the three approaches to do this; using recordings of humans saying the phonemes, computer-generated phonemes, and mimicking the mechanism of the human voice. In the first case, human voices need to be recorded while saying particular phrases to cover the entire range and then each snippet can be rearranged. This is also called the concatenative approach and was believed to create the most natural-sounding voice. Descript does just this with Overdub which it launched recently. It lets the user clone their voice. After recording a set of phrases and some licensing documentation, users can simply type and use their voice to feed future videos, virtual assistants and IVR.

On the other hand, computer-generated speech synthesisers work much in the same way as music synthesisers. This approach is also called formant after the frequencies of sound that the human vocal apparatus generates. Unlike concatenation, which is limited to rearranging pre-recorded sounds, formant synthesisers can say absolutely anything—even words that don’t exist. This is why it is a good choice for GPS navigation which needs computers to read out the names of places.

As TTS advances, it adds ease of use to the many benefits it brings to the table. For example, it can help the travel and tourism industry which has perennially struggled to fluently communicate with visitors from a variety of linguistic backgrounds. The hospitality industry has been using TTS to enable the creation of self-guided audio tours powered by synthetic voices. The sky’s the limit since brands can personalise the voice and tonality, even add accents and cultural context to woo users based on individual preference.

Direct-to-consumer brands have started saving a lot of their funds that would otherwise be spent on voiceover artists, and have been able to trim their customer call centres. Interactive voice response (IVR) systems that use the right voice to handle customer queries are believed to help foster customer loyalty and deepen connections. For example, Nuance’s TTS technology leverages neural network techniques to deliver a human‑like, engaging, and personalised user experience.

The auto-industry has been a popular success story when it comes to integrating TTS into a vehicle’s system for voice-enabled, hands-free controls. In education, it helps learners to partake in bimodal learning. When educational or training content is presented in both audio and visual formats, there is shown to be a higher rate of learner retention.


There will soon be more to talk about when it comes to TTS. Google partnered with DeepMind to build an API that delivers near human quality speech synthesis with 220+ voices across 40+ languages and variants, including Mandarin, Arabic, Russian and more. Apart from picking a voice, users can personalise the pitch with up to 20 semitones and adjust the speed. And if that’s not enough, users can customise the speech using Speech Synthesis Markup Language (SSML) tags that allow you to add pauses, numbers, date and time formatting, and pronunciation instructions.

If you liked reading this, you might like our other stories

Why Language Matters
Alexa vs Google Assistant