Amazon Is Open-Sourcing MASSIVE Speech Dataset

The dataset will assist researchers in developing virtual assistants, even for the world’s most hidden languages

 Amazon has announced the release of its open-source what they have entitled the MASSIVE speech dataset. The aim is to help researchers scale natural-language-understanding technology to every language. The dataset will assist researchers in developing virtual assistants that could easily be generalised even for the world’s most hidden languages. In addition to the database, Amazon has also published open-source modelling code to help developers create more capable virtual assistants.

The Multilingual Amazon SLURP for Slot Filling, Intent Classification, and Virtual-assistant Evaluation, or MASSIVE for short, is a “parallel dataset” that includes one million labelled utterances in 51 languages, including those that lack properly labelled data, as well as open-source code that demonstrates how to execute massively multilingual NLU modelling. With Alexa currently being available in seven languages, the company aims to expand it to over 7000 languages spoken in the masked corners of the world.

Professional translators meticulously curated the dataset by translating the available English-only SLURP dataset into 50 varied languages that lacked labelled data. The MASSIVE database, according to Amazon, will be especially effective in improving spoken-language understanding, in which audio is transformed into text before NLU is done. Natural language understanding (NLU) is a branch of natural language processing (NLP) that deals with converting human language into a machine-readable format. 

The primary shortcoming of these voice-controlled personal assistants is that they are only available in a few familiar languages. The MASSIVE dataset is one step forward in the creation of a dataset that spans several obscure languages to build multilingual natural-language-understanding models that can smoothly adapt to those languages whose training data is scarce, intending to allow people all over the world to enjoy the availability of conversational AI systems like Alexa in their native languages.

Amazon is also establishing a new competition called Massively Multilingual NLU 2022 (MMNLU-22) that will use the MASSIVE dataset to encourage academics to design models that can readily adapt to new languages and create more third-party apps for Alexa. The competition will be hosted on a platform called and will include two tasks. The competition’s outcomes will be presented during December at an EMNLP 2022 workshop in Abu Dhabi and an online session called Massively Multilingual NLU 2022. It will also feature presentations by guest speakers and oral and poster sessions with papers on multilingual natural-language processing that have been submitted. Amazon has a vision for its products like Alexa and Echo to reach and be available to all customers and devices.