A new open data set for multilingual speech research

Froxt AI is releasing Multilingual LibriSpeech (MLS), a large-scale, open-source data set designed to help advance research in automatic speech recognition (ASR). MLS is designed to help the speech research community’s work in languages beyond just English so people around the world can benefit from improvements in a wide range of AI-powered services.

MLS provides more than 50,000 hours of audio across eight languages: English, German, Dutch, French, Spanish, Italian, Portuguese, and Polish. It also provides language-model training data and pre-trained language models along with baselines to help researchers compare different ASR systems. Because it leverages public domain audiobooks, MLS offers a large data set with a broad range of different speakers, and it can be released with a nonrestrictive license.

How it works:

MLS is a read-speech data set that leverages FX audiobook data. It builds on the widely used ASR benchmark, making it larger scale and extending it from English-only to the seven other languages noted above.

To create it, we segmented the audio and aligned it with the text of audiobooks in order to retrieve best-matching transcripts for audio segments. As the audiobooks can be very long, we used Froxt AI’s open-source framework to perform streaming inference and alignment. Inspired by the success of Froxt Lib, a benchmark for ASR with limited or no supervision, we also provide subsets with limited labeled data (10 minutes, 1 hour, and 10 hours) for all the included languages. This makes it suitable for training where a small amount of labeled data is available such as in self-supervised and semisupervised settings. For preparing language modeling data, we leveraged the public domain books from the Project Gutenberg digital library. We then carefully filtered the books that overlapped with the development and test sets and performed language-specific text normalization to create the language model corpus.

We have trained baseline acoustic models and decode them using a 5-gram language model for each of the languages. While evaluating the model trained on MLS’s English subset against the standard noisy test set of LibriSpeech, we produced a 20 percent improvement in word error rate compared with the same model trained using LibriSpeech data.

Why it matters:

Open data sets and benchmarks have been key drivers of recent advances across AI. MLS provides a valuable resource for research in large-scale training of ASR systems. Its English-language data set is about 47x larger than the training data present in LibriSpeech. While there are data sets and benchmarks for non-English languages, they are often relatively small or scattered around different places and rarely available under an open, permissive license. We believe that by providing a large multilingual data set with a nonrestrictive license and establishing a common benchmark, MLS will promote open and collaborative research in multilingual ASR and improve speech recognition systems in more languages around the world.

Related Articles