Testing multilingual text-to-speech tools to serve underserved communities

We translated English text to Hindi audio with the Indian Currents Newsroom

With AI-based text-to-speech tools advancing so quickly, we wanted to determine whether any existing tools could transform written English news content into a high-quality audio recording in another language.

To test this, we partnered with India Currents, a community news organization based in the San Francisco Bay Area. The India Currents team was eager to incorporate more audio stories on their website and social media, as well as engage their Indian American community members who often speak multiple Indian languages in addition to English.

With audio news in Hindi, India Currents hoped to foster stronger engagement and community among Indian Americans who prefer to listen to local news in a non-English language.

We envisioned the ideal AI tool as one that would take news stories composed in English and translate them into the target language of Hindi with minimal editing needed for cultural context. Then, the tool would synthesize natural, human-like audio of the translated text using text-to-speech technology that incorporates proper cadence, inflection and pronunciation for that language.

Multilingual text-to-speech tools and their limitations

While many AI vendors boast “multilingual” products, we quickly found that support for text translation is lacking. The “multilingual” features simply meant the tools could vocalize text in different languages, as long as the submitted text was already in that language. When we entered English text and chose a Hindi-speaking voice to read it in Speechify, the content remained in English but in an Indian accent – which was definitely not what we were aiming for.

We then discovered another AI tool called SeamlessM4T that claimed to offer text-to-text translation, text-to-speech translation, and speech-to-speech translation all in one. At first glance, this seemed like the perfect solution.

However, the translations it produced were error-prone and lacked proper context. The audio in Hindi sounded too robotic and spoken far too quickly with overlapping words that made it impossible to understand to a Hindi speaker. It even struggled with English text-to-speech. On top of that, SeamlessM4T could only translate snippets of audio up to 20 seconds long.

It’s likely these tools will improve in the not-so-far future with ongoing advances in natural language processing, but they are not fully functional yet. We needed to continue researching to find an AI tool or combination of tools capable of high-quality text-to-speech translation across languages.

Screenshot from SeamlessM4T — A screenshot showing the text-to-speech and translation function of SeamlessM4T.

Our solution

Given the limitations of current AI tools to perform integrated text translation and speech synthesis, our current solution is a two-step process. First, we used a translation tool to translate the English news stories into Hindi text. Though imperfect, this provided an initial Hindi version of the story. We tested various translation tools and found Google Translate provided the most accurate Hindi translations from English. Next, we took this translated Hindi text and used it as input for a text-to-speech AI tool to generate an audio file of the news story in Hindi.

This approach of separate translation and speech synthesis is not as sophisticated as the ideal single AI tool we initially envisioned. However, we were able to work around the technological constraints to still achieve our broader goal, even if the translations and audio may lack some nuance. As AI research continues to advance, we hope to eventually find an integrated tool that can perform proficient text-to-speech translation seamlessly across multiple languages.

We evaluated each text-to-speech tool based on three aspects: pricing, user experiences and the sound quality. For the sound quality, we preferred the sounds that have a comfortable speed, accurate pronunciations, human-like voices, and appropriate tone. We wanted to deliver the news story with authentic and sincere audio. For instance, we didn’t want an overly cheery voice describing a serious story.

Step 1

Because India Currents is a small team, they were looking for a tool that would require minimal human intervention – though with any AI tool, we knew that some editing and review would be necessary. Because of this, we first compared three possible tools for machine translation: Google Translate, Claude, and QuillBot.

As others have found, a specific translation tool like Google’s often works better than a generative AI chatbot like Claude or ChatGPT.

Google Translate

Claude

Quillbot

Checking with Hindi speakers, we found that Google Translation is the most accurate among the three. Users can translate text, files, images and websites, but the translation tool has a 5000-character limitation.

We also found that most machine translation tools could not translate with grammatical accuracy in Hindi based on the gender of the narrator or author. Human editing is still needed to correct machine translation errors.

Screenshot from Google Translate showing text in English and Hindi

Screenshot of IndiaCurrents in Hindi — Google Translate proved to be the most reliable machine translation when compared with other tools that are not designed specifically for this purpose.

Step 2

We tested the text-to-speech AI tools: ElevenLabs, Speechify and LOVO.ai to generate audio from Hindi text files. For each tool, we selected a voice that best fit the narration, and turned the text into speech. Speechify and LOVO.ai require paid subscriptions to export the audio file, while ElevenLabs allows downloading audio files for free. All three offered free plans to test the tool.

ElevenLabs

English Version

Hindi Version

ElevenLabs only allows generating 10,000 characters per month without a paid subscription. The Creator pricing level, which starts at $11 for the first month and then increases to $22/month, allows the ability to upload long-form content and full documents. The price level below that starts at $1/month and increases to $5/month for 30,000 characters.

ElevenLabs has a limited number of voice and language selections, even with paid subscription. For this project, I could only find one female voice in Hindi. The sound is very realistic with a natural pace, however it could have a more human-like tone.

Despite the limitations, we believe that ElevenLabs would be a good starting point for newsrooms to try out because of its free downloads, simplicity, and natural sounding voices. The India Currents team also chose ElevenLabs as their favorite when comparing the audio files in Hindi.

Screenshot of Speech Synthesis form from ElevenLabs — ElevenLabs may be a great tool to start with since it allows users to download files in the free version.

Speechify

English Version

Hindi Version

Speechify’s Voice Over tool has the most variety of different languages, including 9 official languages in India, male and female voices, accents and dialects. It also has strong editing features such as incorporating music in the audio, incorporating image and video to the audio, and voice cloning. The voices in Speechify are very human-like, but many of them sound too jovial for news narrations. But since there is such an abundance of voice selections, it’s much more likely to find a perfectly fitting sound for different news stories, with more effort of course.

Speechify and Genny both have tools for updating the pronunciation of specific words, using the phonetic spelling. However, you may want to listen carefully upon review to make sure that pronunciation is applied throughout. For example, in the English version, we corrected the pronunciation of the name Neha using Genny, and it followed that pronunciation in one place but then mispronounced it in another. Similarly, you may have to update the pronunciation of the same word multiple times in Speechify.

Speechify allows ‘text blocks’ for 1000 characters, and users can edit the sound and voice for each text block. This feature could be great for interviews and stories with quotes from different people.

Speechify also has the best collaboration features among the three tools. Users can share the audio generation through web links to someone and they will have the ability to duplicate the file, edit it, and then export it or share the link with others for collaboration.

Screenshot of Edit Voice Over form from Speechify — Speechify includes several editing tools within its platform.

LOVO.ai

English Version

Hindi Version

Genny from LOVO.ai is the most expensive tool among the three. Users have to pay for at least $19 per month subscription fee to use its ‘multi-language’ feature. For the free version, the tool can only turn English text to speech, and doesn’t allow exporting and downloading the file. We needed to pay $25 for one month of the basic subscription in order to test the text-to-speech tool in Hindi. LOVO.ai does have the most realistic and authentic sound with a more human-like tone and pace among the three tools.

Screenshot from Genny by LOVO.ai — While the multi-language features of Genny by LOVO.ai are not available in the free version, we appreciated the user experience of this tool.

The plus side of LOVO.ai is that it has in-house editing features allowing AI writing and art generating. Like Speechify, it also has a sharing feature to copy and share the audio link to others, or share it directly on social media platforms. For privacy considerations, users can set access codes to share with only select people although they cannot edit the audio. For the newsrooms that are looking for a more creative approach to engage their community, and for those who have a budget for technology, LOVO.ai will be a great tool to try.

Compare and contrast

Table with information comparing features among Speechify, ElevenLabs and LOVO.ai

Sign up for the Innovation in Focus Newsletter to get our articles, tips, guides and more in your inbox each month!

RJI

University of Missouri

Testing multilingual text-to-speech tools to serve underserved communities

Yuxi Lei

Multilingual text-to-speech tools and their limitations