Xiaomi has announced that its AI Lab’s new-generation Kaldi team has open-sourced a new text-to-speech (TTS) model called OmniVoice. According to the company, the model is designed to deliver high-quality speech synthesis across hundreds of languages while also supporting voice cloning and customizable speech generation.
The announcement was shared through Xiaomi’s official WeChat account, where the company claimed that OmniVoice performs strongly in both Chinese and English scenarios and competes with, and in some multilingual tasks surpasses, existing commercial systems.
1. Xiaomi Omnivoice focuses on multilingual speech synthesis
One of the biggest highlights of OmniVoice is its support for low-resource languages. Xiaomi says the model can generate speech in “almost any language imaginable,” including languages with very limited online training data. The company describes OmniVoice as the industry’s first voice cloning TTS model that covers hundreds of languages.
In multilingual testing, the OmniVoice outperformed several commercial systems across 24 languages in terms of speech similarity and intelligibility, even when trained only on open-source datasets. The company also claims that in testing across 102 languages, OmniVoice’s speech intelligibility was close to, or in some cases better than, real human speech.
The model is also designed to work with limited training data. According to the brand, even languages with less than 10 hours of training material can still achieve high-quality speech synthesis, which could help expand speech technology support for smaller regional and niche languages.
2. Simpler architecture with faster performance
Xiaomi also says OmniVoice uses a much simpler architecture compared to many current speech synthesis systems. Instead of relying on several different modules and prediction stages, the model uses a single bidirectional Transformer network to directly convert text into speech. This removes the need for separate text modeling, complex hybrid structures, and multi-level token prediction systems that are commonly found in modern TTS models.
The simplified design also improves speed as the OmniVoice is claimed to complete training on 100,000 hours of data in a single day. During inference, the model can run at up to 40 times real-time speed using PyTorch, which could make it easier to deploy in consumer applications and services.
According to Xiaomi, two major design choices helped improve the model’s performance. The first is a “full codebook random masking strategy,” which reportedly boosts training efficiency and overall model capability.
The second is the use of a large language model during pre-training. Xiaomi says this is the first time a large language model has been effectively integrated into a non-autoregressive TTS model to improve pronunciation accuracy and speech intelligibility.
3. Real-world use features
Alongside multilingual speech generation, OmniVoice includes several practical features. Users can create custom voices simply by describing characteristics such as age, gender, pitch, accent, dialect, or speaking style. The model can even generate whispering voices and other special speech styles without requiring a reference audio sample.
Another feature focuses on noisy audio environments. Xiaomi says OmniVoice can automatically remove background noise from reference recordings and extract clearer voice characteristics, allowing better-quality voice cloning even when the original audio is recorded in less-than-ideal conditions.
The model also supports expressive speech synthesis through intonation controls, including laughter and sighing effects, making generated voices sound more natural and conversational.
For pronunciation accuracy, OmniVoice includes tools that allow users to manually correct difficult pronunciations, including polyphonic Chinese characters and English proper nouns. Xiaomi says this can improve the reliability of synthesized speech in real-world applications.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.
Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.
Comments