Advertisement

Xiaomi has announced that its AI Lab’s new-generation Kaldi team has open-sourced a new text-to-speech (TTS) model called OmniVoice. According to the company, the model is designed to deliver high-quality speech synthesis across hundreds of languages while also supporting voice cloning and customizable speech generation.

The announcement was shared through Xiaomi’s official WeChat account, where the company claimed that OmniVoice performs strongly in both Chinese and English scenarios and competes with, and in some multilingual tasks surpasses, existing commercial systems.

Xiaomi Omnivoice voice cloning model open source

1. Xiaomi Omnivoice focuses on multilingual speech synthesis

One of the biggest highlights of OmniVoice is its support for low-resource languages. Xiaomi says the model can generate speech in “almost any language imaginable,” including languages with very limited online training data. The company describes OmniVoice as the industry’s first voice cloning TTS model that covers hundreds of languages.

In multilingual testing, the OmniVoice outperformed several commercial systems across 24 languages in terms of speech similarity and intelligibility, even when trained only on open-source datasets. The company also claims that in testing across 102 languages, OmniVoice’s speech intelligibility was close to, or in some cases better than, real human speech.

The model is also designed to work with limited training data. According to the brand, even languages with less than 10 hours of training material can still achieve high-quality speech synthesis, which could help expand speech technology support for smaller regional and niche languages.

2. Simpler architecture with faster performance

Xiaomi also says OmniVoice uses a much simpler architecture compared to many current speech synthesis systems. Instead of relying on several different modules and prediction stages, the model uses a single bidirectional Transformer network to directly convert text into speech. This removes the need for separate text modeling, complex hybrid structures, and multi-level token prediction systems that are commonly found in modern TTS models.

The simplified design also improves speed as the OmniVoice is claimed to complete training on 100,000 hours of data in a single day. During inference, the model can run at up to 40 times real-time speed using PyTorch, which could make it easier to deploy in consumer applications and services.

According to Xiaomi, two major design choices helped improve the model’s performance. The first is a “full codebook random masking strategy,” which reportedly boosts training efficiency and overall model capability. 

The second is the use of a large language model during pre-training. Xiaomi says this is the first time a large language model has been effectively integrated into a non-autoregressive TTS model to improve pronunciation accuracy and speech intelligibility.

3. Real-world use features

Alongside multilingual speech generation, OmniVoice includes several practical features. Users can create custom voices simply by describing characteristics such as age, gender, pitch, accent, dialect, or speaking style. The model can even generate whispering voices and other special speech styles without requiring a reference audio sample.

Another feature focuses on noisy audio environments. Xiaomi says OmniVoice can automatically remove background noise from reference recordings and extract clearer voice characteristics, allowing better-quality voice cloning even when the original audio is recorded in less-than-ideal conditions.

The model also supports expressive speech synthesis through intonation controls, including laughter and sighing effects, making generated voices sound more natural and conversational.

For pronunciation accuracy, OmniVoice includes tools that allow users to manually correct difficult pronunciations, including polyphonic Chinese characters and English proper nouns. Xiaomi says this can improve the reliability of synthesized speech in real-world applications.

(Github | Demo | Huggingface)

Comments