Back in March, Xiaomi introduced its MiMo-V2-TTS speech synthesis model, which focuses on detailed control over tone, emotion, and speaking style. The company said at the time that it could handle everything from natural conversations to singing, with support for multiple Chinese dialects.
Now, Xiaomi is updating that work with a system that covers both how machines speak and how they listen. The company has announced the MiMo-V2.5-TTS series alongside MiMo-V2.5-ASR, together as a “full-link” voice model setup for what it calls the agent era.

The output models
On the synthesis side, the MiMo-V2.5-TTS series includes three different models, all available through Xiaomi’s MiMo Open Platform for a limited time at no cost. Each model shares a common framework for style instructions, audio tag controls, and text understanding, but they target slightly different use cases.
The base MiMo-V2.5-TTS model comes with a set of prebuilt voices and allows detailed adjustments to speech rate, tone, and emotion.
Meanwhile, the MiMo-V2.5-TTS-VoiceDesign lets users generate entirely new voice timbres using just a short input sentence.
The third option, MiMo-V2.5-TTS-VoiceClone, is about reproducing a specific voice with a small number of samples, while maintaining consistency across different styles and instructions.
A big part of Xiaomi’s pitch here is how the model interprets instructions. Instead of relying on structured parameters, users can describe how a voice should sound in plain language, almost like directing a voice actor. For more complex use cases, like game characters or audio dramas, the system also supports layered script-style inputs where character traits, scenes, and dialogue can be adjusted independently without breaking consistency.
The models also introduce inline audio tags, which allow users to control emotion or delivery at specific points within a sentence. These tags can be mixed within the same text and are claimed to work across both Chinese and English.
The input model
On the input side, Xiaomi is releasing MiMo-V2.5-ASR as an open-source model. According to Xiaomi, the speech recognition system here is designed to handle less predictable, real-world scenarios, including bilingual conversations, regional dialects, and noisy environments.
The ASR model supports several Chinese dialects such as Wu, Cantonese, Minnan, and Sichuanese, while also performing well in complex English scenarios. It can switch between Chinese and English without requiring preset language tags, and it’s capable of recognizing song lyrics even when music and vocals are mixed together.
The model also targets situations with multiple speakers, such as meetings, and can transcribe overlapping conversations with a level of separation. Xiaomi says it can maintain accuracy even in high-noise environments or when dealing with far-field audio pickups.
Another detail here is how the system handles punctuation and structure. Instead of outputting raw text that needs cleanup, MiMo-V2.5-ASR includes native punctuation based on both phonetics and context. As a result, the transcripts are usable without the need for much post-processing.
In terms of performance, Xiaomi claims the model reaches state-of-the-art or near state-of-the-art results across several benchmarks, including bilingual recognition, dialect handling, and code-switching scenarios.
The TTS models are accessible through Xiaomi’s platform and can also be tested in MiMo Studio, while the ASR model is available with open-source weights and code for direct use or further customization.
For more daily updates, please visit our News Section.
Stay ahead in tech! Join our Telegram community and sign up for our daily newsletter of top stories! 💡







Comments