Wed May 21 2025 • 13 min Read
Evaluating the Lightning-v2 Multilingual TTS Model
Benchmarking
Akshat Mandloi
Data Scientist | CTO
Smallest.ai's Lightning-v2 is a general-purpose streaming multilingual TTS model designed for real-time applications. Lightning V2 has proven itself to be a versatile model, supporting 16 languages with another 9 in beta testing. This blog presents the benchmark results for our Lightning V2 model against Eleven Labs real time model (Eleven Flash v2.5).
In our evaluation, we synthesized audio samples at 24 kHz assessed them across nine languages (German, English, Spanish, French, Italian, Dutch, Polish, Russian and Arabic) with well-established speech quality metrics generated by the models in streaming mode.
The following sections describe the metrics used, present the aggregated results, and analyze performance language by language.
Ensemble of Mean Operating Scores
To reliably asses the quality of audios generated by the models, we use an ensemble of mean operating score models. After extensive research through different articles we settled with the following four metrics.
These metrics were carefully chosen to assess the models in different manner, and are described in detail below:
- SIGMOS is an objective, non-intrusive speech-quality metric introduced in the ICASSP 2024 Speech Signal Improvement Challenge, it is able to assess full-band audio quality—something most neural metrics hadn’t previously supported—enabling automated, standardized evaluation across seven perceptual dimensions like clarity, reverberation, and noisiness.
- NISQA estimates speech quality from a degraded audio signal without requiring a clean reference. It provides not only an overall MOS (Mean Opinion Score), but also detailed predictions across four perceptual dimensions: noisiness, coloration, discontinuity, and loudness.
- SHEET-SSQA MOS is a model-based metric within the broader SHEET toolkit—a Speech Human Evaluation Estimation Toolkit—designed to predict human Mean Opinion Scores (MOS) using self-supervised learning (SSL) representations, notably via SSL-MOS models.
- WVMOS, is a widely used metric for speech analysis. What sets it apart is its robustness across multiple datasets and speakers, and closely aligned with subjective human judgments, making it especially useful for benchmarking TTS and voice conversion systems.
We have also developed an in-house voice evaluation framework to reliably use these models for repeatable evaluations.
Evaluation Process and Overall Results
In order to evaluate the models using the metrics, we chose a diverse set of sentences randomly generated using LLMs. Then we generated audios for these sentences using the default settings of both models as mentioned in respective documentations available online. We then evaluated these scores for the generated audios which we share below.
To get a decisive number to compare the speech generations, we use the average of score predicted by our ensemble of MOS models
Language | Eleven Labs Flash v2.5 | Smallest.ai Lightning v2 |
---|---|---|
Arabic | 4.01 | 4.04 |
German | 3.99 | 4.04 |
Spanish | 4.21 | 4.25 |
French | 4.36 | 4.28 |
Italian | 4.17 | 4.20 |
Dutch | 4.34 | 4.23 |
Polish | 4.29 | 4.30 |
Russian | Not Supported | 4.17 |
English | 4.23 | 4.28 |
Language-Wise Performance Analysis
In this section, further detail the results of the languages evaluated on the individual metrics alongside samples demonstrating the model's ability to generate highly intelligible and expressive speech.
German (de): German outputs achieve strong MOS predictor scores (SIGMOS 3.63, NISQA 4.12), indicating listeners would likely find the speech natural and pleasant. Overall, the German voice is clear, natural, and easy to understand.
English (en): English generations perform exceptionally well, with high SIGMOS (3.88) and NISQA (4.47) scores reflecting natural prosody. The WVMOS (4.26) further suggests consistently natural-sounding audio. Together with the strong SSQA (4.54), these metrics confirm that English outputs are highly natural and engaging.
Spanish (es): Spanish outputs are well-balanced, with SIGMOS (3.83) and NISQA (4.51) among the highest, pointing to smooth, natural-sounding speech. SSQA (4.63) is excellent, while WVMOS (4.04) indicates consistently natural audio. Overall, Spanish speech sounds polished and easy to follow.
French (fr): French speech combines strong perceptual quality with excellent intelligibility. SSQA (4.64) is the highest across languages, complemented by NISQA (4.27) and WVMOS (3.90). Together, these results suggest that French outputs are technically precise and highly pleasant to listen to.
Italian (it): talian outputs maintain consistently strong scores. NISQA (4.39) indicates a highly natural voice quality, while SSQA (4.63) and WVMOS (4.01) confirm smooth delivery and clarity. These results suggest Italian speech is clear, engaging, and listener-friendly.
Dutch (nl): Dutch generations show impressive perceptual results. NISQA (4.50) and SSQA (4.55) are both very high, reflecting smooth and natural speech. The strong WVMOS (4.23) confirms Dutch outputs are notably pleasant and easy to follow.
Polish (pl): Polish speech exhibits excellent performance. NISQA (4.53) and SSQA (4.62) are high, reflecting naturalness and clarity. The strong WVMOS (4.47) further indicates that Polish speech is both easy to understand and of very high perceived quality.
Russian (ru): Russian outputs deliver consistently strong perceptual metrics. SIGMOS (3.75) and NISQA (4.57) indicate smooth, natural-sounding speech. The SSQA (4.59) adds further clarity, though the WVMOS (3.76) is slightly lower compared to others. Overall, Russian outputs are highly listenable and natural.
Arabic (ar): Arabic outputs demonstrate balanced quality. NISQA (4.13) and SSQA (4.63) indicate natural voice quality and good clarity. While the WVMOS (3.59) is lower than some other languages, the SIGMOS (3.83) confirms intelligibility. Overall, Arabic speech feels clear, measured, and listener-friendly.
Conclusion
Across all evaluated languages, Lightning-V2 demonstrates strong performance on metrics of naturalness, clarity, and overall audio quality. Every language achieves MOS values above 3.5–4.6, aligning with human-perceived high-quality speech. These results confirm that Lightning-V2 reliably generates natural, intelligible, and engaging speech across English, Spanish, French, Italian, Dutch, Polish, Russian, Arabic, and German.
Recent Blog Posts
Interviews, tips, guides, industry best practices, and news.
Comparative Analysis of Streaming ASR Systems: A Technical Benchmark Study
Streaming ASR is the backbone of real-time voice experiences, but balancing speed, accuracy, and resilience under real-world conditions is no easy task. In this benchmark, we pit Lightning ASR, Deepgram, and OpenAI’s GPT-4o models against each other across 9 languages and stress-test scenarios like noisy audio, heavy accents, and multi-speaker overlaps. The results reveal where each model shines- and why specialization, latency, and multilingual strength matter more than ever.
Exploring the Role of Voice AI in Enhancing EdTech Solutions
Transform EdTech solutions with seamless EdTech voice integration, improving learning experiences, boosting engagement, and supporting long-term success.
How AI Voice Automation Is Transforming the Hospitality Industry in 2025
Explore how AI voice automation is reshaping the hospitality industry, enhancing guest experiences, streamlining operations, and improving service.