Evaluating the Lightning-v2 Multilingual TTS Model

Products

Industries

On Premise

Pricing

Documentation

Blogs

Careers

Contact Sales

Blogs

Top Vogent AI Alternative for 2025: Why Smallest AI Stands Out

Evaluating the Lightning-v2 Multilingual TTS Model

Benchmarking

Akshat Mandloi

Updated on

December 18, 2025 at 12:50 PM

Smallest.ai's Lightning-v2 is a general-purpose streaming multilingual TTS model designed for real-time applications. Lightning V2 has proven itself to be a versatile model, supporting 16 languages with another 9 in beta testing. This blog presents the benchmark results for our Lightning V2 model against Eleven Labs real time model (Eleven Flash v2.5).

In our evaluation, we synthesized audio samples at 24 kHz assessed them across nine languages (German, English, Spanish, French, Italian, Dutch, Polish, Russian and Arabic) with well-established speech quality metrics generated by the models in streaming mode.

The following sections describe the metrics used, present the aggregated results, and analyze performance language by language.

Ensemble of Mean Operating Scores

To reliably asses the quality of audios generated by the models, we use an ensemble of mean operating score models. After extensive research through different articles we settled with the following four metrics.

These metrics were carefully chosen to assess the models in different manner, and are described in detail below:

SIGMOS is an objective, non-intrusive speech-quality metric introduced in the ICASSP 2024 Speech Signal Improvement Challenge, it is able to assess full-band audio quality—something most neural metrics hadn’t previously supported—enabling automated, standardized evaluation across seven perceptual dimensions like clarity, reverberation, and noisiness.
NISQA estimates speech quality from a degraded audio signal without requiring a clean reference. It provides not only an overall MOS (Mean Opinion Score), but also detailed predictions across four perceptual dimensions: noisiness, coloration, discontinuity, and loudness.
SHEET-SSQA MOS is a model-based metric within the broader SHEET toolkit—a Speech Human Evaluation Estimation Toolkit—designed to predict human Mean Opinion Scores (MOS) using self-supervised learning (SSL) representations, notably via SSL-MOS models.
WVMOS, is a widely used metric for speech analysis. What sets it apart is its robustness across multiple datasets and speakers, and closely aligned with subjective human judgments, making it especially useful for benchmarking TTS and voice conversion systems.

We have also developed an in-house voice evaluation framework to reliably use these models for repeatable evaluations.

Evaluation Process and Overall Results

In order to evaluate the models using the metrics, we chose a diverse set of sentences randomly generated using LLMs. Then we generated audios for these sentences using the default settings of both models as mentioned in respective documentations available online. We then evaluated these scores for the generated audios which we share below.

To get a decisive number to compare the speech generations, we use the average of score predicted by our ensemble of MOS models

Language	Eleven Labs Flash v2.5	Smallest.ai Lightning v2
Arabic	4.048	4.042
German	4.047	4.036
Spanish	4.226	4.253
French	4.149	4.132
Italian	4.174	4.203
Dutch	4.202	4.239
Polish	4.165	4.299
Russian	4.090	4.169
English	4.264	4.287
Overall	4.152	4.185

Through our analysis, we conclude that while there was a close chase in certain languages, lightning v2 leads the way in this neck to neck comparison.

Language-Wise Performance Analysis

In this section, we further detail the results of the languages evaluated on the individual metrics alongside samples demonstrating the model's ability to generate highly intelligible and expressive speech.

German (de): German outputs from Lightning-v2 achieve strong MOS predictor scores. SSQA (4.51) and NISQA (4.24) indicate that listeners would likely find the speech natural and pleasant. Overall, the German voice is clear, natural, and easy to understand.

English (en): English generations from Lightning-v2 perform exceptionally well. High SSQA (4.54) and NISQA (4.47) scores reflect natural prosody, and the WVMOS (4.26) suggests consistently natural-sounding audio. Together with a strong NISAQ (3.88), these metrics confirm that English outputs are highly natural and engaging.

Spanish (es): Spanish outputs from Lightning-v2 are well-balanced. SIGMOS (3.83) and NISQA (4.51) are among the highest, pointing to smooth, natural-sounding speech. SSQA (4.63) is excellent, while WVMOS (4.04) indicates consistently natural audio. Overall, the Spanish speech sounds polished and easy to follow.

French (fr): French speech from Lightning-v2 combines strong perceptual quality with excellent intelligibility. SSQA (4.64) is the highest across languages, complemented by NISQA (4.27) and WVMOS (3.89). Together, these results suggest the French outputs are technically precise and very pleasant to listen to.

Italian (it): Italian outputs from Lightning-v2 maintain consistently strong scores. NISQA (4.39) indicates a highly natural voice quality, while SSQA (4.63) and WVMOS (4.01) confirm smooth delivery and clarity. These results suggest the Italian speech is clear, engaging, and listener-friendly.

Dutch (nl): Dutch generations from Lightning-v2 show impressive perceptual results. Both NISQA (4.50) and SSQA (4.55) are very high, reflecting smooth and natural speech. A strong WVMOS (4.23) confirms that the outputs are notably pleasant and easy to follow.

Polish (pl): Polish speech from Lightning-v2 exhibits excellent performance. NISQA (4.53) and SSQA (4.62) are high, reflecting naturalness and clarity. A strong WVMOS (4.47) further indicates that the speech is both easy to understand and of a very high perceived quality.

Russian (ru): Russian outputs from Lightning-v2 deliver consistently strong perceptual metrics. SIGMOS (3.75) and NISQA (4.57) indicate smooth, natural-sounding speech. The SSQA (4.59) adds further clarity, though the WVMOS (3.76) is slightly lower compared to others. Overall, the outputs are highly listenable and natural.

Arabic (ar): Arabic outputs from Lightning-v2 demonstrate balanced quality. The SSQA (4.63) indicates good clarity, and NISQA (4.13) reflects a natural voice quality. SIGMOS (3.83) confirms intelligibility, while the WVMOS (3.59) is lower compared to some other languages. Overall, the speech is clear, measured, and listener-friendly.

Conclusion

Across all evaluated languages, Lightning-V2 demonstrates strong performance on metrics of naturalness, clarity, and overall audio quality. Every language achieves MOS values in the range of 3.5–4.6, aligning with human-perceived high-quality speech. These results confirm that Lightning-V2 reliably generates natural, intelligible, and engaging speech across English, Spanish, French, Italian, Dutch, Polish, Russian, Arabic, and German.

Ensemble of Mean Operating Scores

SIGMOS is an objective, non-intrusive speech-quality metric introduced in the ICASSP 2024 Speech Signal Improvement Challenge, it is able to assess full-band audio quality—something most neural metrics hadn’t previously supported—enabling automated, standardized evaluation across seven perceptual dimensions like clarity, reverberation, and noisiness.
NISQA estimates speech quality from a degraded audio signal without requiring a clean reference. It provides not only an overall MOS (Mean Opinion Score), but also detailed predictions across four perceptual dimensions: noisiness, coloration, discontinuity, and loudness.
SHEET-SSQA MOS is a model-based metric within the broader SHEET toolkit—a Speech Human Evaluation Estimation Toolkit—designed to predict human Mean Opinion Scores (MOS) using self-supervised learning (SSL) representations, notably via SSL-MOS models.
WVMOS, is a widely used metric for speech analysis. What sets it apart is its robustness across multiple datasets and speakers, and closely aligned with subjective human judgments, making it especially useful for benchmarking TTS and voice conversion systems.

We have also developed an in-house voice evaluation framework to reliably use these models for repeatable evaluations.

Evaluation Process and Overall Results

To get a decisive number to compare the speech generations, we use the average of score predicted by our ensemble of MOS models

Language	Eleven Labs Flash v2.5	Smallest.ai Lightning v2
Arabic	4.048	4.042
German	4.047	4.036
Spanish	4.226	4.253
French	4.149	4.132
Italian	4.174	4.203
Dutch	4.202	4.239
Polish	4.165	4.299
Russian	4.090	4.169
English	4.264	4.287
Overall	4.152	4.185

Through our analysis, we conclude that while there was a close chase in certain languages, lightning v2 leads the way in this neck to neck comparison.

Language-Wise Performance Analysis

Conclusion

Automate your Contact Centers with Us

Experience fast latency, strong security, and unlimited speech generation.

Automate Now

Automate your Contact Centers with Us

Experience fast latency, strong security, and unlimited speech generation.

Automate Now

Related Blogs

Lead with Precision, Speak with Purpose: What Smallest.ai Shares with Emmanuel Macron

Nov 25, 2025

Conversational AI in Customer Service: 4 Use Cases And Steps

Dec 18, 2025

The Future of AI in Customer Service: What Comes Next

Dec 18, 2025

9 Ways Contact Center AI Is Changing Customer Calls Forever

Dec 18, 2025

How Generative AI in Financial Services is Defining 2025 ROI

Dec 18, 2025

Talk to a voice expert

Experience the fastest voice ai, book a demo now!

Contact Sales

1160 Battery Street East, San Francisco, CA, 94111

Talk to a voice expert

Experience the fastest voice ai, book a demo now!

Contact Sales

1160 Battery Street East, San Francisco, CA, 94111

Talk to a voice expert

Experience the fastest voice ai, book a demo now!

Contact Sales

1160 Battery Street East, San Francisco, CA, 94111

Ensemble of Mean Operating Scores

Evaluation Process and Overall Results

Language-Wise Performance Analysis

Conclusion

Ensemble of Mean Operating Scores

Evaluation Process and Overall Results

Language-Wise Performance Analysis

Conclusion

Talk to a voice expert

Products

Industries

Others

Legal

Talk to a voice expert

Products

Industries

Others

Legal

Talk to a voice expert

Products

Industries

Others

Legal