logo

Wed May 21 202513 min Read

Evaluating the Lightning-v2 Multilingual TTS Model

Benchmarking

cover image

Akshat Mandloi

Data Scientist | CTO

cover image

Smallest.ai's Lightning-v2 is a general-purpose streaming multilingual TTS model designed for real-time applications. Lightning V2 has proven itself to be a versatile model, supporting 16 languages with another 9 in beta testing. This blog presents the benchmark results for our Lightning V2 model against Eleven Labs real time model (Eleven Flash v2.5).

In our evaluation, we synthesized audio samples at 24 kHz assessed them across nine languages (German, English, Spanish, French, Italian, Dutch, Polish, Russian and Arabic) with well-established speech quality metrics generated by the models in streaming mode.

The following sections describe the metrics used, present the aggregated results, and analyze performance language by language.

Ensemble of Mean Operating Scores

To reliably asses the quality of audios generated by the models, we use an ensemble of mean operating score models. After extensive research through different articles we settled with the following four metrics.

These metrics were carefully chosen to assess the models in different manner, and are described in detail below:

  • SIGMOS is an objective, non-intrusive speech-quality metric introduced in the ICASSP 2024 Speech Signal Improvement Challenge, it is able to assess full-band audio quality—something most neural metrics hadn’t previously supported—enabling automated, standardized evaluation across seven perceptual dimensions like clarity, reverberation, and noisiness.
  • NISQA estimates speech quality from a degraded audio signal without requiring a clean reference. It provides not only an overall MOS (Mean Opinion Score), but also detailed predictions across four perceptual dimensions: noisiness, coloration, discontinuity, and loudness.
  • SHEET-SSQA MOS is a model-based metric within the broader SHEET toolkit—a Speech Human Evaluation Estimation Toolkit—designed to predict human Mean Opinion Scores (MOS) using self-supervised learning (SSL) representations, notably via SSL-MOS models.
  • WVMOS, is a widely used metric for speech analysis. What sets it apart is its robustness across multiple datasets and speakers, and closely aligned with subjective human judgments, making it especially useful for benchmarking TTS and voice conversion systems.

We have also developed an in-house voice evaluation framework to reliably use these models for repeatable evaluations.

Evaluation Process and Overall Results

In order to evaluate the models using the metrics, we chose a diverse set of sentences randomly generated using LLMs. Then we generated audios for these sentences using the default settings of both models as mentioned in respective documentations available online. We then evaluated these scores for the generated audios which we share below.

To get a decisive number to compare the speech generations, we use the average of score predicted by our ensemble of MOS models

Language

Eleven Labs Flash v2.5

Smallest.ai Lightning v2

Arabic

4.01

4.04

German

3.99

4.04

Spanish

4.21

4.25

French

4.36

4.28

Italian

4.17

4.20

Dutch

4.34

4.23

Polish

4.29

4.30

Russian

Not Supported

4.17

English

4.23

4.28

Language-Wise Performance Analysis

In this section, further detail the results of the languages evaluated on the individual metrics alongside samples demonstrating the model's ability to generate highly intelligible and expressive speech.

German (de): German outputs achieve strong MOS predictor scores (SIGMOS 3.63, NISQA 4.12), indicating listeners would likely find the speech natural and pleasant. Overall, the German voice is clear, natural, and easy to understand.

English (en): English generations perform exceptionally well, with high SIGMOS (3.88) and NISQA (4.47) scores reflecting natural prosody. The WVMOS (4.26) further suggests consistently natural-sounding audio. Together with the strong SSQA (4.54), these metrics confirm that English outputs are highly natural and engaging.

Spanish (es): Spanish outputs are well-balanced, with SIGMOS (3.83) and NISQA (4.51) among the highest, pointing to smooth, natural-sounding speech. SSQA (4.63) is excellent, while WVMOS (4.04) indicates consistently natural audio. Overall, Spanish speech sounds polished and easy to follow.

French (fr): French speech combines strong perceptual quality with excellent intelligibility. SSQA (4.64) is the highest across languages, complemented by NISQA (4.27) and WVMOS (3.90). Together, these results suggest that French outputs are technically precise and highly pleasant to listen to.

Italian (it): talian outputs maintain consistently strong scores. NISQA (4.39) indicates a highly natural voice quality, while SSQA (4.63) and WVMOS (4.01) confirm smooth delivery and clarity. These results suggest Italian speech is clear, engaging, and listener-friendly.

Dutch (nl): Dutch generations show impressive perceptual results. NISQA (4.50) and SSQA (4.55) are both very high, reflecting smooth and natural speech. The strong WVMOS (4.23) confirms Dutch outputs are notably pleasant and easy to follow.

Polish (pl): Polish speech exhibits excellent performance. NISQA (4.53) and SSQA (4.62) are high, reflecting naturalness and clarity. The strong WVMOS (4.47) further indicates that Polish speech is both easy to understand and of very high perceived quality.

Russian (ru): Russian outputs deliver consistently strong perceptual metrics. SIGMOS (3.75) and NISQA (4.57) indicate smooth, natural-sounding speech. The SSQA (4.59) adds further clarity, though the WVMOS (3.76) is slightly lower compared to others. Overall, Russian outputs are highly listenable and natural.

Arabic (ar): Arabic outputs demonstrate balanced quality. NISQA (4.13) and SSQA (4.63) indicate natural voice quality and good clarity. While the WVMOS (3.59) is lower than some other languages, the SIGMOS (3.83) confirms intelligibility. Overall, Arabic speech feels clear, measured, and listener-friendly.

Conclusion

Across all evaluated languages, Lightning-V2 demonstrates strong performance on metrics of naturalness, clarity, and overall audio quality. Every language achieves MOS values above 3.5–4.6, aligning with human-perceived high-quality speech. These results confirm that Lightning-V2 reliably generates natural, intelligible, and engaging speech across English, Spanish, French, Italian, Dutch, Polish, Russian, Arabic, and German.