October 9, 2024 • 6 min Read

Introducing Lightning: World's Fastest Text-to-Speech Model

Lightning will allow voicebot companies to drastically reduce latencies, and costs by simplifying their architectures

Sudarshan Kamath

Founder

We at smallest.ai are proud to release a new state-of-the-art multi-lingual text-to-speech model (TTS) - Lightning. This is the result of a lot of hard work by the team in the past few months, pushing the barriers of size and performance in multi-modal AI.

Want to skip the blog and try it out? Login to Waves or check the Waves API.

Overview of Lightning’s Capabilities

Speed - Lightning can generate 10s of ultra-realistic audio in just 100ms. This is a real-time factor of 0.01 - making it world's fastest text to speech model.
Size - Lightning requires much less than 1 GB of VRAM making it easy to run on most consumer and edge devices.
Languages - Lightning supports English and Hindi in multiple accents currently and we plan to add many more languages quickly.
New Data Adoption - Lightning is also very quick to adopt to new languages, accents and speakers - often requiring just an hour of data to train.

Why Lightning is Revolutionary: Comparison with current TTS Models

While we will be launching a detailed technical report soon, here are the top highlights.

The Limitations of current TTS Models

Auto-regressive models, are leading the speech generation benchmarks as they are great at capturing speech nuances, and can model emotions, spontaneity into their speech very well.

However, they suffer from slow decoding. Generating audio in such models is a sequential process, meaning longer clips introduce delays. The first byte of audio might be fast but the overall audio is generated extremely slowly. A 10-second clip, for example, can take up to 5 seconds to generate.

Moreover, autoregressive models often require WebSocket connections for real-time applications, which are harder to scale and maintain compared to simpler REST API integrations. These need to be always on during a call and can max out your CPU very quickly.

Non-Autoregressive models have not come close to Autoregressive ones in terms of speech quality as they often suffer from lack of context as the next tokens in the sequence are not conditioned on previous tokens.

This is no longer true now with Lightning.

How Lightning Overcomes These Challenges

Lightning uses a non-auto-regressive architecture to synthesize an entire clip of audio at the same time, unlike auto-regressive models that generate audio step-by-step. But making non-auto-regressive models work and training them in a scalable manner is not easy. Here are some ideas that have worked to solve these challenges:

Style Diffusor - Lightning uses a special style diffusor that helps add styles to the audio to make it conversational or as required by the user by providing a reference.
Phoneme Based Inputs - We decided to switch from BPE tokenizer based inputs to phoneme based inputs as this helps add new languages quickly. While it is true that not all languages have similar phonemes, we have observed that the model has the capability to overcome the phonemization issues implicitly.
Conditioning Encoder - Lightning allows a high degree of control by conditioning the generated audio based on speaker, style, accent and other latents with the help of our custom conditioning encoder. This encoder is great at capturing these latents with low levels of correlations and we continue to explore more ways to provide these controls in a simple manner to our users.
Low Model Size - Lightning is super-fast as it contains very few parameters compared to traditional large audio models, pushing it into the sub-gigabyte memory range. This has been achieved through rigorous removal of unwanted weights, quantization, and distillation using a proprietary algorithm that we at smallest.ai are continuously working on improving.

How Lightning Works in Real-Time Applications

Lightning’s speed and efficiency make it perfect for real-time products such as Voice Assistants.

We provide a simple REST API, from which developers can easily integrate Lightning into their systems without the complexity of websockets.

Lightning's simple integration and fast responses ensure that the voice bots can run with sub-1-second latency and scale rapidly without any issues.

Here are some simple examples where we show how to integrate Lightning into voice bots, telephony providers, and more.

How to Run Lightning Locally in 5 Minutes

Getting started with Lightning is simple. Here’s how to run it locally in just 5 minutes:

Login to waves.smallest.ai.

smallest_login

Navigate to the API key section on the left-hand panel and Copy your API key.
Go to the Read API Documentation and from the left menu, go to the Waves API.

a. Put the API key on the authorization box.

b. In the Path section select the lightning model.

c. Input the voice_id, here is the list. Just enter the string enclosed between two quotes.

d. After entering, enter the text you want to listen to.

e. And, finally select the sample_rate. Select 16000, other options are available as well.

waves_api_lightning_model_smallest_dot_ai

To run it locally using Python, copy the python code from the right side black box, and paste it in your code editor. Change the token with your actual API key.

import requests

url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"

payload = {
    "voice_id": "jasmine",
    "text": "Hello world, I am lightning a fast text to speech model from smallest dot ai team.",
    "sample_rate": 16000
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.request("POST", url, json=payload, headers=headers)

print(response.text)

To make it easier, here's some additions to the code, so that you can listen to audio locally as well.

import requests
import uuid
import os
import wave

AUDIO_DIR = "audio_output"
SAMPLE_RATE = 16000  # The sample rate in Hz
NUM_CHANNELS = 1  # Assuming mono audio, adjust if needed
SAMPLE_WIDTH = 2  # 2 bytes per sample (16-bit PCM)

# Create directory if it doesn't exist
if not os.path.exists(AUDIO_DIR):
    os.makedirs(AUDIO_DIR)

url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"

payload = {
    "voice_id": "arman",
    "text": "Hey guys, I am lightning! An extremely fast text to speech model from smallest dot ai.",
    "sample_rate": SAMPLE_RATE
}
headers = {
    "Authorization": "Bearer <token>",
    "Content-Type": "application/json"
}

response = requests.post(url, json=payload, headers=headers)

if response.status_code == 200:
    # Get raw PCM data from response content
    pcm_data = response.content

    # Create a unique file name for the output
    audio_file = f"{uuid.uuid4()}.wav"
    audio_path = os.path.join(AUDIO_DIR, audio_file)

    # Write the PCM data into a WAV file
    with wave.open(audio_path, 'wb') as wav_file:
        wav_file.setnchannels(NUM_CHANNELS)  # Mono channel
        wav_file.setsampwidth(SAMPLE_WIDTH)  # 16-bit audio (2 bytes)
        wav_file.setframerate(SAMPLE_RATE)  # SAMPLE RATE
        wav_file.writeframes(pcm_data)  # Write the raw PCM data

    print(f"Audio saved successfully at: {audio_path}")
else:
    print(f"Error: {response.status_code}, {response.text}")

Open the terminal and run python {YOUR_FILENAME}.py. Here's the audio file which can be played in the code editor as well.

Speech from Lightning Model

That’s it! You can now generate high-quality speech locally in milli seconds.

How Much Does Lightning Cost?

Lightning is affordable and scalable, with pricing starting at $ 0.04 USD per Minute.

For enterprises using over 100,000 minutes per month, custom pricing solutions are available. Contact info@smallest.ai for details.

What next?

Even though AI voices are hyper-realistic, why don't humans talk to AI every day? What is missing? What can help us cross the last mile and unlock true multi-modal AI interactions at population scale?

This is the answer we are looking for.

We believe some hints to the answer lie in a radical shift to how we think about AI today.

While the last revolution in AI came in with the understanding of an attention block being able to scale dramatically with data and learn large amounts of information, the next big revolution in AI will come from the ability to interact with humans a billion times a day and improve every second. The next leap will be a data leap. The next leap requires models to observe, interact, and improve from their environment a billion times a day.

Lightning was a step into understanding the capabilities of small multi-modal models that can run on the edge. The next step involves finding how Lightning and similar small multi-modal models can run in close interaction loops with humans on edge devices so that a strong active learning pipeline is established to collect great quality hyper-personalized multi-modal data.

The future does not contain one large language model running on the cloud, being trained once in a while. The future involves billions of smaller models that will be hyper-localized to their environments by constantly improving on the edge.

Our aim is to make small strides in this direction.

Recent Blog Posts

Interviews, tips, guides, industry best practices, and news.

October 9, 2024 • 6 min Read

Introducing Lightning: World's Fastest Text-to-Speech Model

Sudarshan Kamath

Overview of Lightning’s Capabilities

Why Lightning is Revolutionary: Comparison with current TTS Models

The Limitations of current TTS Models

How Lightning Overcomes These Challenges

How Lightning Works in Real-Time Applications

How to Run Lightning Locally in 5 Minutes

Speech from Lightning Model

How Much Does Lightning Cost?

What next?

Recent Blog Posts

Conversational AI in Finance: Key Applications and Industry Impact

Discover how conversational AI for finance streamlines customer service, automates compliance, and improves risk management with real-world industry results.

What is AI in Banking? Practical Strategies and What’s Next

Explore the impact of AI in banking, from enhanced security to personalized services, and learn how financial institutions are transforming with cutting-edge technology.

Conversational AI in Banking: Use Cases, Benefits & Real Examples (2025 Guide)

Explore how conversational AI in banking streamlines customer service, boosts security, and delivers personalized financial insights through real-world use cases.