October 9, 2024 • 6 min Read
Introducing Lightning: World's Fastest Text-to-Speech Model
Lightning will allow voicebot companies to drastically reduce latencies, and costs by simplifying their architectures
Sudarshan Kamath
Founder
We at smallest.ai are proud to release a new state-of-the-art multi-lingual text-to-speech model (TTS) - Lightning. This is the result of a lot of hard work by the team in the past few months, pushing the barriers of size and performance in multi-modal AI.
Want to skip the blog and try it out? Login to Waves or check the Waves API.
Overview of Lightning’s Capabilities
-
Speed - Lightning can generate 10s of ultra-realistic audio in just 100ms. This is a
real-time factor of 0.01
- making it world's fastest text to speech model. -
Size - Lightning requires much less than
1 GB of VRAM
making it easy to run on most consumer and edge devices. -
Languages - Lightning supports English and Hindi in multiple accents currently and we plan to add many more languages quickly.
-
New Data Adoption - Lightning is also very quick to adopt to new languages, accents and speakers - often requiring just an hour of data to train.
Why Lightning is Revolutionary: Comparison with current TTS Models
While we will be launching a detailed technical report soon, here are the top highlights.
The Limitations of current TTS Models
Auto-regressive models, are leading the speech generation benchmarks as they are great at capturing speech nuances, and can model emotions, spontaneity into their speech very well.
However, they suffer from slow decoding. Generating audio in such models is a sequential process, meaning longer clips introduce delays. The first byte of audio might be fast but the overall audio is generated extremely slowly. A 10-second clip, for example, can take up to 5 seconds to generate.
Moreover, autoregressive models often require WebSocket connections for real-time applications, which are harder to scale and maintain compared to simpler REST API integrations. These need to be always on during a call and can max out your CPU very quickly.
Non-Autoregressive models have not come close to Autoregressive ones in terms of speech quality as they often suffer from lack of context as the next tokens in the sequence are not conditioned on previous tokens.
This is no longer true now with Lightning.
How Lightning Overcomes These Challenges
Lightning uses a non-auto-regressive architecture to synthesize an entire clip of audio at the same time, unlike auto-regressive models that generate audio step-by-step. But making non-auto-regressive models work and training them in a scalable manner is not easy. Here are some ideas that have worked to solve these challenges:
-
Style Diffusor - Lightning uses a special style diffusor that helps add styles to the audio to make it conversational or as required by the user by providing a reference.
-
Phoneme Based Inputs - We decided to switch from BPE tokenizer based inputs to phoneme based inputs as this helps add new languages quickly. While it is true that not all languages have similar phonemes, we have observed that the model has the capability to overcome the phonemization issues implicitly.
-
Conditioning Encoder - Lightning allows a high degree of control by conditioning the generated audio based on speaker, style, accent and other latents with the help of our custom conditioning encoder. This encoder is great at capturing these latents with low levels of correlations and we continue to explore more ways to provide these controls in a simple manner to our users.
-
Low Model Size - Lightning is super-fast as it contains very few parameters compared to traditional large audio models, pushing it into the
sub-gigabyte
memory range. This has been achieved through rigorous removal of unwanted weights, quantization, and distillation using a proprietary algorithm that we at smallest.ai are continuously working on improving.
How Lightning Works in Real-Time Applications
Lightning’s speed and efficiency make it perfect for real-time products such as Voice Assistants.
We provide a simple REST API, from which developers can easily integrate Lightning into their systems without the complexity of websockets.
Lightning's simple integration and fast responses ensure that the voice bots can run with sub-1-second latency and scale rapidly without any issues.
Here are some simple examples where we show how to integrate Lightning into voice bots, telephony providers, and more.
How to Run Lightning Locally in 5 Minutes
Getting started with Lightning is simple. Here’s how to run it locally in just 5 minutes:
- Login to waves.smallest.ai.
-
Navigate to the API key section on the left-hand panel and Copy your API key.
-
Go to the Read API Documentation and from the left menu, go to the Waves API.
a. Put the API key on the authorization box.
b. In the Path section select the lightning model.
c. Input the
voice_id
, here is the list. Just enter the string enclosed between two quotes.d. After entering, enter the text you want to listen to.
e. And, finally select the
sample_rate
. Select 16000, other options are available as well.
- To run it locally using
Python
, copy the python code from the right side black box, and paste it in your code editor. Change thetoken
with your actual API key.
import requests
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
payload = {
"voice_id": "jasmine",
"text": "Hello world, I am lightning a fast text to speech model from smallest dot ai team.",
"sample_rate": 16000
}
headers = {
"Authorization": "Bearer <token>",
"Content-Type": "application/json"
}
response = requests.request("POST", url, json=payload, headers=headers)
print(response.text)
- To make it easier, here's some additions to the code, so that you can listen to audio locally as well.
import requests
import uuid
import os
import wave
AUDIO_DIR = "audio_output"
SAMPLE_RATE = 16000 # The sample rate in Hz
NUM_CHANNELS = 1 # Assuming mono audio, adjust if needed
SAMPLE_WIDTH = 2 # 2 bytes per sample (16-bit PCM)
# Create directory if it doesn't exist
if not os.path.exists(AUDIO_DIR):
os.makedirs(AUDIO_DIR)
url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"
payload = {
"voice_id": "arman",
"text": "Hey guys, I am lightning! An extremely fast text to speech model from smallest dot ai.",
"sample_rate": SAMPLE_RATE
}
headers = {
"Authorization": "Bearer <token>",
"Content-Type": "application/json"
}
response = requests.post(url, json=payload, headers=headers)
if response.status_code == 200:
# Get raw PCM data from response content
pcm_data = response.content
# Create a unique file name for the output
audio_file = f"{uuid.uuid4()}.wav"
audio_path = os.path.join(AUDIO_DIR, audio_file)
# Write the PCM data into a WAV file
with wave.open(audio_path, 'wb') as wav_file:
wav_file.setnchannels(NUM_CHANNELS) # Mono channel
wav_file.setsampwidth(SAMPLE_WIDTH) # 16-bit audio (2 bytes)
wav_file.setframerate(SAMPLE_RATE) # SAMPLE RATE
wav_file.writeframes(pcm_data) # Write the raw PCM data
print(f"Audio saved successfully at: {audio_path}")
else:
print(f"Error: {response.status_code}, {response.text}")
- Open the terminal and run
python {YOUR_FILENAME}.py
. Here's the audio file which can be played in the code editor as well.
Speech from Lightning Model
That’s it! You can now generate high-quality speech locally in milli seconds.
How Much Does Lightning Cost?
Lightning is affordable and scalable, with pricing starting at $ 0.04 USD per Minute.
For enterprises using over 100,000 minutes
per month, custom pricing solutions are available. Contact info@smallest.ai for details.
What next?
Even though AI voices are hyper-realistic, why don't humans talk to AI every day? What is missing? What can help us cross the last mile and unlock true multi-modal AI interactions at population scale?
This is the answer we are looking for.
We believe some hints to the answer lie in a radical shift to how we think about AI today.
While the last revolution in AI came in with the understanding of an attention block being able to scale dramatically with data and learn large amounts of information, the next big revolution in AI will come from the ability to interact with humans a billion times a day and improve every second. The next leap will be a data leap. The next leap requires models to observe, interact, and improve from their environment a billion times a day.
Lightning was a step into understanding the capabilities of small multi-modal models that can run on the edge. The next step involves finding how Lightning and similar small multi-modal models can run in close interaction loops with humans on edge devices so that a strong active learning pipeline is established to collect great quality hyper-personalized multi-modal data.
The future does not contain one large language model running on the cloud, being trained once in a while. The future involves billions of smaller models that will be hyper-localized to their environments by constantly improving on the edge.
Our aim is to make small strides in this direction.
Recent Blog Posts
Interviews, tips, guides, industry best practices, and news.
5 Best AI Text-to-Speech Tools for Celebrity Voices
Explore the top 5 AI Text-to-Speech tools with celebrity voices. Compare VoxBox, Speechify, and FakeYou for natural voice quality, cost, and features.
5 Best Text-to-Speech Models with Emotional Intelligence
Explore the top Text-to-Speech models with advanced emotional control, including Waves, Unmixr, Voicegen, Play.ht, and ElevenLabs, for lifelike voice synthesis.
6 Best Text-to-Speech WordPress Plugins for Accessibility
Explore the best Text-to-Speech WordPress plugins like AI Power, GSpeech, and BeyondWords to enhance accessibility and user engagement.