Nov 2, 2024 • 5 min Read

Best Text to Speech API for Modern Day Applications

Choose the most cost effective, high fidelity and low latency API's from the lot.

Kaushal Choudhary

Senior Developer Advocate

APIs enable seamless integration of AI-driven models and services into websites, applications, and platforms, allowing developers to leverage sophisticated functionalities without the need for extensive compute resources. Large language models and audio generation models, for example, typically require substantial computational power that may not be accessible to all users. By providing APIs, companies allow broader access to these powerful models, delivering results in standardized way and provide easier storage, and parsing for the data.

Text-to-Speech (TTS) technology is advancing rapidly, with a growing variety of models and providers catering to different applications—from parsing PDFs and text files to producing complete podcasts, voiceovers, and narrations. Selecting the right TTS API is crucial, as it should align with the specific utility, audience, and goals of the intended tools or applications.

How to Choose the Right API for Text-to-Speech (TTS)?

Selecting the right Text-to-Speech (TTS) API requires a detailed evaluation of several factors, each affecting the API's performance, cost, and suitability for specific applications. Below, we explore the key criteria to help make an informed decision when integrating TTS into your product or platform.

1. On-boarding Time

Ease of Integration: Assess the API’s documentation, SDKs, and examples. A well-documented API reduces time-to-value by making it easier for developers to integrate and test the TTS service.
Developer Tools and Libraries: Check if the API offers language-specific libraries (e.g., Python, JavaScript) or helper functions to streamline implementation.
Community and Support: For rapid troubleshooting, a responsive support team and an active developer community are valuable.

2. Cost

Pricing Model: Analyze whether the API charges per character, per second, or per request. Different models may be more cost-effective depending on your usage volume.
Free Tiers and Trials: For initial testing, free tiers are advantageous, providing a way to benchmark performance before committing to paid usage.
Scalability: As usage scales, consider volume discounts or enterprise packages. Some providers may offer custom pricing for high-volume or enterprise-level users.

3. Voice Quality

Naturalness and Accuracy: High-quality TTS APIs utilize neural networks or advanced language models to produce more lifelike and natural-sounding voices, reducing synthetic artifacts.
Language and Accent Support: Depending on your target audience, it’s essential to choose an API that supports relevant languages and regional accents.
Voice Variety: Many TTS APIs offer a range of voices (e.g., gender, age, tone) to suit different applications and enhance user experience.

Best APIs for Text-to-Speech

Below is a curated list of top TTS APIs, each offering unique strengths in voice quality, customization, and developer support.

1. Smallest.ai TTS

Waves by smallest.ai has world's fastest Text-to-Speech Model with ultra-low latency. It has ultra-realistic speech with multiple language support. One of the unique advantage of waves is its adaptability and time required to train on new language or accents. The cost of inference with latest SOTA model is as low as $0.02 USD per minute.

It has one of the best API's; it doesn't require any library installation, just a working python environment.

import requests

def smallest_tts(text: str):

	url = "https://waves-api.smallest.ai/api/v1/lightning/get_speech"

	payload = {
	"voice_id": "emily",
	"text": text,
	"sample_rate": 12000,
	"add_wav_header": True
	}

	headers = {
	"Authorization": f"Bearer {api_key}",
	"Content-Type": "application/json"
	}

	response = requests.request("POST", url, json=payload, headers=headers)

	if response.status_code == 200:
		# Path to save the audio file
		save_file_path = "smallest_tts_audio.wav"
		# Writing the audio content to a file
		with  open(save_file_path, "wb") as f:
			f.write(response.content) # Use response.content to get the bytes

		print(f"{save_file_path}: Smallest.ai Audio Saved Successfully")
		return save_file_path
   else:
		print("Error:", response.status_code, response.text)
		return  None

Listen to the Voice from Waves

2. DeepGram TTS

DeepGram API is extremely easy to set-up, has a generous free-tier and good voice quality. It provides API support for python and javascript. It is fairly fast, and has a good voice quality with nuanced pronunciations. It has low latency with high fidelity output and can be used in real-time applications. But, it is directed more towards speech to text and is fairly new to tts, so it misses common text intonations and accent specific words.

The setup takes few minutes from log in to running the code on Google Colab.

def deepgram():
	try:
		#create a deepgram client using the api key
		deepgram = DeepgramClient(api_key="")

		#configuring options
		options = SpeakOptions(
			model = "aura-asteria-en",
			encoding="linear16",
			container="wav"
		)

		#call the save method on the speak property
		response = deepgram.speak.v("1").save(filename, SPEAK_OPTIONS, options)
		print(response.to_json(indent=4))

	except Exception as e:
		print(f"Exception : {e}")

DeepGram TTS

3. Play.ht TTS

play.ht has an sub-190ms latency, and has high fidelity audio output. It also provides SSML support. Along with natural voices, it can be used for real-time tasks with a generous free tier to experiment. Playht provides a good user experience but comes on higher side of the price scale in comparison to other platforms.

The set-up requires to install pyht library, but is fairly easy to run.

from pyht import Client
from pyht.client import TTSOptions
import os

speech_file_path = "playht.wav"

client = Client(
	user_id="",
	api_key="",
)

options = TTSOptions(voice="s3://voice-cloning-zero-shot/775ae416-49bb-4fb6-bd45-740f205d20a1/jennifersaad/manifest.json")
for chunk in client.tts("Hi, I am Jennifer from play ht. Nice to meet you!", options):
	None
with  open(speech_file_path, "wb") as f: # Open file in write-binary mode
	for chunk in client.tts("Hi, I am Jennifer from play ht. Nice to meet you!", options):
		f.write(chunk)

Playht TTS

4. Elevenlabs TTS

Elevenlabs is a matured Text-to-Speech platform with production grade API. It has multilingual support with various languages and dialects. However, it can be relatively costly than other platforms. It is been widely used in creating LLM based apps such as Call answering, Voice question answering and many more. Elevenlabs polished user experience comes with a rather expensive premium plan which hinders users to build production grade applications or user face applications. It also provides good voice cloning support.

import os
from elevenlabs import VoiceSettings, play
from elevenlabs.client import ElevenLabs

ELEVENLABS_API_KEY = ""

client = ElevenLabs(
	api_key=ELEVENLABS_API_KEY,
)

def elevenlabs_tts(text: str) -> str:
	# Calling the text_to_speech conversion API with detailed parameters
	response = client.text_to_speech.convert(
		voice_id="pNInz6obpgDQGcFmaJgB", # Adam pre-made voice
		output_format="mp3_22050_32",
		text=text,
		model_id="eleven_turbo_v2_5", # use the turbo model for low latency

		voice_settings=VoiceSettings(
			stability=0.0,
			similarity_boost=1.0,
			style=0.0,
			use_speaker_boost=True,
		),
	)

	save_file_path = "elvenlabs.mp3"
	# Writing the audio to a file
	with  open(save_file_path, "wb") as f:
		for chunk in response:
			if chunk:
				f.write(chunk)

	print(f"{save_file_path}: Elevenlabs Audio Saved Succesfully")

	# Return the path of the saved audio file
	return save_file_path

Conclusion

In a rapidly evolving digital landscape, selecting the right TTS API is crucial for delivering a seamless and high-quality user experience. APIs offer powerful AI-driven capabilities, enabling applications to access advanced functionalities like text-to-speech without the need for extensive computational resources. Evaluating factors such as on-boarding time, cost, voice quality, and customization ensures that the chosen API aligns with the unique requirements of the application.

Whether you're looking to create interactive customer support solutions, produce professional voiceovers, or personalize content with custom voices, selecting the right TTS API can significantly impact both the user experience and operational efficiency. By choosing an API that balances technical quality with cost-effectiveness and flexibility, developers can create dynamic, accessible, and user-centered experiences, leveraging AI’s potential to its fullest extent.

Recent Blog Posts

Interviews, tips, guides, industry best practices, and news.