Oct 28, 2024 • 5 min Read
How to Use Amazon Polly Text-to-Speech: A Complete Guide
Explore how to use Amazon Polly, AWS's Text-to-Speech service. Learn setup, SSML customization, lexicon support, and programmatic integration for lifelike speech.
Kaushal Choudhary
Senior Developer Advocate
We discuss various Text-to-Speech apps, chrome extensions, websites and platforms etc., but Amazon Web Services (AWS) provides an exceptional Text-to-Speech service called Amazon Polly. Amazon Polly is an advanced deep learning model that synthesizes speech from text in a cloud based environment AWS.
In this article, we are going to explore how we can harness the power of Text-to-Speech from Amazon Polly.
How to access Amazon Polly in AWS
Any service or platform should be easy to access and easily comprehensible for fast usage. AWS has made it extremely easy to setup, test and integrate polly into applications. Let's dive into the process.
Step 1: Log In to your AWS Console
Make sure you have an AWS account, and know how to set up root and IAM user access. You can login via root user or IAM user. Here, we are going to login using IAM user.
Step 2: Search for Amazon Polly
After successfully logging in, search on the top bar Amazon Polly
. You will find a service named as Amazon Polly and would look similar to this.
Step 3: Generate Simple Speech
After clicking on Try Polly
you would find a UI similar to the below image.
Let's understand each component in brief:
-
Engine : It is the backend model which synthesizes speech from the text. However, not all regions support all the engines. So, we will go with the
Neural
engine option. -
Language : It is a dropdown displaying all the languages supported by polly. As it is multilingual, it supports major languages for now.
-
Voice : These are the different voices that Amazon Polly supports, currently it has 13 voices; 7 female and 6 male voices. You can choose anyone to generate speech in the desired style.
-
Input Text : It is where we provide the text we want to hear or read aloud. On the top right corner, you can see a
SSML
toggle, which basically means that you can use SSML to customize the generated speech.
After Entering the text, we can simply click on the upper right button Listen
, which will generate our speech. And it also provides a Download
button to download the generated speech in mp3 or wav format.
Audio from Amazon Polly
Step 4: Generate Customized Speech with SSML
As seen above the speech is good, but it's not enough to produce lifelike voices that match humans. We can use SSML to make the generated data even better. Read about SSML here.
Let's use this customized text which uses SSML tags to define the various intonations, prosody and abbreviation controls.
<speak>
<p>Hello guys!</p>
My name is Amazon Polly and I am a <sub alias="'Artificial-Intelligence">AI</sub> model.
I can pronounce, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
and also <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
I have internal modules made up of <sub alias="Mercury">Hg</sub>, as it remains liquid at room temperature.
Let me in on a secret <prosody volume="-6dB"> I am just a system of if-else statements actually. shh!.</prosody>
</speak>
As we can see that the above text uses sub alias
, phoneme
and prosody
tags to control the speech generation more effectively.
Let's listen to the generated speech from Polly using SSML
Step 5: Customize Pronunciation
Amazon Polly also allows users to upload lexicons to support custom pronunciation for more nuanced and customized speech. You can see uploading a lexicon and using multiple lexicon to understand how to add the lexicon file into the engine.
A typical lexicon file which is a .xml
extension file has the contents
<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
alphabet="ipa"
xml:lang="en-US">
<lexeme>
<grapheme>W3C</grapheme>
<alias>World Wide Web Consortium</alias>
</lexeme>
</lexicon>
You can use this to provide lexicons for any text you wish to add support for.
Access Polly Programmatically
Now, after using the service, we know some of you might be keen to integrate this on your application. Well, AWS doesn't disappoint users.
Using the AWS SDK we can integrate polly seamlessly into any application.
- First you need to install the AWS SDK. AWS SDK supports various programming languages including Java , Python, Java Script, (Node.js),Ruby etc.
- Initialize an instance of the Polly client in your application code using the AWS SDK. Use Polly client to synthesize speech from text. Polly supports several languages and voices.
This is actual code from the developer docs, so you can trust it and run it as a simple python file to see if it works. It's self explanatory as well.
However, keep in mind that you would need a boto3
key and s3
storage resource to make it work properly.
See full examples on their github.
class PollyWrapper:
"""Encapsulates Amazon Polly functions."""
def __init__(self, polly_client, s3_resource):
"""
:param polly_client: A Boto3 Amazon Polly client.
:param s3_resource: A Boto3 Amazon Simple Storage Service (Amazon S3) resource.
"""
self.polly_client = polly_client
self.s3_resource = s3_resource
self.voice_metadata = None
def synthesize(
self, text, engine, voice, audio_format, lang_code=None, include_visemes=False
):
"""
Synthesizes speech or speech marks from text, using the specified voice.
:param text: The text to synthesize.
:param engine: The kind of engine used. Can be standard or neural.
:param voice: The ID of the voice to use.
:param audio_format: The audio format to return for synthesized speech. When
speech marks are synthesized, the output format is JSON.
:param lang_code: The language code of the voice to use. This has an effect
only when a bilingual voice is selected.
:param include_visemes: When True, a second request is made to Amazon Polly
to synthesize a list of visemes, using the specified
text and voice. A viseme represents the visual position
of the face and mouth when saying part of a word.
:return: The audio stream that contains the synthesized speech and a list
of visemes that are associated with the speech audio.
"""
try:
kwargs = {
"Engine": engine,
"OutputFormat": audio_format,
"Text": text,
"VoiceId": voice,
}
if lang_code is not None:
kwargs["LanguageCode"] = lang_code
response = self.polly_client.synthesize_speech(**kwargs)
audio_stream = response["AudioStream"]
logger.info("Got audio stream spoken by %s.", voice)
visemes = None
if include_visemes:
kwargs["OutputFormat"] = "json"
kwargs["SpeechMarkTypes"] = ["viseme"]
response = self.polly_client.synthesize_speech(**kwargs)
visemes = [
json.loads(v)
for v in response["AudioStream"].read().decode().split()
if v
]
logger.info("Got %s visemes.", len(visemes))
except ClientError:
logger.exception("Couldn't get audio stream.")
raise
else:
return audio_stream, visemes
Conclusion
In conclusion, Amazon Polly offers a powerful and flexible solution for converting text to lifelike speech, thanks to its advanced neural engine, support for multiple languages, and a variety of customization options. By using SSML tags and custom lexicons, you can fine-tune Polly to meet unique pronunciation and stylistic needs, making it a versatile tool for applications across industries. Furthermore, Polly’s integration with the AWS SDK allows seamless programmatic access, enabling developers to embed high-quality text-to-speech capabilities directly into their applications. Whether for accessibility, interactive user experiences, or content creation, Amazon Polly provides robust tools to bring voice to your text.