Join our discord for early access to new features!Join discord for early access!Join Now

Oct 28, 20245 min Read

How to Use Amazon Polly Text-to-Speech: A Complete Guide

Explore how to use Amazon Polly, AWS's Text-to-Speech service. Learn setup, SSML customization, lexicon support, and programmatic integration for lifelike speech.

cover image

Kaushal Choudhary

Senior Developer Advocate

cover image

We discuss various Text-to-Speech apps, chrome extensions, websites and platforms etc., but Amazon Web Services (AWS) provides an exceptional Text-to-Speech service called Amazon Polly. Amazon Polly is an advanced deep learning model that synthesizes speech from text in a cloud based environment AWS.

In this article, we are going to explore how we can harness the power of Text-to-Speech from Amazon Polly.

How to access Amazon Polly in AWS

Any service or platform should be easy to access and easily comprehensible for fast usage. AWS has made it extremely easy to setup, test and integrate polly into applications. Let's dive into the process.

Step 1: Log In to your AWS Console

Make sure you have an AWS account, and know how to set up root and IAM user access. You can login via root user or IAM user. Here, we are going to login using IAM user.

login-to-aws-console

Step 2: Search for Amazon Polly

After successfully logging in, search on the top bar Amazon Polly. You will find a service named as Amazon Polly and would look similar to this.

amazon-polly

Step 3: Generate Simple Speech

After clicking on Try Polly you would find a UI similar to the below image.

Let's understand each component in brief:

  1. Engine : It is the backend model which synthesizes speech from the text. However, not all regions support all the engines. So, we will go with the Neural engine option.

  2. Language : It is a dropdown displaying all the languages supported by polly. As it is multilingual, it supports major languages for now.

  3. Voice : These are the different voices that Amazon Polly supports, currently it has 13 voices; 7 female and 6 male voices. You can choose anyone to generate speech in the desired style.

  4. Input Text : It is where we provide the text we want to hear or read aloud. On the top right corner, you can see a SSML toggle, which basically means that you can use SSML to customize the generated speech.

After Entering the text, we can simply click on the upper right button Listen, which will generate our speech. And it also provides a Download button to download the generated speech in mp3 or wav format.

generate-tts-with-polly

Audio from Amazon Polly

Step 4: Generate Customized Speech with SSML

As seen above the speech is good, but it's not enough to produce lifelike voices that match humans. We can use SSML to make the generated data even better. Read about SSML here.

Let's use this customized text which uses SSML tags to define the various intonations, prosody and abbreviation controls.

<speak>

    <p>Hello guys!</p>
     My name is Amazon Polly and I am a <sub alias="'Artificial-Intelligence">AI</sub> model.
    I can pronounce, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
    and also <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
    I have internal modules made up of <sub alias="Mercury">Hg</sub>, as it remains liquid at room temperature.
    Let me in on a secret <prosody volume="-6dB"> I am just a system of if-else statements actually. shh!.</prosody>

</speak>

As we can see that the above text uses sub alias, phoneme and prosody tags to control the speech generation more effectively.

Let's listen to the generated speech from Polly using SSML

with-ssml

Step 5: Customize Pronunciation

Amazon Polly also allows users to upload lexicons to support custom pronunciation for more nuanced and customized speech. You can see uploading a lexicon and using multiple lexicon to understand how to add the lexicon file into the engine.

A typical lexicon file which is a .xml extension file has the contents

<?xml version="1.0" encoding="UTF-8"?>
<lexicon version="1.0"
      xmlns="http://www.w3.org/2005/01/pronunciation-lexicon"
      xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://www.w3.org/2005/01/pronunciation-lexicon
        http://www.w3.org/TR/2007/CR-pronunciation-lexicon-20071212/pls.xsd"
      alphabet="ipa"
      xml:lang="en-US">
  <lexeme>
    <grapheme>W3C</grapheme>
    <alias>World Wide Web Consortium</alias>
  </lexeme>
</lexicon>

You can use this to provide lexicons for any text you wish to add support for.

custom-pronunciation-with-polly

Access Polly Programmatically

Now, after using the service, we know some of you might be keen to integrate this on your application. Well, AWS doesn't disappoint users.

Using the AWS SDK we can integrate polly seamlessly into any application.

  • First you need to install the AWS SDK. AWS SDK supports various programming languages including Java , Python, Java Script, (Node.js),Ruby etc.
  • Initialize an instance of the Polly client in your application code using the AWS SDK. Use Polly client to synthesize speech from text. Polly supports several languages and voices.

This is actual code from the developer docs, so you can trust it and run it as a simple python file to see if it works. It's self explanatory as well. However, keep in mind that you would need a boto3 key and s3 storage resource to make it work properly.

See full examples on their github.

class PollyWrapper:
    """Encapsulates Amazon Polly functions."""

    def __init__(self, polly_client, s3_resource):
        """
        :param polly_client: A Boto3 Amazon Polly client.
        :param s3_resource: A Boto3 Amazon Simple Storage Service (Amazon S3) resource.
        """
        self.polly_client = polly_client
        self.s3_resource = s3_resource
        self.voice_metadata = None


    def synthesize(
        self, text, engine, voice, audio_format, lang_code=None, include_visemes=False
    ):
        """
        Synthesizes speech or speech marks from text, using the specified voice.

        :param text: The text to synthesize.
        :param engine: The kind of engine used. Can be standard or neural.
        :param voice: The ID of the voice to use.
        :param audio_format: The audio format to return for synthesized speech. When
                             speech marks are synthesized, the output format is JSON.
        :param lang_code: The language code of the voice to use. This has an effect
                          only when a bilingual voice is selected.
        :param include_visemes: When True, a second request is made to Amazon Polly
                                to synthesize a list of visemes, using the specified
                                text and voice. A viseme represents the visual position
                                of the face and mouth when saying part of a word.
        :return: The audio stream that contains the synthesized speech and a list
                 of visemes that are associated with the speech audio.
        """
        try:
            kwargs = {
                "Engine": engine,
                "OutputFormat": audio_format,
                "Text": text,
                "VoiceId": voice,
            }
            if lang_code is not None:
                kwargs["LanguageCode"] = lang_code
            response = self.polly_client.synthesize_speech(**kwargs)
            audio_stream = response["AudioStream"]
            logger.info("Got audio stream spoken by %s.", voice)
            visemes = None
            if include_visemes:
                kwargs["OutputFormat"] = "json"
                kwargs["SpeechMarkTypes"] = ["viseme"]
                response = self.polly_client.synthesize_speech(**kwargs)
                visemes = [
                    json.loads(v)
                    for v in response["AudioStream"].read().decode().split()
                    if v
                ]
                logger.info("Got %s visemes.", len(visemes))
        except ClientError:
            logger.exception("Couldn't get audio stream.")
            raise
        else:
            return audio_stream, visemes

Conclusion

In conclusion, Amazon Polly offers a powerful and flexible solution for converting text to lifelike speech, thanks to its advanced neural engine, support for multiple languages, and a variety of customization options. By using SSML tags and custom lexicons, you can fine-tune Polly to meet unique pronunciation and stylistic needs, making it a versatile tool for applications across industries. Furthermore, Polly’s integration with the AWS SDK allows seamless programmatic access, enabling developers to embed high-quality text-to-speech capabilities directly into their applications. Whether for accessibility, interactive user experiences, or content creation, Amazon Polly provides robust tools to bring voice to your text.