Oct 8, 2024 • 5 min Read
What is Speech Synthesis Markup Language (SSML)
Learn about SSML and how it can help in text-to-speech
Kaushal Choudhary
Senior Developer Advocate
What is Speech Synthesis Markup Language?
Speech Synthesis Markup Language (SSML) is a standard from W3C for the generation of synthetic speech in web and other applications. It provides a rich, XML-based markup language that assists authors and developers in gaining fine-grained control over various speech aspects, such as pronunciation, volume, pitch, and rate across different platforms. In this article, we will explore SSML in depth and how it augments the speech synthesis process for modern applications.
History
The origin of SSML can be traced back to IBM's introduction of Generalized Markup Language (GML) in 1969. GML provided a set of macros or tags to define text structures, such as phrases, paragraphs, lists, and tables. An example of early GML:
:h1 id='intr'.Chapter 1: Introduction
:p.GML supported hierarchical containers like
:ol.
:li.Ordered lists,
:li.Unordered lists,
:li.Definition lists
:eol.
:p.Markup minimization allowed the omission of end-tags for elements like "h1" and "p."
GML evolved into Standard Generalized Markup Language (SGML), based on two key principles:
- Declarative: Markup should describe document structure rather than specify processing.
- Rigorous: Markup must strictly define objects to ensure compatibility with processing techniques.
SGML is a metalanguage for defining markup languages, was designed for large-scale document sharing in government and industry. In 1998, it was reworked into XML, a simplified version widely adopted for storing and transmitting data. However, text-to-speech (TTS) systems contained idiosyncratic tags and specifications, making it difficult for developers to adopt a general speech synthesis standard.
In response, SABLE was introduced in 1998 as an XML/SGML-based markup scheme for TTS. SABLE aimed to provide a common control system for TTS but lacked standardization, was complex, and struggled with compatibility across platforms. This led to the development of SSML, a more robust, standardized, and widely compatible solution for improving speech synthesis across platforms and was adopted by W3C.
Working
We will now explore how SSML can be used in the speech synthesis process.
A Text-to-Speech system that supports SSML converts a document into spoken output, using the markup to ensure that the speech sounds as the author intended.
Steps Involved
-
XML Parsing: Think of the XML parser as a document reader that extracts the main structure and content from your text. It identifies tags and attributes, which play a crucial role in how the rest of the process works.
-
Structure Analysis: The entire text needs to be enclosed in this structure:
<?xml version="1.0"?>
<speak version="1.1"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
xml:lang="en-US">
<p>
<s>This is a sentence.</s>
<s>Here’s another one.</s>
</p>
</speak>
Tags like <p>
for paragraphs and <s>
for sentences help organize the content. If these tags are missing, the Text-to-Speech (TTS) system uses punctuation and language rules to make sense of it.
Note: All the text that needs to be spoken must be inside the <speak>..</speak>
elements.
- Text Normalization: This step converts written text into how it would be spoken, such as turning "$200" into "two hundred dollars." Tokens cannot cross tags, except within
<token>
or<w>
tags. The<w>
element is an alias for the<token>
element. You can use tags like<sub>
, which provides an alias for the text-to-speech system to pronounce abbreviations. However, keep in mind that the TTS system may try to guess pronunciations and could make mistakes.
For example, the alias for W3C can be written as:
<!-- abbreviation control -->
<sub alias="World Wide Web Consortium">W3C</sub>
hence the system will speak the alias, not the abbreviation.
- Text-to-Phoneme Conversion: In this step, the TTS system turns words into sounds (phonemes), which can be tricky because written and spoken forms often differ. You can specify how a word should be pronounced using the
<phoneme>
tag. If no specific pronunciation is given, the system looks up words in pronunciation dictionaries to figure out how to say them. The<say-as>
element might also be used to indicate that text is a proper name, allowing the TTS system to apply special rules to determine the pronunciation. The<lexicon>
and<lookup>
elements can also be used to reference external definitions of pronunciations.
<speak>
<!-- lexicon -->
<lexicon uri="http://www.example.com/lexicon.pls"
xml:id="pls"/>
<lexicon uri="http://www.example.com/strange-words.file"
xml:id="sw"
type="media-type"/>
<!-- lookup -->
<lookup ref="pls">
tokens here are looked up in lexicon.pls
<lookup ref="sw">
tokens here are looked up first in strange-words.file and then, if not found, in lexicon.pls
</lookup>
tokens here are looked up in lexicon.pls
</lookup>
tokens here are not looked up in lexicon documents
</speak>
- Prosody Analysis: This involves analyzing the rhythm and melody of speech, making it sound more natural. You can guide the TTS system with tags like
<emphasis>
,<break>
, and<prosody>
to emphasize certain words or control pacing.
<speak>
<!-- emphasis -->
That is a <emphasis>big</emphasis> car!
That is a <emphasis level="strong">huge</emphasis> bank account!
Take a deep breath <break/> then continue.
Press 1 or wait for the tone. <break time="3s"/>
I didn't hear you! <break strength="weak"/> Please repeat.
<!-- prosody -->
<s>I am speaking this at the default volume for this voice.</s>
<s><prosody volume="+6dB">
I am speaking this at approximately twice the original signal amplitude.
</prosody></s>
<s><prosody volume="-6dB">
I am speaking this at approximately half the original signal amplitude.
</prosody></s>
</speak>
- Waveform Production: Finally, the TTS system creates the actual audio output using all the gathered information. You can specify different voices with the
<voice>
tag or add audio clips with the<audio>
tag. The default settings for volume, speed, and pitch are based on the original sound, so be sure to tweak them if needed!
<speak>
<!-- voice -->
<voice gender="female" required="languages gender age" languages="en-US ja">
Any female voice here.
<voice age="6">
A female child voice here.
<lang xml:lang="ja">
<!-- Same female child voice rendering Japanese text. -->
</lang>
</voice>
</voice>
<!-- Empty element -->
Please say your name after the tone. <audio src="beep.wav"/>
<!-- Container element with alternative text -->
<audio src="prompt.au">What city do you want to fly from?</audio>
<audio src="welcome.wav">
<emphasis>Welcome</emphasis> to the Voice Portal.
</audio>
</speak>
This was a basic structure of how SSML can be used to fine-tune text-to-speech generation.
For example, let's see how Amazon Polly uses this:
<speak>
Here is a simple word:
<p>hello</p>
Here is the same word spelled out:
<say-as interpret-as='spell-out'>hello</say-as>.
Here is different pronunciation:
You say, <phoneme alphabet="ipa" ph="pɪˈkɑːn">pecan</phoneme>.
I say, <phoneme alphabet="ipa" ph="ˈpi.kæn">pecan</phoneme>.
Abbreviation support using alias:
My favorite chemical element is <sub alias="Mercury">Hg</sub>, because it looks so shiny.
Prosody control for volume:
Sometimes a lower volume <prosody volume="-6dB">is a more effective way of
interacting with your audience.</prosody>
</speak>
Audio from Amazon Polly
As we can see, it uses the <say-as>
element to define how "hello" will be spoken as or spelled out h e l l o
. And then we can use <sub alias="word">Actual word</sub>
to add support for abbreviations. We can also control the volume of speech using the <prosody>
element. You can find all the supported tags with examples here.
Conclusion
In summary, SSML empowers users with fine-grained control over the Text-to-Speech (TTS) system, allowing for a more tailored auditory experience. With tags like <audio>
, we can seamlessly integrate default audio to enhance speech for various use cases. Each element we discussed is part of the broader Document Structure, where additional elements and helpful tips are available to further augment the speech generation process. Other attributes, such as <speed>
, enable us to adjust the speech rate, making it adaptable to different contexts and audiences. Overall, SSML significantly enriches how we interact with TTS technology, paving the way for more engaging and effective communication.