Speech Synthesis Markup Language

Source: Wikipedia, the free encyclopedia.

Speech Synthesis Markup Language (SSML) is an

Text to speech (TTS) markup, also an XML language. It is also used to produce sounds via Azure Cognitive Services' Text to Speech API or when writing third-party skills for Google Assistant or Amazon Alexa

SSML is based on the Java Speech Markup Language (JSML) developed by Sun Microsystems, although the current recommendation was developed mostly by speech synthesis vendors. It covers virtually all aspects of synthesis, although some areas have been left unspecified, so each vendor accepts a different variant of the language. Also, in the absence of markup, the synthesizer is expected to do its own interpretation of the text.


Here is an example of an SSML document:

<?xml version="1.0"?>
<speak xmlns="http://www.w3.org/2001/10/synthesis"
    <dc:title xml:lang="en">Telephone Menu: Level 1</dc:title>

    <s xml:lang="en-US">
      <voice name="David" gender="male" age="25">
        For English, press <emphasis>one</emphasis>.
    <s xml:lang="es-MX">
      <voice name="Miguel" gender="male" age="25">
        Para español, oprima el <emphasis>dos</emphasis>.



SSML specifies a fair amount of markup for prosody, which is not included in the above example. This includes markup for

  • pitch
  • contour
  • pitch range
  • rate
  • duration
  • volume

See also