Mastering Pronunciation

Listening to a monotone speech is boring: good orators have learnt to play with rhythm and prosody, adding pauses and emphasis when necessary. Making the robot speak with eloquence is not easy, but the following tools will help you get it just right.

## More TTS tags¶

You are already familiar with the pause tag, now you can take it to the next level with tags that modify speech itself:

1.  "\vol=value\" (value between 0 and 100) : changes the volume of TTS. The default value is 80, and values above can introduce clipping to the audio signal.


“Do I have to \vol=100\ shout it out?”

2.  "\rspd=value\" (value between 50 and 400) : changes the speed of TTS (default value is 100).


“I’ll only say it once, so listen well: \rspd=200\ supercalifragilisticexpialidocious!”

3.  "\vct=value\" (value between 50 and 200) : changes the pitch of TTS (default value is 100).


“That’s \vct=120\ awesome!”

Please note that the value of these first three tags is relative, and depends on the original parameters of TTS.

If not reset, these first three tags will change the corresponding parameters of TTS until the end of the current utterance, that is, until the end of the “say” function. To reset these parameters, you can either use the tag again, with the original values, or use the following:

4. “\rst\” : resets all parameters to the default values


“\rspd=120\ \vct=120\ Wait \pau=150\ \vct=110\ wait \rspd=100\ \pau=150\ \vct=100\ wait \pau=200\ \rspd=80\ \vct=100\ wait \pau=400\ \rspd=60\ \vct= 110\ wait! \rst\ ”

Note: in QiChat, this will be done automatically, so you don't need to use it when testing in Choregraphe.

The following tags have a one-time effect and don’t need to be reset:

5.  "\emph=value\" : changes the emphasis of TTS for the word following the tag
1. 0: reduced
2. 1: stressed
3. 2: accented


“Sorry, we \emph=2\ are \emph=2\ late.”

6.  “\bound=value\” : changes the prosodic boundary after the next word (all languages but Japanese)
4. W: Weak phrase boundary (equivalent to a comma, but without the small pause)
5. S: Strong phrase boundary (equivalent to a comma)
6. N: No boundary (suppresses a boundary set by TTS)


“\bound=S\ Mmm Okay?”

7.  “\eos=value\” : suppresses (value=0) or forces (value=1) a sentence break (all languages but Japanese)


“Ah \eos=1\ Of course!”

Pitch: the height of a tone.
Emphasis: stress given to a word or a phrase to indicate particular importance.
Prosodic boundary: defines a prosodic unit (a segment of speech that occurs with a single prosodic contour, aka pitch and rhythm contour).
Sentence break: equivalent to an end punctuation.

### Examples of tweaked Text in English and Japanese¶

#### English

In the following dialog, two Pepper robots (PP1 and PP2) are chatting at the end of a conference.

PP2 : Wow! That was intense!

PP1 : Really interesting conference. Really!

PP2 : And they talked about us.

PP1 : Don't you think it sounds promising for the future of technologies?

PP2 : Robots! Awesomer robots !

PP1 : Yes, but I'm not sure you can say "awesomer" though.

PP2 : Really? Well, too bad… But can I dance?

PP1 : What? You still want to dance?

PP2: Yes!

PP1 : Don't you have anything else in mind?

PP2: Nope!

PP2: Oh come on! It's going to be fun! Do just like me.

The original text is enriched with speed variation rspd, emphasis emph, pitch modulation vct, pauses pau, weak or strong boundaries bound and spelling changes to improve pronunciation. \rst\ at the end of a sentence resets all parameters to the default values.

PP2 : \rspd=70\ Wow! \rspd=80\ \emph=2\ That! \emph=2\ \vct=110\ Was! \pau=700\ \vct=120\ Intense! \rst\

PP1 : \rspd=80\ \emph=2\ \bound=W\ \vct=110\ Really \vct=100\ interesting conference. Really! \rst\

PP2 : \vct=90\ \rspd=80\ And they \emph=1\ \vct=100\ talked \emph=2\ \vct=90\ about \emph=2\ \rspd=70\ \vct=60\ us? \rst\

PP1 _: _\rspd=90\ Don’t you \emph=0\ think it \emph=0\ sounds \bound=W\ \rspd=80\ promising \rspd=100\ for the \rspd=80\ future \vct=100\ of technologies? \rst\

PP2 :_ _\rspd=100\ Ro\vct=130\bots! \vct=100\ \rspd=80\ \pau=500\ \emph=2\ Awesomer \rspd=100\ \pau=05\ \vct=160\ro\vct=110\bots ! \rst\

PP1 : \bound=N\ \rspd=110\ Yes \emph=0\ but \pau=300\ \rspd=90\ I'm \rspd=80\ not \rspd=70\ sure \rspd=90\ you can say \pau=50\ \bound=S\ \vct=90\ “awesomer” \emph=0\ though. \rst\

PP2 : \vct=120\ \rspd=90\ Really! \pau=300\ \vct=100\ Well, \emph=0\ \rspd=60\ too \rspd=70\ \emph=0\ bad. \pau=800\ \vct=120\ But \pau=1000\ \rspd=120\ \vct=100\ can I \rspd=100\ dance?

PP1 : \rspd=80\ What, you still \rspd=100\ want to \rspd=100\ \vct=130\ dentz! \rst\

PP2 : \rspd=100\ \vct=120\ Yes \rst\

PP1 : Don’t you \bound=S\ have \emph=2\ \vct=120\ any \emph=0\ thing \pau=50\ else \pau=200\ \vct=100\ in mind? \rst\

PP2: \rspd=60\ Nope!

PP2: \rspd=70\ Oh \rspd=60\ \vct=120\ come \rspd=90\ \vct= 110\ on \pau=1000\ \rspd=100\ \vct=140\ It’s \vct=90\ gonna \vct=110\ be \vct=120\ fun! \pau=500\ \vct=170\ Do \vct=150\ just \vct=130\ like \vct=90\ me! \rst\

The first sentence illustrates almost all the notions seen until now, so let’s recreate the tweaking step by step!

Original sentence:

Wow! That was intense!

Isolate each word with punctuation to put more emphasis on each one:

Wow! That! Was! Intense!

Add a pause for a more dramatic effect:

Wow! That! Was! *\pau=700* Intense!

Slow down the TTS speed:

*\rspd=70* Wow! *\rspd=80* That! Was! \pau=700\ Intense!

\rspd=70\ Wow! \rspd=80\ That! *vct=110* Was! \pau=700\ *\vct=120* Intense!

Add some more emphasis so that every word is accented:

\rspd=70\ Wow! \rspd=80\ *\emph=2* That! *\emph=2* \vct=110\ Was! \pau=700\ \vct=120\ Intense!

Don’t forget to reset at the end, and you’re done!

\rspd=70\ Wow! \rspd=80\ \emph=2\ That! \emph=2\ \vct=110\ Was! \pau=700\ \vct=120\ Intense! *\rst*

#### Japanese

This example displays variations of the same answer: when the robot hears his name, he replies “Yes?” (“はい?” in Japanese, usually written in hiragana, but the tweaks use all three alphabets).

“\vct=135\ \rspd=110\はイッ！"

"\vct=128\ \rspd=90\ハイい？"

"\vct=130\ \rspd=100\ハ愛！。"

"\vct=138\ \rspd=90\歯亞あ\vct=148\いっ？"

"\vct=125\ \rspd=110\歯ぃぃ？"

"\vct=120\ \rspd=108\ハィッッ。"

"\rspd=80\ \vct=135\はーいッ！。"

"\rspd=100\ \vct=125\はーーぁい？。"

"\rspd=90\ \vct=120\ハイィ！。 "

"\rspd=90\ \vct=120\ハイィィ！。"

"\rspd=80\ \vct=120\ハイッ！。 "

"\rspd=100\ \vct=120\ふぁぁい？。 "

Note the use of both spelling changes and various values for the speed and pitch. Here, the idea was to create as many variants as possible to have more variability in the robot’s answer. Each sounds slightly different while still sounding good.

Also, you can see that the question mark is sometimes used for its prosodic value, making the intonation rise without necessarily marking a strong interrogation (the effect is closer to a rhetorical question). This is often used in Japanese to make the robot more expressive.

Please bear in mind that tweaking is not an exact science: it is impossible to give an exhaustive list of best practices, or even a few tips that would work all the time. Tweaking is best done through trial and error, and all the examples presented throughout this lesson are only meant to give you an overview of what is possible, not to say that this is what should always be done (especially as each language is different!).

## Expressive tags (English, French and Chinese only)¶

The English, French and Chinese language packages have an added feature: expressive custom voices. For each of these voices, 3 speaking styles are available: neutral, joyful, and didactic. The speaking style can be changed by entering a tag in the text :

1. \style=neutral\: this is the default voice, so using this tag will not change the robot’s voice. It should be used to reset the default value after using one of the other two though.

2. \style=didactic\: this voice is quite similar to the neutral one, but with a more storytelling quality.

3. \style=joyful\: this voice has a higher pitch and speed, making the robot sound more excited.


For example:
Hello, my name is Pepper!

\style=didactic\ Hello, my name is Pepper! \style=neutral\

\style=joyful** **Hello, my name is Pepper! \style=neutral\

*\style=joyful* \vct=110\ Exciting! *\style=neutral* \vct=100\ I’m \emph=0\ \vct=110\ going to \emph=2\ show \emph=0\ them a \emph=2\ little \rspd=100\ \emph=2\ dance I was just \emph=0\ preparing \vct=110\ batstage! \rst\

Be careful though, the neutral and didactic voices are quite similar but the joyful one is really different, and switching between them can make the robot sound a bit weird, so test it extensively.

## Phonetic writing (all languages except Japanese)¶

In the previous section, we saw that a good way to fix a mispronunciation was to change the spelling until TTS got it right. Most language packages enable you to use a more elegant solution: phonetic writing, with an equivalent of the IPA (International Phonetic Alphabet) using only the latin alphabet to avoid character encoding issues.

Use the \toi=lhp\ tag before the phonetic phrase, and the \toi=orth\ one after.


This is particularly useful in the case of heteronyms (a word that has a different pronunciation and meaning from another word but the same spelling).

For example, in the sentence “I would like to resign”, TTS chooses /rɪˈzaɪn/, meaning “to quit”, but we can change it to /riːˈsaɪn/, to mean the contrary: “to sign again”.

“I would like to \toi=lhp\ riːˈsaIn \toi=orth.”

Another example is the sentence “Please wait a minute, I’ll call an operator.”, where the robot should pronounce _minute _as /ˈmɪnət/, meaning “60 seconds”, but he says /maɪˈnjuːt/, as in “extremely small”.

“Please wait a \toi=lhp\ ˈmInIt \toi=orth, I’ll call an operator.”

## Tweaking in 2.9 (Japanese only)¶

In 2.9, tags specific to the Japanese language package are supported when enclosed within the proper functions:

Use the function ^rawStart before the tags, and the tag ^rawEnd after.

For example:

u:(もしもし) ^rawStart  <S>オ^キャクサ!マーーッ$2_2ヨ!カッタラ$2_2ボ!クト|0オ^ハナシシマセ!ンカ|0ア^アア<R><S>|0|0<F> ^rawEnd


Be careful to only use katakana between these functions.