Questions about scripting with the Festival text-to-speech engine

I’m trying to script a series of spoken messages using the Festival text-to-speech engine, the text2wave command in particular. I took a look at the basics of how Festival scripting works via scm and xml files, yet there are things I can’t seem to find any useful information on. If anyone is familiar with the software I wanted to ask about how I’m meant to use the system in this format.

What I essentially want is to have different voices spoken at different locations in the resulting audio, using different voices if possible. Something among the lines of: Wait 5 seconds, say “foo” in voice X, wait 10 seconds, say “bar” in voice Y. Is this possible to script in a single scm / xml definition, any examples of how to do it?

I’d also like to include other sounds in the equation. Can the schematic for the text2wave command take another wav / ogg and throw it in together with the spoken voices? Overlap is okay… was thinking of using this to add music without having to do other changes with a ffmpeg command.

In addition: Is there a way to change the pitch of a voice? I only found a way to set the speed in the scm using the line (Parameter.set 'Duration_Stretch 1). Do I need to make my own variant for a voice to do that, and how is this done if yes?

It looks to me the better way is to have Festival generate samples and use another program to patch the samples together/increase the pitch.

Seems like it. I read somewhere that you can do advanced scripting with Festival so it got me wondering if I could create a detailed text reading entirely from it. I’m using a bash script anyway so it’s no problem making each read its own wav file then using ffmpeg to mix them with other audio.

Still wanted to know if there’s a way to change the pitch and customize individual voices. I found a way to edit pitch with the rubberband command, but it feels like a workaround and may create lossy artifacts in the audio too. I believe there was a way to override voice parameters in the scm but couldn’t find exact info on how.

From the pitch, did you have a look at 20.1.1 Generating pitchmarks?

That offers part of the solution but not all. Where do I get that .lar and .pm file? And after a new pitchmark is generated, what’s the parameter to use it with the festival and text2wave command?