Questions about scripting with the Festival text-to-speech engine

MirceaKitsune · August 26, 2021, 10:08pm

I’m trying to script a series of spoken messages using the Festival text-to-speech engine, the text2wave command in particular. I took a look at the basics of how Festival scripting works via scm and xml files, yet there are things I can’t seem to find any useful information on. If anyone is familiar with the software I wanted to ask about how I’m meant to use the system in this format.

What I essentially want is to have different voices spoken at different locations in the resulting audio, using different voices if possible. Something among the lines of: Wait 5 seconds, say “foo” in voice X, wait 10 seconds, say “bar” in voice Y. Is this possible to script in a single scm / xml definition, any examples of how to do it?

I’d also like to include other sounds in the equation. Can the schematic for the text2wave command take another wav / ogg and throw it in together with the spoken voices? Overlap is okay… was thinking of using this to add music without having to do other changes with a ffmpeg command.

In addition: Is there a way to change the pitch of a voice? I only found a way to set the speed in the scm using the line (Parameter.set 'Duration_Stretch 1). Do I need to make my own variant for a voice to do that, and how is this done if yes?

marel · August 28, 2021, 10:41am

It looks to me the better way is to have Festival generate samples and use another program to patch the samples together/increase the pitch.

MirceaKitsune · August 28, 2021, 12:59pm

Seems like it. I read somewhere that you can do advanced scripting with Festival so it got me wondering if I could create a detailed text reading entirely from it. I’m using a bash script anyway so it’s no problem making each read its own wav file then using ffmpeg to mix them with other audio.

Still wanted to know if there’s a way to change the pitch and customize individual voices. I found a way to edit pitch with the rubberband command, but it feels like a workaround and may create lossy artifacts in the audio too. I believe there was a way to override voice parameters in the scm but couldn’t find exact info on how.

marel · August 28, 2021, 3:10pm

From the pitch, did you have a look at 20.1.1 Generating pitchmarks?

MirceaKitsune · August 29, 2021, 1:50pm

That offers part of the solution but not all. Where do I get that .lar and .pm file? And after a new pitchmark is generated, what’s the parameter to use it with the festival and text2wave command?