I’m trying to script a series of spoken messages using the Festival text-to-speech engine, the text2wave command in particular. I took a look at the basics of how Festival scripting works via scm and xml files, yet there are things I can’t seem to find any useful information on. If anyone is familiar with the software I wanted to ask about how I’m meant to use the system in this format.
What I essentially want is to have different voices spoken at different locations in the resulting audio, using different voices if possible. Something among the lines of: Wait 5 seconds, say “foo” in voice X, wait 10 seconds, say “bar” in voice Y. Is this possible to script in a single scm / xml definition, any examples of how to do it?
I’d also like to include other sounds in the equation. Can the schematic for the text2wave command take another wav / ogg and throw it in together with the spoken voices? Overlap is okay… was thinking of using this to add music without having to do other changes with a ffmpeg command.
In addition: Is there a way to change the pitch of a voice? I only found a way to set the speed in the scm using the line (Parameter.set 'Duration_Stretch 1). Do I need to make my own variant for a voice to do that, and how is this done if yes?
Seems like it. I read somewhere that you can do advanced scripting with Festival so it got me wondering if I could create a detailed text reading entirely from it. I’m using a bash script anyway so it’s no problem making each read its own wav file then using ffmpeg to mix them with other audio.
Still wanted to know if there’s a way to change the pitch and customize individual voices. I found a way to edit pitch with the rubberband command, but it feels like a workaround and may create lossy artifacts in the audio too. I believe there was a way to override voice parameters in the scm but couldn’t find exact info on how.