Is believable dynamic text-to-speech at runtime a pipe dream, or are we nearer than we think?
As games become ever more cinematic, and writing for games becomes more of an understood craft, the amount of voice content in games has increased by a staggering amount. Take Mass Effect 2, for example: recent statistics suggest that the latest installment of BioWare’s space opus contains over 31,000 lines of dialogue.
Naturally, that means an awful lot more voice recording sessions and an awful lot more disc space to store them. Imagine, then, being able to procedurally generate speech that sounded just like a human, with the right intonation, emotion and expression to match the best voice actors in the industry.
Imagining it? Now stop. It doesn’t exist. Such lofty aims are the Holy Grail for the speech technology industry, but nobody knows exactly when they’ll get there. In the meantime, Cambridge-based Phonetic Arts has a solution to help game developers go beyond just playing pre-recorded clips.
“We’re being very cautious to not over-promise anything,” explains CEO Paul Taylor when we meet him. “ PA Studio 2009 does a lot of good things – a lot of simple things – very well. We can have fairly natural-sounding voices that can say anything, or very natural voices that can say a few things. We really don’t want to say that this first product is the be all and end all; it’s not like that at all. It doesn’t yet perform miracles.”
The first thing that developers need is a source voice file: the system needs a reference, as it essentially ‘mimics’ an existing voice through analysis rather than generating one out of the ether. Once input into PA Studio, the magic starts to happen.
“Essentially what PA Studio does is that it builds a statistical model of the voice – that’s where the magic is,” explains Taylor. “There’s a lot of signal processing, learning statistical models and linguistic analysis and so on going on. Once it’s complete you have this compiled voice, which is basically the asset.”
It’s here that Phonetic Arts’ offering splits in two, depending on what exactly you want to do. The first option is Composer, an ‘intelligent way of combining pre-existing waveforms’. If this doesn’t sound massively innovative to you, this is where what Taylor said earlier comes in: doing something simple very well.
“If you just try to brute force two speech clips together, at best you’ll get a ‘click’ where they join and at worst it’ll sound like a train announcer. So Composer gives you a very easy way of combining these speech waveforms. We’ve got this phoneme blending technology that blends through the join – it’s perfect, and you can’t tell that they weren’t originally a single line.”
A typical use-case is sports announcers in games. “If you take a sentence like ‘Beckham passes to Ronaldo,’ that ‘passes to’ is a carrier sentence, so with Composer you can change both of those names and generate the new waveform.”
The other option is the more freeform Generator, which takes this statistical voice model, plus any text that you feed it, and generates audio – like the text-to-speech you might have played with before, but hugely better quality.
“We’re seeing an awful lot of people really excited about this for generating placeholder dialogue so they can get rough timings for animations. Just give it 50,000 lines of script and it’ll just spit it out.”
Both Composer and Generator can be run off-line through the PA Studio app, but the real magic is in the real-time versions. For Composer, that means that you can store your dialogue as compiled voice files, which typically amounts to about 30 bytes per sentence, and then generate the waveforms at run-time through a lightweight component. Meanwhile Generator can also be done at run-time, allowing, for example, the player to choose their own name and still have it spoken by the cast.
So, not the Holy Grail yet – but an interesting, and targeted, first step.
One of the things mentioned earlier is that the technology needs source audio in order to work its magic.
“You need a decent amount,” says CEO Paul Taylor. “For Composer, the absolute minimum is about 30 minutes. All of the stuff that it does – the joining between the samples – is statistically learned from their general speech patterns, so you need to have quite a lot for phonetic material for it to work from. 60 minutes would be ideal really, especially for a main character.”
Obviously, given that Generator does even funkier voodoo with the source material, it also needs quite a bit more – but a new version, currently being worked on by the team, will most likely bring requirements in line with Composer.
“Today’s version of Generator requires about three hours of speech to function properly, which is quite a lot. A new version that we’re currently working on has a new adaptation technique, where we’ll include what’s essentially a generic voice – the average of everybody’s voice – and after that you just have to record a small sample for each person and just morph the base towards that. I can’t put a number on it, but maybe around 30 minutes again.
“We’re using this because we don’t want to impose huge extra recording costs upon developers. For something like Mass Effect, three hours of dialogue is not a massive undertaking, but for something else – maybe a sports game with an announcer – that’s quite a lot.”