Computerised talking heads start making sense

A team of British computing experts has successfully developed a new approach to tackling one of the biggest challenges in computer animation: how to generate realistic images of a talking head.

The research has been carried out by Dr Michael Brooke and Dr Simon Scott at the Media Technology Research Centre, University of Bath, with funding from the Engineering and Physical Sciences Research Council.

There are many potential applications for realistic computer images of a talking face, says Dr Brooke, ranging from helping people to lip read, to giving ‘speaking’ hole-in-the-wall machines a face so that you can communicate with someone you see as well as hear. The entertainment industry could also be a major longer-term beneficiary of the work.

However, the challenge is not trivial. ‘Human beings are extremely complex and there is a real problem in trying to get good, life-like speech movements with a computer model,’ says Dr Brooke.

Efforts have been made, notably by researchers in the United States, to create 3-D models of the human head whose movements are defined by a complex set of rules. ‘Some of these models look quite plausible, but it is still very difficult to get features like the tongue and teeth right,’ says Dr Brooke.

The Bath team has adopted a different approach to take into account the subtle variations in the appearance of the face when different sounds are being made. These are important because inaccurate visual cues can seriously confuse listeners.

It is not sufficient simply to generate a given set of images to correspond to a given sound. For example, says Dr Brooke, the shape your mouth makes for the ‘huh’ sound in the words ‘who’d’, ‘heard’, ‘heed’, ‘hoard’ or ‘hard’ is very different because the ‘huh’ sound occurs in a different phonetic context in each word and this must be taken into account. Furthermore, the same word spoken by someone at different times can produce slightly different facial expressions.

Given the extent of these variations, the researchers devised a way to train a computer to learn the facial patterns corresponding to different sounds in a range of phonetic contexts, effectively teaching the computer how a talker’s face behaves.

A volunteer was recorded on videotape speaking standard lists of sentences that contain the major speech sounds in many phonetic contexts.

The resulting tape was analysed frame by frame, with each group of images being assigned a speech sound and its phonetic context.

For example, all the frames with the ‘ah’ sound in between a ‘f’ and a ‘th’ sound, as in words like ‘father’, were categorised together, and so on.

The images were digitally encoded to reduce the amount of storage needed. ‘We have managed around 30-fold compression with a minimal loss of accuracy,’ says Dr Brooke.

The coded images for each sound, taking its phonetic context into account, were then subjected to statistical manipulation to capture the natural variability of the speaker’s facial gestures. Just as one person may move his mouth differently when saying ‘father’ on different occasions, so the computer-generated output has similar behaviour in-built.

The teeth and tongue movements, and other cues such as skin texture and shadowing, are obtained quite naturally because the computer has learnt from images of a real speaker.

Any sentence in English can be typed into the computer and the program will convert it into the appropriate sequence of speech sounds. The computer can then calculate and display a plausibly realistic face speaking the sounds that make up the sentence. It is fast, with a relatively modest PC taking only about six to eight seconds to generate the images for a complete spoken sentence.

Having demonstrated that the technique works in principle, the researchers now intend refining the program to make it work even faster and more accurately.

Contact: Dr Michael Brooke

Tel: 01225 826004

Fax: 01225 826492

e-mail: nmb@maths.bath.ac.uk