Read my lips

Mary 101’s face belongs to a real person, but her image is now a video ventriloquist’s dummy. MIT researchers Tomaso Poggio and Tony F. Ezzat can make her say anything they want.

To date, artificially animated human faces have looked jerky and unrealistic. But now, Tomaso Poggio, an investigator with MIT’s McGovern Institute for Brain Research, and Tony Ezzat, an MIT graduate student in electrical engineering and computer science, have simulated mouth movements that look so real, most viewers can’t tell that Mary 101 isn’t an ordinary videotape of a person speaking.

Given a few minutes of footage of any individual, the researchers can pair virtually any audio to any videotaped face, matching mouth movements to the words.

Unlike previous efforts that relied on 3D computer-modelling techniques, Ezzat and Poggio built a computer model that uses example video footage of the person talking.

For the Mary 101 project, Ezzat used the facilities of MIT Video Production Services to videotape model Mary 101 speaking for eight minutes. He gathered 15,000 individual digitised images of her face. He then developed software that allowed the computer to automatically choose a small set of images that covered the range of Mary 101’s mouth movements.

The computer takes these mouth images and re-combines them in a high-dimensional ‘morph space.’ Using a learning algorithm in the morph space, the computer is able to figure out from the original video footage how Mary 101’s face moves. This allows the software to re-synthesise new utterances.

‘The work is still in its infancy, but it proves the point that we can take existing video footage and re-animate it in interesting ways,’ Ezzat said. ‘At this point, we can re-animate realistic speech, but we still need to work on re-animating emotions. In addition, we cannot handle footage with profile views of a person, but we are making progress toward addressing these issues.’

The video produced by the system was evaluated using a special test to see whether human subjects could distinguish between real sequences and synthetic ones. Gadi Geiger, a CBCL researcher working with Ezzat, showed that people could not distinguish the real video from the synthetic sequences generated by the computer.

Poggio can imagine a future in which a celebrity such as Michael Jordan may sell his image and the right to create a virtual video version of himself for advertising and other purposes. Or maybe the estates of John Wayne, Marilyn Monroe or Elvis Presley would be willing to have the performers make a virtual comeback – for a price.

A more realistic scenario for the near future would be webcasts or TV newscasts incorporating the face of a model like Mary 101 that has been programmed to give weather updates or read the day’s winning lottery number.

This method could also be used for redubbing a film from one language to another, negating the need for subtitles. It would also avoid ‘Japanese film syndrome’ where the actors’ lips are still moving long after the shorter English phrase has been uttered.

The work is funded by the National Science Foundation and NTT through the NTT-MIT Research Collaboration.