Brain activity decoded to deliver synthetic speech

Neuroscientists in the US have developed a virtual vocal tract that produces accurate synthetic speech using decoded brain activity, opening the door for future speech prosthetics.

People who have lost speech due to stroke, brain injury or neurological disease currently rely on painfully slow synthesisers that track eye movement to spell out words. Perhaps best associated with the late Stephen Hawking, these technologies are limited to around 10 words per minute. By comparison, natural speech is in the region of 100-150 words per minute.

The researchers, based out of UC San Francisco (UCSF), established in a previous study how the brain’s speech centres encode movements for the lips, jaw, and tongue rather than direct acoustic information. For the latest work, they first gathered recordings of patients reading sentences while also logging the corresponding brain activity. Using linguistic principles, they then reverse engineered the vocal movements required to produce the sounds and mapped them to the brain activity associated with those sounds. This allowed the team to create a realistic virtual vocal tract for each participant that could be controlled by their brain activity. The work, published in Nature, could pave the way for devices that replicate natural speech in real-time.

“For the first time, this study demonstrates that we can generate entire spoken sentences based on an individual’s brain activity,” said Edward Chang, a professor of neurological surgery and member of the UCSF Weill Institute for Neuroscience.

“This is an exhilarating proof of principle that with technology that is already within reach, we should be able to build a device that is clinically viable in patients with speech loss.”

The electrodes used to log brain activity (Credit: UCSF)

The virtual vocal tract at the core of the technology is comprised of two machine learning algorithms: a decoder that transforms brain activity into movements, and a synthesiser that converts these movements into an approximation of the participant’s voice. According to the UCSF team, synthetic speech produced by these algorithms was significantly better than that directly decoded from participants’ brain activity without the intermediary virtual vocal tract.

During crowdsourced testing using Amazon’s Mechanical Turk platform, subjects accurately identified 69 per cent of synthesised words from lists of 25 alternatives and transcribed 43 per cent of sentences with perfect accuracy. With a more challenging 50 words to choose from, overall accuracy dropped to 47 per cent, though subjects were still able to understand 21 per cent of synthesised sentences perfectly.

“We still have a ways to go to perfectly mimic spoken language,” said Josh Chartier, a bioengineering graduate student in Chang’s lab. “We’re quite good at synthesising slower speech sounds like ‘sh’ and ‘z’ as well as maintaining the rhythms and intonations of speech and the speaker’s gender and identity, but some of the more abrupt sounds like ‘b’s and ‘p’s get a bit fuzzy. Still, the levels of accuracy we produced here would be an amazing improvement in real-time communication compared to what’s currently available.”

The researchers also found that the neural code for vocal movements partially overlapped across participants, and that one research subject’s vocal tract simulation could be adapted to respond to the neural instructions recorded from another participant’s brain. These findings suggest that individuals with speech loss due to neurological impairment may be able to learn to control a speech prosthesis modelled on the voice of someone with intact speech.

“People who can’t move their arms and legs have learned to control robotic limbs with their brains,” Chartier said. “We are hopeful that one day people with speech disabilities will be able to learn to speak again using this brain-controlled artificial vocal tract.”