A new study on the way humans perceive and organise speech could lead to better algorithms that will improve the accuracy of speech recognition systems.
Researchers from Aston University hope the results of their EPSRC-funded project will indirectly lead to speech recognition systems that better target a person’s voice in the midst of loud background noise, such as that found on the factory floor. The research could also offer significant improvements to hearing aids.
It is fairly uncommon in everyday life for people to hear the speech of a person talking in the absence of other background sounds, and so the human auditory system is faced with the challenge of grouping together sounds that come from one source and segregating them from those arising from other sources.
‘People have just speculated until now how this is achieved,’ explained Brian Roberts, the project’s principal investigator, ‘but we thought with a very systematic study we might be able to unravel the mystery a bit further.’
Roberts’ research team will place headphones on volunteers at
The artificial stimuli will sound like a slightly robotic female or male voice. The researchers decided to use this rather than real voices because it will be easier to segregate important formants, which are peaks in an acoustic frequency spectrum that result from the resonant frequencies of any acoustic system.
‘The first three formants carry the most of the linguistic information,’ Roberts said. ‘If you produce synthetic speech based on those first three formants and how they change over time, you can produce very intelligible artificial speech.’
The test volunteers will take in the first and third formants into one ear and the second formant in the other ear.
‘In the absence of any other sounds, that cross ear fusion works very well and people find it very easy to identify speech,’ he said. ‘The interesting thing comes if you put a competitor, a possible alternative to the second formant in the same ear as the other two. If the ear groups the true first and third formant with the competitor, intelligibility will fail.’
The researchers will then be able to determine which types of sound competitors most disrupt the comprehension of speech.
‘By doing that we can find out what properties the competitor has to have before it will group with the rest of the speech and ruin the intelligibility of the speaker,’ Roberts said.
This information will, it is hoped, deliver information about what acoustic features of speech are important for the grouping and binding of speech together. ‘If it’s possible to describe those acoustic cues, then computer modellers could process the mixture of speech and look for certain types of relationships,’ he said. ‘That might allow them to more successfully prize out all the formant tracks that come from one speaker and separate them from another.’
‘If that’s possible to do in principle,’ he added, ‘in the long term it could be possible to develop processing strategies for things like hearing aids and cochlear implants that might use these algorithms to improve the signal to noise ratio for the listener. The biggest problem that people with hearing impairments have is they’re hearing ability is often not too bad in quiet, one to one conversation, but in a noisy pub or party they find it quite difficult.’
The information could also be used to improve speech recognition systems used in busy environments. Roberts envisages a future scenario where speech recognition systems are able to distinguish a voice from the hubbub that occurs on a factory floor
‘Our primary interest is understanding how the normal hearing listener functions, but the spin-off from that would be for others to use the information to improve algorithms,’ he said. ‘We don’t have the expertise to do that directly, but the outcome of our project will be significant for those who are developing those platforms.’