Speak and spell with phonetic voice recognition

UK firm Novauris is aiming to take on the market leading computer voice recognition system with a different take on the problem

2012 could be remembered as the year when voice-controlled computers went mainstream. When Apple launched its new iPhone complete with Siri speech-recognition software at the end of last year, there was a sense that the decades-old technology was now good enough for the average consumer. And where Apple leads, the rest will follow.

The voice-recognition market is dominated by US-based firm Nuance Communications, whose software is thought to power Siri, and most technologies rely on a method of turning speech into text that can then be understood by the computer. But a British company, called Nouvaris, is hoping to take on the big boys with an alternative method of capturing voice commands.

Emerging from the prominent voice recognition software firm Dragon Systems (now absorbed into Nuance), Novauris’s founders John Bridle and Melvyn Hunt have been working away relatively quietly for the last decade developing specific applications, from a media download search function for Verizon Wireless to a Japanese rail travel app.

But the firm now hopes to make inroads into the global smartphone market and plans to announce partnerships with two household-name electronics firms (one Japanese, one Korean) that could help it compete with the likes of Siri and Google’s Voice Search more directly.

The most visible difference between Novauris’ NovaSearch and most rival software is that devices don’t need to be connected to the internet to use it. So you could tell a smartphone or computer to search through a database or answer a question when no 3G or wireless signal is available. And if there is an internet connection, the technology can sift through huge databases in a couple of seconds – so far it has worked with lists of up to 245 million entries.

NovaSearch allows more rapid database searching using voice commands because it decodes the speech on a smartphone instead of over the internet

‘Google and Siri basically convert the speech to a stream of words for input to a web search or an AI [artificial intelligence] interpreter,’ said company president Melvyn Hunt.

‘Ours is different because we search custom databases directly with the speech input. We produce a stream of phonetic symbols and use that in a very fast, patented search technique. It’s more efficient for that application and has enabled search of large databases.’

By using a phonetic-based method, the software can search for an entire phrase in one go, rather than for individual words one at a time. After decoding the phonetic sounds into symbols, it produces a shortlist of likely matches and then uses a process called acoustic rescoring to refine the results using more detailed modelling system. This means that it can search, for example, for entire addresses in one process instead of matching each line, working backwards from the country or city to the house number.

Because the system is searching phonetic symbols rather than words it can be done much faster and with relatively low processing power than speech-to-text search. Although the software is still most useful when searching the internet, operating the speech-recognition function locally on the device has certain key advantages, said Hunt. ‘You’re sending a lot less data,’ said Hunt. ‘Speech is an incredibly expensive signal to send. To send a request for a journey is much less data to send. It’s just a great deal faster as well.’

Phonetic searching isn’t a new idea and it does have its limitations, according to Prof Steve Young, head of information engineering at Cambridge University and part of the Speech Research Group there. ‘There are many weaknesses which in my view make phonetic search a short-term limited solution,’ he said.

Phonetic search can struggle with background noise and poor articulation, Young explained. It can have difficulty identifying commands and requests that don’t match its pre-programmed database or that are paraphrases, and it can’t dictate long messages.

‘And finally, a huge advantage of server-based solutions is that when people speak, their data can be collected and analysed. A major factor in recognition accuracy is the amount of data available to train the acoustic models. The more people that speak to Google Search and Siri, the better those systems will get.’

Applications for the software include finding public transport routes by speaking your destination into a smartphone or computer.

Hunt is confident that Novauris can find ways to get around many of these issues. He said NovaSearch already has highly robust system for dealing with background noise that had been thoroughly tested. The software can already handle some paraphrasing and Hunt said that building in capabilities comparable with the likes of Siri should, in principle, be possible.

With regards to NovaSearch’s ability to learn from user data, he said some applications already had the ability to transmit speech data when a network was available and to receive results from training material on servers. And individual devices could be trained to learn from their users.

But he also recognised that there would always be compromises and that there would always be a trade-off between flexibility and efficient, accurate performance in specific circumstances. ‘We took a decision when the company was founded to use our inevitably limited resources to do one class of tasks exceptionally well, and that is what we believe we do,’ he said.