Finding formulae

Penn State College of Information Sciences and Technology researchers have created ChemxSeer, the first publicly available search engine designed specifically for chemical formulae.

According to the scientists, the tool is more accurate than other general search engines, as they say it can sort out when ‘He’ refers to helium rather than a pronoun at least nine times out of 10.

C Lee Giles, professor of information sciences and technology and co-director of the IST Cyber Infrastructure Lab, said that the new algorithm can also identify related chemicals with different formula representations and chemicals with related substructures or similarities.

‘Results from our search engine are much more relevant than results returned by popular search engines,’ said Giles. ‘It is one of several cyber tools under development in our lab which will enable better access to and sharing of information and data among scientists and scholars.’

To create ChemxSeer, the researchers ‘taught’ machines how to recognise chemical formulae by providing training samples of occurrences of both chemical formulae and non-chemical formulae.

‘Teaching the computer to classify what is a formula and what is not was complex because language is inherently context sensitive and judging the meaning of a term using its context is hard for a machine,’ said Prasenjit Mitra, one of the researchers who developed the software.

Other obstacles the scientists had to overcome included the multiple representations possible for each formula so that a person searching for CH4 (methane) can also find the chemical when it is represented as H4C.

ChemxSeer was developed using experience the researchers gained from designing information-extraction algorithms created for CiteSeer, a search engine for academic and science documents.