The rules can then be used to generate new and meaningful sentences. The method also works for such data as sheet music or protein sequences.
The development – which has a patent pending – has implications for speech recognition and for other applications in natural language engineering, as well as for genomics and proteomics. It also offers new insights into language acquisition and psycholinguistics.
“This is the first time an unsupervised algorithm is shown capable of learning complex syntax, generating grammatical new sentences and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics,” explained Shimon Edelman, a computer scientist who is a professor of psychology at Cornell.
Unlike previous attempts at developing computer algorithms for language learning, the new method, called Automatic Distillation of Structure (ADIOS), successfully identifies complex patterns in raw texts. The algorithm discovers the patterns by repeatedly aligning sentences and looking for overlapping parts.
For example, the sentences I would like to book a first-class flight to
If the system also encounters the sentences I need to book a direct flight from
He added, “ADIOS relies on a statistical method for pattern extraction and on structured generalisation – two processes that have been implicated in language acquisition. Our experiments show that it can acquire intricate structures from raw data, including transcripts of parents’ speech directed at 2- or 3-year-olds. This may eventually help researchers understand how children, who learn language in a similar item-by-item fashion and with very little supervision, eventually master the full complexities of their native tongue.”
In addition to child-directed language, the algorithm has been tested on the full text of the Bible in several languages, on artificial context-free languages with thousands of rules and on musical notation. It also has been applied to biological data, such as nucleotide base pairs and amino acid sequences. In analysing proteins, for example, the algorithm was able to extract from amino acid sequences patterns that were highly correlated with the functional properties of the proteins.
The new method was developed jointly with David Horn and Eytan Ruppin, professors of physics and computer science, respectively, at