Using watermarks to preserve the integrity of printed documents dates back 2000 years but researchers at Purdue University could bring that time-tested method into the electronic age.
Mikhail Atallah, professor of computer science, and Victor Raskin, professor of English, have developed a way to embed a watermark in ‘natural language’ text documents as well as sensitive electronic documents. Natural language includes all the spoken languages but not languages created for special purposes, such as computer languages.
The concept of placing watermarks on electronic documents is not new but what makes natural language watermarking unique is that it embeds the watermark in the syntax, or grammatical structure, of the language. A future version of the prototype will embed the watermark in the meaning of the language, as well. This process has never been done before electronically.
One factor making text so difficult to watermark is that, compared to a photographic image, a text document has very few places in which to hide watermarks.
‘Every pixel in a full-screen image contains information,’ said Raskin. ‘There is a lot of redundancy in the image.’
That redundancy is what makes it possible to embed a watermark. One could, for example, switch a few blue pixels to red. If a field of blue surrounds the red pixels, the image itself is still seen as blue.
Text documents are another story, Raskin said. ‘In natural language, there is no redundancy. That is, every word means something. If you change it, you change the meaning of the sentence. That’s the difficulty.’
To get around this problem, Atallah and Raskin have developed a way to embed a watermark using the structure of language itself.
Natural language watermarks, unlike those used in images, do not embed something physical in the text. Language watermarks instead introduce very slight changes in the grammatical makeup of selected sentences throughout a document, while keeping the meaning intact.
‘What we embed is not something you can see,’ said Raskin, ‘It’s in the invisible syntactic structure.’
A watermark is introduced throughout a document using an encryption algorithm based on a very large prime number. This large number is the mechanism required to retrieve a watermark. The algorithm selects certain sentences in a document and subtly changes their syntactic structure.
For example, a sentence in a document may read ‘Ships in the vicinity may provide some additional assistance.’ After the document has been watermarked, the sentence will read, ‘Some additional assistance may be provided by the ships in the vicinity.’
Another factor making this technique difficult to implement is that it must be resistant to change, Atallah said.
‘We wanted to make our scheme resilient to simple changes in the text that are easy to make by automated processes, such as synonym substitutions,’ he said. ‘If you change one word for another throughout the whole document, we would expect the watermark to still be there. It turns out that it’s resilient to a lot more than that. It’s also resistant to insertions and deletions of sentences.’