Arthur C. Clarke once famously wrote “Any sufficiently advanced technology is indistinguishable from magic”. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.
Bureaucracy often has its place and reason. Laws protect exploitable minorities; procedures serve to avoid known problems; cross-referencing forms are filled in triplicate to allow for audit and protection against falsification. The problem is not the bureaucracy as such but rather the fact that it eventually outgrows any individual person’s ability to comprehend it. At that point, only dedicated specialists can understand the process and the rest of us have to offer sacrifices to those acolytes in hopes of beneficial results.
Enter computers. It turns out that computers can bring the complexity of information down, back within the reach of the non-specialist. The more bureaucratic a processes, the better a computer can figure it out. What is a mind-numbing in-triplicate form to a human is a structured source of information with cross-checking redundancy to the computer.
This area of research is called “Natural Language Processing” – NLP. It is not an obscure field – any Google user has benefited from this type of research. Other applications of NLP include speech recognition and machine translation.
NLP is not a new branch of science. Back in the 1950s, software was being developed in the USA to translate from German into English. The translation quality of grammar-based systems was very poor. Nevertheless, even the possibility of machine translation was so impressive that about US$20 million were spent on the research before the enchantment fizzled out and fund allocations virtually stopped. NLP did not die at that point, but it certainly slowed down.
Statistical approaches to NLP have been around nearly as long as grammar-based ones. However, as they require large quantities of data, these did not become feasible until the mid-1990s. Once they did reach popularity, however, the research advanced rapidly, taking advantage of ever increasing computer speed and available storage. Statistical approaches do not rely on language comprehension. Instead, with sufficient amounts of text, common patterns can be established without understanding the rules of their formation.
A good example is Google’s new translation engine from Arabic to English. The engine won the NIST 2005 machine translation competition, even though its software developers did not know Arabic. Instead, they used existing parallel documents of United Nations translated by professionals – some 200 billion words of content in total. It is perhaps symbolic that, even in such a deeply technical area, the Universal Declaration of Human Rights helps to ensure humans all over the world will be able to communicate with each other.
Standalone, however, a statistical approach is not a panacea either. Since there is no real understanding involved, a statistical NLP system has no way to recover from invalid conclusions.
There is more to the puzzle. Most of the real world texts are about somebody or something. The entity could be a person, a company, or a committee. Sometimes, the name of that entity is very long. Documents of the United Nations are known for names that even a human would struggle with. “The Ad Hoc Committee on the Scope of Legal Protection under the Convention on the Safety of United Nations and Associated Personnel” would be one of those. Other large organisations have similar problems.
Currently, neither of the above approaches is sufficient on its own. Grammar-based systems break on complex names; statistical ones mark ‘The Committee’ as a completely separate entity, rather than a reference to the full name.
The ideal system that we’re working on would be able to identify the complex names using a combination of techniques. It would also be capable of using multiple appearances in different contexts to confirm the identification, including linking different forms of the same name. Once these goals are achieved, documents in legal and medical domains can get the full benefits from other, already available, research.
Soon, the day will come when computers understand what humans write or say. Hopefully, without needing the triplicates.