Category Archives: My PhD research

Bulk converting doc files into txt (or html)

I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.

I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. DocConverter by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.

I have just converted more than a thousand documents from doc to txt without any problems.  Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.

Artificial Intelligence discussion at BarCampNYC3

They say at BarCamp that if you don’t like the session you are in, feel free to go to a better one. No hard feelings. But what do you do, if you show up for the announced moderated discussion session yet the moderator does not.

That’s what happened to us with the last (5:15pm) slot of the second day of BarCampNYC3. So, after waiting for 10 minutes past the start time, I decided to step in and moderate.

We talked a bit about everything: a definition of Artificial Intelligence (no agreement) and statistical algorithms that try to find the tanks, tune adverts and prevent SPAM. We discussed the state of art in computer vision and why once well-known consumer company in that space (Riya) still failed miserably. Near the end, we also talked about artificial intelligence as an emotional one and whether Pleo is intelligent.

All together, it was a very spirited discussion and most of the people contributed their opinion and their knowledge. We may not have discussed what the original moderator had in mind, but we certainly discussed interesting topics.

Unravelling the black magic of bureaucracy

Arthur C. Clarke once famously wrote “Any sufficiently advanced technology is indistinguishable from magic”. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.

Bureaucracy often has its place and reason. Laws protect exploitable minorities; procedures serve to avoid known problems; cross-referencing forms are filled in triplicate to allow for audit and protection against falsification. The problem is not the bureaucracy as such but rather the fact that it eventually outgrows any individual person’s ability to comprehend it. At that point, only dedicated specialists can understand the process and the rest of us have to offer sacrifices to those acolytes in hopes of beneficial results.

Enter computers. It turns out that computers can bring the complexity of information down, back within the reach of the non-specialist. The more bureaucratic a processes, the better a computer can figure it out. What is a mind-numbing in-triplicate form to a human is a structured source of information with cross-checking redundancy to the computer.

This area of research is called “Natural Language Processing” – NLP. It is not an obscure field – any Google user has benefited from this type of research. Other applications of NLP include speech recognition and machine translation.

NLP is not a new branch of science. Back in the 1950s, software was being developed in the USA to translate from German into English. The translation quality of grammar-based systems was very poor. Nevertheless, even the possibility of machine translation was so impressive that about US$20 million were spent on the research before the enchantment fizzled out and fund allocations virtually stopped. NLP did not die at that point, but it certainly slowed down.

Statistical approaches to NLP have been around nearly as long as grammar-based ones. However, as they require large quantities of data, these did not become feasible until the mid-1990s. Once they did reach popularity, however, the research advanced rapidly, taking advantage of ever increasing computer speed and available storage. Statistical approaches do not rely on language comprehension. Instead, with sufficient amounts of text, common patterns can be established without understanding the rules of their formation.

A good example is Google’s new translation engine from Arabic to English. The engine won the NIST 2005 machine translation competition, even though its software developers did not know Arabic. Instead, they used existing parallel documents of United Nations translated by professionals – some 200 billion words of content in total. It is perhaps symbolic that, even in such a deeply technical area, the Universal Declaration of Human Rights helps to ensure humans all over the world will be able to communicate with each other.

Standalone, however, a statistical approach is not a panacea either. Since there is no real understanding involved, a statistical NLP system has no way to recover from invalid conclusions.

There is more to the puzzle. Most of the real world texts are about somebody or something. The entity could be a person, a company, or a committee. Sometimes, the name of that entity is very long. Documents of the United Nations are known for names that even a human would struggle with. “The Ad Hoc Committee on the Scope of Legal Protection under the Convention on the Safety of United Nations and Associated Personnel” would be one of those. Other large organisations have similar problems.

Currently, neither of the above approaches is sufficient on its own. Grammar-based systems break on complex names; statistical ones mark ‘The Committee’ as a completely separate entity, rather than a reference to the full name.

The ideal system that we’re working on would be able to identify the complex names using a combination of techniques. It would also be capable of using multiple appearances in different contexts to confirm the identification, including linking different forms of the same name. Once these goals are achieved, documents in legal and medical domains can get the full benefits from other, already available, research.

Soon, the day will come when computers understand what humans write or say. Hopefully, without needing the triplicates.