Computational Linguistics

Subscribe to this category

permalink trackback comments feed

I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.

I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. DocConverter by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.

I have just converted more than a thousand documents from doc to txt without any problems.  Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.

permalink trackback comments feed

While reading weka Data Mining book, I have come across this impressive example of using machine learning to confirm person’s authorship (p. 358).

In 19th century, there lived a famous rabbinic scholar Ben Ish Chai, who among other writings had two collections of letters. Ben Ish Chai claimed that only one collection was his and that the other one was somebody else’s, found by him. Modern scholars thought both collections were his, but could not prove it conclusively as the style of writing was different.

Machine Learning to the rescue! In 2004, Moshe Koppel and Jonathan Schler have discovered that it may help to look not at the writing style differences (as the style may have been faked), but rather at how deep those differences were. For example, an author could fake a stylistic mismatch by consciously avoiding favorite words, but would still write in long overrun sentences, use more of passive verb forms or display many other measurable behaviours.

So, if the most obvious differences were removed one by one, the speed at which the rest of the features would look identical could be a good indicator. They called this technique unmasking and the mistery of Ben Ish Chai was solved for good.

I think what impressed me here was not the clever math. The whole field of determining authorship is based on clever math. It is rather the fact that the math was looking at hints within the hints of the language - the invisible aspects that become noticeable only after the eye learns to see beyond what the most obvious reality offers. I cannot explain it better, but to me it has a special elegance that just counting the words and sentence lengths does not offer.

permalink trackback comments feed

What could be common between Computational Linguistics and Aerobics? Quite a lot, as it turns out to be.

Dance descriptions, while not really in English do have a regular structure and can be thought of as a sub-language with full set of syntactic, semantic and pragmatic levels.

There are basic words of the language (move names), correct ways of putting them in a sentence (a routine) and all the way up to good flowing text (classes that do not hurt the participants).

I was thinking about relationship between dance instructions and computational linguistics in context of Scottish Country Dancing for at least a year. My imagined benefits were that codified dance instructions would allow for automatic dance animations, superior teacher aids and other applications that currently require a lot of sweat and toil. Dance evening programmes that are currently put together manually for each event, could be assisted with automated evaluation pointing out awkward sequences of dances.

Unfortunately, my attempts at explaining the connection made no sense to the people around me. So, I was ecstatic to discover that such a link was already discovered by others before me.

Adam Bull, more than 10 years ago, has tried to apply principles of computational linguistics to Aerobics for his MPhil degree in the paper entitled The formal description of aerobic dance exercise - a corpus-based computational linguistics approach. While, the report is not complete, it puts down many of the same arguments I have tried myself.

Unfortunately, the electronic copy of the document was not available. After some effort, I got in touch with Adam and he send me the copy of the report with the permission to distribute. I have put a copy of it on my own server.

I hope his research will get rediscovered and improved upon. That way when I get some time to apply my own PhD skills to Scottish Country Dancing, there will be more than one person on whose shoulders I would be able to stand.

permalink trackback comments feed

From time to time I experiment with GATE NLP toolkit. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.

The problem is the user configuration file gate.xml that is stored in the shared location, usually home directory. On Windows, that is C:\Documents and Settings\[ProfileName]\.

One of those settings was pointing to where the plugins were loaded from and was still referring to GATE 3.1’s locations. That caused NullPointerExceptions in the GATE and everything was breaking from that point on.

I found this by using FileMon, but later realised that it might have been done easier by changing runtime.spawn property to false in GATE’s build.xml file that is used to start the program. Using ant to start a program is a new one for me, but I guess it makes sense in some cases. Setting the property to false shows the startup messages and the exception that the wrong directories cause.

I have deleted the old gate.xml and gate.session files in my home directory and everything started to work. Back to actually trying to use the software.

permalink trackback comments feed

As part of doing a PhD in Computational Linguistics, I need to understand both computers and linguistics. I am fine with computers, but linguistics is not my strong point. Unfortunately, many of the linguistics books and resources are quite dry.

So, I was really happy to discover an audio course Story of Human Language from The Teaching Company taught by John McWhorter. It is quite long a covers a lot of material, but - apart from some overly long parts on universal language - it is really interesting and Professor McWhorter is a great presenter.

I actually had a chance to listen to both an audio version of the course and to see some of it on DVD. Personally, I prefer just audio for several reason.

Firstly, I can listen to the course on my MP3 player when I am walking or doing chores. Video version requires allocating dedicated time, which for such a long course would be difficult.

Secondly, I actually found visual part of the presentation quite boring - for the most part professor is just standing behind the lectern and talks from his notes. In fact, I found the visual part distracted me from the really great and expressive rhetorics.

There was a number of great section in the course, but I found the one explaining language structure of Arabic and Chinese particularly interesting. He talked about Arabic first and I was all keen to learn that language. Then, he switched over to Chinese and I found it even more fascinating. And then, there were comparisons of languages and his cat. This has to be heard to be believed.

The course is obviously available for purchase, but it is also found in quite a few libraries. If you do borrow it from the library, try requesting all volumes at once. I only requested one volume and it was quite annoying to then have to wait a long time for the rest of the course arrive. This is another way I knew for myself that the course was enjoyable, as I had plenty of other audio material to listen to otherwise.