arafalov on September 1st, 2007

Arthur C. Clarke once famously wrote “Any sufficiently advanced technology is indistinguishable from magic”. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.
Bureaucracy often has its place [...]

Continue reading about Unravelling the black magic of bureaucracy

When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.
The fix is extremely simple:

In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace

new FileInputStream(f) with
new BufferedInputStream(new FileInputStream(f), 1000000)

Recompile maxent library
Deploy new [...]

Continue reading about Reducing disk thrashing of OpenNLP/MaxEnt parser - with one line code change

arafalov on August 11th, 2007

I was not able to get OpenNLP parser to work. There were no samples to play with, no command line tools to run. And I don’t even want to talk about documentation. That’s because there was not any. There was an attempt at lame joke (at least that’s the only sense I can make of [...]

Continue reading about Getting OpenNLP parser to work

arafalov on August 5th, 2007

Bikel’s statistical parser is designed to be run from the command line. I need to run it from my own code.
The following wrapper seems to do the trick on windows (with your own values for|parserdir| :

String settingsFile = “|parserdir|\\settings\\collins.properties”;
Settings.load(settingsFile);
Parser parser = new Parser(”|parserdir|\\bikel\\wsj-02-21.obj.gz”);
Sexp result = parser.parse(Sexp.read(”(This is a funny world)”).list());

There is a complaint when running [...]

Continue reading about Running Bikel’s parser programmatically

I have been using Stanford NLP Parser from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.
Now, I need to wrap-up the parser in my own code to add input/output batching and I [...]

Continue reading about Duplicating -tagSeparator effect when using Stanford Parser programmatically

arafalov on July 27th, 2007

This was the fastest beta invite confirmation ever. Unfortunately, Digger’s Terms of Service do not allow any sort of disclosure about features or results from it. This is very different from Powerset which has been going out of its way to get beta subscribers (even unconfirmed ones) to know what they are doing. Digger does [...]

Continue reading about I received the Digger beta invite

arafalov on July 26th, 2007

Powerset hasn’t even started competing with Google yet and already it has its own competitor.
Digger - which is currently in private beta - does sense disambiguation of the search terms like everybody else. Unlike everybody else, however, they expose the underlying WordNet definitions to the searcher and allow them to pick, rate and even discuss [...]

Continue reading about Digger - Another NLP enhanced search engine (beta)

arafalov on July 23rd, 2007

I found another online syntax tree visualiser that can cope with large trees - phpSyntaxTree. It requires square brackets instead of the lisp s-expression ones, but it should not be too hard to convert from one to another. There is also a Ruby version of the application from a different developer, but it refused to [...]

Continue reading about Another large syntax tree visualiser

arafalov on July 20th, 2007

In my review of WordChamp and LingQ I mentioned that an ideal language learning system would have deep support for the specifics of the learner’s target language. I was asked to clarify what I mean by that.
I have now found an example of what could be a step in the right direction. It is an [...]

Continue reading about Learning english prepositions - the smart way

arafalov on July 17th, 2007

Just a link to an interesting article by Sunayana on Natural Language Processing as applied to problems in India.
She has an interesting point that because NLP is so underdeveloped in India, even undergraduate-level projects may be contributing to the cutting edge of research.
This is similar to what was mentioned in the podcast about Somali speech [...]

Continue reading about Link: NLP - The Indian perspective