Computational Linguistics

Subscribe to this category

permalink trackback comments feed

Arthur C. Clarke once famously wrote “Any sufficiently advanced technology is indistinguishable from magic”. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.

Bureaucracy often has its place and reason. Laws protect exploitable minorities; procedures serve to avoid known problems; cross-referencing forms are filled in triplicate to allow for audit and protection against falsification. The problem is not the bureaucracy as such but rather the fact that it eventually outgrows any individual person’s ability to comprehend it. At that point, only dedicated specialists can understand the process and the rest of us have to offer sacrifices to those acolytes in hopes of beneficial results.

Enter computers. It turns out that computers can bring the complexity of information down, back within the reach of the non-specialist. The more bureaucratic a processes, the better a computer can figure it out. What is a mind-numbing in-triplicate form to a human is a structured source of information with cross-checking redundancy to the computer.

This area of research is called “Natural Language Processing” - NLP. It is not an obscure field - any Google user has benefited from this type of research. Other applications of NLP include speech recognition and machine translation.

NLP is not a new branch of science. Back in the 1950s, software was being developed in the USA to translate from German into English. The translation quality of grammar-based systems was very poor. Nevertheless, even the possibility of machine translation was so impressive that about US$20 million were spent on the research before the enchantment fizzled out and fund allocations virtually stopped. NLP did not die at that point, but it certainly slowed down.

Statistical approaches to NLP have been around nearly as long as grammar-based ones. However, as they require large quantities of data, these did not become feasible until the mid-1990s. Once they did reach popularity, however, the research advanced rapidly, taking advantage of ever increasing computer speed and available storage. Statistical approaches do not rely on language comprehension. Instead, with sufficient amounts of text, common patterns can be established without understanding the rules of their formation.

A good example is Google’s new translation engine from Arabic to English. The engine won the NIST 2005 machine translation competition, even though its software developers did not know Arabic. Instead, they used existing parallel documents of United Nations translated by professionals - some 200 billion words of content in total. It is perhaps symbolic that, even in such a deeply technical area, the Universal Declaration of Human Rights helps to ensure humans all over the world will be able to communicate with each other.

Standalone, however, a statistical approach is not a panacea either. Since there is no real understanding involved, a statistical NLP system has no way to recover from invalid conclusions.

There is more to the puzzle. Most of the real world texts are about somebody or something. The entity could be a person, a company, or a committee. Sometimes, the name of that entity is very long. Documents of the United Nations are known for names that even a human would struggle with. “The Ad Hoc Committee on the Scope of Legal Protection under the Convention on the Safety of United Nations and Associated Personnel” would be one of those. Other large organisations have similar problems.

Currently, neither of the above approaches is sufficient on its own. Grammar-based systems break on complex names; statistical ones mark ‘The Committee’ as a completely separate entity, rather than a reference to the full name.

The ideal system that we’re working on would be able to identify the complex names using a combination of techniques. It would also be capable of using multiple appearances in different contexts to confirm the identification, including linking different forms of the same name. Once these goals are achieved, documents in legal and medical domains can get the full benefits from other, already available, research.

Soon, the day will come when computers understand what humans write or say. Hopefully, without needing the triplicates.

permalink trackback comments feed

When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.

The fix is extremely simple:

  1. In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace
    • new FileInputStream(f) with
    • new BufferedInputStream(new FileInputStream(f), 1000000)
  2. Recompile maxent library
  3. Deploy new version of maxent-2.4.0.jar into OpenNLP’s lib directory

The comparison is striking (the numbers are File access system calls):

  • build.bin.gz - 29830 ->  40
  • chunk.bin.gz  -11853 -> 16
  • tag.bin.gz - 11091 -> 14
permalink trackback comments feed

I was not able to get OpenNLP parser to work. There were no samples to play with, no command line tools to run. And I don’t even want to talk about documentation. That’s because there was not any. There was an attempt at lame joke (at least that’s the only sense I can make of what.html file), but no actual documentation.

Finally, I pinged my research colleague who did get the toolkit working (thanks Scott). Turns out to be there is a whole set of model files missing from the tool’s download. They are linked to from a separate page on the original website (not even in the download).

I am downloading the models now and hopefully will be on my way. But I can certainly see why this particular toolkit is mentioned much less frequently than Stanford’s or Bikel’s.

After the fact, I have also found a mini tutorial by Daniel McLaren explaining OpenNLP components and showing some sample code and output. Looks better than what’s bundled with OpenNLP itself. Maybe Daniel and Thomas Morton (author of OpenNLP) should talk.

permalink trackback comments feed

Bikel’s statistical parser is designed to be run from the command line. I need to run it from my own code.

The following wrapper seems to do the trick on windows (with your own values for|parserdir| :

String settingsFile = "|parserdir|\\settings\\collins.properties";
Settings.load(settingsFile);
Parser parser = new Parser("|parserdir|\\bikel\\wsj-02-21.obj.gz");
Sexp result = parser.parse(Sexp.read("(This is a funny world)").list());

There is a complaint when running the above code:

Settings different during training than now
------------------------------
parser.settingsFile
was |parsedir|\settings\collins.properties
is null

This however does not impact anything and correct values seem to be picked up.

Also, all the scripts are designed for *nix with a lot of flexibility and variables built in. To get it running on Windows, I hardcoded everything but the input file and this is the result:

set PDIR=|parserdir|
java -Xmx500m -cp "%PDIR%\dbparser.jar;%CLASSPATH%" -Dparser.settingsFile=%PDIR%\settings\collins.properties danbikel.parser.Parser -is %PDIR%\wsj-02-21.obj.gz -sa %1

permalink trackback comments feed

I have been using Stanford NLP Parser from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.

Now, I need to wrap-up the parser in my own code to add input/output batching and I discover that this option is not accepted when constructing parser from the code. Despite javadoc saying that LexicalizedParser.setOptionFlags() takes the same parameters as the command line, the option sets are actually very different.

In the end, after much poking around, I built the code sequence that seems to produce identical effect:

LexicalizedParser lp = new LexicalizedParser("..../englishPCFG.ser.gz");
// lp.setOptionFlags(new String[]{"-tagSeparator", "/"});
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(new StringReader(text));
List<Word> words = tokenizer.tokenize();
WordToTaggedWordProcessor wttwp = new WordToTaggedWordProcessor('/');
words = wttwp.process(words);
Tree tree = (Tree) lp.apply(words);

Here, text variable is a string that is effectively pretokenized with white-space separator and ‘/‘ character is the word/tag separator token.

Update (3rd of August):

An email exchange with Christopher Manning and another look through the code proved that  flags in setOptionFlags() are a strict subset of flags accepted by main() method. However, 90% of flags in the setOptionFlags() are not documented in that method’s javadoc, so the only ones I cared about were the ones I saw in main() method.

Yet further digging found some documentation in classes Options, Test and Train, all within edu.stanford.nlp.parser.lexparser package. So, some additional documentation does exist, but one has to navigate the maze of code to find it. I guess that’s the normal curse of the open source software.