Java

Subscribe to this category

permalink trackback comments feed

When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.

The fix is extremely simple:

  1. In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace
    • new FileInputStream(f) with
    • new BufferedInputStream(new FileInputStream(f), 1000000)
  2. Recompile maxent library
  3. Deploy new version of maxent-2.4.0.jar into OpenNLP’s lib directory

The comparison is striking (the numbers are File access system calls):

  • build.bin.gz - 29830 ->  40
  • chunk.bin.gz  -11853 -> 16
  • tag.bin.gz - 11091 -> 14
permalink trackback comments feed

I was not able to get OpenNLP parser to work. There were no samples to play with, no command line tools to run. And I don’t even want to talk about documentation. That’s because there was not any. There was an attempt at lame joke (at least that’s the only sense I can make of what.html file), but no actual documentation.

Finally, I pinged my research colleague who did get the toolkit working (thanks Scott). Turns out to be there is a whole set of model files missing from the tool’s download. They are linked to from a separate page on the original website (not even in the download).

I am downloading the models now and hopefully will be on my way. But I can certainly see why this particular toolkit is mentioned much less frequently than Stanford’s or Bikel’s.

After the fact, I have also found a mini tutorial by Daniel McLaren explaining OpenNLP components and showing some sample code and output. Looks better than what’s bundled with OpenNLP itself. Maybe Daniel and Thomas Morton (author of OpenNLP) should talk.

permalink trackback comments feed

Bikel’s statistical parser is designed to be run from the command line. I need to run it from my own code.

The following wrapper seems to do the trick on windows (with your own values for|parserdir| :

String settingsFile = "|parserdir|\\settings\\collins.properties";
Settings.load(settingsFile);
Parser parser = new Parser("|parserdir|\\bikel\\wsj-02-21.obj.gz");
Sexp result = parser.parse(Sexp.read("(This is a funny world)").list());

There is a complaint when running the above code:

Settings different during training than now
------------------------------
parser.settingsFile
was |parsedir|\settings\collins.properties
is null

This however does not impact anything and correct values seem to be picked up.

Also, all the scripts are designed for *nix with a lot of flexibility and variables built in. To get it running on Windows, I hardcoded everything but the input file and this is the result:

set PDIR=|parserdir|
java -Xmx500m -cp "%PDIR%\dbparser.jar;%CLASSPATH%" -Dparser.settingsFile=%PDIR%\settings\collins.properties danbikel.parser.Parser -is %PDIR%\wsj-02-21.obj.gz -sa %1

permalink trackback comments feed

I have been using Stanford NLP Parser from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.

Now, I need to wrap-up the parser in my own code to add input/output batching and I discover that this option is not accepted when constructing parser from the code. Despite javadoc saying that LexicalizedParser.setOptionFlags() takes the same parameters as the command line, the option sets are actually very different.

In the end, after much poking around, I built the code sequence that seems to produce identical effect:

LexicalizedParser lp = new LexicalizedParser("..../englishPCFG.ser.gz");
// lp.setOptionFlags(new String[]{"-tagSeparator", "/"});
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(new StringReader(text));
List<Word> words = tokenizer.tokenize();
WordToTaggedWordProcessor wttwp = new WordToTaggedWordProcessor('/');
words = wttwp.process(words);
Tree tree = (Tree) lp.apply(words);

Here, text variable is a string that is effectively pretokenized with white-space separator and ‘/‘ character is the word/tag separator token.

Update (3rd of August):

An email exchange with Christopher Manning and another look through the code proved that  flags in setOptionFlags() are a strict subset of flags accepted by main() method. However, 90% of flags in the setOptionFlags() are not documented in that method’s javadoc, so the only ones I cared about were the ones I saw in main() method.

Yet further digging found some documentation in classes Options, Test and Train, all within edu.stanford.nlp.parser.lexparser package. So, some additional documentation does exist, but one has to navigate the maze of code to find it. I guess that’s the normal curse of the open source software.

permalink trackback comments feed

I am trying to use Stanford NLP parser for my research and I need to look at the trees it produces for large, complex sentences. I have found several packages for laying out the output as trees, but they are all seem to be targeted at visualizing smaller sentences, suitable for illustrating a point in the published paper. Sample output of Graphviz layout for Stanford Parser’s output

My trees are large. A sentence of 40 words is an average case, rather than an edge one. So, all of the display packages I have tried cut off large chunks of the tree. It might be possible to tinker with their LaTeX code to produce output that is not cut-off at letter, a4 or even a3 size, but I am not that good with LaTeX yet. And I need to produce this large trees quickly, as I am not even sure whether this parser would be suitable for my needs in the long run.

So, instead, I wrote my own bridging code in Java between penn treebank output of the parser and Graphviz, graph layout software that I use for many layout tasks. The whole implementation was in one file less than 100 lines total and that included the logic to highlight maximum spanning subtrees of a particular element (NounPhrase in this example). Click on the small image to see the full example. Graphviz input file is also available for the curious.

At the moment, it is sufficient to convert to image files. If I ever do convince the parser to understand my 80-word sentences, the resulting trees will probably be large enough to need ZGRViewer.

The Java bridging code is not available yet, as it is very ugly. The secret was in the PennTreeReader’s main() method that showed how to read the parser’s output back in and into Tree form suitable for recursive descent. After that, it was just the code to navigate the tree levels and spit out incredibly easy Graphviz format. I will probably clean the code up a bit over the next couple of weeks and then release it.

If somebody does like the output and wants to see the code sooner, send me an email at alex@thisdomain.