I have been using Stanford NLP Parser from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.

Now, I need to wrap-up the parser in my own code to add input/output batching and I discover that this option is not accepted when constructing parser from the code. Despite javadoc saying that LexicalizedParser.setOptionFlags() takes the same parameters as the command line, the option sets are actually very different.

In the end, after much poking around, I built the code sequence that seems to produce identical effect:

LexicalizedParser lp = new LexicalizedParser("..../englishPCFG.ser.gz");
// lp.setOptionFlags(new String[]{”-tagSeparator”, “/”});
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(new StringReader(text));
List<Word> words = tokenizer.tokenize();
WordToTaggedWordProcessor wttwp = new WordToTaggedWordProcessor(’/');
words = wttwp.process(words);
Tree tree = (Tree) lp.apply(words);

Here, text variable is a string that is effectively pretokenized with white-space separator and ‘/‘ character is the word/tag separator token.

Update (3rd of August):

An email exchange with Christopher Manning and another look through the code proved that  flags in setOptionFlags() are a strict subset of flags accepted by main() method. However, 90% of flags in the setOptionFlags() are not documented in that method’s javadoc, so the only ones I cared about were the ones I saw in main() method.

Yet further digging found some documentation in classes Options, Test and Train, all within edu.stanford.nlp.parser.lexparser package. So, some additional documentation does exist, but one has to navigate the maze of code to find it. I guess that’s the normal curse of the open source software.

2 Responses to “Duplicating -tagSeparator effect when using Stanford Parser programmatically”

  1. Hi
    I’m trying to use the Stanford parser from command line but it keeps loading the parser file every time.
    Is there an option to load only once the serialized parser file?
    Basically I’m looking for a way to use the parser from command line like we use the ui:
    - start the application
    - load the serialized parser file
    - parse sentences
    - close application
    Thank you

  2. Cristian,

    That’s what I am doing above. You initialise the parser once and then just keep calling apply() with the pre-tokenized string. It is much faster that way.

    Try it and if you are still having troubles contact me (my email is on the About page). I will send you my (very messy, but working) code sample.

Leave a Reply