Duplicating -tagSeparator effect when using Stanford Parser programmatically

I have been using Stanford NLP Parser from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.

Now, I need to wrap-up the parser in my own code to add input/output batching and I discover that this option is not accepted when constructing parser from the code. Despite javadoc saying that LexicalizedParser.setOptionFlags() takes the same parameters as the command line, the option sets are actually very different.

In the end, after much poking around, I built the code sequence that seems to produce identical effect:

LexicalizedParser lp = new LexicalizedParser("..../englishPCFG.ser.gz");
// lp.setOptionFlags(new String[]{"-tagSeparator", "/"});
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(new StringReader(text));
List<Word> words = tokenizer.tokenize();
WordToTaggedWordProcessor wttwp = new WordToTaggedWordProcessor('/');
words = wttwp.process(words);
Tree tree = (Tree) lp.apply(words);

Here, text variable is a string that is effectively pretokenized with white-space separator and ‘/‘ character is the word/tag separator token.

Update (3rd of August):

An email exchange with Christopher Manning and another look through the code proved that  flags in setOptionFlags() are a strict subset of flags accepted by main() method. However, 90% of flags in the setOptionFlags() are not documented in that method’s javadoc, so the only ones I cared about were the ones I saw in main() method.

Yet further digging found some documentation in classes Options, Test and Train, all within edu.stanford.nlp.parser.lexparser package. So, some additional documentation does exist, but one has to navigate the maze of code to find it. I guess that’s the normal curse of the open source software.

5 thoughts on “Duplicating -tagSeparator effect when using Stanford Parser programmatically”

  1. Hi
    I’m trying to use the Stanford parser from command line but it keeps loading the parser file every time.
    Is there an option to load only once the serialized parser file?
    Basically I’m looking for a way to use the parser from command line like we use the ui:
    – start the application
    – load the serialized parser file
    – parse sentences
    – close application
    Thank you

  2. Cristian,

    That’s what I am doing above. You initialise the parser once and then just keep calling apply() with the pre-tokenized string. It is much faster that way.

    Try it and if you are still having troubles contact me (my email is on the About page). I will send you my (very messy, but working) code sample.

  3. Hi,

    I am a begginer at Stanford parser an i guess there all alot to discover.

    However, What i need is to pass a sentence or even tokens to the Parser class and get each token withits tag as an ouput so that i can save the words with thier types in database….

    The ParserDemo i got display the output in the tree format which i do not.

    I would appreciate if i could get the answer here…

    Thank you,

  4. Hello Sameha,

    It has been a long while since I looked at this, but I think a parent of each token is its type. So, if you just look at all the leaves and check their immediate parents, you will get the types.

    Hope it helped.

  5. Hi, I’m trying to get the Stanford Parser to work with a java program. I keep hitting a problem when I try to run the ParserDemo.java which should just be a short demonstration. I come out with an error saying, englishPCFG.ser.gz not found but I have made sure that the file is there. Is there anyway to overcome this?

    Thank you in advance

    Mitesh Patel

Comments are closed.