Reducing disk thrashing of OpenNLP/MaxEnt parser – with one line code change

When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.

The fix is extremely simple:

  1. In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace
    • new FileInputStream(f) with
    • new BufferedInputStream(new FileInputStream(f), 1000000)
  2. Recompile maxent library
  3. Deploy new version of maxent-2.4.0.jar into OpenNLP’s lib directory

The comparison is striking (the numbers are File access system calls):

  • build.bin.gz - 29830 ->  40
  • chunk.bin.gz  -11853 -> 16
  • tag.bin.gz11091 -> 14

3 comments to Reducing disk thrashing of OpenNLP/MaxEnt parser – with one line code change

  • Matthew

    All hail plentiful RAM!

  • Indeed.

    Of course to parse anything serious, any of the parsers need 60-100 MBytes of memory. The extra 1 MByte for the disk buffer is a drop in a backet.

  • Mosa

    I would like to take your advice as you are expert in the NLP. I faced problem when I tried to identify classes, I used OpenNLP tools which help me to identifying classes and relationships, I found parsing is very weak to identify verbs, for example ((Library issues Loan item to customer)) the issues is verb but in the parsr is noun. So the parser ignore the verbs if the sentences could be noun or verb, the parsr chose nonu. if sentences has verbs and nouns the parsing identify nouns only and the ignore verbs which is should be identify verbs as should in the sentences “verb”.

    Can you tel me how i must to do for avoid this problem please?

Leave a Reply

 

 

 

You can use these HTML tags

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>