Aug
15
arafalov on August 15th, 2007
When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.
The fix is extremely simple:
- In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace
- new FileInputStream(f) with
- new BufferedInputStream(new FileInputStream(f), 1000000)
- Recompile maxent library
- Deploy new version of maxent-2.4.0.jar into OpenNLP’s lib directory
The comparison is striking (the numbers are File access system calls):
- build.bin.gz - 29830 -> 40
- chunk.bin.gz -11853 -> 16
- tag.bin.gz - 11091 -> 14
August 15th, 2007 at 11:46 am
All hail plentiful RAM!
August 15th, 2007 at 1:31 pm
Indeed.
Of course to parse anything serious, any of the parsers need 60-100 MBytes of memory. The extra 1 MByte for the disk buffer is a drop in a backet.