<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.3.1" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Outer Thoughts &#187; Computational Linguistics</title>
	<link>http://blog.outerthoughts.com</link>
	<description>&#62; From inner thoughts to the outer limits of Alexandre Rafalovitch</description>
	<pubDate>Sun, 20 Apr 2008 00:37:42 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.1</generator>
	<language>en</language>
			<item>
		<title>Bulk converting doc files into txt (or html)</title>
		<link>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/</link>
		<comments>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/#comments</comments>
		<pubDate>Sun, 20 Apr 2008 00:37:42 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/</guid>
		<description><![CDATA[I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.
I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool [...]]]></description>
			<content:encoded><![CDATA[<p>I have written about converting Microsoft Word files into text or html using OpenOffice <a href="http://blog.outerthoughts.com/2006/10/obscure-bulk-format-converters-of-openofficeorg/" title="Previous article about converting files">before</a>. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.</p>
<p>I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. <a href="http://www.ooomacros.org/user.php#95532" title="Location of the DocConverter macro">DocConverter</a> by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.</p>
<p>I have just converted more than a thousand documents from doc to txt without any problems.  Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2008%2F04%2Fbulk-converting-doc-files-into-txt-or-html%2F';
  addthis_title  = 'Bulk+converting+doc+files+into+txt+%28or+html%29';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/feed/</wfw:commentRss>
		</item>
		<item>
		<title>On uselessness of pretending to be somebody else</title>
		<link>http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/</link>
		<comments>http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/#comments</comments>
		<pubDate>Fri, 25 Jan 2008 00:28:36 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Weird Stuff]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/</guid>
		<description><![CDATA[While reading weka Data Mining book, I have come across this impressive example of using machine learning to confirm person&#8217;s authorship (p. 358).
In 19th century, there lived a famous rabbinic scholar Ben Ish Chai, who among other writings had two collections of letters. Ben Ish Chai claimed that only one collection was his and that [...]]]></description>
			<content:encoded><![CDATA[<p>While reading weka <a href="http://www.worldcat.org/oclc/58451668" title="WorldCat link for the book">Data Mining book</a>, I have come across this impressive example of using machine learning to confirm person&#8217;s authorship (p. 358).</p>
<p>In 19th century, there lived a famous rabbinic scholar Ben Ish Chai, who among other writings had two collections of letters. Ben Ish Chai claimed that only one collection was his and that the other one was somebody else&#8217;s, found by him. Modern scholars thought both collections were his, but could not prove it conclusively as the style of writing was different.</p>
<p>Machine Learning to the rescue! In 2004, <span class="m">   Moshe Koppel and Jonathan Schler</span> have discovered that it may help to look not at the writing style differences (as the style may have been faked), but rather at how deep those differences were. For example, an author could fake a stylistic mismatch by consciously avoiding favorite words, but would still write in long overrun sentences, use more of passive verb forms or display many other measurable behaviours.</p>
<p>So, if the most obvious differences were removed one by one, the speed at which the rest of the features would look identical could be a good indicator. They called this technique <a href="http://citeseer.ist.psu.edu/648176.html" title="Paper about unmasking technique">unmasking</a> and the mistery of Ben Ish Chai was solved for good.</p>
<p>I think what impressed me here was not the clever math. The whole field of determining authorship is based on clever math. It is rather the fact that the math was looking at hints <u>within</u> the hints of the language - the invisible aspects that become noticeable only after the eye learns to see beyond what the most obvious reality offers. I cannot explain it better, but to me it has a special elegance that just counting the words and sentence lengths does not offer.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2008%2F01%2Fon-uselessness-of-pretending-to-be-somebody-else%2F';
  addthis_title  = 'On+uselessness+of+pretending+to+be+somebody+else';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Parsing jumping jacks</title>
		<link>http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/</link>
		<comments>http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/#comments</comments>
		<pubDate>Sat, 01 Dec 2007 23:14:25 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[RSCDS]]></category>

		<category><![CDATA[Weird Stuff]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/</guid>
		<description><![CDATA[What could be common between Computational Linguistics and Aerobics? Quite a lot, as it turns out to be.
Dance descriptions, while not really in English do have a regular structure and can be thought of as a sub-language with full set of syntactic, semantic and pragmatic levels.
There are basic words of the language (move names), correct [...]]]></description>
			<content:encoded><![CDATA[<p>What could be common between Computational Linguistics and Aerobics? Quite a lot, as it turns out to be.</p>
<p>Dance descriptions, while not really in English do have a regular structure and can be thought of as a sub-language with full set of syntactic, semantic and pragmatic levels.</p>
<p>There are basic words of the language (move names), correct ways of putting them in a sentence (a routine) and all the way up to good flowing text (classes that do not hurt the participants).</p>
<p>I was thinking about relationship between dance instructions and computational linguistics in context of Scottish Country Dancing for at least a year. My imagined benefits were that codified dance instructions would allow for automatic dance animations, superior teacher aids and other applications that currently require a lot of sweat and toil. Dance evening programmes that are currently put together manually for each event, could be assisted with automated evaluation pointing out awkward sequences of dances.</p>
<p>Unfortunately, my attempts at explaining the connection made no sense to the people around me. So, I was ecstatic to discover that such a link was already discovered by others before me.</p>
<p>Adam Bull, more than 10 years ago, has tried to apply principles of computational linguistics to Aerobics for his MPhil degree in the paper entitled <a href="http://www.comp.leeds.ac.uk/cgi-bin/sis/ext/rs_pub.cgi?cmd=displayabstract&amp;sid=898625237" title="Web page for the report">The formal description of aerobic dance exercise - a corpus-based computational linguistics approach</a>. While, the report is not complete, it puts down many of the same arguments I have tried myself.</p>
<p>Unfortunately, the electronic copy of the document was not available. After some effort, I got in touch with Adam and he send me the copy of the report with the permission to distribute. I have put <a href="http://www.outerthoughts.com/files/adam_bull_thesis_aerobics_compling.pdf" title="Copy of Adam's report">a copy of it on my own server</a>.</p>
<p>I hope his research will get rediscovered and improved upon. That way when I get some time to apply my own PhD skills to Scottish Country Dancing, there will be more than one person on whose shoulders I would be able to stand.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F12%2Fparsing-jumping-jacks%2F';
  addthis_title  = 'Parsing+jumping+jacks';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Upgrading to GATE 4? Beware of leftover configuration files.</title>
		<link>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/</link>
		<comments>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/#comments</comments>
		<pubDate>Sun, 07 Oct 2007 03:03:40 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/</guid>
		<description><![CDATA[From time to time I experiment with GATE NLP toolkit. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.
The problem is the user configuration file gate.xml that is [...]]]></description>
			<content:encoded><![CDATA[<p>From time to time I experiment with <a href="http://gate.ac.uk/" title="Home of the GATE - NLP toolkit">GATE NLP toolkit</a>. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.</p>
<p>The problem is the user configuration file <em>gate.xml</em> that is stored in the shared location, usually home directory. On Windows, that is  <em>C:\Documents and Settings\[ProfileName]\</em>.</p>
<p>One of those settings was pointing to where the plugins were loaded from and was still referring to GATE 3.1&#8217;s locations. That caused NullPointerExceptions in the GATE and everything was breaking from that point on.</p>
<p>I found this by using FileMon, but later realised that it might have been done easier by changing <em>runtime.spawn</em> property to <em>false</em> in GATE&#8217;s <em>build.xml</em> file that is used to start the program. Using <em>ant</em> to start a program is a new one for me, but I guess it makes sense in some cases.  Setting the property to false shows the startup messages and the exception that the wrong directories cause.</p>
<p>I have deleted the old <em>gate.xml</em> and <em>gate.session</em> files in my home directory and everything started to work. Back to actually trying to use the software.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F10%2Fupgrading-to-gate-4-beware-of-leftover-configuration-files%2F';
  addthis_title  = 'Upgrading+to+GATE+4%3F+Beware+of+leftover+configuration+files.';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Story of Human Language - great introductory audio course on linguistics</title>
		<link>http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/</link>
		<comments>http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/#comments</comments>
		<pubDate>Sat, 29 Sep 2007 16:01:45 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[General Education]]></category>

		<category><![CDATA[Language Acquisition]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/</guid>
		<description><![CDATA[As part of doing a PhD in Computational Linguistics, I need to understand both computers and linguistics. I am fine with computers, but linguistics is not my strong point.  Unfortunately, many of the linguistics books and resources are quite dry.
So, I was really happy to discover an audio course      [...]]]></description>
			<content:encoded><![CDATA[<p>As part of doing a PhD in Computational Linguistics, I need to understand both computers and linguistics. I am fine with computers, but linguistics is not my strong point.  Unfortunately, many of the linguistics books and resources are quite dry.</p>
<p>So, I was really happy to discover an audio course <span class="courseTitle" style="padding-top: 15px; padding-bottom: 1px"><a href="http://www.teach12.com/ttcx/coursedesclong2.aspx?cid=1600&amp;pc=Professor304" title="Official web page for the audio course">                         Story of Human Language</a></span> from The Teaching Company taught by John McWhorter. It is quite long a covers a lot of material, but - apart from some overly long parts on universal language - it is really interesting and Professor McWhorter is a great presenter.</p>
<p>I actually had a chance to listen to both an audio version of the course and to see some of it on DVD. Personally, I prefer just audio for several reason.</p>
<p>Firstly, I can listen to the course on my MP3 player when I am walking or doing chores. Video version requires allocating dedicated time, which for such a long course would be difficult.</p>
<p>Secondly, I actually found visual part of the presentation quite boring - for the most part professor is just standing behind the lectern and talks from his notes. In fact, I found the visual part distracted me from the really great and expressive rhetorics.</p>
<p>There was a number of great section in the course, but I found the one explaining language structure of Arabic and Chinese particularly interesting. He talked about Arabic first and I was all keen to learn that language. Then, he switched over to Chinese and I found it even more fascinating. And then, there were comparisons of languages and his cat. This has to be heard to be believed.</p>
<p>The course is obviously <a href="http://www.teach12.com/ttcx/coursedesclong2.aspx?cid=1600&amp;pc=Professor304" title="Original (commercial) source for the course">available for purchase</a>, but it is also <a href="http://www.worldcat.org/oclc/58542774" title="WorldCat entry for the course">found in quite a few libraries</a>. If you do borrow it from the library, try requesting all volumes at once. I only requested one volume and it was quite annoying to then have to wait a long time for the rest of the course arrive. This is another way I knew for myself that the course was enjoyable, as I had plenty of other audio material to listen to otherwise.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F09%2Fstory-of-human-language-great-introductory-audio-course-on-linguistics%2F';
  addthis_title  = 'Story+of+Human+Language+-+great+introductory+audio+course+on+linguistics';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Unravelling the black magic of bureaucracy</title>
		<link>http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/</link>
		<comments>http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/#comments</comments>
		<pubDate>Sat, 01 Sep 2007 22:07:06 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/</guid>
		<description><![CDATA[Arthur C. Clarke once famously wrote &#8220;Any sufficiently advanced technology is indistinguishable from magic&#8221;. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.
Bureaucracy often has its place [...]]]></description>
			<content:encoded><![CDATA[<p>Arthur C. Clarke once famously wrote &#8220;Any sufficiently advanced technology is indistinguishable from magic&#8221;. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.</p>
<p>Bureaucracy often has its place and reason. Laws protect exploitable minorities; procedures serve to avoid known problems; cross-referencing forms are filled in triplicate to allow for audit and protection against falsification. The problem is not the bureaucracy as such but rather the fact that it eventually outgrows any individual person&#8217;s ability to comprehend it. At that point, only dedicated specialists can understand the process and the rest of us have to offer sacrifices to those acolytes in hopes of beneficial results.</p>
<p>Enter computers. It turns out that computers can bring the complexity of information down, back within the reach of the non-specialist. The more bureaucratic a processes, the better a computer can figure it out. What is a mind-numbing in-triplicate form to a human is a structured source of information with cross-checking redundancy to the computer.</p>
<p>This area of research is called &#8220;<a href="http://www.aaai.org/AITopics/html/natlang.html" title="Introduction to NLP">Natural Language Processing</a>&#8221; - NLP. It is not an obscure field - any Google user has benefited from this type of research. Other applications of NLP include speech recognition and machine translation.</p>
<p>NLP is not a new branch of science. Back in the 1950s, software was being developed in the USA to translate from German into English. The translation quality of grammar-based systems was very poor. Nevertheless, even the possibility of machine translation was so impressive that about US$20 million were spent on the research before the enchantment fizzled out and fund allocations virtually stopped. NLP did not die at that point, but it certainly slowed down.</p>
<p>Statistical approaches to NLP have been around nearly as long as grammar-based ones. However, as they require large quantities of data, these did not become feasible until the mid-1990s. Once they did reach popularity, however, the research advanced rapidly, taking advantage of ever increasing computer speed and available storage. Statistical approaches do not rely on language comprehension. Instead, with sufficient amounts of text, common patterns can be established without understanding the rules of their formation.</p>
<p>A good example is Google&#8217;s new translation engine from Arabic to English. The engine won the <a href="http://www.nist.gov/speech/tests/mt/doc/mt05eval_official_results_release_20050801_v3.html" title="Results of NIST 2005 competition">NIST 2005 machine translation competition</a>, even though its software developers did not know Arabic. Instead, <a href="http://blogoscoped.com/archive/2005-05-22-n83.html" title="Story on Google's translation engine">they used existing parallel documents of United Nations</a> translated by professionals - some 200 billion words of content in total. It is perhaps symbolic that, even in such a deeply technical area, the Universal Declaration of Human Rights helps to ensure humans all over the world will be able to communicate with each other.</p>
<p>Standalone, however, a statistical approach is not a panacea either. Since there is no real understanding involved, a statistical NLP system has no way to recover from invalid conclusions.</p>
<p>There is more to the puzzle. Most of the real world texts are about somebody or something. The entity could be a person, a company, or a committee. Sometimes, the name of that entity is very long. Documents of the United Nations are known for names that even a human would struggle with. &#8220;<a href="http://www.un.org/law/UNsafetyconvention/index.html" title="Webpage of the Committee">The Ad Hoc Committee on the Scope of Legal Protection under the Convention on the Safety of United Nations and Associated Personnel</a>&#8221; would be one of those. Other large organisations have similar problems.</p>
<p>Currently, neither of the above approaches is sufficient on its own. Grammar-based systems break on complex names; statistical ones mark &#8216;The Committee&#8217; as a completely separate entity, rather than a reference to the full name.</p>
<p>The ideal system that we’re working on would be able to identify the complex names using a combination of techniques. It would also be capable of using multiple appearances in different contexts to confirm the identification, including linking different forms of the same name. Once these goals are achieved, documents in legal and medical domains can get the full benefits from other, already available, research.</p>
<p>Soon, the day will come when computers understand what humans write or say. Hopefully, without needing the triplicates.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F09%2Funravelling-the-black-magic-of-bureaucracy%2F';
  addthis_title  = 'Unravelling+the+black+magic+of+bureaucracy';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Reducing disk thrashing of OpenNLP/MaxEnt parser - with one line code change</title>
		<link>http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/</link>
		<comments>http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/#comments</comments>
		<pubDate>Wed, 15 Aug 2007 12:56:12 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/</guid>
		<description><![CDATA[When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.
The fix is extremely simple:

In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace

new FileInputStream(f) with
new BufferedInputStream(new FileInputStream(f), 1000000)


Recompile maxent library
Deploy new [...]]]></description>
			<content:encoded><![CDATA[<p>When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.</p>
<p>The fix is extremely simple:</p>
<ol>
<li>In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace
<ul>
<li><em>new FileInputStream(f)</em> with</li>
<li><em>new BufferedInputStream(new FileInputStream(f), 1000000)</em></li>
</ul>
</li>
<li>Recompile maxent library</li>
<li>Deploy new version of <em>maxent-2.4.0.jar</em> into OpenNLP&#8217;s lib directory</li>
</ol>
<p>The comparison is striking (the numbers are File access system calls):</p>
<ul>
<li><em>build.bin.gz</em> <em>- <strong>29830 </strong>-&gt;  </em><em><strong>40</strong> </em></li>
<li><em>chunk.bin.gz </em> -<strong>11853</strong> -&gt; <strong>16</strong></li>
<li><em>tag.bin.gz</em> - <strong>11091</strong> -&gt; <strong>14</strong></li>
</ul>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F08%2Freducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change%2F';
  addthis_title  = 'Reducing+disk+thrashing+of+OpenNLP%2FMaxEnt+parser+-+with+one+line+code+change';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Getting OpenNLP parser to work</title>
		<link>http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/</link>
		<comments>http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/#comments</comments>
		<pubDate>Sun, 12 Aug 2007 01:48:50 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/</guid>
		<description><![CDATA[I was not able to get OpenNLP parser to work. There were no samples to play with, no command line tools to run. And I don&#8217;t even want to talk about documentation. That&#8217;s because there was not any. There was an attempt at lame joke (at least that&#8217;s the only sense I can make of [...]]]></description>
			<content:encoded><![CDATA[<p>I was not able to get <a href="http://opennlp.sourceforge.net/" title="Link to the OpenNLP project page">OpenNLP parser</a> to work. There were no samples to play with, no command line tools to run. And I don&#8217;t even want to talk about documentation. That&#8217;s because there was not any. There was an attempt at lame joke (at least that&#8217;s the only sense I can make of <em>what.html</em> file), but no actual documentation.</p>
<p>Finally, I pinged my research colleague who did get the toolkit working (thanks Scott). Turns out to be there is a whole set of model files missing from the tool&#8217;s download. They are linked to from <a href="http://opennlp.sourceforge.net/models.html" title="Link to the page for the models">a separate page on the original website</a> (not even in the download).</p>
<p>I am downloading the models now and hopefully will be on my way. But I can certainly see why this particular toolkit is mentioned much less frequently than Stanford&#8217;s or Bikel&#8217;s.</p>
<p>After the fact, I have also found <a href="http://danielmclaren.net/2007/05/11/getting-started-with-opennlp-natural-language-processing/" title="The tutorial blog entry">a mini tutorial</a> by Daniel McLaren explaining OpenNLP components and showing some sample code and output. Looks better than what&#8217;s bundled with OpenNLP itself. Maybe Daniel and Thomas Morton (author of OpenNLP) should talk.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F08%2Fgetting-opennlp-parser-to-work%2F';
  addthis_title  = 'Getting+OpenNLP+parser+to+work';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Running Bikel&#8217;s parser programmatically</title>
		<link>http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/</link>
		<comments>http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/#comments</comments>
		<pubDate>Mon, 06 Aug 2007 02:44:34 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/</guid>
		<description><![CDATA[Bikel&#8217;s statistical parser is designed to be run from the command line. I need to run it from my own code.
The following wrapper seems to do the trick on windows (with your own values for&#124;parserdir&#124; :

String settingsFile = "&#124;parserdir&#124;\\settings\\collins.properties";
Settings.load(settingsFile);
Parser parser = new Parser("&#124;parserdir&#124;\\bikel\\wsj-02-21.obj.gz");
Sexp result = parser.parse(Sexp.read("(This is a funny world)").list());

There is a complaint when running [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cis.upenn.edu/~dbikel/software.html#stat-parser" title="Homepage of the parser">Bikel&#8217;s statistical parser</a> is designed to be run from the command line. I need to run it from my own code.</p>
<p>The following wrapper seems to do the trick on windows (with your own values for|parserdir| :<br />
<code><br />
String settingsFile = "|parserdir|\\settings\\collins.properties";<br />
Settings.load(settingsFile);<br />
Parser parser = new Parser("|parserdir|\\bikel\\wsj-02-21.obj.gz");<br />
Sexp result = parser.parse(Sexp.read("(This is a funny world)").list());<br />
</code><br />
There is a complaint when running the above code:<br />
<code><br />
Settings different during training than now<br />
------------------------------<br />
parser.settingsFile<br />
was |parsedir|\settings\collins.properties<br />
is null<br />
</code><br />
This however does not impact anything and correct values seem to be picked up.</p>
<p>Also, all the scripts are designed for *nix with a lot of flexibility and variables built in. To get it running on Windows, I hardcoded everything but the input file and this is the result:<br />
<code><br />
set PDIR=|parserdir|<br />
java -Xmx500m -cp "%PDIR%\dbparser.jar;%CLASSPATH%" -Dparser.settingsFile=%PDIR%\settings\collins.properties danbikel.parser.Parser -is %PDIR%\wsj-02-21.obj.gz -sa %1<br />
</code></p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F08%2Frunning-bikels-parser-programmatically%2F';
  addthis_title  = 'Running+Bikel%26%238217%3Bs+parser+programmatically';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Duplicating -tagSeparator effect when using Stanford Parser programmatically</title>
		<link>http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/</link>
		<comments>http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/#comments</comments>
		<pubDate>Tue, 31 Jul 2007 19:08:26 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/</guid>
		<description><![CDATA[I have been using Stanford NLP Parser from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.
Now, I need to wrap-up the parser in my own code to add input/output batching and I [...]]]></description>
			<content:encoded><![CDATA[<p>I have been using <a href="http://www-nlp.stanford.edu/downloads/lex-parser.shtml" title="Home of the Stanford NLP parser">Stanford NLP Parser</a> from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.</p>
<p>Now, I need to wrap-up the parser in my own code to add input/output batching and I discover that this option is not accepted when constructing parser from the code. Despite javadoc saying that LexicalizedParser.setOptionFlags() takes the same parameters as the command line, the option sets are actually very different.</p>
<p>In the end, after much poking around, I built the code sequence that seems to produce identical effect:<br />
<code><br />
LexicalizedParser lp = new LexicalizedParser("..../englishPCFG.ser.gz");<br />
//        lp.setOptionFlags(new String[]{"-tagSeparator", "/"});<br />
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(new StringReader(text));<br />
List&lt;Word&gt; words = tokenizer.tokenize();<br />
WordToTaggedWordProcessor wttwp = new WordToTaggedWordProcessor('/');<br />
words = wttwp.process(words);<br />
Tree tree = (Tree) lp.apply(words);<br />
</code></p>
<p>Here, <em>text</em> variable is a string that is effectively pretokenized with white-space separator and &#8216;<em>/</em>&#8216; character is the word/tag separator token.</p>
<p><em>Update (3rd of August): </em></p>
<p>An email exchange with Christopher Manning and another look through the code proved that  flags in setOptionFlags() are a strict subset of flags accepted by main() method. However, 90% of flags in the setOptionFlags() are not documented in that method&#8217;s javadoc, so the only ones I cared about were the ones I saw in main() method.</p>
<p>Yet further digging found some documentation in classes <em>Options</em>, <em>Test</em> and <em>Train</em>, all within <em>edu.stanford.nlp.parser.lexparser</em> package. So, some additional documentation does exist, but one has to navigate the maze of code to find it. I guess that&#8217;s the normal curse of the open source software.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F07%2Fduplicating-tagseparator-effect-when-using-stanford-parser-programmatically%2F';
  addthis_title  = 'Duplicating+-tagSeparator+effect+when+using+Stanford+Parser+programmatically';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
