<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="wordpress/2.3.1" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	>

<channel>
	<title>Outer Thoughts &#187; Java</title>
	<link>http://blog.outerthoughts.com</link>
	<description>&#62; From inner thoughts to the outer limits of Alexandre Rafalovitch</description>
	<pubDate>Sun, 20 Apr 2008 00:37:42 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.1</generator>
	<language>en</language>
			<item>
		<title>5 unobvious things about Atlassian Crowd&#8217;s Delegated Authentication Directory</title>
		<link>http://blog.outerthoughts.com/2008/03/5-unobvious-things-about-atlassian-crowds-delegated-authentication-directory/</link>
		<comments>http://blog.outerthoughts.com/2008/03/5-unobvious-things-about-atlassian-crowds-delegated-authentication-directory/#comments</comments>
		<pubDate>Fri, 07 Mar 2008 20:19:48 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/03/5-unobvious-things-about-atlassian-crowds-delegated-authentication-directory/</guid>
		<description><![CDATA[Atlassian has just released Crowd 1.3 that now has the Delegated Authentication option - two-faced directory with an external LDAP facing part for authentication and an internal Crowd part for authorisation. This double-faced functionality causes some non-obvious interface issues.
The most important issue to understand is that external part is accessed only when user is authenticated [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.atlassian.com/software/crowd/CrowdDownloadCenter.jspa" title="Crowd download page">Atlassian has just released Crowd 1.3</a> that now has the Delegated Authentication option - two-faced directory with an external LDAP facing part for authentication and an internal Crowd part for authorisation. This double-faced functionality causes some non-obvious interface issues.</p>
<p>The most important issue to understand is that external part is accessed <strong>only</strong> when user is authenticated with full username/password. In any other context, users and groups are those that are copied/imported into the internal Crowd side of the directory. This produces a couple of cognitive problems:</p>
<ol>
<li>One cannot lookup users from that directory just after the directory is created. The search runs against the internal database and does not even generate LDAP lookup. This is obvious once you realise that the directory has effectively both remote and local repositories in one interface and the search only goes against local (still empty) one.</li>
<li>Directory permissions are also about the local directory. In the past, I disabled all modify permissions when configuring LDAP directory, as I did not want to accidentally change external user. Doing the same thing with Delegated directory will causes very odd database integrity violation stack traces. (now <a href="http://jira.atlassian.com/browse/CWD-911" title="Jira issue for the problem">CWD-911</a>)</li>
<li>Wild card handling in user lookup screen is different between Crowd internal directories and LDAP directories. Internal directories use substring search, while LDAP requires explicit star (*) character. Searching against Delegated Directory is searching against Crowd directory, so putting star wildcard will actually cause no matches. (now <a href="http://jira.atlassian.com/browse/CWD-912" title="Jira issue for the problem">CWD-912</a>)</li>
<li>Local directory part seems to store a lot more information about user than just username and group association. It actually stores email, full name, etc. This means that if any information gets changed in the original external LDAP, it may not be reflected in Crowd&#8217;s directory (and therefore to the applications). As there does not seem to be a way for the administrator to easily check for mismatches, such problem will likely to be extremely hard to troubleshoot. (now <a href="http://jira.atlassian.com/browse/CWD-913" title="Jira issue for the problem">CWD-913</a>)</li>
<li>Finally, there is no easy way to copy small sets of users into local part of the Crowd&#8217;s directory from the remote counter-part. They have to be added (with full information) one by one or copied wholesale from another directory. I have opened <a href="http://jira.atlassian.com/browse/CWD-910" title="Request to improve user import">a request to improve this</a>.</li>
</ol>
<p>Crowd&#8217;s Delegated directory option was eagerly awaited for a long time by great many people, but it is obviously still in a need of improvement or two. I am looking forward to having those issues addressed soon.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2008%2F03%2F5-unobvious-things-about-atlassian-crowds-delegated-authentication-directory%2F';
  addthis_title  = '5+unobvious+things+about+Atlassian+Crowd%26%238217%3Bs+Delegated+Authentication+Directory';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/03/5-unobvious-things-about-atlassian-crowds-delegated-authentication-directory/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Memories, memories</title>
		<link>http://blog.outerthoughts.com/2007/11/memories-memories/</link>
		<comments>http://blog.outerthoughts.com/2007/11/memories-memories/#comments</comments>
		<pubDate>Thu, 15 Nov 2007 19:50:31 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<category><![CDATA[Weird Stuff]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/11/memories-memories/</guid>
		<description><![CDATA[I just found my own oldest webpage (handcoded) and my oldest public source code (Java) at once. Archive.org - that has hosted this long-dead memory since 1999 - is just so great.
Looking back at it, I realise that I was right in the thick of Internet development:

When I just started working with Java, we had [...]]]></description>
			<content:encoded><![CDATA[<p>I just found <a href="http://web.archive.org/web/19990819172348/http://www.geocities.com/SiliconValley/Park/1527/jigsaw.html" title="My oldest Web page (archived)">my own oldest webpage</a> (handcoded) and <a href="http://web.archive.org/web/19990819172348/http://www.geocities.com/SiliconValley/Park/1527/McfFilter.txt" title="Java source code">my oldest public source code (Java)</a> at once. <a href="http://www.archive.org/" title="Archive.org website">Archive.org</a> - that has hosted this long-dead memory since 1999 - is just so great.</p>
<p>Looking back at it, I realise that I was right in the thick of Internet development:</p>
<ul>
<li>When I just started working with Java, we had to throw out all the printed Javadocs, because jdk1.0b2 was released and a lot of Java API (e.g. FTP and MAIL)  from jdk1.0a3 has been hidden under sun&#8217;s internal packages</li>
<li>I did <a href="http://lists.w3.org/Archives/Public/www-jigsaw/1996NovDec/0142.html" title="Email discussing the implementation">a first (alpha) implementation of standard servlet  API</a> for W3C&#8217;s Jigsaw server, by porting it from Sun&#8217;s Jeeves</li>
<li>I dabbled in hot 2.5D Apple technology (<a href="http://downlode.org/Etext/MCF/hotsauce_and_mcf.html" title="Explanation of MCF and HotSauce">HotSause</a>), by generating web server&#8217;s directory content in MCF format. The format has died, but apparently <a href="http://members.aol.com/plysat/xguide.html" title="Story of Hot Sauce">it turned into RDF</a>. I was developing Semantic Web applications well before the term got popular.</li>
<li>I contributed to an Open Source project, well before <a href="http://web.archive.org/web/20000126203923/http://sourceforge.net/" title="Archive page for Sourceforge.net">SourceForge&#8217;s first appearance</a></li>
<li>I was a late-comer to /. and my ID is still below 36000</li>
</ul>
<p>I am not bragging! I am just musing out loud at how much personal web history can be retrieved with few well placed searches.</p>
<p>The flip side of a coin of course, is that this history will not go away, even if I wanted it to. Which is why I do not link to my Slashdot account (and this is <strong>not</strong> an invitation for exercise in forensics). One just hopes that the future recruiter will look at timestamps of my various web appearances and makes appropriate adjustments to skills and effort.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F11%2Fmemories-memories%2F';
  addthis_title  = 'Memories%2C+memories';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/11/memories-memories/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Chumby: Digital picture frame for parents and much more</title>
		<link>http://blog.outerthoughts.com/2007/10/chumby-digital-picture-frame-for-parents-and-much-more/</link>
		<comments>http://blog.outerthoughts.com/2007/10/chumby-digital-picture-frame-for-parents-and-much-more/#comments</comments>
		<pubDate>Fri, 26 Oct 2007 14:04:20 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Chumby]]></category>

		<category><![CDATA[Java]]></category>

		<category><![CDATA[Weird Stuff]]></category>

		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/10/chumby-digital-picture-frame-for-parents-and-much-more/</guid>
		<description><![CDATA[I want to get my parents a digital picture frame. But at the moment I cannot. That&#8217;s because I don&#8217;t want my somewhat less-technical parents to have to fiddle with memory cards, choosing and transferring photographs or running Vista.
My ideal digital picture frame for them would be one sitting in a living room or a [...]]]></description>
			<content:encoded><![CDATA[<p>I want to get my parents a digital picture frame. But at the moment I cannot. That&#8217;s because I don&#8217;t want my somewhat less-technical parents to have to fiddle with memory cards, choosing and transferring photographs or running Vista.</p>
<p>My ideal digital picture frame for them would be one sitting in a living room or a bedroom with new photos to delight my parents every so often.</p>
<p>Such a device would have to be:</p>
<ul>
<li>Wi-Fi capable - My parents have a wireless router and there is no point for a picture frame to sit next to the computer</li>
<li>Able to pull content from private online photo account, such as Flickr or PicasaWeb, to which our extended family could push photos</li>
<li>No ongoing monthly costs - subscription would make it a gift that keeps taking, rather than giving</li>
<li>Controllable over the internet</li>
<li>Ideally with speakers and/or some way to show video to be more future proof</li>
</ul>
<p>I have been on a lookout for such a device for more than a year and had no luck. Obviously, digital picture frames are still a personal purchase rather than a gift one. Or maybe less technical parents is a smaller niche than I imagine.</p>
<p>But I have hope. Yesterday, I have received a small package that contained a <a href="http://www.chumby.com/" title="Chumby's home">Chumby</a>! Chumby is not a digital picture frame. It is quite small (I think the website&#8217;s image is real-size). But it has features that make up for its size.</p>
<p>It has Wi-Fi access, including password-protected; it has no monthly costs; it is configured over the internet and comes with speakers. It also has touch sensitive screen, microphone and accelerometer (like in Wii controller).</p>
<p>Notice I did not say anything about pictures or videos. That&#8217;s because Chumby is a more generic device. It allows to choose what widgets run on it and a widget is a program written in Flash, the same environment that allows us to watch Flickr slide-shows and youTube videos, listen to internet radio and play casual games. It can also double as alarm clock and iPod music player.</p>
<p>More importantly, because anybody can develop and share a widget, I am not married to any particular way of presenting photos. Flickr widget exists already, but other photo and video service widgets are on the way.</p>
<p>And, if I am still unhappy, I can write my own widgets. Chumby runs Linux under the covers and Flash Lite 3 interface. And, differently from Apple&#8217;s position with iPhone, Chumby Industries encourage people to modify their software, hardware and even <a href="http://www.flickr.com/photos/11410414@N06/1325686272/" title="Modified Chumby">basic device shape</a>. Already, there are compilation packages for python, perl and even <a href="http://wiki.chumby.com/mediawiki/index.php?title=Java" title="Description of putting Java on Chumby">Java (actually JamVM)</a>.</p>
<p>Chumby is not yet for public sale, but that should happen any day now. I was on a mailing list, so got a pre-release invite. That is good, as it means I have some time to really play with my Chumby.</p>
<p>And if all goes well, my Chumby will soon have a new friend or two hiding under the Christmas tree overseas.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F10%2Fchumby-digital-picture-frame-for-parents-and-much-more%2F';
  addthis_title  = 'Chumby%3A+Digital+picture+frame+for+parents+and+much+more';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/10/chumby-digital-picture-frame-for-parents-and-much-more/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Upgrading to GATE 4? Beware of leftover configuration files.</title>
		<link>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/</link>
		<comments>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/#comments</comments>
		<pubDate>Sun, 07 Oct 2007 03:03:40 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/</guid>
		<description><![CDATA[From time to time I experiment with GATE NLP toolkit. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.
The problem is the user configuration file gate.xml that is [...]]]></description>
			<content:encoded><![CDATA[<p>From time to time I experiment with <a href="http://gate.ac.uk/" title="Home of the GATE - NLP toolkit">GATE NLP toolkit</a>. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.</p>
<p>The problem is the user configuration file <em>gate.xml</em> that is stored in the shared location, usually home directory. On Windows, that is  <em>C:\Documents and Settings\[ProfileName]\</em>.</p>
<p>One of those settings was pointing to where the plugins were loaded from and was still referring to GATE 3.1&#8217;s locations. That caused NullPointerExceptions in the GATE and everything was breaking from that point on.</p>
<p>I found this by using FileMon, but later realised that it might have been done easier by changing <em>runtime.spawn</em> property to <em>false</em> in GATE&#8217;s <em>build.xml</em> file that is used to start the program. Using <em>ant</em> to start a program is a new one for me, but I guess it makes sense in some cases.  Setting the property to false shows the startup messages and the exception that the wrong directories cause.</p>
<p>I have deleted the old <em>gate.xml</em> and <em>gate.session</em> files in my home directory and everything started to work. Back to actually trying to use the software.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F10%2Fupgrading-to-gate-4-beware-of-leftover-configuration-files%2F';
  addthis_title  = 'Upgrading+to+GATE+4%3F+Beware+of+leftover+configuration+files.';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/feed/</wfw:commentRss>
		</item>
		<item>
		<title>The Rich Web Experience - day 1</title>
		<link>http://blog.outerthoughts.com/2007/09/the-rich-web-experience-day-1/</link>
		<comments>http://blog.outerthoughts.com/2007/09/the-rich-web-experience-day-1/#comments</comments>
		<pubDate>Fri, 07 Sep 2007 15:18:08 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Java]]></category>

		<category><![CDATA[web2.0]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/09/the-rich-web-experience-day-1/</guid>
		<description><![CDATA[I am currently at The Rich Web Experience 2007 conference. It is interesting to compare it to JavaOne conferences I have been to in the past.
To start, RWE is much smaller. It is about 400 people as compared to 15 thousands at JavaOne. This obviously makes scheduling logistics and eating arrangements simpler, but there is [...]]]></description>
			<content:encoded><![CDATA[<p>I am currently at <a href="http://www.therichwebexperience.com/conference/san_jose/2007/09/index.html" title="Conference website">The Rich Web Experience</a> 2007 conference. It is interesting to compare it to JavaOne conferences I have been to in the past.</p>
<p>To start, RWE is much smaller. It is about 400 people as compared to 15 thousands at JavaOne. This obviously makes scheduling logistics and eating arrangements simpler, but there is also a very different feel in the air. It feels that it is much harder to walk around without bumping into speakers and/or other moderately famous web people. At JavaOne, it is all about learning, here it is more like sharing.</p>
<p>Another interesting thing I noticed is that a lot more people than I expected were coming from Java server side background. In fact, we had a round of introductions at <em>Web design</em> Birds-Of-Feather session and more than half of the  people in the room had some (often strong) background in Java. To me, this is a great sign as it shows that the path I am taking (adding HTML/CSS/JavaScript to my Java skills)  has already been done by multiple people before without too many problems.</p>
<p>I have gone to the following sessions:</p>
<ol>
<li> <a href="http://www.therichwebexperience.com/speaker_topic_view.jsp?topicId=418" title="Session details">Secure application development with Ajax</a> (by Dean H. Saxe)  - The presentation itself was great and covered interesting topic in details. I did not understand all of the advanced concepts and consequences, but the core message was very clear and the slides give enough hints and terms to do further research on my own. I would have liked a more detailed example (e.g. &#8216;This is why SOP is not applicable&#8217; ), but overall it was great.</li>
<li><a href="http://www.therichwebexperience.com/speaker_topic_view.jsp?topicId=408" title="Session information">Merging Ajax and Accessibility</a> (by Mark Meeker) - Another great presentation. I heard before that designing for accessibility actually has beneficial side-effects of increased general usability and better design practices, but it was good to see it confirmed with large commercial sites. Mark also had great examples and talked about <a href="http://domscripting.com/blog/display/41" title="Blog article introducing Hijax">Hijax</a> a bit as a way of building accessibility into the process, rather than trying to bolt it on at the end.</li>
<li><a href="http://www.therichwebexperience.com/speaker_topic_view.jsp?topicId=427" title="Session information">Web Design for Server-Side Developers</a> (by Greg Murray) - This one I have found somewhat disappointing. I knew that covering good HTML, CSS, Javascript,  modular design and supporting tools in one presentation might have been too ambitious.  Still, I was looking forward to some sort of high-level view consistent story tying together the bits together with some best practices thrown in. Unfortunately, Greg was not able to deliver that. He spent too much time jumping between the topics. He also talked about jMaki&#8217;s  implementation a lot. That might have been useful, but given that some very important issues (Internationalisation, classes vs. IDs, etc) were still not implemented correctly (by Greg&#8217;s own admission), I felt jMaki was not yet ready to be shown as an example of best practices.</li>
<li>Web design/architecture Birds-Of-Feather session with Aaron Gustafson, David Verba and couple of others. It was actually interesting, because I sat with them at the dinner table without realising who they were. But you could see they were really smart and interesting, even in their unstaged moments. True geeks, in the good sense of the word. The session itself was a very interesting discussing and somehow I even managed to hog the floor for a while with my questions. Hopefully, it did not annoy too many people.</li>
</ol>
<p>I am looking forward to the second day.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F09%2Fthe-rich-web-experience-day-1%2F';
  addthis_title  = 'The+Rich+Web+Experience+-+day+1';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/09/the-rich-web-experience-day-1/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Reducing disk thrashing of OpenNLP/MaxEnt parser - with one line code change</title>
		<link>http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/</link>
		<comments>http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/#comments</comments>
		<pubDate>Wed, 15 Aug 2007 12:56:12 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/</guid>
		<description><![CDATA[When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.
The fix is extremely simple:

In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace

new FileInputStream(f) with
new BufferedInputStream(new FileInputStream(f), 1000000)


Recompile maxent library
Deploy new [...]]]></description>
			<content:encoded><![CDATA[<p>When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.</p>
<p>The fix is extremely simple:</p>
<ol>
<li>In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace
<ul>
<li><em>new FileInputStream(f)</em> with</li>
<li><em>new BufferedInputStream(new FileInputStream(f), 1000000)</em></li>
</ul>
</li>
<li>Recompile maxent library</li>
<li>Deploy new version of <em>maxent-2.4.0.jar</em> into OpenNLP&#8217;s lib directory</li>
</ol>
<p>The comparison is striking (the numbers are File access system calls):</p>
<ul>
<li><em>build.bin.gz</em> <em>- <strong>29830 </strong>-&gt;  </em><em><strong>40</strong> </em></li>
<li><em>chunk.bin.gz </em> -<strong>11853</strong> -&gt; <strong>16</strong></li>
<li><em>tag.bin.gz</em> - <strong>11091</strong> -&gt; <strong>14</strong></li>
</ul>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F08%2Freducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change%2F';
  addthis_title  = 'Reducing+disk+thrashing+of+OpenNLP%2FMaxEnt+parser+-+with+one+line+code+change';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/08/reducing-disk-thrashing-of-opennlpmaxent-parser-with-one-line-code-change/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Getting OpenNLP parser to work</title>
		<link>http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/</link>
		<comments>http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/#comments</comments>
		<pubDate>Sun, 12 Aug 2007 01:48:50 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/</guid>
		<description><![CDATA[I was not able to get OpenNLP parser to work. There were no samples to play with, no command line tools to run. And I don&#8217;t even want to talk about documentation. That&#8217;s because there was not any. There was an attempt at lame joke (at least that&#8217;s the only sense I can make of [...]]]></description>
			<content:encoded><![CDATA[<p>I was not able to get <a href="http://opennlp.sourceforge.net/" title="Link to the OpenNLP project page">OpenNLP parser</a> to work. There were no samples to play with, no command line tools to run. And I don&#8217;t even want to talk about documentation. That&#8217;s because there was not any. There was an attempt at lame joke (at least that&#8217;s the only sense I can make of <em>what.html</em> file), but no actual documentation.</p>
<p>Finally, I pinged my research colleague who did get the toolkit working (thanks Scott). Turns out to be there is a whole set of model files missing from the tool&#8217;s download. They are linked to from <a href="http://opennlp.sourceforge.net/models.html" title="Link to the page for the models">a separate page on the original website</a> (not even in the download).</p>
<p>I am downloading the models now and hopefully will be on my way. But I can certainly see why this particular toolkit is mentioned much less frequently than Stanford&#8217;s or Bikel&#8217;s.</p>
<p>After the fact, I have also found <a href="http://danielmclaren.net/2007/05/11/getting-started-with-opennlp-natural-language-processing/" title="The tutorial blog entry">a mini tutorial</a> by Daniel McLaren explaining OpenNLP components and showing some sample code and output. Looks better than what&#8217;s bundled with OpenNLP itself. Maybe Daniel and Thomas Morton (author of OpenNLP) should talk.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F08%2Fgetting-opennlp-parser-to-work%2F';
  addthis_title  = 'Getting+OpenNLP+parser+to+work';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/08/getting-opennlp-parser-to-work/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Running Bikel&#8217;s parser programmatically</title>
		<link>http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/</link>
		<comments>http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/#comments</comments>
		<pubDate>Mon, 06 Aug 2007 02:44:34 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/</guid>
		<description><![CDATA[Bikel&#8217;s statistical parser is designed to be run from the command line. I need to run it from my own code.
The following wrapper seems to do the trick on windows (with your own values for&#124;parserdir&#124; :

String settingsFile = "&#124;parserdir&#124;\\settings\\collins.properties";
Settings.load(settingsFile);
Parser parser = new Parser("&#124;parserdir&#124;\\bikel\\wsj-02-21.obj.gz");
Sexp result = parser.parse(Sexp.read("(This is a funny world)").list());

There is a complaint when running [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.cis.upenn.edu/~dbikel/software.html#stat-parser" title="Homepage of the parser">Bikel&#8217;s statistical parser</a> is designed to be run from the command line. I need to run it from my own code.</p>
<p>The following wrapper seems to do the trick on windows (with your own values for|parserdir| :<br />
<code><br />
String settingsFile = "|parserdir|\\settings\\collins.properties";<br />
Settings.load(settingsFile);<br />
Parser parser = new Parser("|parserdir|\\bikel\\wsj-02-21.obj.gz");<br />
Sexp result = parser.parse(Sexp.read("(This is a funny world)").list());<br />
</code><br />
There is a complaint when running the above code:<br />
<code><br />
Settings different during training than now<br />
------------------------------<br />
parser.settingsFile<br />
was |parsedir|\settings\collins.properties<br />
is null<br />
</code><br />
This however does not impact anything and correct values seem to be picked up.</p>
<p>Also, all the scripts are designed for *nix with a lot of flexibility and variables built in. To get it running on Windows, I hardcoded everything but the input file and this is the result:<br />
<code><br />
set PDIR=|parserdir|<br />
java -Xmx500m -cp "%PDIR%\dbparser.jar;%CLASSPATH%" -Dparser.settingsFile=%PDIR%\settings\collins.properties danbikel.parser.Parser -is %PDIR%\wsj-02-21.obj.gz -sa %1<br />
</code></p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F08%2Frunning-bikels-parser-programmatically%2F';
  addthis_title  = 'Running+Bikel%26%238217%3Bs+parser+programmatically';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/08/running-bikels-parser-programmatically/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Duplicating -tagSeparator effect when using Stanford Parser programmatically</title>
		<link>http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/</link>
		<comments>http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/#comments</comments>
		<pubDate>Tue, 31 Jul 2007 19:08:26 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/</guid>
		<description><![CDATA[I have been using Stanford NLP Parser from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.
Now, I need to wrap-up the parser in my own code to add input/output batching and I [...]]]></description>
			<content:encoded><![CDATA[<p>I have been using <a href="http://www-nlp.stanford.edu/downloads/lex-parser.shtml" title="Home of the Stanford NLP parser">Stanford NLP Parser</a> from command line with -tagSeparator flag to supply it with partially tagged input. As the parser seems to be really bad with date expressions and complex name entities, I need this functionality.</p>
<p>Now, I need to wrap-up the parser in my own code to add input/output batching and I discover that this option is not accepted when constructing parser from the code. Despite javadoc saying that LexicalizedParser.setOptionFlags() takes the same parameters as the command line, the option sets are actually very different.</p>
<p>In the end, after much poking around, I built the code sequence that seems to produce identical effect:<br />
<code><br />
LexicalizedParser lp = new LexicalizedParser("..../englishPCFG.ser.gz");<br />
//        lp.setOptionFlags(new String[]{"-tagSeparator", "/"});<br />
WhitespaceTokenizer tokenizer = new WhitespaceTokenizer(new StringReader(text));<br />
List&lt;Word&gt; words = tokenizer.tokenize();<br />
WordToTaggedWordProcessor wttwp = new WordToTaggedWordProcessor('/');<br />
words = wttwp.process(words);<br />
Tree tree = (Tree) lp.apply(words);<br />
</code></p>
<p>Here, <em>text</em> variable is a string that is effectively pretokenized with white-space separator and &#8216;<em>/</em>&#8216; character is the word/tag separator token.</p>
<p><em>Update (3rd of August): </em></p>
<p>An email exchange with Christopher Manning and another look through the code proved that  flags in setOptionFlags() are a strict subset of flags accepted by main() method. However, 90% of flags in the setOptionFlags() are not documented in that method&#8217;s javadoc, so the only ones I cared about were the ones I saw in main() method.</p>
<p>Yet further digging found some documentation in classes <em>Options</em>, <em>Test</em> and <em>Train</em>, all within <em>edu.stanford.nlp.parser.lexparser</em> package. So, some additional documentation does exist, but one has to navigate the maze of code to find it. I guess that&#8217;s the normal curse of the open source software.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F07%2Fduplicating-tagseparator-effect-when-using-stanford-parser-programmatically%2F';
  addthis_title  = 'Duplicating+-tagSeparator+effect+when+using+Stanford+Parser+programmatically';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/07/duplicating-tagseparator-effect-when-using-stanford-parser-programmatically/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Laying out penn treebank output of Stanford parser</title>
		<link>http://blog.outerthoughts.com/2007/06/laying-out-penn-treebank-output-of-stanford-parser/</link>
		<comments>http://blog.outerthoughts.com/2007/06/laying-out-penn-treebank-output-of-stanford-parser/#comments</comments>
		<pubDate>Sun, 17 Jun 2007 04:31:46 +0000</pubDate>
		<dc:creator>Alexandre Rafalovitch</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/06/laying-out-penn-treebank-output-of-stanford-parser/</guid>
		<description><![CDATA[I am trying to use Stanford NLP parser for my research and I need to look at the trees it produces for large, complex sentences. I have found several packages for laying out the output as trees, but they are all seem to be targeted at visualizing smaller sentences, suitable for illustrating a point in [...]]]></description>
			<content:encoded><![CDATA[<p>I am trying to use <a href="http://www-nlp.stanford.edu/downloads/lex-parser.shtml" title="Webpage of the parser">Stanford NLP parser</a> for my research and I need to look at the trees it produces for large, complex sentences. I have found several packages for laying out the output as trees, but they are all seem to be targeted at visualizing smaller sentences, suitable for illustrating a point in the published paper. <a href="http://www.outerthoughts.com/files/line_tree2.gif" title="Sample output of Graphviz layout for Stanford Parserâ€™s output"><img src="http://blog.outerthoughts.com/wp-content/uploads/2007/06/line_tree2_icon.gif" title="Sample output of Graphviz layout for Stanford Parserâ€™s output" alt="Sample output of Graphviz layout for Stanford Parserâ€™s output" align="left" height="86" vspace="10" width="142" /></a></p>
<p>My trees are large. A sentence of 40 words is an average case, rather than an edge one. So, all of the display packages I have tried cut off large chunks of the tree. It might be possible to tinker with their LaTeX code to produce output that is not cut-off at letter, a4 or even a3 size, but I am not that good with LaTeX yet. And I need to produce this large trees quickly, as I am not even sure whether this parser would be suitable for my needs in the long run.</p>
<p>So, instead, I wrote my own bridging code in Java between penn treebank output of the parser and <a href="http://www.graphviz.org/" title="Homepage of the Graphviz software">Graphviz</a>, graph layout software that I use for many layout tasks. The whole implementation was in one file less than 100 lines total and that included the logic to highlight maximum spanning subtrees of a particular element (NounPhrase in this example). Click on the small image to see the full example. Graphviz input file is <a href="http://www.outerthoughts.com/files/line_tree2.dot" title="Graphviz intermediate file for the example">also available</a> for the curious.</p>
<p>At the moment, it is sufficient to convert to image files. If I ever do convince the parser to understand my 80-word sentences, the resulting trees will probably be large enough to need <a href="http://zvtm.sourceforge.net/zgrviewer.html" title="Link to ZGRViewer software for Graphviz files">ZGRViewer</a>.</p>
<p>The Java bridging code is not available yet, as it is very ugly. The secret was in the <a href="http://www-nlp.stanford.edu/viewvc/trunk/javanlp/src/edu/stanford/nlp/trees/PennTreeReader.java?view=markup" title="Source view for the PennTreeReader class">PennTreeReader</a>&#8217;s main() method that showed how to read the parser&#8217;s output back in and into Tree form suitable for recursive descent. After that, it was just the code to navigate the tree levels and spit out incredibly easy Graphviz format. I will probably clean the code up a bit over the next couple of weeks and then release it.</p>
<p>If somebody does like the output and wants to see the code sooner, send me an email at alex@thisdomain.</p>
<script type="text/javascript">
  addthis_url    = 'http%3A%2F%2Fblog.outerthoughts.com%2F2007%2F06%2Flaying-out-penn-treebank-output-of-stanford-parser%2F';
  addthis_title  = 'Laying+out+penn+treebank+output+of+Stanford+parser';
  addthis_pub    = 'arafalov';
</script><script type="text/javascript" src="http://s7.addthis.com/js/addthis_widget.php?v=12" ></script>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/06/laying-out-penn-treebank-output-of-stanford-parser/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
