<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Outer Thoughts</title>
	<atom:link href="http://blog.outerthoughts.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.outerthoughts.com</link>
	<description>&#62; From inner thoughts to the outer limits of Alexandre Rafalovitch</description>
	<lastBuildDate>Fri, 20 Aug 2010 05:12:23 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.0.1</generator>
		<item>
		<title>Arabic numerals&#8217; non-WYSIWYG</title>
		<link>http://blog.outerthoughts.com/2010/08/arabic-numerals-non-wysiwyg/</link>
		<comments>http://blog.outerthoughts.com/2010/08/arabic-numerals-non-wysiwyg/#comments</comments>
		<pubDate>Fri, 20 Aug 2010 05:12:23 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Problems and Solutions]]></category>
		<category><![CDATA[Weird Stuff]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=359</guid>
		<description><![CDATA[



Image via Wikipedia



<p>For my other project, I needed to process some Arabic text that was in HTML file derived from MSWord document.</p>
<p>Everything was going reasonably well, except my regular expressions were not picking section name/numbers sequences in all of the cases, which was causing a problem with the 6-language alignment algorithm.</p>
<p>Normally, I just examine the text <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2010/08/arabic-numerals-non-wysiwyg/">Arabic numerals&#8217; non-WYSIWYG</a></span>]]></description>
			<content:encoded><![CDATA[<div class="zemanta-img" style="margin: 1em; display: block;">
<div>
<dl class="wp-caption alignright" style="width: 310px;">
<dt class="wp-caption-dt"><a href="http://commons.wikipedia.org/wiki/File:EgyptphoneKeypad.jpg"><img title="I made this photo myself. Its now in the Publi..." src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/d8/EgyptphoneKeypad.jpg/300px-EgyptphoneKeypad.jpg" alt="I made this photo myself. Its now in the Publi..." width="300" height="298" /></a></dt>
<dd class="wp-caption-dd zemanta-img-attribution" style="font-size: 0.8em;">Image via <a href="http://commons.wikipedia.org/wiki/File:EgyptphoneKeypad.jpg">Wikipedia</a></dd>
</dl>
</div>
</div>
<p>For <a title="UN Corpora project website" href="http://www.uncorpora.org/">my other project</a>, I needed to process some Arabic text that was in HTML file derived from MSWord document.</p>
<p>Everything was going reasonably well, except my regular expressions were not picking section name/numbers sequences in all of the cases, which was causing a problem with the 6-language alignment algorithm.</p>
<p>Normally, I just examine the text visually, determine a new regular expression pattern and that particular problem is solved. This time it was not to be.</p>
<p>When I looked at the text what I saw was the phrase &#8220;<big><strong>Section 1٣</strong></big>&#8221; with the word Section written in Arabic (right-to-left of course). The problem here is <big><strong>1٣</strong></big> which means 13, but with first digit 1 coming from <a href="http://en.wikipedia.org/wiki/Arabic_numerals">Arabic Numerals</a> set (which is what we use in English language) and the second digit <big><strong>٣ </strong></big>(3) coming from <a href="http://en.wikipedia.org/wiki/Eastern_Arabic_numerals">Arabic-Indic Numerals</a> set (which is what at least some Arab countries use). Confusing, I know. We use their numbers and</p>
<p>they already use somebody else&#8217;s. What do they know that we haven&#8217;t yet figured out?</p>
<p>Of course this juxtaposition makes no sense. Why would somebody mix the two alphabets, especially in an official document. I contacted the authoring departments and &#8211; unbelievably to me &#8211; they looked at the document and it was looking correct to them.</p>
<p>I had nothing to go on with, so I left that puzzle unsolved for a couple of weeks. That is until it hit me &#8211; they were looking at it in the MSWord, while I was looking at it on the codepoint character level. They had <a class="zem_slink" title="WYSIWYG" rel="wikipedia" href="http://en.wikipedia.org/wiki/WYSIWYG">WYSIWYG</a> on and I did not. So that was the difference.</p>
<p>I went looking around the MSWord interface with Arabic enabled and sure enough there was <a title="Microsoft's documentation on Arabic support in MSWord" href="http://www.microsoft.com/middleeast/arabicdev/office/officeXP/wPapers/Word.aspx#_Toc15640940">a whole collection of options for Arabic fonts, numbers and other options</a>. And one of them was to display all numbers as Arabic-Indic. So, when that mode is enabled, MSWord will display any digits as Arabic-Indic ones. That answered half of the puzzle of why the original authors could not see the difference. But how did that happen in first place?</p>
<p>My guess is that the original section was copied from somewhere else in the document. The person who worked on that original had the keyboard (not MSWord display) configured to use Arabic numbers and was actually entering all too familiar 1,2,3 but displaying them as <big><strong>١,٢,٣</strong></big>. Then, the person who copied the section title had a keyboard configured to use Arabic-Indic characters and he/she replaced or added to the section number using her keyboard. It still displayed cohesively, but now had numbers from different numeric systems.</p>
<p>Of course since the documents were designed for printing nobody noticed and really had no reason to care. This issue only becomes important when those documents are used as <em>input</em> for bitext alignment or some other computational processing. Then, and only then, it bites the person trying to make sense out of it.</p>
<p>The lesson here is. WYSIWYG might be good if all you are doing is looking or printing. But if your documents serve as input to other processes as well, WYSIWYG can cause some very non-obvious issues.</p>
<div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Enhanced by Zemanta" href="http://www.zemanta.com/"><img class="zemanta-pixie-img" style="border: medium none; float: right;" src="http://img.zemanta.com/zemified_e.png?x-id=6f0a35ff-ec35-43e1-b628-a1f85b671e0f" alt="Enhanced by Zemanta" /></a></div>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2010/08/arabic-numerals-non-wysiwyg/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>jQuery: Cycling between multiple classes with random start</title>
		<link>http://blog.outerthoughts.com/2010/08/jquery-cycling-between-multiple-classes-with-random-start/</link>
		<comments>http://blog.outerthoughts.com/2010/08/jquery-cycling-between-multiple-classes-with-random-start/#comments</comments>
		<pubDate>Thu, 12 Aug 2010 14:32:36 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[jQuery]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=316</guid>
		<description><![CDATA[



Image via Wikipedia



<p>I saw an interesting question on StackOverflow on how to cycle between 3 states for list items , but with initial state for each item being potentially different.</p>
<p>This random start position part of the problem was making me think, so I used it as an exercise to try some newish jQuery functions, such as <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2010/08/jquery-cycling-between-multiple-classes-with-random-start/">jQuery: Cycling between multiple classes with random start</a></span>]]></description>
			<content:encoded><![CDATA[<div class="zemanta-img" style="margin: 1em; display: block;">
<div>
<dl class="wp-caption alignright" style="width: 260px;">
<dt class="wp-caption-dt"><a href="http://commons.wikipedia.org/wiki/File:Directed.svg"><img title="A directed graph." src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a2/Directed.svg/250px-Directed.svg.png" alt="A directed graph." width="149" height="135" /></a></dt>
<dd class="wp-caption-dd zemanta-img-attribution" style="font-size: 0.8em;">Image via <a href="http://commons.wikipedia.org/wiki/File:Directed.svg">Wikipedia</a></dd>
</dl>
</div>
</div>
<p>I saw an interesting question on <a href="http://stackoverflow.com/">StackOverflow</a> on how to cycle between 3 states for list items , but with initial state for each item being potentially different.</p>
<p>This <em>random start position</em> part of the problem was making me think, so I used it as an exercise to try some newish <a class="zem_slink" title="JQuery" rel="homepage" href="http://jquery.com/">jQuery</a> functions, such as delegate and advanced class selectors.</p>
<p>My solution was basically to build a reduced case of state transition diagram with a cycle.  The advantage of it is that any number of states can be iterated through.  It  could even be a plugin, if need be.</p>
<p>My full test example can be found in <a title="Original Question on StackOverflow, together with my answer" href="http://stackoverflow.com/questions/3463954/jquery-click-through-class-cycle">the original SO question</a>, probably toward the bottom. I am also including it here as a Pastie:<br />
<script src="http://pastie.org/1088322.js"></script></p>
<div class="zemanta-pixie" style="margin-top: 10px; height: 15px;"><a class="zemanta-pixie-a" title="Enhanced by Zemanta" href="http://www.zemanta.com/"><img class="zemanta-pixie-img" style="border: medium none; float: right;" src="http://img.zemanta.com/zemified_e.png?x-id=0e477175-6203-4cf8-8352-3e2fe9647d39" alt="Enhanced by Zemanta" /></a></div>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2010/08/jquery-cycling-between-multiple-classes-with-random-start/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A new camera</title>
		<link>http://blog.outerthoughts.com/2010/08/a-new-camera/</link>
		<comments>http://blog.outerthoughts.com/2010/08/a-new-camera/#comments</comments>
		<pubDate>Fri, 06 Aug 2010 01:36:03 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Photography]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=287</guid>
		<description><![CDATA[<p>I got myself a new digital camera recently, a Canon T2i. It feels really nice and makes it quite hard to go back to point-and-shoots afterward. And it takes really good 18 megapixel shots.</p>
<p>Here is one of the South African Warthog, displayed using Microsof&#8217;s Zoom.it technology. Try zooming <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2010/08/a-new-camera/">A new camera</a></span>]]></description>
			<content:encoded><![CDATA[<p>I got myself a new digital camera recently, a Canon T2i. It feels really nice and makes it quite hard to go back to point-and-shoots afterward. And it takes really good 18 megapixel shots.</p>
<p>Here is one of the South African Warthog, displayed using Microsof&#8217;s <a href="http://zoom.it">Zoom.it</a> technology. Try zooming on it:</p>
<p><script src="http://zoom.it/ivJR.js?width=auto&amp;height=400px"></script></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2010/08/a-new-camera/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>jQuery for multilingual web development</title>
		<link>http://blog.outerthoughts.com/2009/10/jquery-for-multilingual-web-development/</link>
		<comments>http://blog.outerthoughts.com/2009/10/jquery-for-multilingual-web-development/#comments</comments>
		<pubDate>Sun, 01 Nov 2009 02:54:14 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[jQuery]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=278</guid>
		<description><![CDATA[<p>I have (nearly) finished developing a mini-website in 6 languages (Arabic, Chinese, English, French, Russian, Spanish). The layout was the same, so ideally it would have been driven by a content management system. Not in this case unfortunately, as I was not given enough time to setup the infrastructure.</p>
<p>As I know nearly nothing of at least <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/10/jquery-for-multilingual-web-development/">jQuery for multilingual web development</a></span>]]></description>
			<content:encoded><![CDATA[<p>I have (nearly) finished developing a mini-website in 6 languages (Arabic, Chinese, English, French, Russian, Spanish). The layout was the same, so ideally it would have been driven by a content management system. Not in this case unfortunately, as I was not given enough time to setup the infrastructure.</p>
<p>As I know nearly nothing of at least two of the languages above (Arabic and Chinese), I had to keep rechecking the content provided to ensure the right text ends up in the right place on a page. Google Translate helped with that by back-translating from another language back into English and making sure I got right sentence boundaries, etc.</p>
<p>However, even with content in the right place, I still needed to visually verify that things are correct. Also, some of the late arriving changes needed to be implemented for all 6 sets of files. For example, some of the URLs changed, some classes for javascript enhancements were added or removed, and so on.</p>
<p>Initially, I tried to check things in the editor by using regular expressions. This worked for basic things, but as the project progressed and markup (and javascript enhancements) became more complex, the regular expressions became not sufficient. I needed something that understood HTML structure and could easy to run interactively.</p>
<p>I already was using jQuery for progressive enhancement and my Firefox always has Firebug setup. And I have been poking at random web-pages with jQuerify bookmarklet for ages. But with this project, jQuery+Firebug combination of tools has now graduated to a 1st class development and troubleshooting toolkit specifically for multi-lingual content.</p>
<p>Here is a couple of basic queries I run in Firebug console window:</p>
<ul>
<li>I had most of the links going to a new window and needed to check I did not miss a target attribute: <em><span class="status-body"><span class="entry-content">$(&#8220;a[target != '_blank']&#8220;)</span></span></em></li>
<li><span class="status-body"><span class="entry-content">When comparing languages side-by-side, I needed to see whether URL links were the same. The easiest way to do that was by looking at where those links were actually pointing out. I could of course select an element with Firebug to see all of its content, but it was easier to print a particular attribute automatically, when I hovered over it with a mouse: <em>$(&#8220;a&#8221;).mouseenter(function(){console.log( $(this).attr(&#8216;href&#8217;));})</em></span></span></li>
<li><span class="status-body"><span class="entry-content">If I quickly needed to check which elements were affected by a particular class, I would just highlight them:<em> $(&#8220;.NYOnly&#8221;).css(&#8220;background-color&#8221;, &#8220;red&#8221;)</em></span></span></li>
</ul>
<p><span class="status-body"><span class="entry-content"><a title="A video of using jQuery and Firebug" href="http://encosia.com/2009/09/21/updated-see-how-i-used-firebug-to-learn-jquery/">None of these are hidden secrets</a>, however it may not always be obvious what can be done and how far a couple of lines of jQuery code can go. Here is an example that gets pasted right into Firebug window. It uses <a title="jQuery extension for Google Translation API" href="http://code.google.com/p/jquery-translate/">jQuery-translate</a> extension to hook into Google Translate API and prints out translated content of a table cell that is clicked on:</span></span></p>
<blockquote><p><code>$.getScript('http://jquery-translate.googlecode.com/files/jquery.translate-1.3.9.min.js');</code></p>
<p>$(&#8220;td&#8221;).click(function(){<br />
$(this).translate(&#8216;ar&#8217;, &#8216;en&#8217;, {<br />
replace: false,<br />
each: function(i){console.log( this.translation[i] ) }<br />
})<br />
});</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/10/jquery-for-multilingual-web-development/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Making up with ANTLR</title>
		<link>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/</link>
		<comments>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/#comments</comments>
		<pubDate>Fri, 29 May 2009 02:25:16 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[My PhD research]]></category>
		<category><![CDATA[Problems and Solutions]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=275</guid>
		<description><![CDATA[<p>I like ANTLR! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all Dust Puppy like. And I have used it in the past with great success.</p>
<p>But, every time I put this particular tool aside, I know that picking it back up will be like making up <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/05/making-up-with-antlr/">Making up with ANTLR</a></span>]]></description>
			<content:encoded><![CDATA[<p>I like <a title="ANTLR's home page" href="http://antlr.org/">ANTLR</a>! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all <a title="Explanation of Dust Puppy" href="http://www.userfriendly.org/cartoons/dustpuppy/">Dust Puppy</a> like. And I have used it in the past with great success.</p>
<p>But, every time I put this particular tool aside, I know that picking it back up will be like making up after a bad break up. Things feel familiar, but you are still so uncomfortable you cannot get anything working. Only knowing how great the tool is underneath, makes me go through the effort of re-familiarization.</p>
<p>I just downloaded ANTLR 3.1.2 bundled with its own GUI ANTLRWorks that offers visual diagrams, debugger and templates. You would think that would make for an easy out-of-box experience. You would be wrong.</p>
<p>You start the GUI and end up facing a blank screen. Lots of options and tabs for sure, but the only easy start one seems to be &#8216;Insert rule from template&#8217;.</p>
<p>Ok, so here is a couple of rules from templates trying to parse &#8220;Hello World!&#8221; string:</p>
<blockquote><p>ID    :    LETTER (LETTER | DIGIT)*<br />
;<br />
LETTER<br />
:    &#8216;a&#8217;..&#8217;z&#8217; | &#8216;A&#8217;..&#8217;Z&#8217;<br />
;</p>
<p>DIGIT    :    &#8217;0&#8242;..&#8217;9&#8242;<br />
;</p>
<p>WS    :    (&#8216; &#8216; | &#8216;\t&#8217; | &#8216;\n&#8217; | &#8216;\r&#8217;) { $setType(Token.SKIP); }<br />
;</p></blockquote>
<p>Not good. We are missing a start state apparently. Ok, let&#8217;s add one:</p>
<blockquote><p>hello    :    ID ID &#8216;!&#8217;<br />
;</p></blockquote>
<p>Still no good. Start looking at examples, trying to see what bits are compulsory. Ok, the word grammar is missing at the top of the file. Of course, I have both grammar and lexer elements now in one file (ANTLR 3 feature, I believe), but let&#8217;s not worry about deep meaning here.</p>
<blockquote><p>grammar test;</p></blockquote>
<p>Now, suddenly, syntax diagram starts showing up. Let&#8217;s try saving (as test.g) and compiling. No good:</p>
<blockquote><p>The following token definitions can never be matched because prior tokens match the same input: LETTER</p></blockquote>
<p>So much for following a template. More digging in examples. Memory really starts to bring back the <a title="Seminal book on Compiler technologies" href="http://dragonbook.stanford.edu/">Dragon Book</a>&#8216;s lessons. What&#8217;s the problem with LETTER and who is the <em>prior token</em> here. Ah, we don&#8217;t want the lexer to return LETTER (or DIGIT), only ID. So, LETTER and DIGIT are both token fragments, not tokens. Add <em>fragment</em> in front of both definitions. All good?</p>
<p>Nope! Now we have a problem with:</p>
<blockquote><p>attribute is not a token, parameter, or return value: setType</p></blockquote>
<p>But I did not write <em>setType</em>, the template provided it! Back to the examples! Apparently, somewhere along the way Skip tokens have gone away and we now have hidden channels instead. Swap that bit with one from an example and try again.</p>
<p>SUCCESS. Switch to interpreter, enter &#8220;Hello World!&#8221; in input box and run <em>hello</em> rule. Beauty, we have a parse diagram.</p>
<p>The final running grammar example is here:</p>
<blockquote><p>grammar test;</p>
<p>hello    :    ID ID &#8216;!&#8217;<br />
;</p>
<p>ID    :    LETTER (LETTER | DIGIT)*<br />
;<br />
fragment LETTER<br />
:    &#8216;a&#8217;..&#8217;z&#8217; | &#8216;A&#8217;..&#8217;Z&#8217;<br />
;</p>
<p>fragment DIGIT    :    &#8217;0&#8242;..&#8217;9&#8242;<br />
;</p>
<p>WS    :    (&#8216; &#8216; | &#8216;\t&#8217; | &#8216;\n&#8217; | &#8216;\r&#8217;) {  $channel = HIDDEN;  }<br />
;</p></blockquote>
<p>Hello World! Now, on to the real grammar and (if things really, really work) GATE integration&#8230;..</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Conjunctions in named entities</title>
		<link>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/</link>
		<comments>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 02:34:52 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=273</guid>
		<description><![CDATA[<p>A recent article on lingpipe discussed conjuncted named entities such as Johnson and Johnson and Wallace and Gromit. They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.</p>
<p>In the United Nations <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/">Conjunctions in named entities</a></span>]]></description>
			<content:encoded><![CDATA[<p>A <a title="Lingpipe's article on conjunctions in named entities" href="http://lingpipe-blog.com/2009/03/26/joint-referential-uncertainty-the-wallace-and-gromit-dilemma/">recent article on lingpipe</a> discussed conjuncted named entities such as <span style="text-decoration: underline;">Johnson and Johnson</span> and <span style="text-decoration: underline;">Wallace and Gromit</span><em>.</em> They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.</p>
<p>In the United Nations corpus, things can get even more interesting. Let&#8217;s look at a relatively easy example: <em><span style="text-decoration: underline;">draft resolution A/56/L.28 and Add.1</span></em>.</p>
<p>Is this a one document (one draft resolution) or two? And if two, then which two? The first one is obviously <span style="text-decoration: underline;">A/56/L.28</span>. But <span style="text-decoration: underline;">Add.1</span> is not a valid document symbol, it is actually an (additive?) coreference to the first one and resolves to <span style="text-decoration: underline;">A/56/L.28/Add.1</span>?</p>
<p>The answer (as good as I can make it so far) could lie in <a title="Introduction to FRBR" href="http://techessence.info/frbr">FRBR</a> distinction between Expression and Manifestation. A resolution is an expression of Member States&#8217;s proposals and negotiations. To some degree, it evolves over several meetings. However between the discussions, the latest version or changes need to be reported to make sure they are formally registered and also to ensure the next round of discussions could have latest documents to work from.</p>
<p>In our case, the first time the draft resolution had to be presented it was published under <span style="text-decoration: underline;">A/56/L.28</span> (which incidentally means a limited distribution document 28 of the General Assembly&#8217;s 56th regular session). So, the initial Manifestation of the draft resolution became this physical document with a distinct symbol assigned.</p>
<p>But apart from its text, draft resolution has a list of sponsoring Member States. That list can change as draft resolution gains sponsors. These additional sponsors were in the Addendum <span style="text-decoration: underline;">A/56/L.28/Add.1</span>. But the addendum does not make sense without the original document, so actually both physical documents represent one logical draft resolution, which is reflected in the grammar of the text (draft resolution, not resolution<span style="text-decoration: underline;">s</span>).</p>
<p>What this means for named entity annotations and for recognition algorithms is hard to say and is something I am looking at with my PhD research.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>CiteULike Exhibit visualization</title>
		<link>http://blog.outerthoughts.com/2009/01/citeulike-exhibit-visualization/</link>
		<comments>http://blog.outerthoughts.com/2009/01/citeulike-exhibit-visualization/#comments</comments>
		<pubDate>Wed, 28 Jan 2009 00:57:22 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Problems and Solutions]]></category>
		<category><![CDATA[bibliography]]></category>
		<category><![CDATA[CiteULike]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=270</guid>
		<description><![CDATA[<p>Homegrown visualization is not the only way to quickly navigate CiteULike references. There are other tools that display bibliographies in interesting ways.</p>
<p>One of such tools is Exhibit, one of graduates from SIMILE project. It allows to do a very interactive webpage driven by just HTML+Javascript, with no server-side component required. I really like SIMILE&#8217;s tools, even <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/citeulike-exhibit-visualization/">CiteULike Exhibit visualization</a></span>]]></description>
			<content:encoded><![CDATA[<p><a title="Previous article on visualizing CiteULike's bibliographies" href="http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/">Homegrown visualization</a> is not the only way to quickly navigate CiteULike references. There are other tools that display bibliographies in interesting ways.</p>
<p>One of such tools is <a title="Exhibit and other ex-SIMILE tools" href="http://code.google.com/p/simile-widgets/">Exhibit</a>, one of graduates from <a title="SIMILE project's homepage" href="http://simile.mit.edu/">SIMILE</a> project. It allows to do a very interactive webpage driven by just HTML+Javascript, with no server-side component required. I really like SIMILE&#8217;s tools, even though it feels like development slowed somewhat recently.</p>
<p>There is <a href="http://simile.mit.edu/wiki/Exhibit/How_to_make_a_publications_exhibit">an example of how to import and display bibtext within Exhibit</a>. It is not difficult, just a couple of steps. It must have been a popular section, as there is now a dedicated new tool for it.</p>
<p><a title="Citeline Exhibit Builder" href="http://citeline.mit.edu/">Citeline Exhibit Builder</a> allows to load in bibtext and presents editing interface to customize Exhibit&#8217;s presentation of the publications. It looks great and seem to work well. A nice aspect is that it allows to chose which bibtext fields to expose as filter facets. With original tutorial that would require html editing and understanding Exhibit mindset. Citeline nicely hides user from it.</p>
<p>There was a couple of small problems. Apparently, there is a way to login and &#8216;claim&#8217; your presentation. I couldn&#8217;t test that as OpenID authentication failed (something about a nonce). Also, there is jsMath library but, once the generated Exhibit is downloaded, it fails with cross-server issues. Finally, as with most end-to-end solutions, it does not do data preprocessing/normalization to allow me, for example, to combine author/editor fields for sorting purposes.</p>
<p>Citeline is a very promising tool and I am certainly going to keep it in mind for publishing my bibliographies.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/citeulike-exhibit-visualization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Visualizing CiteULike collections</title>
		<link>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/</link>
		<comments>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/#comments</comments>
		<pubDate>Sun, 25 Jan 2009 07:10:20 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>
		<category><![CDATA[Problems and Solutions]]></category>
		<category><![CDATA[CiteULike]]></category>
		<category><![CDATA[Graphviz]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=266</guid>
		<description><![CDATA[<p>I am collecting my reading and reference material in CiteULike. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.</p>
<p>Nice as CiteULike is, it is fairly difficult to get an overall picture of one&#8217;s own <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/">Visualizing CiteULike collections</a></span>]]></description>
			<content:encoded><![CDATA[<p>I am collecting my reading and reference material in <a title="My library in CiteULike" href="http://www.citeulike.org/user/arafalov">CiteULike</a>. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.</p>
<p>Nice as CiteULike is, it is fairly difficult to get an overall picture of one&#8217;s own collection. It is especially difficult to see quickly if there are people who serve as hubs by collaborating with multiple different groups. The information is there, but it requires a lot of clicks to find it out.</p>
<p>My usual solution is to export information out, massage it into <a title="Home page of Graphviz" href="http://www.graphviz.org/">Graphviz</a> format and use graph segmentation and layout algorithms to get a better overview. I <a title="Search for my articles mentioning Graphviz" href="http://blog.outerthoughts.com/?s=graphviz">have talked about Graphviz</a> a number of times on this blog before. This is yet another time it proved useful.</p>
<p>I started by exporting CiteULike&#8217;s content of my library. I found Endnote export format to be more structured and therefore easier to parse. I then run it through <a title="My converter" href="http://www.outerthoughts.com/files/paperviz/v1/convert.py">a custom Python program</a> that basically spat out graph with titles pointing at authors. That produced a <strong>very large</strong> graph and was not particularly useful.</p>
<p>The next step was to discover disjointed clusters of titles/authors. I used <em>ccomps</em> with -v and -x flags (e.g. <em>ccomps.exe -v -x -o comp.dot output.dot</em>).</p>
<p><em>ccomps</em> gave me partitioned graphs as well as statistics on number of nodes/edges in each graph. I could then choose a graph with large number of nodes/edges (eventually, all of them) and run it through <em>neato</em> with overlap=scale and splines=true (e.g. <em>neato.exe -Tgif -o neato_1.gif -Goverlap=scale -Gsplines=true comp_1.dot</em>).</p>
<p>The resulting graph was still not perfect, but it was a good start. I also tried <em>fdp</em> instead of <em>neato</em>, but that seemed to produce giraffe versions of the graph with graph edges being overly long.</p>
<p>You can see <a title="Output image of one of the clusters" href="http://www.outerthoughts.com/files/paperviz/v1/neato_1.gif">an example</a> of <em>neato</em> output for one of my clusters. Warning: if it causes problems due to its size, try it with <a title="Graphics viewing freeware" href="http://www.irfanview.com/">IrfanView</a>; that program can display even improbably large graphs (e.g. unpartitioned ones).</p>
<p>I have run into some problems as well that would either cause partitions combine together or produce duplicate nodes and edges.</p>
<p>The first problem was that sometimes a person was an author and sometimes an editor. I was interested in both, so collapsed those fields together. That caused some non-people to then show up on the graph and connect clusters in unexpected ways. For my library the specific value was &#8216;European&#8217;, so I filtered it out in the code.</p>
<p>The second problem was to do with CiteULike&#8217;s parsing. Sometimes, it would split a first+last name into separate names, probably due to incorrect manual entry at some point. I had to fix those at the source by editing corresponding CiteULike entry. Probably a good thing to do anyway.</p>
<p>The other problem is right out of the co-reference resolution domain. Sometimes names would include full first names, sometimes only a first name initial. I have worked around that by normalizing all first names to the initials. Obviously, this could collapse entries belonging to multiple real people into one.</p>
<p>Further on name problems, in cases of non English names (e.g. Spanish names with multiple surnames), CiteULike would get confused which part is which and not display or export it correctly. Additionally, sometimes characters such as <strong>ñ</strong> would be entered as plain <strong>n</strong>. Those also needed to be corrected manually.</p>
<p>The project only took a couple of hours including writing code and cleanup. It is already useful to me, as I found a new person who was in unexpectedly large number of papers and also found a chain of connections that might be interesting to follow more closely.</p>
<p>There is of course a lot more that could be done. Automatic co-reference of misspelt names, layout hints based on number of times authors appeared together, color coding of tags &#8211; these are just some of the easy ideas.</p>
<p>There might even be a small project/paper in doing co-reference resolution and cleaning up CiteULike data? After all, similar projects were done for Wikipedia. I don&#8217;t think CiteULike currently makes a full export available, but they do have <a title="CiteULike's datasets available for research" href="http://www.citeulike.org/faq/data.adp">some</a> so might be amendable to exporting a special set for research purposes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>New mailing list to discuss junction of NLP and Software Engineering</title>
		<link>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/</link>
		<comments>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/#comments</comments>
		<pubDate>Sat, 17 Jan 2009 21:20:03 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=260</guid>
		<description><![CDATA[<p>Dr. René Witte has just created a new mailing list (SENLP) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.</p>
<p>I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on written notes <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/">New mailing list to discuss junction of NLP and Software Engineering</a></span>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.rene-witte.net/">Dr. René Witte</a> has just created a new mailing list (<a title="Introduction to SENLP mailing list" href="http://www.semanticsoftware.info/blog/senlp-mailing-list-connecting-software-engineering-and-nlp">SENLP</a>) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.</p>
<p>I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on written notes in support cases could have improved quality of technical support. I did not get to do any of that, but some interest remains.</p>
<p>The second topic is even more interesting and important to me. It can build on current discussions currently held on blogs (see &#8216;<a title="Blog entry about Software Engineering and NLP" href="http://www.drni.de/niels/s9y/archives/5-The-USES-Issue.html">The USES Issue</a>&#8216; at Niels Ott&#8217;s blog) and in journals (see: <a title="Ted Pedersen article on building better NLP software" href="http://www.d.umn.edu/~tpederse/Pubs/pedersen-last-word-2008.pdf">&#8216;Empiricism Is Not a Matter of Faith</a>&#8216; by Ted Pedersen). While some of the issues are discussed on mailing lists for individual pieces of software, a place to discuss cross-cutting concerns is very welcome.</p>
<p>I have joined the list and hope to see at least some of my readers there as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Where are all legal computational linguistics resources?</title>
		<link>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/</link>
		<comments>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/#comments</comments>
		<pubDate>Wed, 14 Jan 2009 01:41:44 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[Ideas]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=258</guid>
		<description><![CDATA[<p>I am frustrated. I know my corpus (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in legal <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/">Where are all legal computational linguistics resources?</a></span>]]></description>
			<content:encoded><![CDATA[<p>I am frustrated. I know <a href="http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/">my corpus</a> (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in legal domain.</p>
<p>I have just gone through all of <a title="Jurix conference" href="http://www.jurix.nl/">Jurix</a>&#8216; proceedings as well as all of <a title="Digital edition of &quot;Artificial Intelligence and Law&quot; journal" href="http://www.springerlink.com/content/100239/">Artificial Intelligence and Law</a> and all I got is <a title="My article set from legal domain" href="http://www.citeulike.org/user/arafalov/tag/legal">between 2 and 4 articles worth following-up</a>.</p>
<p>There must be somebody actually trying to parse real legal texts and figuring out to deal with complex organisation, people and group names. But all I can see is articles dealing with levels from ontology and up.</p>
<p>There might even be money in it!</p>
<p>One of the crazy business ideas I had was to parse all the web-based <em>terms of use</em> and <em>privacy notices</em> and annotate/crowd-vote them for how bad they are. So, before creating a web-based account, I could check it against database/parser and it would highlight and rate for me passages that I really should pay attention to (e.g. <em>we sell your contact details to every spammer we know</em> ). Since the language of those notices is often ritualistically formulaic, extracting interesting and useful summary would actually be simpler than it looks.</p>
<p>And the business model would center on providing automatic notification option if a notice from subscribed website sneakily changed and became much worse. That way one would pay money for peace of mind that there were no unexpected service rule changes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>
