<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>

<channel>
	<title>Outer Thoughts</title>
	<atom:link href="http://blog.outerthoughts.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.outerthoughts.com</link>
	<description>&#62; From inner thoughts to the outer limits of Alexandre Rafalovitch</description>
	<pubDate>Sun, 01 Nov 2009 02:54:14 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.6.1</generator>
	<language>en</language>
			<item>
		<title>jQuery for multilingual web development</title>
		<link>http://blog.outerthoughts.com/2009/10/jquery-for-multilingual-web-development/</link>
		<comments>http://blog.outerthoughts.com/2009/10/jquery-for-multilingual-web-development/#comments</comments>
		<pubDate>Sun, 01 Nov 2009 02:54:14 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[jQuery]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=278</guid>
		<description><![CDATA[I have (nearly) finished developing a mini-website in 6 languages (Arabic, Chinese, English, French, Russian, Spanish). The layout was the same, so ideally it would have been driven by a content management system. Not in this case unfortunately, as I was not given enough time to setup the infrastructure.
As I know nearly nothing of at [...]]]></description>
			<content:encoded><![CDATA[<p>I have (nearly) finished developing a mini-website in 6 languages (Arabic, Chinese, English, French, Russian, Spanish). The layout was the same, so ideally it would have been driven by a content management system. Not in this case unfortunately, as I was not given enough time to setup the infrastructure.</p>
<p>As I know nearly nothing of at least two of the languages above (Arabic and Chinese), I had to keep rechecking the content provided to ensure the right text ends up in the right place on a page. Google Translate helped with that by back-translating from another language back into English and making sure I got right sentence boundaries, etc.</p>
<p>However, even with content in the right place, I still needed to visually verify that things are correct. Also, some of the late arriving changes needed to be implemented for all 6 sets of files. For example, some of the URLs changed, some classes for javascript enhancements were added or removed, and so on.</p>
<p>Initially, I tried to check things in the editor by using regular expressions. This worked for basic things, but as the project progressed and markup (and javascript enhancements) became more complex, the regular expressions became not sufficient. I needed something that understood HTML structure and could easy to run interactively.</p>
<p>I already was using jQuery for progressive enhancement and my Firefox always has Firebug setup. And I have been poking at random web-pages with jQuerify bookmarklet for ages. But with this project, jQuery+Firebug combination of tools has now graduated to a 1st class development and troubleshooting toolkit specifically for multi-lingual content.</p>
<p>Here is a couple of basic queries I run in Firebug console window:</p>
<ul>
<li>I had most of the links going to a new window and needed to check I did not miss a target attribute: <em><span class="status-body"><span class="entry-content">$(&#8221;a[target != '_blank']&#8220;)</span></span></em></li>
<li><span class="status-body"><span class="entry-content">When comparing languages side-by-side, I needed to see whether URL links were the same. The easiest way to do that was by looking at where those links were actually pointing out. I could of course select an element with Firebug to see all of its content, but it was easier to print a particular attribute automatically, when I hovered over it with a mouse: <em>$(&#8221;a&#8221;).mouseenter(function(){console.log( $(this).attr(&#8217;href&#8217;));})</em></span></span></li>
<li><span class="status-body"><span class="entry-content">If I quickly needed to check which elements were affected by a particular class, I would just highlight them:<em> $(&#8221;.NYOnly&#8221;).css(&#8221;background-color&#8221;, &#8220;red&#8221;)</em></span></span></li>
</ul>
<p><span class="status-body"><span class="entry-content"><a title="A video of using jQuery and Firebug" href="http://encosia.com/2009/09/21/updated-see-how-i-used-firebug-to-learn-jquery/">None of these are hidden secrets</a>, however it may not always be obvious what can be done and how far a couple of lines of jQuery code can go. Here is an example that gets pasted right into Firebug window. It uses <a title="jQuery extension for Google Translation API" href="http://code.google.com/p/jquery-translate/">jQuery-translate</a> extension to hook into Google Translate API and prints out translated content of a table cell that is clicked on:</span></span></p>
<blockquote><p><code>$.getScript('http://jquery-translate.googlecode.com/files/jquery.translate-1.3.9.min.js');</code></p>
<p>$(&#8221;td&#8221;).click(function(){<br />
$(this).translate(&#8217;ar&#8217;, &#8216;en&#8217;, {<br />
replace: false,<br />
each: function(i){console.log( this.translation[i] ) }<br />
})<br />
});</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/10/jquery-for-multilingual-web-development/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Making up with ANTLR</title>
		<link>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/</link>
		<comments>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/#comments</comments>
		<pubDate>Fri, 29 May 2009 02:25:16 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[My PhD research]]></category>

		<category><![CDATA[Problems and Solutions]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=275</guid>
		<description><![CDATA[I like ANTLR! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all Dust Puppy like. And I have used it in the past with great success.
But, every time I put this particular tool aside, I know that picking it back up will be like making [...]]]></description>
			<content:encoded><![CDATA[<p>I like <a title="ANTLR's home page" href="http://antlr.org/">ANTLR</a>! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all <a title="Explanation of Dust Puppy" href="http://www.userfriendly.org/cartoons/dustpuppy/">Dust Puppy</a> like. And I have used it in the past with great success.</p>
<p>But, every time I put this particular tool aside, I know that picking it back up will be like making up after a bad break up. Things feel familiar, but you are still so uncomfortable you cannot get anything working. Only knowing how great the tool is underneath, makes me go through the effort of re-familiarization.</p>
<p>I just downloaded ANTLR 3.1.2 bundled with its own GUI ANTLRWorks that offers visual diagrams, debugger and templates. You would think that would make for an easy out-of-box experience. You would be wrong.</p>
<p>You start the GUI and end up facing a blank screen. Lots of options and tabs for sure, but the only easy start one seems to be &#8216;Insert rule from template&#8217;.</p>
<p>Ok, so here is a couple of rules from templates trying to parse &#8220;Hello World!&#8221; string:</p>
<blockquote><p>ID    :    LETTER (LETTER | DIGIT)*<br />
;<br />
LETTER<br />
:    &#8216;a&#8217;..&#8217;z&#8217; | &#8216;A&#8217;..&#8217;Z&#8217;<br />
;</p>
<p>DIGIT    :    &#8216;0&#8242;..&#8217;9&#8242;<br />
;</p>
<p>WS    :    (&#8217; &#8216; | &#8216;\t&#8217; | &#8216;\n&#8217; | &#8216;\r&#8217;) { $setType(Token.SKIP); }<br />
;</p></blockquote>
<p>Not good. We are missing a start state apparently. Ok, let&#8217;s add one:</p>
<blockquote><p>hello    :    ID ID &#8216;!&#8217;<br />
;</p></blockquote>
<p>Still no good. Start looking at examples, trying to see what bits are compulsory. Ok, the word grammar is missing at the top of the file. Of course, I have both grammar and lexer elements now in one file (ANTLR 3 feature, I believe), but let&#8217;s not worry about deep meaning here.</p>
<blockquote><p>grammar test;</p></blockquote>
<p>Now, suddenly, syntax diagram starts showing up. Let&#8217;s try saving (as test.g) and compiling. No good:</p>
<blockquote><p>The following token definitions can never be matched because prior tokens match the same input: LETTER</p></blockquote>
<p>So much for following a template. More digging in examples. Memory really starts to bring back the <a title="Seminal book on Compiler technologies" href="http://dragonbook.stanford.edu/">Dragon Book</a>&#8217;s lessons. What&#8217;s the problem with LETTER and who is the <em>prior token</em> here. Ah, we don&#8217;t want the lexer to return LETTER (or DIGIT), only ID. So, LETTER and DIGIT are both token fragments, not tokens. Add <em>fragment</em> in front of both definitions. All good?</p>
<p>Nope! Now we have a problem with:</p>
<blockquote><p>attribute is not a token, parameter, or return value: setType</p></blockquote>
<p>But I did not write <em>setType</em>, the template provided it! Back to the examples! Apparently, somewhere along the way Skip tokens have gone away and we now have hidden channels instead. Swap that bit with one from an example and try again.</p>
<p>SUCCESS. Switch to interpreter, enter &#8220;Hello World!&#8221; in input box and run <em>hello</em> rule. Beauty, we have a parse diagram.</p>
<p>The final running grammar example is here:</p>
<blockquote><p>grammar test;</p>
<p>hello    :    ID ID &#8216;!&#8217;<br />
;</p>
<p>ID    :    LETTER (LETTER | DIGIT)*<br />
;<br />
fragment LETTER<br />
:    &#8216;a&#8217;..&#8217;z&#8217; | &#8216;A&#8217;..&#8217;Z&#8217;<br />
;</p>
<p>fragment DIGIT    :    &#8216;0&#8242;..&#8217;9&#8242;<br />
;</p>
<p>WS    :    (&#8217; &#8216; | &#8216;\t&#8217; | &#8216;\n&#8217; | &#8216;\r&#8217;) {  $channel = HIDDEN;  }<br />
;</p></blockquote>
<p>Hello World! Now, on to the real grammar and (if things really, really work) GATE integration&#8230;..</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Conjunctions in named entities</title>
		<link>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/</link>
		<comments>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 02:34:52 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=273</guid>
		<description><![CDATA[A recent article on lingpipe discussed conjuncted named entities such as Johnson and Johnson and Wallace and Gromit. They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.
In the United [...]]]></description>
			<content:encoded><![CDATA[<p>A <a title="Lingpipe's article on conjunctions in named entities" href="http://lingpipe-blog.com/2009/03/26/joint-referential-uncertainty-the-wallace-and-gromit-dilemma/">recent article on lingpipe</a> discussed conjuncted named entities such as <span style="text-decoration: underline;">Johnson and Johnson</span> and <span style="text-decoration: underline;">Wallace and Gromit</span><em>.</em> They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.</p>
<p>In the United Nations corpus, things can get even more interesting. Let&#8217;s look at a relatively easy example: <em><span style="text-decoration: underline;">draft resolution A/56/L.28 and Add.1</span></em>.</p>
<p>Is this a one document (one draft resolution) or two? And if two, then which two? The first one is obviously <span style="text-decoration: underline;">A/56/L.28</span>. But <span style="text-decoration: underline;">Add.1</span> is not a valid document symbol, it is actually an (additive?) coreference to the first one and resolves to <span style="text-decoration: underline;">A/56/L.28/Add.1</span>?</p>
<p>The answer (as good as I can make it so far) could lie in <a title="Introduction to FRBR" href="http://techessence.info/frbr">FRBR</a> distinction between Expression and Manifestation. A resolution is an expression of Member States&#8217;s proposals and negotiations. To some degree, it evolves over several meetings. However between the discussions, the latest version or changes need to be reported to make sure they are formally registered and also to ensure the next round of discussions could have latest documents to work from.</p>
<p>In our case, the first time the draft resolution had to be presented it was published under <span style="text-decoration: underline;">A/56/L.28</span> (which incidentally means a limited distribution document 28 of the General Assembly&#8217;s 56th regular session). So, the initial Manifestation of the draft resolution became this physical document with a distinct symbol assigned.</p>
<p>But apart from its text, draft resolution has a list of sponsoring Member States. That list can change as draft resolution gains sponsors. These additional sponsors were in the Addendum <span style="text-decoration: underline;">A/56/L.28/Add.1</span>. But the addendum does not make sense without the original document, so actually both physical documents represent one logical draft resolution, which is reflected in the grammar of the text (draft resolution, not resolution<span style="text-decoration: underline;">s</span>).</p>
<p>What this means for named entity annotations and for recognition algorithms is hard to say and is something I am looking at with my PhD research.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/feed/</wfw:commentRss>
		</item>
		<item>
		<title>CiteULike Exhibit visualization</title>
		<link>http://blog.outerthoughts.com/2009/01/citeulike-exhibit-visualization/</link>
		<comments>http://blog.outerthoughts.com/2009/01/citeulike-exhibit-visualization/#comments</comments>
		<pubDate>Wed, 28 Jan 2009 00:57:22 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Problems and Solutions]]></category>

		<category><![CDATA[bibliography]]></category>

		<category><![CDATA[CiteULike]]></category>

		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=270</guid>
		<description><![CDATA[Homegrown visualization is not the only way to quickly navigate CiteULike references. There are other tools that display bibliographies in interesting ways.
One of such tools is Exhibit, one of graduates from SIMILE project. It allows to do a very interactive webpage driven by just HTML+Javascript, with no server-side component required. I really like SIMILE&#8217;s tools, [...]]]></description>
			<content:encoded><![CDATA[<p><a title="Previous article on visualizing CiteULike's bibliographies" href="http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/">Homegrown visualization</a> is not the only way to quickly navigate CiteULike references. There are other tools that display bibliographies in interesting ways.</p>
<p>One of such tools is <a title="Exhibit and other ex-SIMILE tools" href="http://code.google.com/p/simile-widgets/">Exhibit</a>, one of graduates from <a title="SIMILE project's homepage" href="http://simile.mit.edu/">SIMILE</a> project. It allows to do a very interactive webpage driven by just HTML+Javascript, with no server-side component required. I really like SIMILE&#8217;s tools, even though it feels like development slowed somewhat recently.</p>
<p>There is <a href="http://simile.mit.edu/wiki/Exhibit/How_to_make_a_publications_exhibit">an example of how to import and display bibtext within Exhibit</a>. It is not difficult, just a couple of steps. It must have been a popular section, as there is now a dedicated new tool for it.</p>
<p><a title="Citeline Exhibit Builder" href="http://citeline.mit.edu/">Citeline Exhibit Builder</a> allows to load in bibtext and presents editing interface to customize Exhibit&#8217;s presentation of the publications. It looks great and seem to work well. A nice aspect is that it allows to chose which bibtext fields to expose as filter facets. With original tutorial that would require html editing and understanding Exhibit mindset. Citeline nicely hides user from it.</p>
<p>There was a couple of small problems. Apparently, there is a way to login and &#8216;claim&#8217; your presentation. I couldn&#8217;t test that as OpenID authentication failed (something about a nonce). Also, there is jsMath library but, once the generated Exhibit is downloaded, it fails with cross-server issues. Finally, as with most end-to-end solutions, it does not do data preprocessing/normalization to allow me, for example, to combine author/editor fields for sorting purposes.</p>
<p>Citeline is a very promising tool and I am certainly going to keep it in mind for publishing my bibliographies.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/citeulike-exhibit-visualization/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Visualizing CiteULike collections</title>
		<link>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/</link>
		<comments>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/#comments</comments>
		<pubDate>Sun, 25 Jan 2009 07:10:20 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[My PhD research]]></category>

		<category><![CDATA[Problems and Solutions]]></category>

		<category><![CDATA[CiteULike]]></category>

		<category><![CDATA[Graphviz]]></category>

		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=266</guid>
		<description><![CDATA[I am collecting my reading and reference material in CiteULike. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.
Nice as CiteULike is, it is fairly difficult to get an overall picture of one&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>I am collecting my reading and reference material in <a title="My library in CiteULike" href="http://www.citeulike.org/user/arafalov">CiteULike</a>. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.</p>
<p>Nice as CiteULike is, it is fairly difficult to get an overall picture of one&#8217;s own collection. It is especially difficult to see quickly if there are people who serve as hubs by collaborating with multiple different groups. The information is there, but it requires a lot of clicks to find it out.</p>
<p>My usual solution is to export information out, massage it into <a title="Home page of Graphviz" href="http://www.graphviz.org/">Graphviz</a> format and use graph segmentation and layout algorithms to get a better overview. I <a title="Search for my articles mentioning Graphviz" href="http://blog.outerthoughts.com/?s=graphviz">have talked about Graphviz</a> a number of times on this blog before. This is yet another time it proved useful.</p>
<p>I started by exporting CiteULike&#8217;s content of my library. I found Endnote export format to be more structured and therefore easier to parse. I then run it through <a title="My converter" href="http://www.outerthoughts.com/files/paperviz/v1/convert.py">a custom Python program</a> that basically spat out graph with titles pointing at authors. That produced a <strong>very large</strong> graph and was not particularly useful.</p>
<p>The next step was to discover disjointed clusters of titles/authors. I used <em>ccomps</em> with -v and -x flags (e.g. <em>ccomps.exe -v -x -o comp.dot output.dot</em>).</p>
<p><em>ccomps</em> gave me partitioned graphs as well as statistics on number of nodes/edges in each graph. I could then choose a graph with large number of nodes/edges (eventually, all of them) and run it through <em>neato</em> with overlap=scale and splines=true (e.g. <em>neato.exe -Tgif -o neato_1.gif -Goverlap=scale -Gsplines=true comp_1.dot</em>).</p>
<p>The resulting graph was still not perfect, but it was a good start. I also tried <em>fdp</em> instead of <em>neato</em>, but that seemed to produce giraffe versions of the graph with graph edges being overly long.</p>
<p>You can see <a title="Output image of one of the clusters" href="http://www.outerthoughts.com/files/paperviz/v1/neato_1.gif">an example</a> of <em>neato</em> output for one of my clusters. Warning: if it causes problems due to its size, try it with <a title="Graphics viewing freeware" href="http://www.irfanview.com/">IrfanView</a>; that program can display even improbably large graphs (e.g. unpartitioned ones).</p>
<p>I have run into some problems as well that would either cause partitions combine together or produce duplicate nodes and edges.</p>
<p>The first problem was that sometimes a person was an author and sometimes an editor. I was interested in both, so collapsed those fields together. That caused some non-people to then show up on the graph and connect clusters in unexpected ways. For my library the specific value was &#8216;European&#8217;, so I filtered it out in the code.</p>
<p>The second problem was to do with CiteULike&#8217;s parsing. Sometimes, it would split a first+last name into separate names, probably due to incorrect manual entry at some point. I had to fix those at the source by editing corresponding CiteULike entry. Probably a good thing to do anyway.</p>
<p>The other problem is right out of the co-reference resolution domain. Sometimes names would include full first names, sometimes only a first name initial. I have worked around that by normalizing all first names to the initials. Obviously, this could collapse entries belonging to multiple real people into one.</p>
<p>Further on name problems, in cases of non English names (e.g. Spanish names with multiple surnames), CiteULike would get confused which part is which and not display or export it correctly. Additionally, sometimes characters such as <strong>ñ</strong> would be entered as plain <strong>n</strong>. Those also needed to be corrected manually.</p>
<p>The project only took a couple of hours including writing code and cleanup. It is already useful to me, as I found a new person who was in unexpectedly large number of papers and also found a chain of connections that might be interesting to follow more closely.</p>
<p>There is of course a lot more that could be done. Automatic co-reference of misspelt names, layout hints based on number of times authors appeared together, color coding of tags - these are just some of the easy ideas.</p>
<p>There might even be a small project/paper in doing co-reference resolution and cleaning up CiteULike data? After all, similar projects were done for Wikipedia. I don&#8217;t think CiteULike currently makes a full export available, but they do have <a title="CiteULike's datasets available for research" href="http://www.citeulike.org/faq/data.adp">some</a> so might be amendable to exporting a special set for research purposes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/feed/</wfw:commentRss>
		</item>
		<item>
		<title>New mailing list to discuss junction of NLP and Software Engineering</title>
		<link>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/</link>
		<comments>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/#comments</comments>
		<pubDate>Sat, 17 Jan 2009 21:20:03 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=260</guid>
		<description><![CDATA[Dr. René Witte has just created a new mailing list (SENLP) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.
I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on written [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.rene-witte.net/">Dr. René Witte</a> has just created a new mailing list (<a title="Introduction to SENLP mailing list" href="http://www.semanticsoftware.info/blog/senlp-mailing-list-connecting-software-engineering-and-nlp">SENLP</a>) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.</p>
<p>I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on written notes in support cases could have improved quality of technical support. I did not get to do any of that, but some interest remains.</p>
<p>The second topic is even more interesting and important to me. It can build on current discussions currently held on blogs (see &#8216;<a title="Blog entry about Software Engineering and NLP" href="http://www.drni.de/niels/s9y/archives/5-The-USES-Issue.html">The USES Issue</a>&#8216; at Niels Ott&#8217;s blog) and in journals (see: <a title="Ted Pedersen article on building better NLP software" href="http://www.d.umn.edu/~tpederse/Pubs/pedersen-last-word-2008.pdf">&#8216;Empiricism Is Not a Matter of Faith</a>&#8216; by Ted Pedersen). While some of the issues are discussed on mailing lists for individual pieces of software, a place to discuss cross-cutting concerns is very welcome.</p>
<p>I have joined the list and hope to see at least some of my readers there as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Where are all legal computational linguistics resources?</title>
		<link>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/</link>
		<comments>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/#comments</comments>
		<pubDate>Wed, 14 Jan 2009 01:41:44 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<category><![CDATA[Ideas]]></category>

		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=258</guid>
		<description><![CDATA[I am frustrated. I know my corpus (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in [...]]]></description>
			<content:encoded><![CDATA[<p>I am frustrated. I know <a href="http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/">my corpus</a> (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in legal domain.</p>
<p>I have just gone through all of <a title="Jurix conference" href="http://www.jurix.nl/">Jurix</a>&#8216; proceedings as well as all of <a title="Digital edition of &quot;Artificial Intelligence and Law&quot; journal" href="http://www.springerlink.com/content/100239/">Artificial Intelligence and Law</a> and all I got is <a title="My article set from legal domain" href="http://www.citeulike.org/user/arafalov/tag/legal">between 2 and 4 articles worth following-up</a>.</p>
<p>There must be somebody actually trying to parse real legal texts and figuring out to deal with complex organisation, people and group names. But all I can see is articles dealing with levels from ontology and up.</p>
<p>There might even be money in it!</p>
<p>One of the crazy business ideas I had was to parse all the web-based <em>terms of use</em> and <em>privacy notices</em> and annotate/crowd-vote them for how bad they are. So, before creating a web-based account, I could check it against database/parser and it would highlight and rate for me passages that I really should pay attention to (e.g. <em>we sell your contact details to every spammer we know</em> ). Since the language of those notices is often ritualistically formulaic, extracting interesting and useful summary would actually be simpler than it looks.</p>
<p>And the business model would center on providing automatic notification option if a notice from subscribed website sneakily changed and became much worse. That way one would pay money for peace of mind that there were no unexpected service rule changes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Weird Wired Magazine (or maybe just stupid)</title>
		<link>http://blog.outerthoughts.com/2009/01/weird-wired-mag/</link>
		<comments>http://blog.outerthoughts.com/2009/01/weird-wired-mag/#comments</comments>
		<pubDate>Sat, 10 Jan 2009 02:19:21 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Weird Stuff]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=255</guid>
		<description><![CDATA[I really do not get Wired Magazine&#8217;s subscription policy. They are supposed to target smart geeks, yet make really stupid moves.
I used to be a subscriber. But I got annoyed by a large number of ads, deliberate and unnecessary foul language and subscription inserts advertising $8 new subscriptions. So, I did not renew early.
Renewal notices [...]]]></description>
			<content:encoded><![CDATA[<p>I really do not get Wired Magazine&#8217;s subscription policy. They are supposed to target smart geeks, yet make really stupid moves.</p>
<p>I used to be a subscriber. But I got annoyed by a large number of ads, deliberate and unnecessary foul language and subscription inserts advertising $8 new subscriptions. So, I did not renew early.</p>
<p>Renewal notices starting arriving and that&#8217;s where things got weird. The first renewal notice was for $12, not $8 as I was expecting. I figured they may have pushed the prices up and that next magazine&#8217;s subscription insert would be $15 with renewal offer at $12. Nope! The next magazine still had 6 inserts with $8 offer.</p>
<p>The next renewal notice had a slightly panicked tone. It called me a valuable subscriber. But it was still $12. The third notice appealed to my need as a professional to know what Wired thinks is cool. It obviously thought I was too busy a professional to remember price point differences.</p>
<p>Finally, they got to me. I did not renew, but instead emailed their subscription support asking about the price difference. Tammi from Wired replied:</p>
<blockquote><p>Thank you for contacting us concerning a lower subscription price that you have recently seen. We have many different offers to attract new subscribers.  These offers can also be available to you.  Please respond with your special offer information and we will be happy to enter your subscription.</p></blockquote>
<p>That was so weird, I eventually got their Consumer Marketing Director reply:</p>
<blockquote><p>Your recent email was routed to my office.  I regret that you were offended by seeing an $8 per year subscription offer on an insert card in your copy of Wired.  The explanation is that we periodically test different price points, both higher and lower,  and that is why your subscription copy carried that more discounted offer.  Let me assure you that had you requested the 12/$8 offer, we would have honored your renewal order at that rate rather than the higher priced offer you received with your renewal communication.   We do value our long term subscribers and as a courtesy, along with my apologies, I would be happy to offer you a complimentary year’s subscription.</p></blockquote>
<p>I thought about it and figured I was really bothered by their version of <em>valuing long term subscribers</em> and refused to renew at any price. My complaint was never about price, just about being treated as captive audience and a stupid one.</p>
<p>That was 6 months ago. Yesterday I got another email from them. They want me back. At $12/year. The online price is $10. This is now SPAM. I made sure GMail knows that.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/weird-wired-mag/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Explaining Computational Linguistics to friends and family</title>
		<link>http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/</link>
		<comments>http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/#comments</comments>
		<pubDate>Tue, 26 Aug 2008 02:41:43 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Computational Linguistics]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=253</guid>
		<description><![CDATA[It is hard enough to explain what we are doing to our professors; explaining it in plain English to our friends and family is nearly impossible.
So it is always good to see people who can explain what POS tagger is and why it is important without having to throw around references to Norvig or Jurafsky.
Markus [...]]]></description>
			<content:encoded><![CDATA[<p>It is hard enough to explain what we are doing to our professors; explaining it in <em>plain English</em> to our friends and family is nearly impossible.</p>
<p>So it is always good to see people who can explain what POS tagger is and why it is important without having to throw around references to Norvig or Jurafsky.</p>
<p>Markus Dickinson has managed to do exactly such explanation in his <a href="http://jones.ling.indiana.edu/~mdickinson/papers/budapest.html">non-linguistic primer</a> to a serious research paper on <a href="http://jones.ling.indiana.edu/%7Emdickinson/papers/dickinson-meurers-03.html">Detecting Errors in   Part-of-Speech Annotation</a>. The writing is quite old (2003), but it reads well and still feels relevant. Of course, his research page contains more recent papers on the same topic too.</p>
<p>(via <a href="http://blogamundo.net/dev/2008/08/22/a-plain-english-description-of-a-computational-linguistics-thesis/">Hacklog</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Vista repeatedly dropping wireless connection - solution</title>
		<link>http://blog.outerthoughts.com/2008/06/vista-repeatedly-dropping-wireless-connection-solution/</link>
		<comments>http://blog.outerthoughts.com/2008/06/vista-repeatedly-dropping-wireless-connection-solution/#comments</comments>
		<pubDate>Sat, 14 Jun 2008 12:44:15 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
		
		<category><![CDATA[Problems and Solutions]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/06/vista-repeatedly-dropping-wireless-connection-solution/</guid>
		<description><![CDATA[I am visiting my parents and connect to their network via wireless router. My laptop, which is (still!) running Vista kept dropping wireless connection every couple of minutes and reconnecting again. Interestingly, the other computers connected to the same router had no problems.
I could not figure out where to even start troubleshooting this issue, until [...]]]></description>
			<content:encoded><![CDATA[<p>I am visiting my parents and connect to their network via wireless router. My laptop, which is (still!) running Vista kept dropping wireless connection every couple of minutes and reconnecting again. Interestingly, the other computers connected to the same router had no problems.</p>
<p>I could not figure out where to even start troubleshooting this issue, until I noticed that the problem only happens while I am running on battery and not when I am connected to the mains. Once I notice that, the solution was simple - power management module must have been too eager and turning off wireless after 30 seconds of inactivity. Given that I was trying to read emails or webpages, that would occur fairly regularly.</p>
<p>The fix is to go to the power-management control panel and adjust on-battery behaviour to match the full-power one. I am putting this out because an hour of searching for this problem online did not bring any result. I hope the next person to be flummoxed by this repeated connection loss will find my blog entry fast.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/06/vista-repeatedly-dropping-wireless-connection-solution/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>
