<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Outer Thoughts &#187; My PhD research</title>
	<atom:link href="http://blog.outerthoughts.com/category/my-phd-research/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.outerthoughts.com</link>
	<description>&#62; From inner thoughts to the outer limits of Alexandre Rafalovitch</description>
	<lastBuildDate>Wed, 27 Jul 2011 00:24:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.4</generator>
		<item>
		<title>My guest post about uncorpora project at TAUS blog</title>
		<link>http://blog.outerthoughts.com/2011/01/my-guest-post-about-uncorpora-project-at-taus-blog/</link>
		<comments>http://blog.outerthoughts.com/2011/01/my-guest-post-about-uncorpora-project-at-taus-blog/#comments</comments>
		<pubDate>Wed, 26 Jan 2011 23:54:00 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=399</guid>
		<description><![CDATA[<p>I was asked to guest blog for TAUS about my research/work project UNCORPORA. The article has now gone live. It might be interesting for people interested in UN languages, natural language processing or (by following links) XML geeks.</p> <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2011/01/my-guest-post-about-uncorpora-project-at-taus-blog/">My guest post about uncorpora project at TAUS blog</a></span>]]></description>
			<content:encoded><![CDATA[<p>I was asked to guest blog for TAUS about my research/work project <a title="My project on: Corpora of the United Nations for the research purposes" href="http://www.uncorpora.org/">UNCORPORA</a>. <a title="Guest post at TAUS about uncorpora.org project" href="http://bit.ly/fjKiI8">The article</a> has now gone live. It might be interesting for people interested in UN languages, natural language processing or (by following links) XML geeks.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2011/01/my-guest-post-about-uncorpora-project-at-taus-blog/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Making up with ANTLR</title>
		<link>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/</link>
		<comments>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/#comments</comments>
		<pubDate>Fri, 29 May 2009 02:25:16 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[My PhD research]]></category>
		<category><![CDATA[Problems and Solutions]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=275</guid>
		<description><![CDATA[<p>I like ANTLR! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all Dust Puppy like. And I have used it in the past with great success.</p> <p>But, every time I put this particular tool aside, I know that picking it back up will be like <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/05/making-up-with-antlr/">Making up with ANTLR</a></span>]]></description>
			<content:encoded><![CDATA[<p>I like <a title="ANTLR's home page" href="http://antlr.org/">ANTLR</a>! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all <a title="Explanation of Dust Puppy" href="http://www.userfriendly.org/cartoons/dustpuppy/">Dust Puppy</a> like. And I have used it in the past with great success.</p>
<p>But, every time I put this particular tool aside, I know that picking it back up will be like making up after a bad break up. Things feel familiar, but you are still so uncomfortable you cannot get anything working. Only knowing how great the tool is underneath, makes me go through the effort of re-familiarization.</p>
<p>I just downloaded ANTLR 3.1.2 bundled with its own GUI ANTLRWorks that offers visual diagrams, debugger and templates. You would think that would make for an easy out-of-box experience. You would be wrong.</p>
<p>You start the GUI and end up facing a blank screen. Lots of options and tabs for sure, but the only easy start one seems to be &#8216;Insert rule from template&#8217;.</p>
<p>Ok, so here is a couple of rules from templates trying to parse &#8220;Hello World!&#8221; string:</p>
<blockquote><p>ID    :    LETTER (LETTER | DIGIT)*<br />
;<br />
LETTER<br />
:    &#8216;a&#8217;..&#8217;z&#8217; | &#8216;A&#8217;..&#8217;Z&#8217;<br />
;</p>
<p>DIGIT    :    &#8217;0&#8242;..&#8217;9&#8242;<br />
;</p>
<p>WS    :    (&#8216; &#8216; | &#8216;\t&#8217; | &#8216;\n&#8217; | &#8216;\r&#8217;) { $setType(Token.SKIP); }<br />
;</p></blockquote>
<p>Not good. We are missing a start state apparently. Ok, let&#8217;s add one:</p>
<blockquote><p>hello    :    ID ID &#8216;!&#8217;<br />
;</p></blockquote>
<p>Still no good. Start looking at examples, trying to see what bits are compulsory. Ok, the word grammar is missing at the top of the file. Of course, I have both grammar and lexer elements now in one file (ANTLR 3 feature, I believe), but let&#8217;s not worry about deep meaning here.</p>
<blockquote><p>grammar test;</p></blockquote>
<p>Now, suddenly, syntax diagram starts showing up. Let&#8217;s try saving (as test.g) and compiling. No good:</p>
<blockquote><p>The following token definitions can never be matched because prior tokens match the same input: LETTER</p></blockquote>
<p>So much for following a template. More digging in examples. Memory really starts to bring back the <a title="Seminal book on Compiler technologies" href="http://dragonbook.stanford.edu/">Dragon Book</a>&#8216;s lessons. What&#8217;s the problem with LETTER and who is the <em>prior token</em> here. Ah, we don&#8217;t want the lexer to return LETTER (or DIGIT), only ID. So, LETTER and DIGIT are both token fragments, not tokens. Add <em>fragment</em> in front of both definitions. All good?</p>
<p>Nope! Now we have a problem with:</p>
<blockquote><p>attribute is not a token, parameter, or return value: setType</p></blockquote>
<p>But I did not write <em>setType</em>, the template provided it! Back to the examples! Apparently, somewhere along the way Skip tokens have gone away and we now have hidden channels instead. Swap that bit with one from an example and try again.</p>
<p>SUCCESS. Switch to interpreter, enter &#8220;Hello World!&#8221; in input box and run <em>hello</em> rule. Beauty, we have a parse diagram.</p>
<p>The final running grammar example is here:</p>
<blockquote><p>grammar test;</p>
<p>hello    :    ID ID &#8216;!&#8217;<br />
;</p>
<p>ID    :    LETTER (LETTER | DIGIT)*<br />
;<br />
fragment LETTER<br />
:    &#8216;a&#8217;..&#8217;z&#8217; | &#8216;A&#8217;..&#8217;Z&#8217;<br />
;</p>
<p>fragment DIGIT    :    &#8217;0&#8242;..&#8217;9&#8242;<br />
;</p>
<p>WS    :    (&#8216; &#8216; | &#8216;\t&#8217; | &#8216;\n&#8217; | &#8216;\r&#8217;) {  $channel = HIDDEN;  }<br />
;</p></blockquote>
<p>Hello World! Now, on to the real grammar and (if things really, really work) GATE integration&#8230;..</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/05/making-up-with-antlr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Conjunctions in named entities</title>
		<link>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/</link>
		<comments>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 02:34:52 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=273</guid>
		<description><![CDATA[<p>A recent article on lingpipe discussed conjuncted named entities such as Johnson and Johnson and Wallace and Gromit. They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.</p> <p>In the <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/">Conjunctions in named entities</a></span>]]></description>
			<content:encoded><![CDATA[<p>A <a title="Lingpipe's article on conjunctions in named entities" href="http://lingpipe-blog.com/2009/03/26/joint-referential-uncertainty-the-wallace-and-gromit-dilemma/">recent article on lingpipe</a> discussed conjuncted named entities such as <span style="text-decoration: underline;">Johnson and Johnson</span> and <span style="text-decoration: underline;">Wallace and Gromit</span><em>.</em> They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.</p>
<p>In the United Nations corpus, things can get even more interesting. Let&#8217;s look at a relatively easy example: <em><span style="text-decoration: underline;">draft resolution A/56/L.28 and Add.1</span></em>.</p>
<p>Is this a one document (one draft resolution) or two? And if two, then which two? The first one is obviously <span style="text-decoration: underline;">A/56/L.28</span>. But <span style="text-decoration: underline;">Add.1</span> is not a valid document symbol, it is actually an (additive?) coreference to the first one and resolves to <span style="text-decoration: underline;">A/56/L.28/Add.1</span>?</p>
<p>The answer (as good as I can make it so far) could lie in <a title="Introduction to FRBR" href="http://techessence.info/frbr">FRBR</a> distinction between Expression and Manifestation. A resolution is an expression of Member States&#8217;s proposals and negotiations. To some degree, it evolves over several meetings. However between the discussions, the latest version or changes need to be reported to make sure they are formally registered and also to ensure the next round of discussions could have latest documents to work from.</p>
<p>In our case, the first time the draft resolution had to be presented it was published under <span style="text-decoration: underline;">A/56/L.28</span> (which incidentally means a limited distribution document 28 of the General Assembly&#8217;s 56th regular session). So, the initial Manifestation of the draft resolution became this physical document with a distinct symbol assigned.</p>
<p>But apart from its text, draft resolution has a list of sponsoring Member States. That list can change as draft resolution gains sponsors. These additional sponsors were in the Addendum <span style="text-decoration: underline;">A/56/L.28/Add.1</span>. But the addendum does not make sense without the original document, so actually both physical documents represent one logical draft resolution, which is reflected in the grammar of the text (draft resolution, not resolution<span style="text-decoration: underline;">s</span>).</p>
<p>What this means for named entity annotations and for recognition algorithms is hard to say and is something I am looking at with my PhD research.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Visualizing CiteULike collections</title>
		<link>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/</link>
		<comments>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/#comments</comments>
		<pubDate>Sun, 25 Jan 2009 07:10:20 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>
		<category><![CDATA[Problems and Solutions]]></category>
		<category><![CDATA[CiteULike]]></category>
		<category><![CDATA[Graphviz]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=266</guid>
		<description><![CDATA[<p>I am collecting my reading and reference material in CiteULike. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.</p> <p>Nice as CiteULike is, it is fairly difficult to get an overall picture of <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/">Visualizing CiteULike collections</a></span>]]></description>
			<content:encoded><![CDATA[<p>I am collecting my reading and reference material in <a title="My library in CiteULike" href="http://www.citeulike.org/user/arafalov">CiteULike</a>. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.</p>
<p>Nice as CiteULike is, it is fairly difficult to get an overall picture of one&#8217;s own collection. It is especially difficult to see quickly if there are people who serve as hubs by collaborating with multiple different groups. The information is there, but it requires a lot of clicks to find it out.</p>
<p>My usual solution is to export information out, massage it into <a title="Home page of Graphviz" href="http://www.graphviz.org/">Graphviz</a> format and use graph segmentation and layout algorithms to get a better overview. I <a title="Search for my articles mentioning Graphviz" href="http://blog.outerthoughts.com/?s=graphviz">have talked about Graphviz</a> a number of times on this blog before. This is yet another time it proved useful.</p>
<p>I started by exporting CiteULike&#8217;s content of my library. I found Endnote export format to be more structured and therefore easier to parse. I then run it through <a title="My converter" href="http://www.outerthoughts.com/files/paperviz/v1/convert.py">a custom Python program</a> that basically spat out graph with titles pointing at authors. That produced a <strong>very large</strong> graph and was not particularly useful.</p>
<p>The next step was to discover disjointed clusters of titles/authors. I used <em>ccomps</em> with -v and -x flags (e.g. <em>ccomps.exe -v -x -o comp.dot output.dot</em>).</p>
<p><em>ccomps</em> gave me partitioned graphs as well as statistics on number of nodes/edges in each graph. I could then choose a graph with large number of nodes/edges (eventually, all of them) and run it through <em>neato</em> with overlap=scale and splines=true (e.g. <em>neato.exe -Tgif -o neato_1.gif -Goverlap=scale -Gsplines=true comp_1.dot</em>).</p>
<p>The resulting graph was still not perfect, but it was a good start. I also tried <em>fdp</em> instead of <em>neato</em>, but that seemed to produce giraffe versions of the graph with graph edges being overly long.</p>
<p>You can see <a title="Output image of one of the clusters" href="http://www.outerthoughts.com/files/paperviz/v1/neato_1.gif">an example</a> of <em>neato</em> output for one of my clusters. Warning: if it causes problems due to its size, try it with <a title="Graphics viewing freeware" href="http://www.irfanview.com/">IrfanView</a>; that program can display even improbably large graphs (e.g. unpartitioned ones).</p>
<p>I have run into some problems as well that would either cause partitions combine together or produce duplicate nodes and edges.</p>
<p>The first problem was that sometimes a person was an author and sometimes an editor. I was interested in both, so collapsed those fields together. That caused some non-people to then show up on the graph and connect clusters in unexpected ways. For my library the specific value was &#8216;European&#8217;, so I filtered it out in the code.</p>
<p>The second problem was to do with CiteULike&#8217;s parsing. Sometimes, it would split a first+last name into separate names, probably due to incorrect manual entry at some point. I had to fix those at the source by editing corresponding CiteULike entry. Probably a good thing to do anyway.</p>
<p>The other problem is right out of the co-reference resolution domain. Sometimes names would include full first names, sometimes only a first name initial. I have worked around that by normalizing all first names to the initials. Obviously, this could collapse entries belonging to multiple real people into one.</p>
<p>Further on name problems, in cases of non English names (e.g. Spanish names with multiple surnames), CiteULike would get confused which part is which and not display or export it correctly. Additionally, sometimes characters such as <strong>ñ</strong> would be entered as plain <strong>n</strong>. Those also needed to be corrected manually.</p>
<p>The project only took a couple of hours including writing code and cleanup. It is already useful to me, as I found a new person who was in unexpectedly large number of papers and also found a chain of connections that might be interesting to follow more closely.</p>
<p>There is of course a lot more that could be done. Automatic co-reference of misspelt names, layout hints based on number of times authors appeared together, color coding of tags &#8211; these are just some of the easy ideas.</p>
<p>There might even be a small project/paper in doing co-reference resolution and cleaning up CiteULike data? After all, similar projects were done for Wikipedia. I don&#8217;t think CiteULike currently makes a full export available, but they do have <a title="CiteULike's datasets available for research" href="http://www.citeulike.org/faq/data.adp">some</a> so might be amendable to exporting a special set for research purposes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Where are all legal computational linguistics resources?</title>
		<link>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/</link>
		<comments>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/#comments</comments>
		<pubDate>Wed, 14 Jan 2009 01:41:44 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[Ideas]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=258</guid>
		<description><![CDATA[<p>I am frustrated. I know my corpus (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/">Where are all legal computational linguistics resources?</a></span>]]></description>
			<content:encoded><![CDATA[<p>I am frustrated. I know <a href="http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/">my corpus</a> (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in legal domain.</p>
<p>I have just gone through all of <a title="Jurix conference" href="http://www.jurix.nl/">Jurix</a>&#8216; proceedings as well as all of <a title="Digital edition of &quot;Artificial Intelligence and Law&quot; journal" href="http://www.springerlink.com/content/100239/">Artificial Intelligence and Law</a> and all I got is <a title="My article set from legal domain" href="http://www.citeulike.org/user/arafalov/tag/legal">between 2 and 4 articles worth following-up</a>.</p>
<p>There must be somebody actually trying to parse real legal texts and figuring out to deal with complex organisation, people and group names. But all I can see is articles dealing with levels from ontology and up.</p>
<p>There might even be money in it!</p>
<p>One of the crazy business ideas I had was to parse all the web-based <em>terms of use</em> and <em>privacy notices</em> and annotate/crowd-vote them for how bad they are. So, before creating a web-based account, I could check it against database/parser and it would highlight and rate for me passages that I really should pay attention to (e.g. <em>we sell your contact details to every spammer we know</em> ). Since the language of those notices is often ritualistically formulaic, extracting interesting and useful summary would actually be simpler than it looks.</p>
<p>And the business model would center on providing automatic notification option if a notice from subscribed website sneakily changed and became much worse. That way one would pay money for peace of mind that there were no unexpected service rule changes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Bulk converting doc files into txt (or html)</title>
		<link>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/</link>
		<comments>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/#comments</comments>
		<pubDate>Sun, 20 Apr 2008 00:37:42 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/</guid>
		<description><![CDATA[<p>I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.</p> <p>I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/">Bulk converting doc files into txt (or html)</a></span>]]></description>
			<content:encoded><![CDATA[<p>I have written about converting Microsoft Word files into text or html using OpenOffice <a href="http://blog.outerthoughts.com/2006/10/obscure-bulk-format-converters-of-openofficeorg/" title="Previous article about converting files">before</a>. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.</p>
<p>I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. <a href="http://www.ooomacros.org/user.php#95532" title="Location of the DocConverter macro">DocConverter</a> by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.</p>
<p>I have just converted more than a thousand documents from doc to txt without any problems.  Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Artificial Intelligence discussion at BarCampNYC3</title>
		<link>http://blog.outerthoughts.com/2008/03/artificial-intelligence-discussion-at-barcampnyc3/</link>
		<comments>http://blog.outerthoughts.com/2008/03/artificial-intelligence-discussion-at-barcampnyc3/#comments</comments>
		<pubDate>Mon, 17 Mar 2008 02:18:32 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[My PhD research]]></category>
		<category><![CDATA[BarCampNYC3]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/03/artificial-intelligence-discussion-at-barcampnyc3/</guid>
		<description><![CDATA[<p>They say at BarCamp that if you don&#8217;t like the session you are in, feel free to go to a better one. No hard feelings. But what do you do, if you show up for the announced moderated discussion session yet the moderator does not.</p> <p>That&#8217;s what happened to us with the last (5:15pm) slot <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2008/03/artificial-intelligence-discussion-at-barcampnyc3/">Artificial Intelligence discussion at BarCampNYC3</a></span>]]></description>
			<content:encoded><![CDATA[<p>They say at BarCamp that if you don&#8217;t like the session you are in, feel free to go to a better one. No hard feelings. But what do you do, if you show up for the announced moderated discussion session yet the moderator does not.</p>
<p>That&#8217;s what happened to us with the last (5:15pm) slot of the second day of BarCampNYC3. So, after waiting for 10 minutes past the start time, I decided to step in and moderate.</p>
<p>We talked a bit about everything: a definition of Artificial Intelligence (no agreement) and statistical algorithms that try to find the tanks, tune adverts and prevent SPAM. We discussed the state of art in computer vision and why once well-known consumer company in that space (Riya) still failed miserably. Near the end, we also talked about artificial intelligence as an emotional one and whether <a href="http://www.pleoworld.com/" title="Information about Pleo">Pleo</a> is intelligent.</p>
<p>All together, it was a very spirited discussion and most of the people contributed their opinion and their knowledge. We may not have discussed what the original moderator had in mind, but we certainly discussed interesting topics.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/03/artificial-intelligence-discussion-at-barcampnyc3/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Unravelling the black magic of bureaucracy</title>
		<link>http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/</link>
		<comments>http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/#comments</comments>
		<pubDate>Sat, 01 Sep 2007 22:07:06 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/</guid>
		<description><![CDATA[<p>Arthur C. Clarke once famously wrote &#8220;Any sufficiently advanced technology is indistinguishable from magic&#8221;. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.</p> <p>Bureaucracy often has its <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/">Unravelling the black magic of bureaucracy</a></span>]]></description>
			<content:encoded><![CDATA[<p>Arthur C. Clarke once famously wrote &#8220;Any sufficiently advanced technology is indistinguishable from magic&#8221;. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.</p>
<p>Bureaucracy often has its place and reason. Laws protect exploitable minorities; procedures serve to avoid known problems; cross-referencing forms are filled in triplicate to allow for audit and protection against falsification. The problem is not the bureaucracy as such but rather the fact that it eventually outgrows any individual person&#8217;s ability to comprehend it. At that point, only dedicated specialists can understand the process and the rest of us have to offer sacrifices to those acolytes in hopes of beneficial results.</p>
<p>Enter computers. It turns out that computers can bring the complexity of information down, back within the reach of the non-specialist. The more bureaucratic a processes, the better a computer can figure it out. What is a mind-numbing in-triplicate form to a human is a structured source of information with cross-checking redundancy to the computer.</p>
<p>This area of research is called &#8220;<a href="http://www.aaai.org/AITopics/html/natlang.html" title="Introduction to NLP">Natural Language Processing</a>&#8221; &#8211; NLP. It is not an obscure field &#8211; any Google user has benefited from this type of research. Other applications of NLP include speech recognition and machine translation.</p>
<p>NLP is not a new branch of science. Back in the 1950s, software was being developed in the USA to translate from German into English. The translation quality of grammar-based systems was very poor. Nevertheless, even the possibility of machine translation was so impressive that about US$20 million were spent on the research before the enchantment fizzled out and fund allocations virtually stopped. NLP did not die at that point, but it certainly slowed down.</p>
<p>Statistical approaches to NLP have been around nearly as long as grammar-based ones. However, as they require large quantities of data, these did not become feasible until the mid-1990s. Once they did reach popularity, however, the research advanced rapidly, taking advantage of ever increasing computer speed and available storage. Statistical approaches do not rely on language comprehension. Instead, with sufficient amounts of text, common patterns can be established without understanding the rules of their formation.</p>
<p>A good example is Google&#8217;s new translation engine from Arabic to English. The engine won the <a href="http://www.nist.gov/speech/tests/mt/doc/mt05eval_official_results_release_20050801_v3.html" title="Results of NIST 2005 competition">NIST 2005 machine translation competition</a>, even though its software developers did not know Arabic. Instead, <a href="http://blogoscoped.com/archive/2005-05-22-n83.html" title="Story on Google's translation engine">they used existing parallel documents of United Nations</a> translated by professionals &#8211; some 200 billion words of content in total. It is perhaps symbolic that, even in such a deeply technical area, the Universal Declaration of Human Rights helps to ensure humans all over the world will be able to communicate with each other.</p>
<p>Standalone, however, a statistical approach is not a panacea either. Since there is no real understanding involved, a statistical NLP system has no way to recover from invalid conclusions.</p>
<p>There is more to the puzzle. Most of the real world texts are about somebody or something. The entity could be a person, a company, or a committee. Sometimes, the name of that entity is very long. Documents of the United Nations are known for names that even a human would struggle with. &#8220;<a href="http://www.un.org/law/UNsafetyconvention/index.html" title="Webpage of the Committee">The Ad Hoc Committee on the Scope of Legal Protection under the Convention on the Safety of United Nations and Associated Personnel</a>&#8221; would be one of those. Other large organisations have similar problems.</p>
<p>Currently, neither of the above approaches is sufficient on its own. Grammar-based systems break on complex names; statistical ones mark &#8216;The Committee&#8217; as a completely separate entity, rather than a reference to the full name.</p>
<p>The ideal system that we’re working on would be able to identify the complex names using a combination of techniques. It would also be capable of using multiple appearances in different contexts to confirm the identification, including linking different forms of the same name. Once these goals are achieved, documents in legal and medical domains can get the full benefits from other, already available, research.</p>
<p>Soon, the day will come when computers understand what humans write or say. Hopefully, without needing the triplicates.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

