<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Outer Thoughts &#187; Computational Linguistics</title>
	<atom:link href="http://blog.outerthoughts.com/category/computational-linguistics/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.outerthoughts.com</link>
	<description>&#62; From inner thoughts to the outer limits of Alexandre Rafalovitch</description>
	<lastBuildDate>Wed, 27 Jul 2011 00:24:19 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.4</generator>
		<item>
		<title>Conjunctions in named entities</title>
		<link>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/</link>
		<comments>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 02:34:52 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=273</guid>
		<description><![CDATA[<p>A recent article on lingpipe discussed conjuncted named entities such as Johnson and Johnson and Wallace and Gromit. They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.</p> <p>In the <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/">Conjunctions in named entities</a></span>]]></description>
			<content:encoded><![CDATA[<p>A <a title="Lingpipe's article on conjunctions in named entities" href="http://lingpipe-blog.com/2009/03/26/joint-referential-uncertainty-the-wallace-and-gromit-dilemma/">recent article on lingpipe</a> discussed conjuncted named entities such as <span style="text-decoration: underline;">Johnson and Johnson</span> and <span style="text-decoration: underline;">Wallace and Gromit</span><em>.</em> They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.</p>
<p>In the United Nations corpus, things can get even more interesting. Let&#8217;s look at a relatively easy example: <em><span style="text-decoration: underline;">draft resolution A/56/L.28 and Add.1</span></em>.</p>
<p>Is this a one document (one draft resolution) or two? And if two, then which two? The first one is obviously <span style="text-decoration: underline;">A/56/L.28</span>. But <span style="text-decoration: underline;">Add.1</span> is not a valid document symbol, it is actually an (additive?) coreference to the first one and resolves to <span style="text-decoration: underline;">A/56/L.28/Add.1</span>?</p>
<p>The answer (as good as I can make it so far) could lie in <a title="Introduction to FRBR" href="http://techessence.info/frbr">FRBR</a> distinction between Expression and Manifestation. A resolution is an expression of Member States&#8217;s proposals and negotiations. To some degree, it evolves over several meetings. However between the discussions, the latest version or changes need to be reported to make sure they are formally registered and also to ensure the next round of discussions could have latest documents to work from.</p>
<p>In our case, the first time the draft resolution had to be presented it was published under <span style="text-decoration: underline;">A/56/L.28</span> (which incidentally means a limited distribution document 28 of the General Assembly&#8217;s 56th regular session). So, the initial Manifestation of the draft resolution became this physical document with a distinct symbol assigned.</p>
<p>But apart from its text, draft resolution has a list of sponsoring Member States. That list can change as draft resolution gains sponsors. These additional sponsors were in the Addendum <span style="text-decoration: underline;">A/56/L.28/Add.1</span>. But the addendum does not make sense without the original document, so actually both physical documents represent one logical draft resolution, which is reflected in the grammar of the text (draft resolution, not resolution<span style="text-decoration: underline;">s</span>).</p>
<p>What this means for named entity annotations and for recognition algorithms is hard to say and is something I am looking at with my PhD research.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/03/conjunctions-in-named-entities/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Visualizing CiteULike collections</title>
		<link>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/</link>
		<comments>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/#comments</comments>
		<pubDate>Sun, 25 Jan 2009 07:10:20 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>
		<category><![CDATA[Problems and Solutions]]></category>
		<category><![CDATA[CiteULike]]></category>
		<category><![CDATA[Graphviz]]></category>
		<category><![CDATA[visualization]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=266</guid>
		<description><![CDATA[<p>I am collecting my reading and reference material in CiteULike. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.</p> <p>Nice as CiteULike is, it is fairly difficult to get an overall picture of <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/">Visualizing CiteULike collections</a></span>]]></description>
			<content:encoded><![CDATA[<p>I am collecting my reading and reference material in <a title="My library in CiteULike" href="http://www.citeulike.org/user/arafalov">CiteULike</a>. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.</p>
<p>Nice as CiteULike is, it is fairly difficult to get an overall picture of one&#8217;s own collection. It is especially difficult to see quickly if there are people who serve as hubs by collaborating with multiple different groups. The information is there, but it requires a lot of clicks to find it out.</p>
<p>My usual solution is to export information out, massage it into <a title="Home page of Graphviz" href="http://www.graphviz.org/">Graphviz</a> format and use graph segmentation and layout algorithms to get a better overview. I <a title="Search for my articles mentioning Graphviz" href="http://blog.outerthoughts.com/?s=graphviz">have talked about Graphviz</a> a number of times on this blog before. This is yet another time it proved useful.</p>
<p>I started by exporting CiteULike&#8217;s content of my library. I found Endnote export format to be more structured and therefore easier to parse. I then run it through <a title="My converter" href="http://www.outerthoughts.com/files/paperviz/v1/convert.py">a custom Python program</a> that basically spat out graph with titles pointing at authors. That produced a <strong>very large</strong> graph and was not particularly useful.</p>
<p>The next step was to discover disjointed clusters of titles/authors. I used <em>ccomps</em> with -v and -x flags (e.g. <em>ccomps.exe -v -x -o comp.dot output.dot</em>).</p>
<p><em>ccomps</em> gave me partitioned graphs as well as statistics on number of nodes/edges in each graph. I could then choose a graph with large number of nodes/edges (eventually, all of them) and run it through <em>neato</em> with overlap=scale and splines=true (e.g. <em>neato.exe -Tgif -o neato_1.gif -Goverlap=scale -Gsplines=true comp_1.dot</em>).</p>
<p>The resulting graph was still not perfect, but it was a good start. I also tried <em>fdp</em> instead of <em>neato</em>, but that seemed to produce giraffe versions of the graph with graph edges being overly long.</p>
<p>You can see <a title="Output image of one of the clusters" href="http://www.outerthoughts.com/files/paperviz/v1/neato_1.gif">an example</a> of <em>neato</em> output for one of my clusters. Warning: if it causes problems due to its size, try it with <a title="Graphics viewing freeware" href="http://www.irfanview.com/">IrfanView</a>; that program can display even improbably large graphs (e.g. unpartitioned ones).</p>
<p>I have run into some problems as well that would either cause partitions combine together or produce duplicate nodes and edges.</p>
<p>The first problem was that sometimes a person was an author and sometimes an editor. I was interested in both, so collapsed those fields together. That caused some non-people to then show up on the graph and connect clusters in unexpected ways. For my library the specific value was &#8216;European&#8217;, so I filtered it out in the code.</p>
<p>The second problem was to do with CiteULike&#8217;s parsing. Sometimes, it would split a first+last name into separate names, probably due to incorrect manual entry at some point. I had to fix those at the source by editing corresponding CiteULike entry. Probably a good thing to do anyway.</p>
<p>The other problem is right out of the co-reference resolution domain. Sometimes names would include full first names, sometimes only a first name initial. I have worked around that by normalizing all first names to the initials. Obviously, this could collapse entries belonging to multiple real people into one.</p>
<p>Further on name problems, in cases of non English names (e.g. Spanish names with multiple surnames), CiteULike would get confused which part is which and not display or export it correctly. Additionally, sometimes characters such as <strong>ñ</strong> would be entered as plain <strong>n</strong>. Those also needed to be corrected manually.</p>
<p>The project only took a couple of hours including writing code and cleanup. It is already useful to me, as I found a new person who was in unexpectedly large number of papers and also found a chain of connections that might be interesting to follow more closely.</p>
<p>There is of course a lot more that could be done. Automatic co-reference of misspelt names, layout hints based on number of times authors appeared together, color coding of tags &#8211; these are just some of the easy ideas.</p>
<p>There might even be a small project/paper in doing co-reference resolution and cleaning up CiteULike data? After all, similar projects were done for Wikipedia. I don&#8217;t think CiteULike currently makes a full export available, but they do have <a title="CiteULike's datasets available for research" href="http://www.citeulike.org/faq/data.adp">some</a> so might be amendable to exporting a special set for research purposes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/visualizing-citeulike-collections/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>New mailing list to discuss junction of NLP and Software Engineering</title>
		<link>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/</link>
		<comments>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/#comments</comments>
		<pubDate>Sat, 17 Jan 2009 21:20:03 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=260</guid>
		<description><![CDATA[<p>Dr. René Witte has just created a new mailing list (SENLP) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.</p> <p>I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/">New mailing list to discuss junction of NLP and Software Engineering</a></span>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.rene-witte.net/">Dr. René Witte</a> has just created a new mailing list (<a title="Introduction to SENLP mailing list" href="http://www.semanticsoftware.info/blog/senlp-mailing-list-connecting-software-engineering-and-nlp">SENLP</a>) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.</p>
<p>I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on written notes in support cases could have improved quality of technical support. I did not get to do any of that, but some interest remains.</p>
<p>The second topic is even more interesting and important to me. It can build on current discussions currently held on blogs (see &#8216;<a title="Blog entry about Software Engineering and NLP" href="http://www.drni.de/niels/s9y/archives/5-The-USES-Issue.html">The USES Issue</a>&#8216; at Niels Ott&#8217;s blog) and in journals (see: <a title="Ted Pedersen article on building better NLP software" href="http://www.d.umn.edu/~tpederse/Pubs/pedersen-last-word-2008.pdf">&#8216;Empiricism Is Not a Matter of Faith</a>&#8216; by Ted Pedersen). While some of the issues are discussed on mailing lists for individual pieces of software, a place to discuss cross-cutting concerns is very welcome.</p>
<p>I have joined the list and hope to see at least some of my readers there as well.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/new-mailing-list-to-discuss-junction-of-nlp-and-software-engineering/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Where are all legal computational linguistics resources?</title>
		<link>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/</link>
		<comments>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/#comments</comments>
		<pubDate>Wed, 14 Jan 2009 01:41:44 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[Ideas]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=258</guid>
		<description><![CDATA[<p>I am frustrated. I know my corpus (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/">Where are all legal computational linguistics resources?</a></span>]]></description>
			<content:encoded><![CDATA[<p>I am frustrated. I know <a href="http://blog.outerthoughts.com/2007/09/unravelling-the-black-magic-of-bureaucracy/">my corpus</a> (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in legal domain.</p>
<p>I have just gone through all of <a title="Jurix conference" href="http://www.jurix.nl/">Jurix</a>&#8216; proceedings as well as all of <a title="Digital edition of &quot;Artificial Intelligence and Law&quot; journal" href="http://www.springerlink.com/content/100239/">Artificial Intelligence and Law</a> and all I got is <a title="My article set from legal domain" href="http://www.citeulike.org/user/arafalov/tag/legal">between 2 and 4 articles worth following-up</a>.</p>
<p>There must be somebody actually trying to parse real legal texts and figuring out to deal with complex organisation, people and group names. But all I can see is articles dealing with levels from ontology and up.</p>
<p>There might even be money in it!</p>
<p>One of the crazy business ideas I had was to parse all the web-based <em>terms of use</em> and <em>privacy notices</em> and annotate/crowd-vote them for how bad they are. So, before creating a web-based account, I could check it against database/parser and it would highlight and rate for me passages that I really should pay attention to (e.g. <em>we sell your contact details to every spammer we know</em> ). Since the language of those notices is often ritualistically formulaic, extracting interesting and useful summary would actually be simpler than it looks.</p>
<p>And the business model would center on providing automatic notification option if a notice from subscribed website sneakily changed and became much worse. That way one would pay money for peace of mind that there were no unexpected service rule changes.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2009/01/where-are-all-legal-computational-linguistics-resources/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Explaining Computational Linguistics to friends and family</title>
		<link>http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/</link>
		<comments>http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/#comments</comments>
		<pubDate>Tue, 26 Aug 2008 02:41:43 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/?p=253</guid>
		<description><![CDATA[<p>It is hard enough to explain what we are doing to our professors; explaining it in plain English to our friends and family is nearly impossible.</p> <p>So it is always good to see people who can explain what POS tagger is and why it is important without having to throw around references to Norvig or <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/">Explaining Computational Linguistics to friends and family</a></span>]]></description>
			<content:encoded><![CDATA[<p>It is hard enough to explain what we are doing to our professors; explaining it in <em>plain English</em> to our friends and family is nearly impossible.</p>
<p>So it is always good to see people who can explain what POS tagger is and why it is important without having to throw around references to Norvig or Jurafsky.</p>
<p>Markus Dickinson has managed to do exactly such explanation in his <a href="http://jones.ling.indiana.edu/~mdickinson/papers/budapest.html">non-linguistic primer</a> to a serious research paper on <a href="http://jones.ling.indiana.edu/%7Emdickinson/papers/dickinson-meurers-03.html">Detecting Errors in   Part-of-Speech Annotation</a>. The writing is quite old (2003), but it reads well and still feels relevant. Of course, his research page contains more recent papers on the same topic too.</p>
<p>(via <a href="http://blogamundo.net/dev/2008/08/22/a-plain-english-description-of-a-computational-linguistics-thesis/">Hacklog</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/08/explaining-computational-linguistics-to-friends-and-family/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Bulk converting doc files into txt (or html)</title>
		<link>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/</link>
		<comments>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/#comments</comments>
		<pubDate>Sun, 20 Apr 2008 00:37:42 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[My PhD research]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/</guid>
		<description><![CDATA[<p>I have written about converting Microsoft Word files into text or html using OpenOffice before. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.</p> <p>I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/">Bulk converting doc files into txt (or html)</a></span>]]></description>
			<content:encoded><![CDATA[<p>I have written about converting Microsoft Word files into text or html using OpenOffice <a href="http://blog.outerthoughts.com/2006/10/obscure-bulk-format-converters-of-openofficeorg/" title="Previous article about converting files">before</a>. However, the wizards I described in that article were crashing when the number of files crossed into several hundreds.</p>
<p>I have written some macros to do the conversion, but they were scary looking and fragile. Fortunately, I now found a tool that does the same job better and with more flexibility. <a href="http://www.ooomacros.org/user.php#95532" title="Location of the DocConverter macro">DocConverter</a> by Danny Brewer and Dan Horwood allows to convert a whole directory of files at a time from any to any OpenOffice-understood format.</p>
<p>I have just converted more than a thousand documents from doc to txt without any problems.  Actually, I had a small problem, but it was my fault. I had some corrupted files that OO would not open and that was breaking DocConverter and throwing some ugly looking Basic runtime error. I had to delete the problem files, kill the Open Office (stop macro did not) and rerun the tool. Otherwise, it just run.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/04/bulk-converting-doc-files-into-txt-or-html/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>On uselessness of pretending to be somebody else</title>
		<link>http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/</link>
		<comments>http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/#comments</comments>
		<pubDate>Fri, 25 Jan 2008 00:28:36 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[Weird Stuff]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/</guid>
		<description><![CDATA[<p>While reading weka Data Mining book, I have come across this impressive example of using machine learning to confirm person&#8217;s authorship (p. 358).</p> <p>In 19th century, there lived a famous rabbinic scholar Ben Ish Chai, who among other writings had two collections of letters. Ben Ish Chai claimed that only one collection was his and <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/">On uselessness of pretending to be somebody else</a></span>]]></description>
			<content:encoded><![CDATA[<p>While reading weka <a href="http://www.worldcat.org/oclc/58451668" title="WorldCat link for the book">Data Mining book</a>, I have come across this impressive example of using machine learning to confirm person&#8217;s authorship (p. 358).</p>
<p>In 19th century, there lived a famous rabbinic scholar Ben Ish Chai, who among other writings had two collections of letters. Ben Ish Chai claimed that only one collection was his and that the other one was somebody else&#8217;s, found by him. Modern scholars thought both collections were his, but could not prove it conclusively as the style of writing was different.</p>
<p>Machine Learning to the rescue! In 2004, <span class="m">   Moshe Koppel and Jonathan Schler</span> have discovered that it may help to look not at the writing style differences (as the style may have been faked), but rather at how deep those differences were. For example, an author could fake a stylistic mismatch by consciously avoiding favorite words, but would still write in long overrun sentences, use more of passive verb forms or display many other measurable behaviours.</p>
<p>So, if the most obvious differences were removed one by one, the speed at which the rest of the features would look identical could be a good indicator. They called this technique <a href="http://citeseer.ist.psu.edu/648176.html" title="Paper about unmasking technique">unmasking</a> and the mistery of Ben Ish Chai was solved for good.</p>
<p>I think what impressed me here was not the clever math. The whole field of determining authorship is based on clever math. It is rather the fact that the math was looking at hints <u>within</u> the hints of the language &#8211; the invisible aspects that become noticeable only after the eye learns to see beyond what the most obvious reality offers. I cannot explain it better, but to me it has a special elegance that just counting the words and sentence lengths does not offer.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2008/01/on-uselessness-of-pretending-to-be-somebody-else/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Parsing jumping jacks</title>
		<link>http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/</link>
		<comments>http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/#comments</comments>
		<pubDate>Sat, 01 Dec 2007 23:14:25 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[RSCDS]]></category>
		<category><![CDATA[Weird Stuff]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/</guid>
		<description><![CDATA[<p>What could be common between Computational Linguistics and Aerobics? Quite a lot, as it turns out to be.</p> <p>Dance descriptions, while not really in English do have a regular structure and can be thought of as a sub-language with full set of syntactic, semantic and pragmatic levels.</p> <p>There are basic words of the language (move <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/">Parsing jumping jacks</a></span>]]></description>
			<content:encoded><![CDATA[<p>What could be common between Computational Linguistics and Aerobics? Quite a lot, as it turns out to be.</p>
<p>Dance descriptions, while not really in English do have a regular structure and can be thought of as a sub-language with full set of syntactic, semantic and pragmatic levels.</p>
<p>There are basic words of the language (move names), correct ways of putting them in a sentence (a routine) and all the way up to good flowing text (classes that do not hurt the participants).</p>
<p>I was thinking about relationship between dance instructions and computational linguistics in context of Scottish Country Dancing for at least a year. My imagined benefits were that codified dance instructions would allow for automatic dance animations, superior teacher aids and other applications that currently require a lot of sweat and toil. Dance evening programmes that are currently put together manually for each event, could be assisted with automated evaluation pointing out awkward sequences of dances.</p>
<p>Unfortunately, my attempts at explaining the connection made no sense to the people around me. So, I was ecstatic to discover that such a link was already discovered by others before me.</p>
<p>Adam Bull, more than 10 years ago, has tried to apply principles of computational linguistics to Aerobics for his MPhil degree in the paper entitled <a href="http://www.comp.leeds.ac.uk/cgi-bin/sis/ext/rs_pub.cgi?cmd=displayabstract&amp;sid=898625237" title="Web page for the report">The formal description of aerobic dance exercise &#8211; a corpus-based computational linguistics approach</a>. While, the report is not complete, it puts down many of the same arguments I have tried myself.</p>
<p>Unfortunately, the electronic copy of the document was not available. After some effort, I got in touch with Adam and he send me the copy of the report with the permission to distribute. I have put <a href="http://www.outerthoughts.com/files/adam_bull_thesis_aerobics_compling.pdf" title="Copy of Adam's report">a copy of it on my own server</a>.</p>
<p>I hope his research will get rediscovered and improved upon. That way when I get some time to apply my own PhD skills to Scottish Country Dancing, there will be more than one person on whose shoulders I would be able to stand.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/12/parsing-jumping-jacks/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Upgrading to GATE 4? Beware of leftover configuration files.</title>
		<link>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/</link>
		<comments>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/#comments</comments>
		<pubDate>Sun, 07 Oct 2007 03:03:40 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[Java]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/</guid>
		<description><![CDATA[<p>From time to time I experiment with GATE NLP toolkit. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.</p> <p>The problem is the user configuration file gate.xml that <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/">Upgrading to GATE 4? Beware of leftover configuration files.</a></span>]]></description>
			<content:encoded><![CDATA[<p>From time to time I experiment with <a href="http://gate.ac.uk/" title="Home of the GATE - NLP toolkit">GATE NLP toolkit</a>. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.</p>
<p>The problem is the user configuration file <em>gate.xml</em> that is stored in the shared location, usually home directory. On Windows, that is  <em>C:\Documents and Settings\[ProfileName]\</em>.</p>
<p>One of those settings was pointing to where the plugins were loaded from and was still referring to GATE 3.1&#8242;s locations. That caused NullPointerExceptions in the GATE and everything was breaking from that point on.</p>
<p>I found this by using FileMon, but later realised that it might have been done easier by changing <em>runtime.spawn</em> property to <em>false</em> in GATE&#8217;s <em>build.xml</em> file that is used to start the program. Using <em>ant</em> to start a program is a new one for me, but I guess it makes sense in some cases.  Setting the property to false shows the startup messages and the exception that the wrong directories cause.</p>
<p>I have deleted the old <em>gate.xml</em> and <em>gate.session</em> files in my home directory and everything started to work. Back to actually trying to use the software.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/10/upgrading-to-gate-4-beware-of-leftover-configuration-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Story of Human Language &#8211; great introductory audio course on linguistics</title>
		<link>http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/</link>
		<comments>http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/#comments</comments>
		<pubDate>Sat, 29 Sep 2007 16:01:45 +0000</pubDate>
		<dc:creator>arafalov</dc:creator>
				<category><![CDATA[Computational Linguistics]]></category>
		<category><![CDATA[General Education]]></category>
		<category><![CDATA[Language Acquisition]]></category>

		<guid isPermaLink="false">http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/</guid>
		<description><![CDATA[<p>As part of doing a PhD in Computational Linguistics, I need to understand both computers and linguistics. I am fine with computers, but linguistics is not my strong point. Unfortunately, many of the linguistics books and resources are quite dry.</p> <p>So, I was really happy to discover an audio course Story of Human Language from <span style="color:#777"> . . . &#8594; Read More: <a href="http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/">Story of Human Language &#8211; great introductory audio course on linguistics</a></span>]]></description>
			<content:encoded><![CDATA[<p>As part of doing a PhD in Computational Linguistics, I need to understand both computers and linguistics. I am fine with computers, but linguistics is not my strong point.  Unfortunately, many of the linguistics books and resources are quite dry.</p>
<p>So, I was really happy to discover an audio course <span class="courseTitle" style="padding-top: 15px; padding-bottom: 1px"><a href="http://www.teach12.com/ttcx/coursedesclong2.aspx?cid=1600&amp;pc=Professor304" title="Official web page for the audio course">                         Story of Human Language</a></span> from The Teaching Company taught by John McWhorter. It is quite long a covers a lot of material, but &#8211; apart from some overly long parts on universal language &#8211; it is really interesting and Professor McWhorter is a great presenter.</p>
<p>I actually had a chance to listen to both an audio version of the course and to see some of it on DVD. Personally, I prefer just audio for several reason.</p>
<p>Firstly, I can listen to the course on my MP3 player when I am walking or doing chores. Video version requires allocating dedicated time, which for such a long course would be difficult.</p>
<p>Secondly, I actually found visual part of the presentation quite boring &#8211; for the most part professor is just standing behind the lectern and talks from his notes. In fact, I found the visual part distracted me from the really great and expressive rhetorics.</p>
<p>There was a number of great section in the course, but I found the one explaining language structure of Arabic and Chinese particularly interesting. He talked about Arabic first and I was all keen to learn that language. Then, he switched over to Chinese and I found it even more fascinating. And then, there were comparisons of languages and his cat. This has to be heard to be believed.</p>
<p>The course is obviously <a href="http://www.teach12.com/ttcx/coursedesclong2.aspx?cid=1600&amp;pc=Professor304" title="Original (commercial) source for the course">available for purchase</a>, but it is also <a href="http://www.worldcat.org/oclc/58542774" title="WorldCat entry for the course">found in quite a few libraries</a>. If you do borrow it from the library, try requesting all volumes at once. I only requested one volume and it was quite annoying to then have to wait a long time for the rest of the course arrive. This is another way I knew for myself that the course was enjoyable, as I had plenty of other audio material to listen to otherwise.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.outerthoughts.com/2007/09/story-of-human-language-great-introductory-audio-course-on-linguistics/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

