Computational Linguistics – News update for Nov 15, 2006

November 16, 2006

Computational Linguistics

Lots of new sightings of CL/NLP technologies since the last update:

On the commercial speech recognition front, Nexidia is currently in beta with phonemes-mapping audio search. But don’t go to the company’s site. Instead, read the explanation and collection of links is in the ResourceShelf’s article.
If, instead of waiting for commercial offerings, you would like to contribute to the open source one, VoxForge always needs more transcribed audio recordings to improve their Command and Control acoustic models.
Switching from speech recognition to the speech synthesis, E-health-insider has a fascinating podcast from the field (Somalia), with practical example of how even an imperfect technology can bring tangible benefits to people in need.
Text generation might also soon become a more interesting topic. Indiana university recently launched The Synthetic Worlds Initiative and - as part of it - very recently started ARDEN project that will try to produce a synthetic 3D world in the universe of William Shakespeare. They are not planning to have bots in there, but can they resist it, given that a virtual world interface and availability of full texts of Shakespeare’s works make it ideal playground for advanced A.L.I.C.E competitions.
If you like text classifications tasks and/or machine learning, there is an Agnostic Learning vs. Prior Knowledge Challenge & Workshop. Dataset Nova is the one for text classification, there are others for different machine learning tasks. There might even be a small prize.
For those who only get out of bed for big(ger) prizes, there is the Second Annual CyC prize. The prize is $2,500, but to get it you must publish an academic paper that has something to do with CyC’s knowledge base of assertions about the everyday world. This may or may not be a hard task; you can judge it for yourself by checking out the winners of the last year’s prize. The deadline is February 21st, 2007 and some people may have had an early start since the competition has been running since February this year.
Named Entities and Semantic Web come together in the demo put together by InFact that parsed and cross-linked public domain books in a web of names, places and relations. Just don’t try to manually change the urls; the implementation itself is a bit brittle (company was notified). Speaking on a more abstract level, this demo also shows benefits of actually having unrestricted full-text access to books. I feel that public domain books are just waiting to be remixed and experimented with beyond what we see now.
Finally, those who missed AOL’s attempt to beat Google’s release of n-gram models, by releasing and then withdrawing 20 million web queries that included private data can still get access to that data from multiple websites, including one with a semi-useful search interface. One wonders if AOL’s executive responsible for the release decision likes the proverbs, specifically the one that goes “A word spoken is past recalling”.