From time to time I experiment with GATE NLP toolkit. Just now I tried to upgrade to the latest version (version 4) and run into really strange problem with ANNIE system not loading correctly. Later, when I uninstalled older GATE version, it stopped loading at all.
The problem is the user configuration file gate.xml that is stored in the shared location, usually home directory. On Windows, that is C:\Documents and Settings\[ProfileName]\.
One of those settings was pointing to where the plugins were loaded from and was still referring to GATE 3.1’s locations. That caused NullPointerExceptions in the GATE and everything was breaking from that point on.
I found this by using FileMon, but later realised that it might have been done easier by changing runtime.spawn property to false in GATE’s build.xml file that is used to start the program. Using ant to start a program is a new one for me, but I guess it makes sense in some cases. Setting the property to false shows the startup messages and the exception that the wrong directories cause.
I have deleted the old gate.xml and gate.session files in my home directory and everything started to work. Back to actually trying to use the software.
Use case
Many people come to the foreign countries and feel lost/confused traveling around and/or getting services. If possible, they like to go places with a local friend who will point out the best features, explain how things work and/or translate the requests into the local language. This is a service for those who do not have such a friend.
Basic business flow
- A service kiosk in the airport (visitor’s center) would hire out he mobile phones with GPS/Camera built-in. A visitor picks up the phone and gives his language preferences.
- At any point, the visitor can call the local service number on speed-dial and they will be helped with services that voice+GPS+SMS+Camera can do. For example:
- The agent answers the call in visitor’s language and can translate the communication between the visitor and locals (via speaker phone). If it is a sign, poster or written material, it can be photographed and sent to agent for explanation.
- The agent knows where the user is located (via GPS) and has internet access to street directories, toilet maps, phone directories, public transport, traffic maps, Google Earth view, local rules, etc.
- Therefore the agent can advise the visitor on any issues that visitor needs resolved (in their language). Any notes can be sent via SMS to ensure understanding/recall.
Additional features and up-sell opportunities
- If the visitor brought their different-standard phone over, use the service interface to automatically copy the phone numbers onto the new handset with automatic cheap rates via (for example) Rebtel number substitution. Or phone/Skype integration.
- Provide alarm, booking, etc services
- Provide audio tours via integration with GPS and IVR system
- If the visitor provides their social network credentials, integrate with those systems to post sent geocoded pictures to user’s account and/or provide two-way integration between SMS and Skype
- Provide language lessons (speed dial 8 for the “target language” taught by “your language” tutor)
- The service can be pitched as for emergency use only, so that the fee for hiring the phone without actually using it would be similar to car insurance (couple of dollars per day).
Revenue
- The revenue would come from charging for the services (probably per minute) and for mediating 3rd party solutions such as Rebtel (surcharge per minute). Some services that are mostly IVR interface could probably be cheaper than human assisted ones.
- None of these services require phone agents to be present in the target country, only the sales agents and DID phone numbers. SIP trunks allow for that.
- Even sales agents could be minimised with booking the phone over the Internet, etc. Depending on the cost of the phones, they could be prepaid service or Credit Card deposit type.
This idea is released under Creative Commons Attribution 3.0 Unported license.
Ed Foster has discovered that it is very difficult to sign out from big companies’ websites. Yes, it is true when staying within the website’s rules. But it is dead easy otherwise.
The important thing to remember is that your identity is most of the times stored in the browser cookies. So, if you kill cookies, the session will go away and your identity will go away.
The easiest (but most destructive way) is to delete all cookies. With Firefox, this is the menu item Tools/Clear Private Data (Control-Shift-Del); on Internet Explorer 6, it is Tools/Internet Options/General/Delete Cookies.
The problem of course is that it all your login information for all the websites. Of course, if you were shopping on a public computer, that’s the best course of action anyway.
With Firefox, there is a much more precise way to delete the cookies. It comes with the Web Developer Extension - and cookie management is just one of that extension’s invaluable options. Once the extension is installed, it shows up as a toolbar. Cookies is a submenu on the left with a whole host of different options.
Using the extension, the easiest way to delete cookies is then to go Cookies/Delete Domain Cookies while on the target (Amazon, eBay, etc) website. This will delete all cookies set by that site and on the page refresh you will be a totally anonymous customer.
Advanced user’s notes
The above works in nearly all cases. Some websites get a bit sneaky and set Flash cookies instead. This is mostly done by websites such as YouTube, but for some reason images.amazon.com sets one as well. Deleting those can be done via Adobe’s Flash Player Settings Manager, which is actually a web page with specialised Flash application that shows cookies and allows to clear them.
Finally, deleting the cookie does not mean the website cannot track you otherwise. Google for example, will apparently use your IP address to correlate searches even across multiple sessions. It is not the same issue as keeping you logged-in, so I am only mentioning it in the wider privacy context.
As part of doing a PhD in Computational Linguistics, I need to understand both computers and linguistics. I am fine with computers, but linguistics is not my strong point. Unfortunately, many of the linguistics books and resources are quite dry.
So, I was really happy to discover an audio course Story of Human Language from The Teaching Company taught by John McWhorter. It is quite long a covers a lot of material, but - apart from some overly long parts on universal language - it is really interesting and Professor McWhorter is a great presenter.
I actually had a chance to listen to both an audio version of the course and to see some of it on DVD. Personally, I prefer just audio for several reason.
Firstly, I can listen to the course on my MP3 player when I am walking or doing chores. Video version requires allocating dedicated time, which for such a long course would be difficult.
Secondly, I actually found visual part of the presentation quite boring - for the most part professor is just standing behind the lectern and talks from his notes. In fact, I found the visual part distracted me from the really great and expressive rhetorics.
There was a number of great section in the course, but I found the one explaining language structure of Arabic and Chinese particularly interesting. He talked about Arabic first and I was all keen to learn that language. Then, he switched over to Chinese and I found it even more fascinating. And then, there were comparisons of languages and his cat. This has to be heard to be believed.
The course is obviously available for purchase, but it is also found in quite a few libraries. If you do borrow it from the library, try requesting all volumes at once. I only requested one volume and it was quite annoying to then have to wait a long time for the rest of the course arrive. This is another way I knew for myself that the course was enjoyable, as I had plenty of other audio material to listen to otherwise.
These are new style language-learning websites that are trying to leverage community and/or new capabilities allowed by the internet:
- SpanishSense - they have podcasts, PDFs, daily emails and a lot more. This site has been done by the same people who have been doing really successful ChinesePod for several years now. It looks very slick.
- LiveMocha - they are doing social network style language learning. Others have done it before them, but LiveMocha seems to be a bit stronger on multiple modes of learning than other similar sites. Of course, building yet another social network is a pain and will be limiting factor.
- Mango - This website is an invite-only beta, but my invite arrived less than 30 minutes after registering. They have lessons for a number of languages, but the lessons themselves are in a power-point style presentation. I guess they hope that nice presentation will make up for somewhat inflexible format. And of course, they are still beta.
Among these three, my money is on SpanishSense. I feel that trying to do too many languages at the same time, means none will be done right.
It is true that internet allows to leverage Long Tail effect and create a super-niche website (with a niche per language), but I do not see how one company would have enough time and money to support all those niches well enough. This is my main annoyance with the WordChamp, which I quite like otherwise.
As part of The Rich Web Experience, Fairmont hotel - where the conference is held - offers free WiFi. You have to enter username/password on the first post-connect page and then it unlocks browsing capabilities.
I love WiFi. I have an HP PocketPC that has WiFi built in. I was fully prepared to read my mail, do research and upload photos. Alas, that was not to be!
The WiFi protection form that collects the username and password uses javascript to submit the form with the submit button being an image with onClick handler. My PocketPC does not do JavaScript, or at least JavaScript they used. Therefore, I was not able to get past the login screen and actually use the WiFi.
I find this extremely ironic given that half of the talks at the conference is about Progressive Enhancement, Hajax and other ways to insure that the base functionality works even with JavaScript disabled. In my eyes, ability to submit a two-field form is pretty base.
I am currently at The Rich Web Experience 2007 conference. It is interesting to compare it to JavaOne conferences I have been to in the past.
To start, RWE is much smaller. It is about 400 people as compared to 15 thousands at JavaOne. This obviously makes scheduling logistics and eating arrangements simpler, but there is also a very different feel in the air. It feels that it is much harder to walk around without bumping into speakers and/or other moderately famous web people. At JavaOne, it is all about learning, here it is more like sharing.
Another interesting thing I noticed is that a lot more people than I expected were coming from Java server side background. In fact, we had a round of introductions at Web design Birds-Of-Feather session and more than half of the people in the room had some (often strong) background in Java. To me, this is a great sign as it shows that the path I am taking (adding HTML/CSS/JavaScript to my Java skills) has already been done by multiple people before without too many problems.
I have gone to the following sessions:
- Secure application development with Ajax (by Dean H. Saxe) - The presentation itself was great and covered interesting topic in details. I did not understand all of the advanced concepts and consequences, but the core message was very clear and the slides give enough hints and terms to do further research on my own. I would have liked a more detailed example (e.g. ‘This is why SOP is not applicable’ ), but overall it was great.
- Merging Ajax and Accessibility (by Mark Meeker) - Another great presentation. I heard before that designing for accessibility actually has beneficial side-effects of increased general usability and better design practices, but it was good to see it confirmed with large commercial sites. Mark also had great examples and talked about Hijax a bit as a way of building accessibility into the process, rather than trying to bolt it on at the end.
- Web Design for Server-Side Developers (by Greg Murray) - This one I have found somewhat disappointing. I knew that covering good HTML, CSS, Javascript, modular design and supporting tools in one presentation might have been too ambitious. Still, I was looking forward to some sort of high-level view consistent story tying together the bits together with some best practices thrown in. Unfortunately, Greg was not able to deliver that. He spent too much time jumping between the topics. He also talked about jMaki’s implementation a lot. That might have been useful, but given that some very important issues (Internationalisation, classes vs. IDs, etc) were still not implemented correctly (by Greg’s own admission), I felt jMaki was not yet ready to be shown as an example of best practices.
- Web design/architecture Birds-Of-Feather session with Aaron Gustafson, David Verba and couple of others. It was actually interesting, because I sat with them at the dinner table without realising who they were. But you could see they were really smart and interesting, even in their unstaged moments. True geeks, in the good sense of the word. The session itself was a very interesting discussing and somehow I even managed to hog the floor for a while with my questions. Hopefully, it did not annoy too many people.
I am looking forward to the second day.
Arthur C. Clarke once famously wrote “Any sufficiently advanced technology is indistinguishable from magic”. In the same vein, many people feel that any sufficiently established bureaucracy is like a black magic, sorcery even. Certainly, it often takes skills out of this world to follow the logic of modern tax return instructions.
Bureaucracy often has its place and reason. Laws protect exploitable minorities; procedures serve to avoid known problems; cross-referencing forms are filled in triplicate to allow for audit and protection against falsification. The problem is not the bureaucracy as such but rather the fact that it eventually outgrows any individual person’s ability to comprehend it. At that point, only dedicated specialists can understand the process and the rest of us have to offer sacrifices to those acolytes in hopes of beneficial results.
Enter computers. It turns out that computers can bring the complexity of information down, back within the reach of the non-specialist. The more bureaucratic a processes, the better a computer can figure it out. What is a mind-numbing in-triplicate form to a human is a structured source of information with cross-checking redundancy to the computer.
This area of research is called “Natural Language Processing” - NLP. It is not an obscure field - any Google user has benefited from this type of research. Other applications of NLP include speech recognition and machine translation.
NLP is not a new branch of science. Back in the 1950s, software was being developed in the USA to translate from German into English. The translation quality of grammar-based systems was very poor. Nevertheless, even the possibility of machine translation was so impressive that about US$20 million were spent on the research before the enchantment fizzled out and fund allocations virtually stopped. NLP did not die at that point, but it certainly slowed down.
Statistical approaches to NLP have been around nearly as long as grammar-based ones. However, as they require large quantities of data, these did not become feasible until the mid-1990s. Once they did reach popularity, however, the research advanced rapidly, taking advantage of ever increasing computer speed and available storage. Statistical approaches do not rely on language comprehension. Instead, with sufficient amounts of text, common patterns can be established without understanding the rules of their formation.
A good example is Google’s new translation engine from Arabic to English. The engine won the NIST 2005 machine translation competition, even though its software developers did not know Arabic. Instead, they used existing parallel documents of United Nations translated by professionals - some 200 billion words of content in total. It is perhaps symbolic that, even in such a deeply technical area, the Universal Declaration of Human Rights helps to ensure humans all over the world will be able to communicate with each other.
Standalone, however, a statistical approach is not a panacea either. Since there is no real understanding involved, a statistical NLP system has no way to recover from invalid conclusions.
There is more to the puzzle. Most of the real world texts are about somebody or something. The entity could be a person, a company, or a committee. Sometimes, the name of that entity is very long. Documents of the United Nations are known for names that even a human would struggle with. “The Ad Hoc Committee on the Scope of Legal Protection under the Convention on the Safety of United Nations and Associated Personnel” would be one of those. Other large organisations have similar problems.
Currently, neither of the above approaches is sufficient on its own. Grammar-based systems break on complex names; statistical ones mark ‘The Committee’ as a completely separate entity, rather than a reference to the full name.
The ideal system that we’re working on would be able to identify the complex names using a combination of techniques. It would also be capable of using multiple appearances in different contexts to confirm the identification, including linking different forms of the same name. Once these goals are achieved, documents in legal and medical domains can get the full benefits from other, already available, research.
Soon, the day will come when computers understand what humans write or say. Hopefully, without needing the triplicates.
When OpenNLP toolkit uses MaxEnt parser, it has to read in about 25 MBytes of model files. The model reader uses basic unbuffered FileReader. The result is the excessive number of system calls (and disk access calls) during the parser startup.
The fix is extremely simple:
- In maxent-2.4.0/src/java/opennlp/maxent/io/ObjectGISModelReader.java, replace
- new FileInputStream(f) with
- new BufferedInputStream(new FileInputStream(f), 1000000)
- Recompile maxent library
- Deploy new version of maxent-2.4.0.jar into OpenNLP’s lib directory
The comparison is striking (the numbers are File access system calls):
- build.bin.gz - 29830 -> 40
- chunk.bin.gz -11853 -> 16
- tag.bin.gz - 11091 -> 14
I was not able to get OpenNLP parser to work. There were no samples to play with, no command line tools to run. And I don’t even want to talk about documentation. That’s because there was not any. There was an attempt at lame joke (at least that’s the only sense I can make of what.html file), but no actual documentation.
Finally, I pinged my research colleague who did get the toolkit working (thanks Scott). Turns out to be there is a whole set of model files missing from the tool’s download. They are linked to from a separate page on the original website (not even in the download).
I am downloading the models now and hopefully will be on my way. But I can certainly see why this particular toolkit is mentioned much less frequently than Stanford’s or Bikel’s.
After the fact, I have also found a mini tutorial by Daniel McLaren explaining OpenNLP components and showing some sample code and output. Looks better than what’s bundled with OpenNLP itself. Maybe Daniel and Thomas Morton (author of OpenNLP) should talk.