Arabic numerals’ non-WYSIWYG

Image via Wikipedia

For my other project, I needed to process some Arabic text that was in HTML file derived from MSWord document.

Everything was going reasonably well, except my regular expressions were not picking section name/numbers sequences in all of the cases, which was causing a problem with the 6-language alignment algorithm.

Normally, I just examine the text . . . → Read More: Arabic numerals’ non-WYSIWYG

jQuery: Cycling between multiple classes with random start

Image via Wikipedia

I saw an interesting question on StackOverflow on how to cycle between 3 states for list items , but with initial state for each item being potentially different.

This random start position part of the problem was making me think, so I used it as an exercise to try some newish jQuery functions, such as . . . → Read More: jQuery: Cycling between multiple classes with random start

A new camera

I got myself a new digital camera recently, a Canon T2i. It feels really nice and makes it quite hard to go back to point-and-shoots afterward. And it takes really good 18 megapixel shots.

Here is one of the South African Warthog, displayed using Microsof’s Zoom.it technology. Try zooming . . . → Read More: A new camera

jQuery for multilingual web development

I have (nearly) finished developing a mini-website in 6 languages (Arabic, Chinese, English, French, Russian, Spanish). The layout was the same, so ideally it would have been driven by a content management system. Not in this case unfortunately, as I was not given enough time to setup the infrastructure.

As I know nearly nothing of at least . . . → Read More: jQuery for multilingual web development

Making up with ANTLR

I like ANTLR! It is a specialized tool that can really be applied to many difficult tasks when regular expressions get all Dust Puppy like. And I have used it in the past with great success.

But, every time I put this particular tool aside, I know that picking it back up will be like making up . . . → Read More: Making up with ANTLR

Conjunctions in named entities

A recent article on lingpipe discussed conjuncted named entities such as Johnson and Johnson and Wallace and Gromit. They suggest that maybe a way of treating this is as a frozen expression. I assume that means relying on statistical measures to see this Multi-Word-Expression repeating enough times to be treated as a unit.

In the United Nations . . . → Read More: Conjunctions in named entities

CiteULike Exhibit visualization

Homegrown visualization is not the only way to quickly navigate CiteULike references. There are other tools that display bibliographies in interesting ways.

One of such tools is Exhibit, one of graduates from SIMILE project. It allows to do a very interactive webpage driven by just HTML+Javascript, with no server-side component required. I really like SIMILE’s tools, even . . . → Read More: CiteULike Exhibit visualization

Visualizing CiteULike collections

I am collecting my reading and reference material in CiteULike. I like the service because it can capture details from multiple sources. It also allows to discover what was collected by other interesting people through tags, people and bookmarks graph navigation.

Nice as CiteULike is, it is fairly difficult to get an overall picture of one’s own . . . → Read More: Visualizing CiteULike collections

New mailing list to discuss junction of NLP and Software Engineering

Dr. René Witte has just created a new mailing list (SENLP) to discuss applying NLP techniques to Software Engineering and also to discuss general Software Engineering issues in developing NLP systems.

I am interested in both topics. I did 3 years as senior technical support at BEA and could see how applying NLP techniques on written notes . . . → Read More: New mailing list to discuss junction of NLP and Software Engineering

Where are all legal computational linguistics resources?

I am frustrated. I know my corpus (resolutions of the United Nations General Assembly) shares a lot in common with biomedical and legal domain. And I can find interesting articles in biomedical domain dealing with similar issues of complex tokenization, long named entity mentions (though mine are much longer), etc. But I see nothing in legal . . . → Read More: Where are all legal computational linguistics resources?