Introduction to Apache Solr – presentation and code

I recently presented about Solr at Bangkok meetup group. There were about 40 people and I had great follow-up discussions afterwards.

As part of the presentation, I showed how Solr deals with Thai language. Even though my knowledge of Thai is fairly rudimentary, I dug into existing resources and found at least a couple of ways to process Thai language.

One of the little things I wanted to show to mixed Thai/English audience was how to map Thai language to Latin text automatically (e.g. for cross-language search). Learning how to do it, lead me to discover ICUTransformFilterFactory (a side-benefit of compiling my own whole list of filters), which led me to the fact that the factory actually accepts complex transformation grammar. Which led me to try magic invocations such as:

<filter class=”solr.ICUTransformFilterFactory” id=”Thai-Latin” />
<filter class=”solr.ICUTransformFilterFactory” id=”NFD; [:Nonspacing Mark:] Remove; NFC” />

Which, in  – human, not codepage – English, means:

  1. Convert Thai to Latin characters, which for this factory includes Latin characters with tone marks on them – a really strange-looking text (ข้าว -> k̄ĥāw)
  2. Split the tone marks to be separate from characters
  3. Get rid of those tone marks
  4. Rebuild proper Unicode characters, which are just Latin characters by now

This was still not perfect in any way, but was interesting enough that even local Thai Solr consultancy (Inspica) were quite fascinated.

I have published my full configuration, data set and import scripts as a GitHub repository. Anyone more familiar with Thai than myself is welcome to push that particular issue further.