Leanpub recipe: versioned book backups

Leanpub is a platform for publishing books that gives author control and tools beyond those available from traditional publishers. I used it to test an idea for a Solr Clients book and was able to validate its lack of traction without spending a cent.

One of the great features of Leanpub is that a book can be updated and the readers get notified of the new version. Consequently, it is sometime useful to keep the old versions around for the author and sometimes even for the readers.

This feature is not currently present in Leanpub, but we can add that with a bit of a glue from Zapier – web automation service. This is especially useful if you are using GitHub to edit your book with automatic preview and want a full end-to-end edit-preview experience without having to touch the Leanpub website.

Continue reading

Javadoc custom doclets – fun, frustration and forward motion

Javadoc is default – and often only – documentation for open source Java projects. It is generated automatically and can just be dumped on any public-facing server as a bunch of static files. Or even bundled with the distribution, if size is not an issue.

However, as project grows, several issues with using Javadoc documentation become apparent. The main issue is that Javadoc (yes, even JDK 8 one) uses frames and JavaScript for navigating the packages and classes. Which breaks any sort of direct linking to the content as well as discoverability by search engine. Yes, there are NO FRAMES links, but then the navigation becomes really cumbersome. The second issue is that generated Javadoc is using rather old HTML standards and is really not designed for Search Engine Optimization. Which means that search engines usually end up discovering random entry points following somebody’s old blog post.

Continue reading

Introduction to Apache Solr – presentation and code

I recently presented about Solr at Bangkok meetup group. There were about 40 people and I had great follow-up discussions afterwards.

As part of the presentation, I showed how Solr deals with Thai language. Even though my knowledge of Thai is fairly rudimentary, I dug into existing resources and found at least a couple of ways to process Thai language.

Continue reading

Introducing Solr Start

The Solr Start website has been out for a little while in the alpha form. However, it has now been formally relaunched (into beta) and I wanted to introduce and clarify reasons for it.

Doing the research for my Solr book, I realized how powerful Solr was. Yet, I was also frustrated that this power was hard to discover. I would read about a configurable aspect of Solr, such as UpdateRequestProcessor factories and the official wiki would have a couple of them mentioned. But going to the Javadoc would show that there was quite a number of those factories available, if one would just spend 30 minutes clicking around the multi-level inheritance hierarchy. Which I did the first time I discovered the power.

But then, with each new version of Solr, other factories would be added and I would miss them. And, as was obvious from the mailing list and Stack Overflow discussions, so would many other people. Even some quite advanced ones. And, even if one could navigate JavaDocs, it would not always help. Lucene and Solr’s documentation is split into several modules and classes such as UIMAUpdateRequestProcessorFactory (in solr-uima module) link to the parent class (in solr-core module) but are not linked back to.

It wasn’t that the information was not available. It was just hard to find and even harder to comprehend quickly. Between classes browsing, bad Solr Javadoc SEO on the web and obscure jars for non-core modules, it was clear that people were missing useful Solr functionality.

This problem was even more noticeable for beginner Solr users who were trying to up-skill to the intermediate level. Solr is very easy to start with, but there is a noticeable jump in effort required to get much beyond the collection1 level of understanding.

I tried to address that problem with my book and now also with the Solr Start website. Currently, it has two comprehensive collections of resources, bringing together their names, direct links to their JavaDocs and even the module jars they can be found in:

There are more resources like that and they will be added to the site soon.

Importantly, these collections are not the only examples of what Solr Start will have. I have accumulated quite a large ToDo list of diverse things that could be useful for beginner and intermediate Solr users. And I hope Solr Start will become a very prominent resource for users to quickly discover exactly what’s available and to save the unnecessary digging around.

There is also an associated mailing list on the home and the resources pages. I will be announcing new resources on that list. But I will also be doing some early previews well before the general audience will see the final version. And there will be rewards to the subscribers on my next commercial Solr project (a book of sorts).

So, please visit the website, join the mailing list and use and share the resource links to help making the Solr mastery an easier goal to achieve for everyone.

Checking examples in “Solr Indexing” with Solr 4.7 under Windows – part 1

It’s been 9 months since my introductory Solr book came out. It was written for version 4.3. In the meanwhile, Solr kept marching on and is now at version 4.7. There has been quite a number of changes and new features. So I really wanted to recheck that the examples in the book still make sense.  I also wanted to do the tests on Windows to see whether the *nix-centered instructions in the book caused any issues.

This first part covers the issues and supplementary material based on the review of the first five chapters. Later parts will be covered in other blog posts. So far, it seems that the examples survived without any serious issues.

Overall comments

  • Do not copy examples from the book PDF. Instead, use examples published on GitHub.  This will avoid problems with content broken over multiple lines and having to delete page-end material introduced in the PDF
  • Solr logs look somewhat funny in DOS console, especially with some weird û characters, but mostly they are ok. It may be worth increasing screen buffer size to see long exception traces.
  • Even though book PDF has clickable links enabled on URLs (that took FOREVER), some – though not all – line breaks cause URLs stop prematurely as well. So, if your browser results are very different from what book suggests, recheck that you have full URLs. Sometimes that will require some surgery if Solr redirects within Admin interface erroneously.
  • Command line examples often mix URLs and local directories. On *nix/Mac, these all use forward slash (/), but on Windows the URLs still use forward slashes and local paths use backslashes (\). For example:
    java -Dauto -Durl=http://localhost:8983/solr/collection1/update -jar post.jar collection1\input1.csv

Chapter 1: Creating your first collection

  • In solrconfig.xml, we have Lucene version set to LUCENE_43. This still works, though the latest version is LUCENE_47 and is what I am testing with. I also noticed that now it seems possible to use version number directly (e.g. 4.7).
  • Solr WebAdmin screens have rather different collection-specific sub-screens. The ones book uses are all still there, but there has been quite a lot of change and progress overall. Some of which is related to now being able to edit schema and configuration via web UI.
  • Scripting Solr startup turned out to be slightly more complicated on Windows. Here is an example that works:

    SETLOCAL
    CD /D D:\SOLRPATH\solr-4.7.0\example
    java -Dsolr.solr.home=D:\SOLR-INDEXING -jar start.jar

    Here, SETLOCAL insures that the prompt returns to directory script was started from, instead of switching to the Solr’s example directory and staying there.

Chapter 2: Running several collections at once

As mentioned in the book, Solr now has core autodiscovery with radically different semantics. And as predicted, it took a couple of versions for it to mature. I think it works by now, but the book examples still work fine with legacy solr.xml format.

Chapter 3: Importing multivalued fields

Similar to scripting in chapter 1, both copying and deleting directories turned out to be slightly non-trivial in DOS. The relevant example commands are:

xcopy /SI collection1 multivalued
rmdir /s multivalued\data

Chapter 4: Using Solr’s XML format

If you practice delete commands at the end of the chapter, you need to rerun the import before moving to the next chapter. The full populated index is reused later.

Chapter 5: Indexing text

  • Be careful with the step 1, as it combines several instructions (trying to save space). Do not delete data directory and make sure to restart the Solr after modifying solr.xml
  • Some of the examples have ‘?’, ‘;’, and ‘.’ characters stuck at the end of URLs. They don’t actually change the results (no match), but could be quite confusing, especially since the URLs also have URL encoding of values as well.
  • Be careful if you following this chapter by copying config files from the GitHub repository. The same several files are changed multiple times during the chapter and the GitHub represent only the final state. It might be better to copy selectively or even do this one mostly by typing.

Summary

It seems that the book has survived so far, at least for basic examples. And it works on Windows without too many changes. So, if you are a beginner or early intermediate Solr user, it is still a good value.


If you enjoyed this article, you may also benefit from other information resources available at solr-start.com.

Book review: Apache Solr 4 Cookbook

As an author of an introductory Solr book, I have quite a bit of curiosity in how other books cover similar and more advanced material. So I was quite interested when Packt Publishing asked me to review one of their other Solr books: Apache Solr 4 Cookbook by Rafał Kuć.

Apache Solr 4 Cookbook is actually not a new book; even the second edition is exactly a year old (published in January 2013). As Solr has been moving rather quickly, things have changed between version 4.0 that the book covers and the version 4.6.1 that was just released. Several recipes became out of date, including new logging library requirements, library file name changes, and ongoing improvements for schema-less configuration. I would not be surprised if some of the examples do not quite work, though usually a quick search will fix it. Solr user mailing list is a good forum to get clarifications on possible changes and the book author is one of the participants. Also, author’s own website provides additional useful material, often on quite recent features of Solr.

The book is structured – and not just called – as a cookbook. There is a large number of small examples targeting specific use cases with instructions and explanations on how to achieve the goals. As each recipe is standalone, there is no sense of progression through the book. So it is not worth trying to read it in any particular sequence. Unfortunately, it also means that the issues are explored not in any great depth and they rely on pre-existing knowledge that the reader may or may not have. It also jumps between very basic topics and topics more suitable for high-intermediate readers. Therefore, I would strongly not recommend this book as a first book on Solr. There is a couple of other books on the market specifically targeting beginners and introducing material in an incremental fashion. Mine is one of them and I will be reviewing another one soon: Apache Solr Beginner’s Guide by Alfredo Serafini, also from Packt.

Despite the above listed shortcomings, the book does have some value. Skimming the table of content shows a number of issues people often run into. So, once somebody has acquired reasonable background knowledge of Solr, this book becomes a more valuable quick reference. Most of the same information exists in other sources, often for free; however the book provides complete mini examples with explanations. That’s often easier to understand than jumping between 4 or 5 pages of the online wiki and random blog posts.

In fact, the later chapters of the book have material that’s quite hard to find and distill. I particularly liked the chapters 6 (Improving Solr Performance), 8 (Using Additional Solr Functionalities) and – surprisingly – the appendix (Real-life Situations). I would recommend reading those chapters in-depth and trying to experiment with configurations provided in there.

I do wish the examples were more comprehensive, but that’s the limitation of printed books that every author struggles with. It’s just that it is a little hard to really understand boosting examples when experimenting on only two records. Perhaps the next edition could improve on that as a part of book marketing – provide shorter examples in the book and the longer examples online or as part of code download.

In summary, despite the book Apache Solr 4 Cookbook being somewhat uneven and out of date, it still has quite a bit of value as a reference for an intermediate Solr user. Just make sure you read something more introductory first.

Wrap-up of the Solr Usability Contest

The Solr Usability Contest has finished. It run for four weeks, has received 29 suggestions, 113 votes and more than 300 visits. People from several different Solr communities participated.

The final list of suggestions (sorted by votes) is:

  1. Better documentation (13 votes)
  2. Make atomic updates really atomic (11 vote)
  3. Automatically redistribute documents across shards when more shards are added to a collection (11 vote)
  4. Make dashboard more interactive and configurable (9 votes)
  5. Make it easy to visualize Solr configuration (8 votes)
  6. Admin tool for testing relevance (6 votes)
  7. Add scripting capability (5 votes)
  8. A tool for analyzing search components (4 votes)
  9. An aggregator for Solr consultancy companies (4 votes)
  10. A directory of tools/libraries/frameworks that work with Solr (4 votes)
  11. Indexing and search of multilingual document (4 votes)
  12. Evolving Solr for Recommendation engine side. (3 votes)
  13. More complex example texts to allow examples beyond what is already provided. (3 votes)
  14. A list of all UpdateRequestProcessors (3 votes)
  15. Puppet/Chef configuration to automatically setup Solr under configuration management (3 votes)
  16. A troubleshooting Solr book/tutorial (3 votes)
  17. Solr Lint – a tool to check Solr configuration and detect issues (3 votes)
  18. User/group ACLs + Authentication on Solr’s API calls (2 votes)
  19. A paste service with Solr awareness (2 votes)
  20. Public instance of Solr with example configs (2 votes)
  21. Solr and Tika integration to index pdf/doc/odf files (2 votes)
  22. Prediction using SOLR documents (1 vote)
  23. More examples for Solr with Ajax (1 vote)
  24. Solr multi-book index (1 vote)
  25. Mailing list for Solr integrators (owners of 3rd party Solr clients) (1 vote)
  26. A Solr learning virtual machine (1 vote)
  27. Hello Solr example in all the 3rd party libraries (1 vote)
  28. Interactive builder for schema files (1 vote)
  29. Dictionaries for DictionaryCompoundWordTokenFilterFactory (1 vote)

The following five people’s suggestions were voted on most and – therefore – they won an electronic copy of my book Instant Apache Solr for Indexing Data How-to :

What happens next? I am about to move countries (Canada to Thailand). Once I am settled, I will start going through the suggestions one by one and documenting what resources are available and what is the best way to move the issues forward. I will create JIRA support requests where required.

If you have any additional information or would like to write yourself in-depth on any of the topics above, please feel free to contact me and I will link to your post. My goal here is improved Solr experience and many hands make light work. In a meanwhile, follow me on Twitter to get updates.

Solr Usability Contest – one week in

It has been just over a week since launching Solr Usability Contest. It is doing well. There are 25 suggestions, more than 150 visits, and quite a number of votes.

The most popular suggestion so far is Better Documentation. This is both easy to predict and a bit sad. From my own experience, there is quite a bit of documentation about Solr on the web, but it might be a little hard to find. Some of the best stuff actually hides in videos and slideshows, so may not even be easily visible to Google. And then, there are books – also not indexable (see relevant suggestion). Some of those books  are quite comprehensive. And, of course, for the advanced material, the source code is self-documenting on the most basic level, yet again not something Google can search effectively (use other search engines for that).

Still, you would think there would be an easier way to find good documentation. Given that Solr is all about searching and finding things. Perhaps, there is an opportunity to innovate in this space and use Solr to help people with finding information about Solr. Dogfooding and and all that.

The contest runs for another 3 weeks, so there is enough time to add more suggestions, vote for the most interesting ones or – even – suggest solutions to the ones that already got a lot of attention. Please join in. And remember to login, if you are making a suggestion and want to win a copy of my book.

Announcing Solr Usability contest

In collaboration with Packt Publishing and to celebrate the release of my new book Instant Apache Solr for Indexing Data How-to,  we are organizing a contest to collect Solr Usability ideas.

I have written about the reasons behind the book before and the contest builds on that idea. Basically, I feel that a lot of people are able to start with Solr and get basic setup running, either directly or as part of other projects Solr is in. But then, they get stuck at a local-maximum of their understanding and have difficulty moving forward because they don’t fully comprehend how their configuration actually works or which of the parameters can be tuned to get results. And the difficulty is even greater when the initial Solr configuration is generated by an external system, such as Nutch, Drupal or SiteCore automatically behind the scenes.

The contest will run for 4 weeks (until mid-August 2013) and people suggesting the five ideas with most votes will get free electronic copies of  my book. Of course, if you want to get the book now, feel free. I’ll make sure you will get rewarded in some other way, such as through advanced access to the upcoming Solr tools like SolrLint.

The results of the contest will be analyzed and fed into Solr improvement by better documentation, focused articles or feature requests on issue trackers. The end goal is not to give away a couple of books. There are much easier ways to do that. The goal is to improve Solr with specific focus on learning curve and easy adoption and integration.

How does the contest compares to other Solr resources?

  • Solr User mailing list – The mailing list is a fantastic resource. Core contributors hang around and the discussion range from philosophical to deeply esoteric. At the same time, it is a high-volume mailing list specializing on Solr itself. While the beginners are very welcome, they are not the focus and it is easy to get overwhelmed. There is an expectation that somebody will at least try to read the available resources and integrate the knowledge on their own first.
  • Stack Overflow – Stack Overflow is designed for Questions and Answers on specific problems that will be useful to wider community. And while beginners often have questions that other beginners will also have, those themes are hard to extrapolate by looking at individual SO questions. Additionally, SO Solr community is pretty small and most of the answers are provided by only a handful of people.
  • Framework-specific communities (e.g. for SolrNet or for Project Blacklight) are great for questions on the specific framework and how Solr is expressed through those framework. But there is an expectation of general understanding of Solr and intermediate Solr questions are often sent to the Solr User mailing list. This, in some cases, can create a gap of support where particular types of questions cannot be answered by either community. Additionally, the themes that show up in one community might be similar to themes in another community, but there is no real common ground to discover those. I am hoping that members of different communities will meet at the contest page and ideas will emerge that in-retrospect could be obvious barriers to Solr adoption.
  • Issue trackers – while anybody and everybody is encouraged to create issues for Solr or other related projects, these are often only used by advanced users and/or developers. There is an etiquette and informal rules around such systems, which often require a learning curve on its own. The beginners often don’t have real bugs, as much as a lack of understanding of how to put working features together for the best results.
  • Books, Wikis and online tutorials –  All of these are fantastic resources for people to study on their own, but there are no real study-groups around them, so no communal learning/reinforcement happens.

So, if you have been using Solr for a while and the memories of times you got stuck are fresh in your mind, participate in the contest and make it easier for others to adopt Solr too. Oh, and follow me at Twitter to get the short status updates during the contest.

Updated:

  1. First progress update after one week (August 1st)

Setting up Apache Solr on Windows as a service

I had to set up Apache Solr 4 on Windows as a service using Jetty container. The following is the documentation on how to do it. I am not saying that this is the best way to get it to work. But it is one way that works and seems to be more recent and more comprehensive than the other approaches I found.

Initial setup

  1. Ensure Java version 1.6 or higher is present on the machine. If not, download it from http://www.java.com/ (Java 7 is actually also known as 1.7).
  2. Download latest Solr (here Solr 4.3.1) and unpack it somewhere.
  3. Create a directory where your Solr configuration and Jetty will live (e.g. C:\Services\SolrService, which we will refer to as <_SOLRDIR_>).
  4. If you don’t have Solr configuration defined yet, you can just copy Solr’s own example\solr directory to <_SOLRDIR_>, such that solr.xml is just inside <_SOLRDIR_> directory.
  5. Under the <_SOLRDIR_>, create subdirectory called jetty.
  6. Copy the following directories and the files into the jetty subdirectory from the Solr’s example directory:
    • contexts
    • etc
    • lib
    • logs
    • resources (Solr 4.3+)
    • webapps
    • start.jar
  7. From the command line in the jetty directory, run the Solr server to make sure it works:
    1. java -Dsolr.solr.home=.. -jar start.jar
    2. Ensure that Solr is accessible at: http://localhost:8983/solr/
    3. Stop the server (Ctrl-C)

Service setup for availability

  1. Download Apache Commons’ Daemon package for Windows (make sure to get the Windows package that includes .exe files).
  2. Copy platform-appropriate prunsrv.exe into the jetty directory and rename it to SolrService.exe – this is the actual Windows service executable.
  3. Copy prunmgr.exe into a new subdirectory jetty/serviceui and rename it also to SolrService.exe (to provide default service name) – this is the UI that allows to correct some of the parameters of the service after it had been setup via the command line.
  4. Make sure that computer’s environmental variables have JAVA_HOME defined for system variables, not just for user’s (run sysdm.cpl as Administrator). The variable value should look something like C:\Progra~1\Java\jre7 . Otherwise, attempting to start the service will produce the following message in the daemon/service logs: Unable to find Java Runtime Environment.
  5. Run the service registration (with renamed prunsrv) as the Administrator on the command prompt after replacing <_SOLRDIR_> with the actual path to the Solr installation as setup above.  The command below should be all on a single line (with spaces instead of line breaks):
     
    SolrService.exe //IS//SolrService
    --DisplayName="Solr Service"
    --Install=<_SOLRDIR_>\jetty\SolrService.exe
    --LogPath=<_SOLRDIR_>\jetty\logs
    --LogLevel=Debug
    --StdOutput=auto
    --StdError=auto
    --StartMode=java
    --StopMode=java
    --Jvm=auto
    ++JvmOptions=-Djetty.home=<_SOLRDIR_>\jetty
    ++JvmOptions=-DSTOP.PORT=8087
    ++JvmOptions=-DSTOP.KEY=stopsolr
    ++JvmOptions=-Djetty.logs=<_SOLRDIR_>\jetty\logs
    ++JvmOptions=-Dorg.eclipse.jetty.util.log.SOURCE=true
    ++JvmOptions=-XX:MaxPermSize=128M
    --Classpath=<_SOLRDIR_>\jetty\start.jar
    --StartClass=org.eclipse.jetty.start.Main
    ++StartParams=OPTION=ALL
    ++StartParams=<_SOLRDIR_>\jetty\etc\jetty.xml
    --StopClass=org.eclipse.jetty.start.Main
    ++StopParams=--stop
    ++JvmOptions=-Dsolr.solr.home=<_SOLRDIR_>
    --StartPath=<_SOLRDIR_>\jetty
  6. Run the service:
    1. SolrService.exe start
    2. Check the log files in the log subdirectory (usually commons-daemon.<date>.log) for messages.
    3. If there is an error message, run serviceui\SolrService.exe and try to correct possible problems. An account that runs the service may need to switch to one with full access to the directories (e.g. Administrator).
    4. Run SolrService.exe stop to stop the service and try starting again. If error log shows problems with stopping and the Services control panel shows the service as ‘starting’, see the troubleshooting section below on how to kill a run-away service.
    5. The success criteria here is being able to access Solr’ Web Admin UI from http://localhost:8983/solr (again).
  7. In the Services control panel, change status of the service Solr Service from manual to automatic to ensure it survives restart. Try stopping and starting the service from the Services control panel to test that it works. Use Solr’ Web Admin UI to verify that it has stopped/started correctly.

Updating Solr configuration

  1. Update would usually involve changes to schema.xml and – much more rarely – to solrconfig.xml .
  2. Most of the time, the core can be reloaded from the Solr’s Web Admin UI to account for the new configuration.
  3. If the changes are significant (e.g. new collection or some deep server settings), the service itself may need to be restarted.

Upgrading Solr

  1. Download and unpack the latest 4.x version of Solr.
  2. Stop Solr service.
  3. Delete and re-copy the directories into jetty subdirectory as per the initial-installation instructions above, with two exceptions:
    1. Keep logs directory, which contains production logs.
    2. Delete solr-webapp directory, which is an expanded version of – now obsolete – webapps/solr.war. If this is not done, you will get version mismatch between old and new content. The solr-webapp directory will be regenerated on the next Jetty restart.
  4. If you need to update schema or config, do it now while the service is down.
  5. Restart the service.
  6. If you are running some persistent client, you may need to restart that as well.
  7. Test from Web Admin UI interface that the service is running.

 Troubleshooting service setup

If a service is misconfigured, Java runtime not found or something else goes wrong, you may get a run-away service that refuses to start and stop properly from the control panel. In which case, you may need to use command line tools (possibly as Administrator) to find and kill the stuck service:

  1. sc queryex
  2. taskkill /PID <PID> /F

In here, <PID> is the process ID for the solr service that is stuck.


If you enjoyed this article, you may also like my Solr book: Instant Apache Solr for Indexing Data How-to.

You may also benefit from other information resources available at solr-start.com.

 

> From inner thoughts to outer limits of Alexandre Rafalovitch