Solr 5 puzzle: Magic date – answer

This is the answer and explanation to the Solr puzzle on what happens during indexing, using a date as an example. In this blog post, we will dig into the complex and fascinating details of what those three simple commands cause behind the scenes.

So, let’s start from getting the server up and running. This example is based on Solr 5.5, though it should work in 6.0 the same.

$ bin/solr start

This starts the server. As we are creating our own core, we are just starting a blank server with home in server/solr. If you are not clear where different start commands put Solr’s home and core directories, I have written about that before.

Now, for the first command

$ bin/solr create_core -c puzzle_date

This should create a core. But with what configuration? Or will it crash and burn because we did not provide one, like in the answer option 1? Let’s look at the documentation:

$ bin/solr create_core -help

Usage: solr create_core [-c core] [-d confdir] [-p port]

  -c <core>     Name of core to create

  -d <confdir>  Configuration directory to copy when creating the new core, built-in options are:

      basic_configs: Minimal Solr configuration
      data_driven_schema_configs: Managed schema with field-guessing support enabled
      sample_techproducts_configs: Example configuration with many optional features enabled to demonstrate the full power of Solr

If not specified, default is: data_driven_schema_configs

So, the answer option 1 is incorrect and our new core is getting data_driven_schema_configs as a baseline. Which is ‘schemaless’ and the fields that are not already defined in the schema will be auto-defined. You can check the definitions in your own distribution or, for those away from their Solr setup, there is always source on the GitHub.

Now, let’s do the indexing:

bin/post -c puzzle_date -type text/csv -d $'today\n2016-04-08'

To decompress, we are indexing content in CSV format and that content – provided inline – consists of only one record with a single field today which has the value 2016-04-08.

Should this complain due a missing uniqueKey (the answer option 2) or due to a bad date format (answer option 3)? It would in a static schema. But ours is – as we now know – a managed ‘schemaless’ one. So, it all depends on how the recognition patterns are configured. Which – for our example – is in the UpdateRequestProcessor chain add-unknown-fields-to-the-schema, starting on the line 1316 of the relevant solrconfig.xml.

We can see that the very first step in the URP chain, is UUIDUpdateProcessorFactory, which will generate us an ID, if one is not provided. So, the answer option 2 is not correct and we will be able to proceed even without explicit ID.

Onto the date. Date parsing is done by the last of the four explicit parsers, looking for booleans, longs, doubles, and dates in that order. All of the parsers can take parameters (including complex field selection criteria inherited from the parent class), but only date parser requires them explicitly. So, we have a long list of Java date formats we can recognize. 2016-04-08 would match yyyy-MM-dd on the line 1348. So, we answer option 3 is also incorrect and we will have successfully indexed our record with the multiValued date field created for today. And in fact, doing a *:* query will return us:

  "today": [
  "id": "846af0eb-f2f5-43d8-8774-eba65f270e43",
  "_version_": 1532175408279060500

(if you did not get this, check your single quotes in the indexing command. Sometimes they get manged into smart quotes and make commands fail in mysterious ways. )

So, now for the curveball. A query by value Fri, which does not show up at all anywhere in either our submitted value or in the parsed displayed value.

curl http://localhost:8983/solr/puzzle_date/select?q=Fri

Do we have no result (answer option 4) or get the record back against all odds (answer option 5)?

PAUSE a bit here if you haven’t figured it out already. We have already discarded 3 out of 5 options. Now, faced with a binary choice, can you figure the answer and – more importantly – WHY. If you cannot, run the actual commands above and try to figure it out from the information available in the schema, solrconfig.xml, and various Admin UI screens.

And when you are sure, read the rest of the explanation.

Solr 5 puzzle: Magic date – answer (part 2)

(This is the final part of the explanation for the Solr puzzle that brings together schemaless mode, dates and other automagical parts of Solr. See the puzzle post for the setup and the 1st part of the answer for explanations of why 3 out of 5 answer choices were not valid)

By now, we are facing a basic binary choice. In summary, given a schemaless configuration, we indexed a new field today with a value 2016-04-18, which got parsed as a date 2016-04-08T00:00:00Z. We now want to search with the query Fri and see whether or not that record matches.

There are two ways to look at it, forward and backward. We will look at this backward from the search:

curl http://localhost:8983/solr/puzzle_date/select?q=Fri

What do we actually search here? It is not a field today, which did not exist at the start of the puzzle. Instead, we are searching a default field.

What’s our default field? Well, it is not defined explicitly in our query with a df parameter, so it must be in definition for the /select handler.  Which starts at line 773 for the Solr 5.5 distribution. Except it is not there either, but in the initParams section on line 853. Finally, we find out that we are searching the field _text_. We could also have discovered this by using echoParams=all in our query, such as (escaping the & symbols for Unix command lines):

curl http://localhost:8983/solr/puzzle_date/select?q=Fri\&echoParams=all\&indent=true

So, why did we not see _text_ field and what is its content? The first one is a bit easier to answer. If you look in the original managed-schema file on line 123 (line ~401 in your own rewritten post-today schema file) , you can see that the field is defined multiValued/indexed/NOT-stored. Since it is not stored, it does not display when we run a query, but it is indexed.  And right on the next line 124, we can see the copyField statement that copies content of ALL the fields to the _text_. This would include the field today after it gets created.

You can also discover the same field definition and copyField information in the Schema Browser screen of the Admin UI. More importantly, on the same screen you can Load Term Info for the field and see what indexed tokens it contains. Even if the field is not stored, we can look at its content without resorting to something like Luke.

Puzzle Date - Schema - screenshot

And suddenly, we can see the term Fri showing up in the _text_ field. Which means we can search for it and our single record will show up. Making the answer choice 5 – the correct one.

But where did that token came from? Well, 8 April 2016 happens to be a Friday, but that does not explain how it gets there. Let’s make this a tiny bit simpler by making this invisible content easier. We can do it by redefining _text_ as stored and rerunning the indexing, but instead we will just add another copyField instruction.

For this, we need to switch to the new Admin UI, which provides new buttons to control managed schemas in its version of the Schema (browser) screen. Let’s add a copyField from all fields to a field text_ss. (If you don’t know why this would work, I am leaving it as a home work 🙂 ).

Puzzle Date - Schema - copyField

Now, we need to reindex, so let’s rerun the indexing command again:

bin/post -c puzzle_date -type text/csv -d $'today\n2016-04-08'

Now, if you rerun the search, one of the two records should now include the field text_ss, with several values, including Fri Apr 08 00:00:00 UTC 2016. (Are you surprised about having two records? If so, reread the first part of the answer even more carefully.)

So, we know where the Fri came from in the _text_ field and therefore why we get the record match during the search. But now we have THREE different date format between what we indexed, what we displayed and what we actually searched. Why?

For that, we have to go back to the schemaless mode magic and understand it at the level below one big black-box UpdateRequestProcessor chain.

Let’s look again starting from the line 1316 of the solrconfig.xml. The date value mapping actually happens in two steps. First, we detect and parse text value as a date with the ParseDateFieldUpdateProcessorFactory on the lines 1330-1350. That mutates the text value into the java.util.Date, as per the parser’s documentation. At this point, we have lost our original string representation and are actually carrying a different object type around.

The next step is AddSchemaFieldsUpdateProcessorFactory, which looks at those object types and maps them to field types. Specifically, on lines 1357-1360, we map all new fields carrying an object of type  java.util.Date to a type tdates , which is configured as TrieDateField. TrieDateField knows what to do with a java.util.Date, so we are all done.

Except, we don’t care what TrieDateField is doing in our curveball. We are looking at the copyField instruction which will take the final object representation and will try to copy it for the processing according to its target definition, which – for us – is _text_ field of type text_general. Most definitely, not a type that knows what to do with anything but plain strings as an input.

And so, somewhere within Solr bowels, a non-string java.util.Date object gets serialized back into the string format so it could be copied into a text field and then tokenized, lowercased, and so on as per the type definition. And that serialization, obviously, uses a format different from the format used to display the proper date field as part of Solr query.  Which is how we get tokens that were neither in the input nor in the output, yet still impacting the search.

And we are not even going to talk about how this impacts the relevancy!


If you liked this Solr puzzle and learned anything from it, please share the original question link via twitter, email, or any other means.  You can also share the answer post’s first half, but please do not share this (second half of the answer) post.

Solr 5 puzzle: Magic Date

From the SolrStart newsletter issue #30:

Given the following sequence of commands:

1. bin/solr create_core -c puzzle_date
2. bin/post -c puzzle_date -type text/csv -d $'today\n2016-04-08'
3. curl http://localhost:8983/solr/puzzle_date/select?q=Fri

Would the result be:

  1. Error in the command 1 for not providing a configuration directory
  2. Error in the command 2 for missing a uniqueKey field
  3. Error in the command 2 due to an incorrect date format
  4. No records in the command 3 output
  5. One record in the command 3 output

For answer and detailed explanation, subscribe to the SolrStart newsletter.

> From inner thoughts to outer limits of Alexandre Rafalovitch