Spock announces an Entity Resolution competition

April 16, 2007

Netflix prize must be doing well, as there are now other companies willing to tap into web’s researchers with deep knowledge of computational techniques.

The latest company is Spock, a company so new that you have to read 3rd party sites to figure out exactly what they do. Even to use it, one has to signup for the account.

Basically, Spock is about the search with the strong focus on the people mentions. Similar to clustering search engines that distinguish Jaguar the car from Jaguar the animal, Spock will distinguish Eric Schmidt of (currently) Google from all the other Eric Schmidt’s of the world. A bit like previously described FreshNotes/Knover, but with a single strong focus.

I am sure Spock has good strong scientists onboard, but they are obviously not above trying to leverage the community as much as possible.

They have just announced a (nearly) worldwide competition (via O’Reilly Radar), in which they provide a corpus of web documents and the winner with the best entity resolution algorithm gets 50000$ (USA). Now, 50 thousands is not quite Netflix’s cool million dollars, but it is nothing to sneer at either.

The website is currently very bare and basically contains one self-contained download file. In the file, there are instructions, training and testing corpus and ground-truth results for the training corpus. There is even a python evaluation script. A big warning - the download file is over a Gigabyte and it unpacks into 97000 files and takes just over 9 Gigabytes of space. I hope the company will provide a smaller file for all those who want to check the data over before fully committing to working with it.

I am also quite curious about the data itself. The files contained in the archive are basically html scraped from the web with seemingly no post-processing. In fact, there are image links still pointing to the original websites. I wonder how long it will take for one of those companies to notice that they have a number of referrer fields pointing at local files with names like SCI.1.10317256.html. It also feels a little silly to make every contestant reinvent html cleanup step. I would have at least run all the files through tools like HTMLTidy, removing extra space and obviously non-relevant markup in the process.

From the technical point of view, the requirements are straightforward: parse multiple documents and extract the coreference chains. The output is a file with each line listing space separated file names that are supposed to refer to the same entity. Evaluation is an F-measure with precision favoured 3 times as much as recall.

It is good to see computational linguistics techniques being used for real (or at least funding-worthy) companies. Seeing a (nearly) whole world invited to improve the algorithms is in some ways even cooler. Certainly puts a new spin on the term User-Generated Content.