Spock announces an Entity Resolution competition

Netflix prize must be doing well, as there are now other companies willing to tap into web’s researchers with deep knowledge of computational techniques.

The latest company is Spock, a company so new that you have to read 3rd party sites to figure out exactly what they do. Even to use it, one has to signup for the account.

Basically, Spock is about the search with the strong focus on the people mentions. Similar to clustering search engines that distinguish Jaguar the car from Jaguar the animal, Spock will distinguish Eric Schmidt of (currently) Google from all the other Eric Schmidt‘s of the world. A bit like previously described FreshNotes/Knover, but with a single strong focus.

I am sure Spock has good strong scientists onboard, but they are obviously not above trying to leverage the community as much as possible.

They have just announced a (nearly) worldwide competition (via O’Reilly Radar), in which they provide a corpus of web documents and the winner with the best entity resolution algorithm gets 50000$ (USA). Now, 50 thousands is not quite Netflix’s cool million dollars, but it is nothing to sneer at either.

The website is currently very bare and basically contains one self-contained download file. In the file, there are instructions, training and testing corpus and ground-truth results for the training corpus. There is even a python evaluation script. A big warning – the download file is over a Gigabyte and it unpacks into 97000 files and takes just over 9 Gigabytes of space. I hope the company will provide a smaller file for all those who want to check the data over before fully committing to working with it.

I am also quite curious about the data itself. The files contained in the archive are basically html scraped from the web with seemingly no post-processing. In fact, there are image links still pointing to the original websites. I wonder how long it will take for one of those companies to notice that they have a number of referrer fields pointing at local files with names like SCI.1.10317256.html. It also feels a little silly to make every contestant reinvent html cleanup step. I would have at least run all the files through tools like HTMLTidy, removing extra space and obviously non-relevant markup in the process.

From the technical point of view, the requirements are straightforward: parse multiple documents and extract the coreference chains. The output is a file with each line listing space separated file names that are supposed to refer to the same entity. Evaluation is an F-measure with precision favoured 3 times as much as recall.

It is good to see computational linguistics techniques being used for real (or at least funding-worthy) companies. Seeing a (nearly) whole world invited to improve the algorithms is in some ways even cooler. Certainly puts a new spin on the term User-Generated Content.

4 thoughts on “Spock announces an Entity Resolution competition”

  1. Thanks for telling us what’s in he download file, without having to download it! A group of European reseachers from Leeds and Trento have launched CLEANEVAL, a contest to build a tidy tool for web-as-corpus research, see http://cleaneval.sigwac.org.uk/ ; this could be a useful first-step for anyone trying the spock challenge

  2. Eric,
    Glad you find this useful.

    CLEANEVAL certainly looks like something to keep in mind. In fact, one might as well enter the CLEANEVAL competition if they want to do the Spock one, as the HTML cleanup would have to be done one way or another.

    This may even be something useful to mention to Spock directly (in the forums).

  3. I quickly read their rules, and this part is interesting:

    “Spock may use all residuals it obtains from any Software Submissions for any purpose and in any manner it desires. The term “residuals” means information in intangible form, which is retained in memory by persons who have had access to the Software Submissions, including ideas, concepts, know-how, or techniques contained in the Software Submissions. Spock shall not have any obligation to pay royalties or otherwise for any work resulting from the use of residuals.”

    I wonder what happens if they try to get a patent based on one of the algorithm submitted.

Comments are closed.