On uselessness of pretending to be somebody else

January 25, 2008

While reading weka Data Mining book, I have come across this impressive example of using machine learning to confirm person’s authorship (p. 358).

In 19th century, there lived a famous rabbinic scholar Ben Ish Chai, who among other writings had two collections of letters. Ben Ish Chai claimed that only one collection was his and that the other one was somebody else’s, found by him. Modern scholars thought both collections were his, but could not prove it conclusively as the style of writing was different.

Machine Learning to the rescue! In 2004, Moshe Koppel and Jonathan Schler have discovered that it may help to look not at the writing style differences (as the style may have been faked), but rather at how deep those differences were. For example, an author could fake a stylistic mismatch by consciously avoiding favorite words, but would still write in long overrun sentences, use more of passive verb forms or display many other measurable behaviours.

So, if the most obvious differences were removed one by one, the speed at which the rest of the features would look identical could be a good indicator. They called this technique unmasking and the mistery of Ben Ish Chai was solved for good.

I think what impressed me here was not the clever math. The whole field of determining authorship is based on clever math. It is rather the fact that the math was looking at hints within the hints of the language - the invisible aspects that become noticeable only after the eye learns to see beyond what the most obvious reality offers. I cannot explain it better, but to me it has a special elegance that just counting the words and sentence lengths does not offer.