Authorship Attribution Data

Last week I participated in the Authorship Attribution Workshop at Brooklyn Law School, funded by  NSF and hosted by Larry Solan. It was great to meet people whose work I’ve followed and only communicated with through email (Moshe Koppel, Carl Vogel, Statis Stamatatos) and to see again people I’ve met at conferences or other venues over the years (Shlomo Argamon, Patrick Juola, Steve Feinberg, Bob Carpenter, Ed Cheng, Tim Grant, Malcolm Coulthard) as well as new acquaintances (Nuria Gavalda, Ed Finegan, Jay Koehler) and old friends (Ron Butters, Michael Saks, Michael Risinger). It was great to see, from the audience, Tammy Gales (who brought along students) and her colleague Benji Wald, Janet Randall and Ben Zimmer. I wish that Robert Leonard, Roger Shuy and especially Gerald McMenamin had attended. McMenamin’s publications in forensic stylistics have been so important in explaining the stylistics method. Even though I disagree with stylistics as a forensic technique, Jerry’s work has been really important in gettting these issues into scholarly debate. It was pretty amazing when Malcolm Coulthard, a well-known proponent of forensic stylistics, said in his presentation that he thought the future lies in the computational approach.

At the end of the workshop we had a discussion about where the field should go. One issue that we all agreed on was the idea of sharing datasets. I have made the Chaski Writer Sample database available to other researchers with a confidentiality agreement. My dataset was collected under a confidentiality agreement so it is not something I can just post on a website for download, but I can make it available for research. Moshe Koppel et alias have made their Blogger Database available. Since the Koppel dataset was scraped from the web, there is perhaps no issue of human subject research or protecting privacy, and the data can be downloaded.

The basic idea is that we do needed ground truth data that we can use for developing and testing our methods. If you have datasets that you can share, please let me know so that we can arrange for the best way to protect human subjects while also enabling research to move forward.

Finally, to quote from one of my slides, I think a goal for the field should be a set of complementary computational methods with known data requirements, so that we have reliable methods for lots of different circumstances that arise in actual cases.