I am reading Tony McEnery and Andrew Hardie’s new textbook, Corpus Linguistics, published by Cambridge University Press as I’ll be reviewing it for a journal.

I am so impressed with one of their chapters I just have to mention it now. McEnery and Hardie address the issue of ethics in collecting data from the Web, and even mention forensic linguistics.

Cheers for these scholars for including this issue!

I’d like to add a bit to their discussion. In one case where I was asked to review a forensic linguistics expert’s report, the expert had used the Web to determine how frequent or rare a phrase was. This method is tricky because:

(1) web postings change and cannot always be found after the data is collected, so it is important to list the data and time of any search;

(2) most of what is retrieved is not relevant to what is searched for, especially when exact phrases are the keywords.

But in the particular case report I was reviewing, to my astonishment, the expert had slightly changed the phrase from the questioned document to another variation. When I tested the actual phrase from the questioned document it garnered thousands of more hits than the concocted phrase which the expert had used. Now of course, my result had to be treated just as gingerly as a real indication of frequency as the expert’s variant phrase search results. But it was the difference, rather than the actual number, that mattered, because it showed that fudging the data had made the results different from what the actual data got for hits on the web.

So the ethical issues that arise in collecting data from the Web include all of those discussed by McEnery and Hardie, and also ones that relate more to scientific methodology. Sleight of hand –shift of phrase– fudging the data– this kind of behavior is never warranted in any scientific work, and as McEnery and Hardie point out, especially not in forensic linguistics where a person’s life could be on the line.

Just as I am always disturbed when I see experts putting forth analyses that fall apart when one actually looks at the data, I am always delighted when I see linguists of such intelligence and competence as McEnery and Hardie putting forth ethical standards that a field can be proud of. Hooray for Corpus Linguistics and its authors!