Monday, October 10, 2005

Lies, damn lies and text mining statistics

LA Times writer Brendan Buhler took TV newswriters to task Sunday for their overuse of the phrase "get a handle". OK. I hadn't noticed such a growing phenomena, but no matter.

He proved his claim by running a LexisNexis search on the phrase over five years and said its use is rising every year. ("It was in 3,504 stories in 2004, nearly 700 more than 2000. ")

I found this to be creating truth where none exists by a fast use of text mining. I have two concerns:

1) What were the context of these references? I searched "get a handle" in Factiva and found several mentiones of that phrase in a oft-sited direct quote ("Firefighters were able to get a handle on this early on," said Capt. Jason Neuman of the California Department of Forestry and Fire Protection.) Does that make the phrase more common or is it just a function of the phrase being replicated by the distribution of AP wire copy.

2) Did Mr. Buhler account for any changes in the universe of publications and/or documents over that time period? The number of mentions in one year versus another needs to be compared to the total documents in each year. When I ran the "get a handle" search in Factiva's top 50 U.S. Newspapers (a more controlled group) and then compared it to all documents each year in that group, I found the rate of mentions of the phrase rather flat year on year.

Ah. Lies, damn lies and statistics.

No comments: