Data mining and bad math

There’s a lot of speculation that the warrantless spying authorized by the Bush administration is using some kind of TIA-like, Echelon-like massive data gathering and data mining operation.
That’s why the administration couldn’t get FISA warrants. If that’s what they’re doing, it’s arguably a bad idea even if it was legal (which right now it pretty clearly isn’t).
You can get warrants if you are spying on one, or five, or twenty people. You can’t get warrants if you are spying on 100,000 people, or 1 million people.
It’s also why they couldn’t use the “after the fact” exemption in FISA. Under FISA, the government can start spying immediately, and ask for the warrant up to 72 hours later. But if you’ve amassed petabytes of data on millions of people, the analysts haven’t analyzed it all in 72 hours. Maybe they go back and look for a pattern months after the fact.
Even if it was legal, though, it would arguably be a bad idea. Bruce Schneier makes the best argument that data mining is in many cases less effective than traditional, lead-based investigative work.
When you’re looking for a needle in a haystack, data mining is bad math. It’s very different from the use of data mining to detect credit risk patterns. In the US, there are probably tens of millions of people who are iffy credit risks, and there are different probabilities of default. It’s reasonable to use math to assign a credit rating based on probability. And there’s a competitive market for credit. If an individual gets turned down by one provider, they might get credit from another. It’s not a binary thing.
But what about looking for terrorist sympathizers. Islamist terrorists in the US are rare. How many potential terrorists in the US are willing to kill innocent civilians — maybe 100, 200? Not that many. How big is their network of sympathizers and supports? Maybe a few thousand? By contrast, how many people are there who are news buffs, ordinary muslims, and ordinary, never-violent political activists? Many millions.
So a data mining operation that looked for keywords would find many many more innocent people than potential terrorists. The government would waste their time reading this blog post and menus for mosque community dinners.
When you are looking to assess a credit rating, being about right is OK. If someone pays a rate of 15% instead of 14%, not that much harm is done. But when you are looking for a terrorist, you want to be 100% right. It doesn’t help if you miss a killer and abduct uncle abdul the hardware store owner.
The government would be much better off doing the traditional job of finding leads, getting warrants, trailing those people, and finding their contacts. That sort of hard work actually has a higher probability of success than the data mining approach.

