Big Data and Data Science Piss Me Off

Get off my lawn!

I don’t talk about this much, but I actually trained in statistics, not in computer science, and I’ve been getting slowly but progressively weirded out by the whole “big data” / “data science” thing. Because so much of it is bogus, or boys-with-toys or something.

Basically, my objections to the big data thing are the usual: probably your data is not big. It really isn’t, and there are some great blog posts all about that.

So that’s point number one: most people blabbing on about big data can fit their problem onto a big vertical machine and analyze it to their heart’s content in R or something.

Point number two is less frequently touched upon: sure, you have 2 trillion records, but why do you need to look at all of them? The whole point of an education in statistics is to learn how to reason about a population using a random sample. So why are all these alleged “data scientists” firing up massive compute clusters to summarize every single record in their collections?

I’m guessing it’s the usual reason: because they can. And because the current meme is that they should. They should stand up a 100 node cluster on AWS and bloody well count all 2 trillion of them. Because: CPUs.

But honestly, if you want to know the age distribution of people buying red socks, draw a sample of a couple hundred thousand records, and find out to within a fraction of a percentage point 19-times-out-of-20. After all, you’re a freaking “data scientist”, right?