Big Data and Data Science Piss Me Off11 Aug 2015
Get off my lawn!
I don’t talk about this much, but I actually trained in statistics, not in computer science, and I’ve been getting slowly but progressively weirded out by the whole “big data” / “data science” thing. Because so much of it is bogus, or boys-with-toys or something.
So that’s point number one: most people blabbing on about big data can fit their problem onto a big vertical machine and analyze it to their heart’s content in R or something.
Point number two is less frequently touched upon: sure, you have 2 trillion records, but why do you need to look at all of them? The whole point of an education in statistics is to learn how to reason about a population using a random sample. So why are all these alleged “data scientists” firing up massive compute clusters to summarize every single record in their collections?
I’m guessing it’s the usual reason: because they can. And because the current meme is that they should. They should stand up a 100 node cluster on AWS and bloody well count all 2 trillion of them. Because: CPUs.
But honestly, if you want to know the age distribution of people buying red socks, draw a sample of a couple hundred thousand records, and find out to within a fraction of a percentage point 19-times-out-of-20. After all, you’re a freaking “data scientist”, right?