Big Data and Data Science Piss Me Off
11 Aug 2015Get off my lawn!
.@galvanize bringing its 12-week Data Science boot camp to Denver.
You still stoked about studying for the GRE ? pic.twitter.com/EMFgUq0OFV
— Brian Timoney (@briantimoney) August 11, 2015
I don’t talk about this much, but I actually trained in statistics, not in computer science, and I’ve been getting slowly but progressively weirded out by the whole “big data” / “data science” thing. Because so much of it is bogus, or boys-with-toys or something.
Basically, my objections to the big data thing are the usual: probably your data is not big. It really isn’t, and there are some great blog posts all about that.
So that’s point number one: most people blabbing on about big data can fit their problem onto a big vertical machine and analyze it to their heart’s content in R or something.
Point number two is less frequently touched upon: sure, you have 2 trillion records, but why do you need to look at all of them? The whole point of an education in statistics is to learn how to reason about a population using a random sample. So why are all these alleged “data scientists” firing up massive compute clusters to summarize every single record in their collections?
I’m guessing it’s the usual reason: because they can. And because the current meme is that they should. They should stand up a 100 node cluster on AWS and bloody well count all 2 trillion of them. Because: CPUs.
But honestly, if you want to know the age distribution of people buying red socks, draw a sample of a couple hundred thousand records, and find out to within a fraction of a percentage point 19-times-out-of-20. After all, you’re a freaking “data scientist”, right?