We've written about BIG data before and while some reckon it's sexy, you better roll up your sleeves because you'll invariably need to do a lot of 'janitorial' (a.k.a. shit) work first!
Ron Sandland recently wrote about the new phenomenon of 'big data' - weighing up the benefits and concerns. Terry Speed reflected on the same issue in a talk earlier this year inGothenburg, Sweeden noting that this is nothing new to statisticians. So what's all the fuss about? Here's another take on the 'big data' bandwagon.
Ron Sandland recently wrote about the new phenomenon of 'big data' - weighing up the benefits and concerns. Terry Speed reflected on the same issue in a talk earlier this year inGothenburg, Sweeden noting that this is nothing new to statisticians. So what's all the fuss about? Here's another take on the 'big data' bandwagon.
To repeat a quote (attributed to Dan Ariely) from Terry Speed's talk:
"Big data is like teenage sex: everyone talks about it, nobody really knows how to do it, everyone thinks everyone else is doing it, so everyone claims they are doing it... "
It was Hal Varian, Google's Chief economist who pronounced that "the sexy job in the next ten years will be statisticians". His affirmation of the importance of the role of statistical science no doubt reflects Google's massive data capturing and processing aspirations. I share Ron's sentiments that 'big data' is potentially both a boon and a curse. However, I have two concerns: (i) the 'sexy' label may be turning this into a fad, reminiscent of how Statistical Process Control (SPC) was hijacked by non-statisticians some 25 years ago and who preferred jingoism (JIT - just in time; TQM - Total Quality Management; Six-sigma etc.) to rigorous science; and (ii) the importance of "little data" is being overlooked or ignored. On this latter point, I work with Ecotoxicologists both here and abroad. One of our primary tasks is to set thresholds on concentrations of contaminants in water bodies such that some high fraction (eg. 99%) of all species in an ecosystem will be protected. The process is highly statistical and, unlike Hal Varian and Google, we do not have the luxury of terabytes of data. For us, a sample of size 10 is a luxury! So before we all jump on the 'big data' bandwagon because it's suddenly become fashionable (even 'sexy') we need to be mindful that the information content of a data set is not always in direct proportion to n!