We've written about BIG data before and while some reckon it's sexy, you better roll up your sleeves because you'll invariably need to do a lot of 'janitorial' (a.k.a. shit) work first!
Ron Sandland recently wrote about the new phenomenon of 'big data' - weighing up the benefits and concerns. Terry Speed reflected on the same issue in a talk earlier this year inGothenburg, Sweeden noting that this is nothing new to statisticians. So what's all the fuss about? Here's another take on the 'big data' bandwagon.
We've written about BIG data before and while some reckon it's sexy, you better roll up your sleeves because you'll invariably need to do a lot of 'janitorial' (a.k.a. shit) work first!
A recent article in the New York Times looks at the pre-processing requirements faced by statisticians when organising data in order to get it into shape for more sophisticated analyses - a process variously referred to as 'data wrangling', 'data munging' and 'data janitor work'.
While not a very inspiring task, it is one that is both essential and often-times incredibly time-consuming. The time spent getting your data into the necessary format can be reduced considerably by some pre-planning and discussion with the data providers.
We often have to deal with the processing of huge volumes of data generated by autonomous loggers. One of the things that really throws a spanner in the works is when the data supplier decides to make changes without telling us. Little things that seem incidental to the data gatherer such as renaming the variables in a spreadsheet, changing the column ordering, or changing the way missing data are recorded will invariably cause automated scripts in R to crash.
So - the take home message is: If you want to minimise the cost of data preparation and organisation - have a PLAN and stick to it! If changes need to be made - do it by consensus and discuss the necessary changes with all concerned BEFORE you do anything.