We've written about BIG data before and while some reckon it's sexy, you better roll up your sleeves because you'll invariably need to do a lot of 'janitorial' (a.k.a. shit) work first!
Ron Sandland recently wrote about the new phenomenon of 'big data' - weighing up the benefits and concerns. Terry Speed reflected on the same issue in a talk earlier this year inGothenburg, Sweeden noting that this is nothing new to statisticians. So what's all the fuss about? Here's another take on the 'big data' bandwagon.
R - the Wikepedia of statistical software? August 20, 2014
The R computing environment is feature-rich, incredibly powerful, and best of all - free! But to what extent can we trust user-contributed packages?
There was a time when I believed the 'R-learning curve' was too steep and the investment of time and effort not warranted by virtue of: (a) having software that did most if not all of what I needed to do; and (b) undertaking analyses which tended to be 'one-off' and not requiring code which could be recycled later on.
I no longer subscribe to that view and, having written over 5,000 lines of R-code and taught a number of courses in both elementary and advanced uses of R, I think I can say I'm 'up to speed' on R.
However, before my epiphany I used to vigorously defend my inefficient approach to statistical analysis arguing that at least by using commercially-produced software I had some level of assurance that the algorithms had undergone rigourous QA/QC and that there was some level of customer support if problems arose.
During this period, I recall on more than one occasion having 'robust' dinner / bar conversations about R's QA/QC processes with some of my statistical colleagues who, it is true to say, are R aficionados and who had seen the 'R-light' many years before me. In response to my question "so how can you be sure that an R package gives the correct results", one of these colleagues wryly replied "you get what you paid for". Touche! Although I'm well down the R path and fully immersed in this computing paradigm, I have not been able to extinguish the nagging doubt about this fundamental quality assurance issue.
More recently I came across an R promotional video produced by Revolution Analytics (http://goo.gl/u5t5ph) which spoke to my unresolved issue with its reference to R's "crowd-sourced quality validation and support from the most recognised industry leaders in every field that uses R".
But is this sufficient? After all, given no money has changed hands, there are no consumer warranties and hence no commercial and/or legal imperatives to fix that which is not right and even that which is known not to be right.
As a recent example of this, I was using the R-package 'lpSolve' to find a solution to a large (strictly) integer programming problem. After 12+ hours of grinding away, the software did indeed identify a solution. Pleased with the outcome, I proceeded to use lpSolve's variable which is assigned the computed optimal value of the objective function (which should have been of class integer) as an array dimension. It was at this point my own R-code crashed. A quick check revealed that this variable was declared as double-precision and had a value equal to the correct integer solution plus delta, where delta was of the order 10^(-14). Clearly, no big deal and easily fixed, but what I found more intriguing was the following in the lpSolve documentation with respect to setting the use.rw argument of the lp() function to TRUE (page 3): "This is just to defeat a bug somewhere. Although the default is FALSE, we recommend you set this to TRUE if you need num.bin.solns > 1, until the bug is found". How long has that disclaimer been there I wondered, and what is this mysterious bug and when will it be fixed? Given that I was looking for multiple optimal solutions, I did need to set num.bin.solns to something greater than unity. Re-running lpSolve with this option invoked worked, but in an unexpected way.
Conventionally, but not always, all the necessary output from some R routine can be accessed either by typing the name of the object to which the output has been assigned or alternatively typing summary(object). Not so the lpSolve package. Typing the name of the object simply gives: "Success: the objective function is ...". Applying the summary command to the object simply identifies the components of this object (which is an R list). Drilling down into this list, one can extract the solution vector. Doing this I was surprised to find the values of my decision-variables for multiple optimal solutions simply concatenated into one, long, unstructured vector whose length was equal to the number of decision variables * the number of solutions plus one! What's the extra component I wondered? Nothing in the documentation about that so I shot off an email to the maintainer of the lpSolve package. I received a prompt response which suggested that the solution vector length issue was a quick fix to an inconsistency between R and C and an admission that both the documentation and output needed improvement.
While I appreciate the honesty, it does go to show that with R, it's caveat emptor, or as my colleague, Andrew Robinson at the University of Melbourne prefers caveat computator!