We've written about BIG data before and while some reckon it's sexy, you better roll up your sleeves because you'll invariably need to do a lot of 'janitorial' (a.k.a. shit) work first!
Ron Sandland recently wrote about the new phenomenon of 'big data' - weighing up the benefits and concerns. Terry Speed reflected on the same issue in a talk earlier this year inGothenburg, Sweeden noting that this is nothing new to statisticians. So what's all the fuss about? Here's another take on the 'big data' bandwagon.
Probability Weighted Indicies for Improved Ecosystem Report Card Scoring May 09, 2014
A new way for calculating an environmental index is described in an upcoming paper "Probability Weighted Indices for improved ecosystem report card scoring" has been published in Environmetrics. Click here.
Assessing the state of the environment or a small part of it is a challenging task. We can observe it, measure it, and monitor it but a more fundamental question is "how?". A related question is what is "it" - everything?;just the important things?; the threatened things?; the sentinels?; or only the things that directly affect us? These are difficult questions to answer but nonetheless are ones that environmental scientists, natural resource managers and governments have to address on a regular basis. Clearly we can't measure everything, so the task is reduced to identifying a set of variables or parameters that are either measurable or observable and which, either separately or together, provide useful information about current status and trends in environmental condition over space and time.
Notwithstanding the difficulties listed above, additional consideration needs to be given to (a) the method(s) and metrics by which we assess change in condition; and (b) the development of an appropriate aggregation or 'pooling' scheme to combine many measures into a single 'index'. To give this some relevance, suppose we have decided to assess the condition of a water body by taking measurements on two parameters: dissolved oxygen (DO) and total suspended sediments (TSS). The first question that must be addressed is what are the threshold or benchmark values of DO and TSS that (individually) represent minimum desired outcomes? Assuming this question can be addressed we then need to combine the results for DO and TSS into a single assessment - usually either an index between 0 and 100, a 'traffic light' colour (red, amber, green), or a label (good, fair, poor). This raises further questions. For example, should the final description assign equal weighting to both DO and TSS? Perhaps it's more important to keep DO levels high and so we might weight this more heavily in our assessment process.
The present paper focuses attention on the first problem, namely, how do we decide if an individual result is good, bad, or unchanged? The way this is usually done is to compare the current reading with an average value that has been computed from an unimpacted or what is otherwise regarded as 'healthy' system. While this may be intuitively plausible, it suffers from one significant drawback: the average value of an environmental parameter is rarely a good benchmark. The reason for this is that ecosystems tend to respond to extreme conditions rather than 'average' conditions. Thus, for example, it is not the average concentration of a pollutant that is toxic or the average level of dissolved oxygen that leads to fish dying. Just as engineers design bridges to withstand the heaviest load or dams to withstand the 1-in-100 year flood, environmental scientists similarly need to focus on extreme outcomes.
The insensitivity of the arithmetic mean to important changes in environmental condition reminds one of the joke about the statistician who has his head in a freezer and his feet in an oven yet claims that, on average, he feels fine! Similarly, if we measured dissolved oxygen on two occasions and found one was three standard deviations below the benchmark and the other was three standard deviations above the benchmark we would erroneously conclude, because the average is equal to the benchmark, that the water quality was totally acceptable. In fact what we are likely to see is dead fish floating on the surface as a result of the extremely low DO event.
The present paper takes a slightly more sophisticated approach to this problem by providing an additional 'knob' which the environmental scientist can use to fine-tune the assessment process so that it focuses on the more ecologically relevant portion of the response distribution. It does this in a way that still retains all the monitoring data, but importantly, treats some values more importantly than others in the final determination.