84.9% of us have heard that you can lie with statistics, yet another case of perfectly fine math being corrupted, misused or misunderstood when it encounters the real world (actually, I made up the 84.9% bit, but certainly LOTS of you have heard that…). Where better to turn to understand this than a statistician, Nate Silver, well known for his predictions of election results. In his recent, very readable book, The Signal and the Noise, Silver discusses many of the sources of error people encounter when doing statistics (examples and data in this post are from the book).
One of his examples is why weather forecasting has improved dramatically in 50 years, while earthquake forecasting has improved not at all. According to Silver, weather forecasting has decreased the forecast errors by about half, in large part due to the increased capabilities of supercomputers over that time that allow ever more precision in modeling weather patterns. The average high temperature forecast three days in advance had an error of 6 degrees in 1970, down to only 3.8 degrees in 2010. The chances of an American being killed by lightening has dropped from 1 in 400,000 in 1940 to 1 in 11 million today, thanks partly to improved forecasts, and the three-day advance forecast of where a hurricane will make landfall has declined from 350 miles to 100 miles in just twenty-five years.
Earthquakes, by comparison, are not as well understood. A major finding in 1944, the Gutenberg-Richter Law, determined that earthquake sizes follow a power law distribution, allowing the frequency of large earthquakes to be directly predicted from the frequency of small earthquakes in an area.Then came plate tectonics in the 1950s and 1960s which provided a basis for the location and mechanism of earthquakes; since then, virtually nothing has stood the test of time. The problem is that, although we can predict the frequency of big earthquakes on a given fault, we do not know (and, may never know) how to predict the timing of a big earthquake well enough to be useful. To say that a large earthquake will likely happen in Seattle in the next 150 years is not very useful, and in fact does not even imply that in fifty years we will be able to predict a powerful quake there in only 100 years. About half the time major earthquakes are preceded by clusters of smaller temblors, but that means about half the time they are not. Similarly, the fact that a cluster of smaller quakes has occurred is also not very useful in determining whether a larger quake will follow or not.
One of the problems in applying statistics to the real world is that they generally represent properties of populations as a whole, not individuals or individual events within the population as we wish they would. The list of pitfalls doesn’t stop there however,and in fact is very large, including sampling bias (the sample you pick to measure has to accurately represent the larger population), extrapolation (the trends will continue), over-fitting (beware of anyone who claims to use hundreds of different variables in their predictions), “correlation is not causality”, and so forth. Furthermore, systems that people measure statistically are often very dynamic and non-linear, so what you think you understand today may not actually apply tomorrow. More on these topics next week when “Where Math Breaks” explores Big Data and Analytics.
Finally there is the question of lying with statistics. Back to our weather example….data show that the National Weather Service predictions of rain are very good…measured over enough days, it really does rain 60% of the time when they give a Forecast Probability of 60%, and this is true for the full range of Forecast Probabilities all the way from 0% to 100%. The Weather Channel, on the other hand has a designed-in “bias”…they consistently and deliberately “lie” with statistics at the low end of the probability range. They have found that The Weather Channel audience is much more forgiving if they forecast a 20% chance of rain even when they know (from the National Weather Service) that it’s really only 5% than vice versa. If rain is forecast and it turns out sunny, nobody complains; if they guarantee a sunny day and a rain shower pops up, they hear about it and ratings suffer. But don’t be too hard on The Weather Channel…they do much better and are much more consistent than local TV weather forecasts, where entertainment takes precedence over statistical accuracy.