Last week’s (22 October) post for “Where Math Breaks” addressed Statistics. Big Data and Analytics is Statistics on steroids with a whole lotta marketing hype mixed in. All designed to influence your social behavior without you having to do anything. A volatile mix to be sure, and an easy target for wholesale math breakage.
First, to the Big Data piece. Roughly speaking, there are only five places in the U.S. that have actual Big Data – the Federal Government, Google, Amazon, Netflix, and the New York Stock Exchange. Everyone else, by definition, has not-so-Big Data. The now-classic definition of Big Data (volume, velocity, variety) is descriptive rather than practical, as is the much more relativistic “big data is anything that your current tools can’t handle”. One practical metric, akin to how many rows your spreadsheet can handle, is whether you can fit all your data into memory at the same time in order to do normal statistics against it. The current affordable limitations are on the order of a couple hundred gigabytes, which is really a whole bunch of data.
Alternatively some groups, such as State government social service providers, are combining individual data holdings to dramatically increase the efficiency and efficacy of providing services to citizens; this probably qualifies more as “open data” than “big data” even though some of the analytics are similar. All in all, a good working definition for Big Data is:
- a collection of data or data sets;
- that can be analyzed or processed in value-added ways not inherent to the primary reason for collecting the data;
- resulting in new insights, better delivery of services, or improved outcomes for various constituencies
What really matters is the analytics, and this is where the math can break. First, all the problems inherent with more general statistics (overfitting, for example) apply to Big Data analytics, and the problems scale accordingly. Used to look at six variables to make projections, and now you can look at one hundred and six? Good chance your projections get worse, not better. Having a lot of data can lead practitioners into a second trap: believing that adding even more data will make the outcomes more convincing or more insightful. Jeff Jonas, an IBM Fellow and Chief Scientist of the IBM Entity Analytics Group writes an occasional blog called Fantasy Analytics, talking about analytics that people want to perform against their data holdings which have no basis in the content of the data.
A second concern is the validity of the data. There are the longstanding issues with data quality, especially when converging multiple datasets that may have several entries under slightly different name permutations for an individual. Are these the same? How do you know? Are both datasets equally accurate or valid? Some of us more insidious types engage in “behavioral fuzzing” (I have met at least one other person who admits to doing this routinely) to deliberately try to make the analytic outcomes less precise for us as individuals. Swap credit cards with your spouse for online Holiday shopping, and pretty soon you will be getting spam for both male enhancement and high-heel shoes with red soles, indicating that the marketing groups are now very confused. Big data analytics are based significantly on regular patterns in particular domains, but people do in fact have some control over their behaviors. No data analytics tool can predict when you will start a new behavior pattern, or what it will contain, even though some can quickly combine data to fit your new behaviors into a new predictable pattern.
Finally there is the ontological question of whether the mathematics of Big Data tells us things we really want to know. Writing a critique of Alex Pentland’s book Social Physics in the April MIT Review, Nicholas Carr (see a review of his new book The Glass Cage) sums it up nicely: “What big data can’t account for is what’s most unpredictable, and most interesting, about us.”