Medium Sized Data

When I was in second grade, we nerdly types demonstrated our math prowess by playing “Who Can Name The Biggest Number”. Went something like this: “a thousand”; “a thousand one”; “a million”; “a billion”… very few could follow along to trillions and beyond. But lots of us knew the biggest countable number of all, and used it as our secret weapon: a Googleplex. No “Googleplex plus one” for this crowd! And of course even second grade mathematicians know that “infinity” is a fraught concept (if only we’d known then about orders of infinity…).

Seems like we’re about at this second grade level when it comes to discussions of Big Data. Three hundred Terabytes! 28 Petabytes!! Zettabytes by 2016!!! At least we know now that the Googleplex is that humongous data center on the bottom of San Francisco Bay, accessed via that mysterious Google barge.

A modest proposal… lets agree that there are only five places in the U.S. that have actual Big Data – the Federal Government, Google, Amazon, Netflix, and the New York Stock Exchange. Everyone else, by definition, has medium-sized data. Next time one of those big storage sales folks call you, just tell them you only have medium-sized data, and ask about spreadsheets. Of course we would have to rethink the now-classic definition of Big Data (volume, velocity, variety) or the much more relativistic “big data is anything that your current tools can’t handle”. If you absolutely must have Big Data, rest assured, solutions exist.

So what’s the point? The point is that maybe size matters some, but what you do with what you’ve got is much more important. The name of the game is analytics, and more specifically analytics that help you deliver better outcomes for your constituencies. A good working definition for medium-sized data is:

  • a collection of data or data sets;
  • that can be analyzed or processed in value-added ways not inherent to the primary reason for collecting the data;
  • resulting in new insights, better delivery of services, or improved outcomes for your constituencies

One place that is heading this way is the Commonwealth of Virginia (try “data.virginia” in your browser). First, they have separated the concept of Open Data from Big Data (well, maybe it’s only medium-sized data), thereby providing an explicit imperative to address the difference between public data and data that requires limited access to protect privacy. They are starting to build a successful set of case studies in education, health care, workforce management, and elsewhere. They are asking the question “how can we improve outcomes for citizens” as opposed to the more typical “how can we exploit personal data for increased revenue”. This is a work in progress, but certainly one example that bears watching.

There are of course plenty of other examples of people working with large data sets. But in all cases the successes will come from people asking interesting new questions and doing great things with whatever data sets they have available. Much better to be a wizard with medium-sized data than the lonely guard of a warehouse with Exabytes of write-only data.

Leave a comment