Look, I’m certainly no Nate Silver. But I am just about ready to reject the null hypothesis that everyone who talks a big game about big data actually understands it. In business, as in Moneyball, it is often the more overlooked data that can lead to the most profound insights.
Why so confident about my own confidence interval? The answer is sampling. Big Data today has a sampling problem, which is only compounded when extrapolations are performed and conclusions are drawn from data that is scattered and incomplete. No Big Data analytics “tool” can remedy subpar quality of a data sample. “Garbage in…” well, I’m sure you know the rest.
I’ve conducted no proper sampling of my own – at least formally on this issue -- but with a few years of watching the “Big Data” industry evolve, it’s clear that there are some gaps in knowledge. The business press can be one of the biggest offenders: much of the information out there tries to compact big data into a homogenous concept that’s antithetical to its true nature. Sometimes it seems everyone is using the same words to talk about different things, and the errors inherent to obtaining samples of data are completely lost in the fray.
The marketing language surrounding big data is one of the reasons we’ve come to rightfully become suspicious of claims. As certain analyst firm might put it, we seem to be in a trough of disillusionment. We’ve seen the tools and they’ve failed to deliver. Perhaps, though, the problem isn’t the analysis tools themselves: it’s the entire architecture underlying our information governance practices. In order for samples to be useful, they have to be representative of the population in question. Before data analysis must come data management.
A sampling approach requires pulling handfuls of data from existing pools and feeding it into processing tools for extrapolation. It has certain strengths, but it prone to bias and severe compounding of any errors. A population approach to data, on the other hand, requires a complete data set for the group in question.
Not all data sets are equally suitable for analysis, and this is where sampling problems arise. Cleaning, de-duplication, and identification of any missing information is critical before doing large-scale manipulation, especially when analyzing across multiple data types. We’re feeling growing pains as the big data industry matures and businesses are starting to realize that the narrow “tool” based approach constrains the purported power of big data. Businesses are just beginning to shift their gaze from sampling-based tools to a more holistic data population approach, but our language and architectural approaches may be insufficient to support it.
Population studies are not without their pitfalls. Applying analytics to an enterprise data corpus without removing duplicate copies is like trying to conduct the US census, but with exclusion of some while double-counting others. Is the census perfect as-is? Perhaps not, but it was designed to be as close as possible, given the barriers to contacting and surveying 300 million+ individuals.
Take two very different hypothetical approaches to the same census problem:(1) Census Methodology 1: The Traditional Population Census
It’s slow and tedious, but every effort is made to obtain the individual responses of every last citizen. Every person is given the same form via the same medium. It’s a mindboggling logistical effort. Yes, some will be missed, but the percentage is extremely low. Some data substitution via imputation does occur, but the sheer volume of data being collected ensures that relatively accurate predictions can be made at the granular neighborhood level where there are any suspected missing individuals.(2) Census Methodology 2: A Haphazard Sampling Approach
Samples are taken based on convenience, with response forms being sent out via multiple methods. Questions may vary based on region or state. There is no centralized control or administration. Some people are more likely or able to respond… whether it be due to cultural differences, job role, income, access to technology, or other demographic factors. No penalty is enforced for failure to respond. Some people respond multiple times, while others are never contacted. The suspected missing responses are then “counted” as the averaged individual data from the entire response pool, severely compounding any bias that may have occurred in the sampling.
In these overly-simplified scenarios, it’s clear that most businesses are still treating unstructured data with an approach much closer to the second: trying to use sampling based on whatever they can obtain. Siloed architecture is profuse, so for the most part, companies dip data from whatever is available, regardless of duplicates or missing information. And to be fair, it was all that was realistically possible for a long time. Some unstructured content formats were controlled more rigorously – resulting in many duplicates of relatively important items – while others went completely unmanaged. Even confining analysis to one data type, such as email, might show skewed data if duplicate copies or missing items were not cleaned out: a monolithic task in itself.
But the good news is that enterprise data is NOT quite analogous to a human population. It’s much easier to define the population of enterprise content than it is to define and count an entire morphing human aggregate. Ownership – as in what data “belongs” to the business – is defined by the systems implemented and work policies defining what is allowed. While changes in a data ecosystem are measured in milliseconds relative to the glacial plod of human population dynamics, it has a massive differentiator: enterprise unstructured data has a point of creation within the business, and data is relatively easy to capture.
As big data continues to grow and unstructured analytics tools continue to advance, it’s clear we need to take a more “population” style approach to data management and analytics. Samples are fine, but only when they are an accurate representation of the population being studied: an impossibility in many of today’s current IT environments where silos, duplicates, and missing information is rife. Centralization, deduplication, and consistent policy control for all data are the factors that allow even “junk” to become useful from a holistic business strategy perspective.
So while population research in the human world is rarely possible, the enterprise has a much more clearly defined population of unstructured content. Samples were historically implemented to make the best of missing information… just ask Mr. “Student” of t-test fame. But why settle when near-perfect data management is possible? Population-based data governance, concurrent with centralized control and de-duplication, can reveal much more about the trends in content, workflows, and the human “pulse” of a business organization.
Don’t wait 10 years to conduct your own business data census: the capacity to leverage entire data populations begins now. But it doesn’t start or even end with an out-of-the-box analysis tool: the architecture underlying the data is what supports the capacity for large-scale analysis.