In math, multiplying two negative numbers together yields a positive number. In the world of data, multiplying a huge number of negative data points, worthless in isolation, can yield highly positive insights, according to new research published by Flemish researchers in the Big Data journal.
As the authors conclude, “when predictive models are built from sparse, fine-grained data—such as data on low-level human behavior—we continue to see marginal increases in predictive performance even to very large scale.”
The trick is figuring out the right questions to ask and finding the right people to interpret the data. This turns out to be a big hurdle.
Lots Of Junk Data Equals Good Data
While traditional data analysis tends to focus on data that has intrinsic, meaningful value, Big Data allows us to aggregate otherwise meaningless data and find insight in the group. By way of analogy, it probably won’t tell us much to observe the individual meanderings of an ant, but when observing the colony together, patterns emerge.
Critically, however, the researchers suggest that we can’t find patterns in aggregated data without focusing on “fine-grained feature data, such as that derived from the observation of the behaviors of individuals.”
This isn’t summary data, such as “the ant tends to head north.” It’s the little details of behavior that are important.
The online world has already shown how such data can be useful. As the authors note, data on individuals’ visits to massive numbers of specific web pages are used in predictive analytics for targeting online display advertisements, just as data on individual geographic locations can be used for targeting mobile advertisements.
If you’re seeing a pattern here, it’s what Jeff Hammerbacher, chief scientist at Big Data company Cloudera, pointed out some time ago: “The best minds of my generation are thinking about how to make people click ads.” Sad, but true.
Regardless of the social value of the work, what the researchers discovered is that “certain telling behaviors may not even be observed in sufficient numbers without massive data.” While we can guess at patterns based on limited data, it’s only when we collect data in huge quantities that true patterns emerge.
Which Data Is Best?
As the authors argue, certain kinds of data won’t yield “truth”:
[M]ore data do not necessarily lead to better predictive performance. It has been argued that sampling (reducing the number of instances) or transformation of the data to lower dimensional spaces (reducing the number of features) is beneficial, whereas others have argued that massive data can lead to lower estimation variance and therefore better predictive performance. The bottom line is that, not unexpectedly, the answer depends on the type of data, the distribution of the signal (the information on the target variable) across the features, as well as the signal-to-noise ratio.
So what to do? The authors argue that “gathering more data over more behaviors or individuals…[delivers] better predictions.” The key is not to try to pre-analyze the data and guess what one will need, but simply to gather more, broader data at the outset.
This sounds great in theory, but it’s actually quite hard in practice.
No, not the data gathering itself—that’s relatively easy. No, the issue is how to determine whether data with a low signal-to-noise ratio will be useful in aggregate. It’s fine to say “collect it anyway,” but as statistician Nate Silver has argued, increasing the volume of data we collect may simply point us to spurious correlations in the data.
At some point, a human being is needed to interpret the data, and that job may actually get harder at scale.
Still, some data makes intuitive sense to gather in greater quantities or variety. For example, for an online advertising company that collects data on websites one visits, getting even more data on website visits and click-throughs should be helpful, as would gathering user’s mobile location data.
Impossible To Get Rid Of The Interpreter
While the researchers suggest that “for predictive analytics with sparse, fine-grained data, even with simple models it indeed can be valuable to scale up to millions of instances and millions of features,” they also caution that “even the best data scientists in the world do not yet know how best to build complex predictive models from this sort of massive data.”
In other words, more data is better for predictive analysis, but good luck interpreting it all.