Big data is an exciting idea, but I find little agreement about what the phrase means in concrete terms, and how to make use of it. Could big data be the next nanotech, a field that was hyped enormously in the last decade but produced disappointing practical value? (I concluded that nanotech is basically the 21st century name for the mature science of chemistry.) Or is big data more like the PC or the smartphone, platform technologies that unlocked huge user value and a stream of valuable innovations?
First, what does big data mean? Here’s my try at a definition. “Big data” is a set of technologies that enable collection and analysis of very large data sets that yields valuable results, commercially or otherwise.
Web search was an early big data breakthrough. Various search providers, Google most successfully, developed techniques for building an index of a large fraction of the web (web crawlers), for rapidly and efficiently searching that index (the then-new Map/Reduce database plus Google’s massively parallel commodity server architecture), and for producing valuable results (the Page/Brin algorithm). And, Google found a way to monetize its service: search advertising, an idea it borrowed from another web company (Overture) and built into a titanic business.
Collecting massive amounts of data and analyzing it is a pillar of Google’s business strategy: generally the objective is to learn more about users to be be able to present ads that have a high probability of leading to a purchase (which creates demand for the ads). Gmail, for example, is a feature rich service that Google provides for free to consumers. The value for Google is the data it collects about users by analyzing their email traffic. (Recall the web businesses saying: “If you don’t see the business model, then you are the product.”) Google Voice has the same strategic rationale.
One of the hard problems in building a big dataset is accessing data from diverse sources and rendering it in a form that computers can analyze. Palantir, a heavily funded start-up, has built its business on ability to make diverse datasets analyzable by computers. It’s major customers are commonly believed to be the “three letter agencies” of the U.S. defense community. Rhiza, a CMU spin-out in Pittsburgh, is doing something similar for commercial purposes: tapping into diverse public and commercial databases to provide advertisers deep and granular information on which half of their advertising spend is wasted (i.e.: “I know half my advertising spend is wasted, but I don’t know which half!”).
Google has pioneered other big data technologies, e.g.: Google translations. “Machine learning” technology analyzes massive amounts of text that have been translated by humans. The computer “learns” to recognize patterns of correspondence between text in one language and another, e.g., the phrase in French “il faut frapper tant que le fer est chaud” is usually rendered as “you have to strike while the iron is hot” in English (the same saying, but not a word-for-word translation).
A key potential of big data technology is finding valuable needles in large haystacks. A small amount of “signal” (valuable information) is often buried in a large amount of noise (random, irrelevant information). At first the signal is hard to discern; it looks just like the noise. But, if you can analyze enough data, the random noise remains random, and the signal will identify itself by revealing a consistent pattern. Google has used this technique to provide early warning of flu epidemics by spotting searches related to flu that rise above the normal background noise.
Detecting consistent patterns in large datasets also enables predictions. In the 1990s the major airlines learned to analyze traffic data to predict how many seats in each fare class on each flight would be sold; insiders tell me they usually know this with confidence two weeks in advance. Airlines use this information to change pricing and make special offers, with the objective (in the industry argot) of “putting a butt in every seat”. This “yield management” technology has helped U.S. airlines raise their average “load factor” (the percentage of seats occupied by paying passengers) from ~65% in the 1970s to 80%-85% in recent years (more). More recently Amazon and Netflix have analyzed their large customer behavior datasets to help predict what products or videos customers are likely to want. This results in suggestions (“You might like this product or video.”) that have proven quite effective.
A final example is using data science to stitch together partial datasets and make reliable predictions of what the missing data is. This technique is being used by several companies to move the basis consumer market research from those annoying dinner hour phone calls (which are tolerated less and less by digital natives) to the web. The challenge with web based research is the brief encounters that typically occur on the web: you can only ask a few questions. But, using tracking cookies and data science, these brief encounters can be stitched together into a large and deeply profiled “virtual panel” that produces reliable results. Practitioners here includes CivicScience* and, you guessed, Google.
It does not take much of a data scientist to notice Google provides many of my examples. That should not be a surprise. Many talk about big data, but Google has gone much further than any company I can think of to capitalize on it in the commercial world, and the NSA probably holds a similar position in the non-commercial world (but they would have to shoot me if I published those examples, or even knew any details).
Will “big data” turn out to be another empty buzzword, the 21st century name for the mature science of statistics? I think not. Moore’s law, cheap storage, the cloud, digital lifestyle, voice recognition, and advances in data science have converged to create a powerful new platform. The trail has been blazed and many companies are working on applications of big data. Great value has been delivered already, and I expect a great deal more.
*New Atlantic Ventures, a venture fund in which I am a partner, is an investor in CivicScience.