Of the famous big data Vs, it’s the variety in data that holds the most potential for exploitation. While not everybody has the huge problems of volume and velocity that a Facebook or a high frequency trader has, even the smallest business has multiple data sources they can benefit from combining. Straightforward access to a broad variety of data is a key part of a platform for driving innovation and efficiency.
One common response from businesspeople to the term “big data” is to think that they simply don’t have that problem—but this is to ignore the variety of data. The notion of variety in data encompasses the idea of using multiple sources of data to help understand a problem. It’s forgivable to overlook this potential: as an abstract concept, it’s harder to grasp than “bigger” or “faster”.
Notwithstanding the difficulties inherent in grasping the concept, the ability of an additional data set to shed light on observed phenomena is profound. Consider, for instance, the addition of weather, geographical and social media data to the daily sales figures for a retail chain. It is easy to conceive that correlations with peaks and troughs in sales could be elicited: perhaps with good weather, word-of-mouth trends or road accessibility. With sufficient data, some of these events might even be found to be predictive of sales.
While identifying such trends may seem well-worn examples in today’s marketing-driven environment, the reality of taking advantage of such variety is less straightforward. In general, data systems are geared up to expect clean, tabular data of the sort that flows into relational database systems and data warehouses.
Handling diverse and messy data requires a lot of cleanup and preparation. Four years into the era of data scientists, most practitioners report that their primary occupation is still obtaining and cleaning data sets. This forms 80% of the work required before the much-publicized investigational skill of the data scientist can be put to use.
Understanding and identifying
Even to focus on the problems of cleaning data is to ignore the primary problem, however. A chief obstacle for many business and research endeavors is simply locating, identifying and understanding data sources in the first place, either internal or external to an organization.
This is complicated not only in a technical sense, but often in a political, legal or logistical way.
While the data scientist may not be able to directly bring about organizational or political change, they are able to materially affect the accessibility and comprehensibility of data. It’s not glamorous, but it’s powerful: they must document and describe their data. The documentation and description of datasets with metadata—data about data—enhances the discoverability and usability of data both for current and future applications, as well as forming a platform for the vital function of tracking data provenance.
Though “metadata” has long and somewhat unfairly been held as a slightly dull topic, of interest primarily to librarians, the news coverage of national security agencies tracking metadata has served to educate the public of metadata’s importance in understanding and exploiting information and behavior. Metadata within data infrastructures enables us to locate and combine data, and to analyze its lifecycle and history.
In the last two decades, the problem of data description and discovery has been tackled to an extent within the data warehousing world, but is an expensive approach to deploy even within an organization, regardless of its ability to scale across multiple data platforms or the web.
In the same way we looked to the web for big data technologies, there are seeds of possible routes forward for data description and discovery out on the web. While we hear much trumpeting of open data for government, today’s enterprises are much further behind. It’s hard for employees to share data with each other, never mind third parties. Perhaps open data technology can be brought inside organizations. The work of the Open Knowledge Foundation in promoting public open data has led to technical developments in data sharing, many of which have intra-organizational potential as well as for their intended purpose. Notable among these is CKAN, an open source data portal platform.
The output of one step of data processing necessarily becomes the input of the next. To process data and exploit only the result of the calculation is short-sighted. The practices and tools of big data and data science do not stand alone in the data ecosystem. They rely on the usability of data, and we will all gain from ensuring that our results are able to form a platform for future discovery and innovation. As big data tools grow in maturity and adoption over 2014, we will see the rising importance of the need to support this kind of exchange and collaboration around enterprise data.