Why External Web Data Is Getting Vastly More Valuable

Author

Dan Woods, Contributor

March 6, 2015

New forms of automation are changing the economics of external web data, making such data more valuable and a promising source of valuable insights. Both companies who use data and those who sell it are increasingly finding that gaining access and making use of such data is easier than ever. In other words, there is a world of web data that may help your business. It is time to go and find it because the cost of using it is lower than ever.

As I’ve pointed out in the past, the world seems to suffer from a “data not invented here syndrome”, a bias toward using internal data that was created inside the four walls of a business. This syndrome affects both companies that use data and publishers that create products using data.

Two factors drive this bias. First, internal data is convenient to use and familiar. Second, most of the time, this data is clean and has a high signal. Internal data can tell you a lot about what’s going on.

But to paraphrase Bill Joy’s comment about smart people, there is a lot more data that is potentially valuable outside your company than inside it. What has changed in recent years is that new types of automation have changed the economics of accessing this data.

The New World of Automated Web Data Capture

Since the World Wide Web was invented, programs have been crawling web sites to harvest the information on them. Google, Yahoo, and lots of search companies that no longer exist perfected the process of automatically reading everything published on web sites so that search engines could find it for us. This wholesale harvesting of web content continues to this day, and in fact, was the inspiration for the Hadoop system, which is powering the commercial exploitation of big data.

Targeted downloading of web sites is also as old as the web. When network connectivity was a problem, some people would download content they wanted to access in a speedy way and view it off line. Remember Freeloader?

Targeted harvesting of web content has evolved into advanced systems that use machine learning and other forms of automation to allow web data to be identified and then harvested systematically. I use the term web data rather than just web content because the results of harvesting the data from the page is not a web page that then needs further processing, but organized information. The inputs to this process are thousands of web pages, but the output is the equivalent of a collection of spreadsheets that contain the data you want. Here are some examples of web data:

<) Financial information provider may monitor hundreds of thousands of websites for change to deliver up-to-the-minute company and financial data
<) A supply-side online advertising platform gathers real-time data from more than 100 ad networks to reconcile internal billing processes
<) Electronics, appliance and other consumer goods companies optimize pricing for hundreds of thousands of products by compiling, aggregating and structuring competitor pricing data in real-time
<) Government organizations keep up to date on new sanctions and regulations at international, federal and state levels to ensure compliance with terrorist financing
<) Financial data and ratings service aggregates critical financial data on tens of thousands of private companies’ websites to support research and analytics delivered to institutional investors, banks, advisors and wealth managers
<) A multi-sector job board automates their data collection process – identifying additions, deletions and changes to job postings — to avoid reloading of old data and cost-effectively expand its services beyond national boundaries

The economics of harvesting external data are crucial to its viability. When people first started harvesting web data, they used custom coded programs. This approach works just fine, and is still in use, but it is expensive both to create and to maintain such programs. If the structure of a site changes just a bit, the program needs to be changed. One remedy for that is cheap manpower. The rise of low cost off shore labor and various forms of structured crowdsourcing have driven people in some cases to use large teams of people to maintain brittle ways of harvesting data. For high value data, both of these approaches work, even though they are not cheap or easy, because this equation holds true:

(Value of data extracted) > (Cost of extraction and maintenance)

So if it possible to reduce the cost of extraction through automation, then a huge amount of data opens up and becomes valuable. That’s exactly what’s happening.

Connotate, a New Jersey company, is one of the leaders at this form of automation. Their system is used by major publishers to extract web data from thousands of web sites. While publishers use external data to create and enhance various types of information products, businesses large and small harvest data to keep track of competitors, assemble data to provide more context for internal data, and create special purpose databases. I’ll be explaining more about how this works in a webinar in a couple of weeks based on a white paper we just created (“From Data to Dollars: How Website Content Can Power Your Data Business”) that explains how to monetize web data.

Connotate is one of several companies, including Mozenda, import.io and BrightPlanet that attempt to make the cost of harvesting web data as low as possible. Each of these companies has a unique approach, but all of them have to tackle the following hard problems:

<) Making it easy to identify the information on a web page or collection of web pages and assemble that information into a useful structure.
<) Allowing the data to still be harvested correctly, even if the page changes in some way.
<) Recognizing when new information has arrived.
<) Harvesting data on a regular schedule.
<) Managing and performing quality control on thousands of agents.
<) Handling complex was of creating pages such as responsive design.
<) Integrating harvested data into a data warehouse or other repository.

It’s not enough just to solve these problems. The web data harvesting systems must make it easy for a non-programmer with basic skills to set up agents to harvest data. In addition, the process of having lots of people work together must be supported. The economics of external data have changed because this latest generation of software makes these problems far easier than they have been in the past. That means that the cost of acquiring web data is at an historical low.

Time for an External Data Reporter

Publishers, for which data is their entire business have been using systems like Connotate to create vast supply chains to harvest web data and create or supplement advanced information products. These firms have dedicated people whose job it is to seek out and find valuable sources of data. It is time for this practice to spread beyond information publishers.

In other words, it is time for Chief Data Officers to create the position of External Data Researcher, or better yet, External Data Reporter. This job would be to boldly seek out new sources of data on the web, either on web sites or from other sources, and then suggest how that data could be used to add to the existing complex.

I’ve been asking around about this for a while, and I have yet to find someone with this job description. Given the new, attractive economics of external data, it is time for this job to be staffed.

Dan Woods is on a mission to help people find the technology they need to succeed. Users of technology should visit CITO Research, a publication where early adopters find technology that matters. Vendors should visit Evolved Media for advice about how to find the right buyers. See list of Dan’s clients on this page.

This article was written by Dan Woods from Forbes and was legally licensed through the NewsCred publisher network.

Great ! Thanks for your subscription !

You will soon receive the first Content Loop Newsletter