Panama Papers: Digging For Dirt In The Data

Author

Meta S. Brown

April 15, 2016

No screenwriter could have packed more intrigue into just a few words.

The mammoth investigative journalism effort we know as the Panama Papers began with a terse online chat. A person identified as John Doe asked a reporter at German newspaper Süddeutsche Zeitung if he was interested in data. The reporter replied, “We’re very interested.”

[John Doe] There are a couple of conditions. My life is in danger. We will only chat over encrypted files. No meeting, ever. The choice of stories is obviously up to you.

How much data?

[John Doe] More than anything you have ever seen.

When the reporter gets the data, it really is more than he’s ever seen. It’s the largest information leak of all time: 2.6 terabytes, representing 11.5 million documents, all of them from the Panama-based law firm Mossack Fonseca, known for its work helping clients establish offshore corporations in tax havens.

It’s a setup worthy of the finest spy thriller, and one day that dialogue will surely play out on the big screen. What happens next is another matter.

The material that John Doe leaked to Süddeutsche Zeitung included nearly five million emails, three million database records (often called a “row” of data, these are little groupings of related facts, such as the details of a transaction or an individual’s contact information) and 3.5 million files of images, pdfs or other forms, most of them representing scanned paper documents. What’s the reporter going to do now? He can’t read and make sense of 11.5 million documents.

Volume is not the only issue. The database records don’t mean anything without their schema, a guide that defines how the material is organized, but that was not included with the leak. The documents are in many languages. It will require a team of technical experts equipped with significant software and computer hardware resources just to make the data accessible.

Real life investigation is not for loners. No one person, nor single media outlet, could handle this leak alone, so Süddeutsche Zeitung turned to the International Consortium of Investigative Journalists (ICIJ) to provide resources for data sharing and analysis and coordinated a team that grew to involve hundreds of journalists over the course of the past year.

ICIJ, which has a network of journalists spanning 65 countries, has coordinated investigation of four major leaks involving offshore corporations over the past few years. Recognizing the growing technical complexity of reporting, it has been developing expertise and resources to handle complex data investigation.

Mar Cabra, a Columbia University trained data journalist, is head of ICIJ’s Data and Research Unit. She and her team enable the data sharing and analytics processes behind the Panama Papers revelations.

How does this process compare with good analytics process in an ordinary business application? One difference is fundamental. While business analytics projects normally begin by specifying a business problem and goals for addressing it, Cabra explained that Panama Papers journalists had no way of knowing exactly what information they were seeking. Nor did they understand what data they had. You might say that their initial analytics goal was to determine whether they had information worth analyzing. So, the process had to begin by injecting structure into the data.

While the scope and complexity of the Panama Papers is remarkable, it is not entirely unique as an analytics challenge. Dave Lewis, a consulting computer scientist and pioneering information retrieval researcher, provides some perspective. He explains that this data volume is comparable to a “medium” corporate fraud investigation. Litigants have advantages that the Panama Papers journalists do not, including access to database schemas, and people with understanding of the source data. But, Lewis points out, the intelligence community routinely experiences the same issues that come leaked data, and in far greater volume.

What’s remarkable today will be commonplace tomorrow. The analytics process that ICIJ applies to the Panama Papers may become the model for complex business analytics applications in the years to come.

Cabra put one specialist to work rehabilitating the database records. The specialist explored the data for elements that gave clues to its intended organization, pieced together the structure and used that information to transform the disorganized data into a searchable database.

Then there were images of scanned documents to be dealt with. While these could be read individually by humans, they could not be searched. So the data team used optical character recognition (OCR) to scan these files and identify text within them. The resulting text was transferred into a database. This process meant that reporters would now be able to search for specific documents of interest.

OCR is a resource-intensive process. The computer hardware available within ICIJ was not nearly sufficient to do all the conversion in a reasonable period of time. So the data team used virtual servers (essentially rented computing power accessed online). This outside resource provided the equivalent of 30 to 40 powerful computers to process images in parallel.

Converting images to text greatly reduces the volume of the data. Cabra estimates that the database containing extracted text, rather than original images, is about 130 gigabytes, about 5% of the initial size. Reducing the size of the database reduces complexity of other operations down the line.

ICIJ’s Data and Research Unit is a multidisciplinary team with varied expertise in the fields including journalism, data analysis and programming. These specialists can develop databases and applications, perform data analysis and support reporters, but they can’t do it all. Like many analytics teams, it works in concert with subject matter experts, people who may not be involved in the technical side of analytics, but whose knowledge is invaluable for context and understanding of the data’s significance.

Unlike today’s typical business data analysis project, the data journalism process requires a great deal of hands-on work by subject experts. That’s largely because the bulk of the data is email messages, contracts and other material that can’t be readily interpreted with statistical and automated analysis techniques.

In a typical business analytics project, only a handful of subject matter experts participate, perhaps just one or two. Ten subject matter experts would be a lot for a single project. But analyzing a data leak is different: about 400 journalists have been involved in the work over the past year.

The business world, too, is experiencing growing pressure to treat text sources as data, but costs and time pressure demand automation. Hence, the developing field of text analytics, which seeks to simplify and automate interpretation of text. ICIJ also takes advantage of text analytics. For example, ICIJ uses entity extraction, a method that enables it to automate some of the effort to detect names, places and other specifics within text documents.

Still, Panama Papers research involves an exceptional number of people. To facilitate communication and cooperation among the participants, ICIJ built a social network specifically for the project. It did not have to start from scratch, as platforms for this type of development are available, but customization was needed to provide enhanced security.

Cabra sees an even more remarkable future for ICIJ and data journalism. She’s seeking ways to make disparate databases interact, so that linkages between information in ICIJ databases and other sources can be more easily identified. Also on her roadmap are greater automation for discovering names, places and other important elements in text, and user interfaces that will make it possible for a wider range of users to do data analysis.

More reading: Thomas Fox-Brewster’s “From Encrypted Drives To Amazon’s Cloud — The Amazing Flight Of The Panama Papers outlines security issues surrounding the Panama Papers leak, and vulnerabilities that may have led to the leak.

Tools used in connection with the Panama Papers data analysis work:

  • Apache Tika – data and metadata extraction
  • Apache Solr – indexing
  • Blacklight http://projectblacklight.org/ a user interface
  • Amazon Web Services Cloud – virtual servers
  • tesseract – optical character recognition (OCR)
  • Veracrypt encryption for hard drives
  • Talend, data extraction, transformation, and loading (ETL)
  • Neo4j – data storage
  • Nuix – OCR, data indexing, visualization
  • Linkurious – user interface/visualization
  • Oxwall – social network development
  • PGP – secure communication
  • Hashmail – secure communication
  • Phreema – secure communication
  • Signal – secure communication
  • Internally created tools, including customization of other tools to enhance security

OCR is a resource-intensive process. The computer hardware available within ICIJ was not nearly sufficient to do all the conversion in a reasonable period of time. So the data team used virtual servers (essentially rented computing power accessed online). This outside resource provided the equivalent of 30 to 40 powerful computers to process images in parallel.

Converting images to text greatly reduces the volume of the data. Cabra estimates that the database containing extracted text, rather than original images, was about 130 gigabytes, about 5% of the initial size. Reducing the size of the database reduces complexity of other operations down the line.

ICIJ’s Data and Research Unit is a multidisciplinary team with varied expertise in the fields including journalism, data analysis and programming. These specialists can develop databases and applications, perform data analysis and support reporters, but they can’t do it all. Like many analytics teams, it works in concert with subject matter experts, people who may not be involved in the technical side of analytics, but whose knowledge is invaluable for context and understanding of the data’s significance.

Unlike today’s typical business data analysis project, the data journalism process requires a great deal of hands-on work by subject experts. That’s largely because the bulk of the data is text such as email messages, contracts and other material that can’t be readily interpreted with statistical and automated analysis techniques.

In a typical business analytics project, only a handful of subject matter experts participate, perhaps just one or two. Ten subject matter experts would be a lot for a single project. But analyzing a data leak is different: about 400 journalists have been involved in the work over the past year.

The business world, too, is experiencing growing pressure to treat text sources as data, but costs and time pressure call for automation. Hence, the developing field of text analytics, which seeks to simplify and automate interpretation of text. ICIJ also takes advantage of text analytics. For example, ICIJ uses entity extraction, a method that enables it to automate some of the effort to detect names, places and other specifics within text documents.

Still, Panama Papers research involves an exceptional number of people. To facilitate communication and cooperation among the participants, ICIJ built a social network specifically for the project. It did not have to start from scratch, as platforms for this type of development are available, but customization was needed to provide enhanced security.

Cabra sees an even more remarkable future for ICIJ and data journalism. She’s seeking ways to make disparate databases interact, so that linkages between information in ICIJ databases and other sources can be more easily identified. Also on her roadmap are greater automation for discovering names, places and other important elements in text, and user interfaces that will make it possible for a wider range of users to do data analysis.

More reading: Thomas Fox-Brewster’s “From Encrypted Drives To Amazon’s Cloud — The Amazing Flight Of The Panama Papers outlines security issues surrounding the Panama Papers leak, and vulnerabilities that may have led to the leak.

Tools used in connection with the Panama Papers data analysis work:

  • Apache Tika – data and metadata extraction
  • Apache Solr – indexing
  • Blacklight http://projectblacklight.org/ a user interface
  • Amazon Web Services Cloud – virtual servers
  • tesseract – optical character recognition (OCR)
  • Veracrypt encryption for hard drives
  • Talend, data extraction, transformation, and loading (ETL)
  • Neo4j – data storage
  • Nuix – OCR, data indexing, visualization
  • Linkurious – user interface/visualization
  • Oxwall – social network development
  • PGP – secure communication
  • Hashmail – secure communication
  • Phreema – secure communication
  • Signal – secure communication
  • Internally created tools, including customization of other tools to enhance security

This article was written by Meta S. Brown from Forbes and was legally licensed through the NewsCred publisher network.

Comment this article

Great ! Thanks for your subscription !

You will soon receive the first Content Loop Newsletter