Eight unusual things you can find hidden in data

Author

Matthew Sparkes Deputy Head of Technology

December 9, 2014

As it emerges that the Met Police is using data mining to predict crimes before they even happen, we look at eight other ways that data mining is being used to squeeze meaning from the increasing amount of data we produce and store

The Met Police has been carrying out trials of software developed by Accenture which attempts to predict crimes before they happen – or before the computer knows they’ve happened, at least.

Because the system is still under trial it has been using old data, collected from around four years ago. A computer was told to trawl through the first three years of information, including social media posts, and predict who would commit crimes and what they would be. It was then compared to the final year’s data to evaluate how accurate it was.

Accenture told the Telegraph the results are not yet ready for release, and may still be operationally sensitive. But it begs the question: how else is data mining being used to predict the future, catch wrongdoing or extract new understanding? We look at eight unusual examples.

Find your soulmate

Most data mining systems are hidden from view, operated by trained experts and the results are often analysed behind closed doors. But there’s one big application of the technique where the public interacts with it directly: online dating.

Sites like OKCupid are essentially data mining tools to predict the compatibility of any two people. You feed in enough data that it can get a sense of what you’re looking for, then it trawls the database for matches.

How well does it work? Survey results show that one third of marriages in the US now stem from online dating. Of course, the ONS also shows that 42 per cent of marriages on these shores end in divorce…

Forensics

When forensic scientists find a fingerprint at a crime scene, it’s data mining that they turn to in order to identify who made it.

Computer software creates a table of prominent features and scans the database of fingerprint samples to look for a match. Without data mining it would take a very patient investigator thousands of years – and police resources are already stretched.

Similarly with DNA results, data mining is essential. But unlike the glossy computers in CSI, it’s likely to be running on a dusty server in a cupboard at Scotland Yard.

Terrorism

The collection of data by the NSA and GCHQ is a hot topic. Critics argue that having our every email, text message, phone call, website visit and online purchase tracked and analysed is a breach of privacy.

The counter-argument, of course, is that this is the data which is mined for evidence of terrorist activity. Is anyone searching for flights to certain countries and also looking at a certain website, their systems can discern – that may point to an impending attack. The director of the NSA last year told a Senate hearing that data mining has helped prevent “dozens” of attacks.

Healthcare

Certain diseases are caused by complex combinations of genetic variations, known by doctors as epistasis. There is no single cause which can be easily tested for. That makes identifying those at high-risk problematic for a doctor.

But several studies have shown significant success in finding those people at high risk of developing breast cancer by using data mining to explore genetic tests and various lifestyle information such as alcohol and tobacco use. A human doctor can point to an overweight patient and see that they have a risk of heart disease, but only a computer can weigh up the complex interactions of potentially dozens or hundreds of genetic variations.

Musical genres

Few classification systems are as complex, fragmented and contentious as musical genres. Unfortunately for many companies, there’s a need to classify albums, artists and tracks by genre – otherwise catalogues are simply too unwieldy to browse. Using statistical analysis several companies have been able to automatically classify music. You think that music is art, not science? Statisticians would disagree: any track can be deconstructed into “mel-frequency cepstral coefficients” and coldly analysed.

Credit card fraud

Most of us are pretty predictable creatures: we buy the same things over and over, from the same places, at the usual times. When we diverge from that data mining software at the bank may flag it up as potentially fraudulent. When you treat yourself to those unusually expensive shoes and the cashier tells you that they need to phone the bank, or you go on holiday and your card gets stopped, that’s what’s at play.

Tackling hackers

This same approach can also be applied to stopping hackers in their tracks. By analysing a computer network, who uses it, what data they normally send where and by which means, security experts can build up a system that can monitor for anything out of the ordinary. Is a new computer trying to connect? Shut it down, and call for a human to look into it. Is an existing computer trying to send a large amount of data somewhere unusual? Shut it down. This approach has the advantage that no matter how unique, crafty and cunning a hacking attack is, if it isn’t normal, it gets stopped.

Catching tax avoiders

Every time a tax office catches someone paying less than their fair share, they can feed those details into a data mining package. Eventually the software can start to spot recurring patterns in this data: perhaps tax evaders tend to report high profits at certain times of the year, but siphon income out of view towards the end of the period, for instance. Once it knows what to look for it can go and look for that in fresh data, fed in from banks and other sources. If you get a letter informing you that you’re being audited, it may have been far from random. Several academic papers setting out systems to do just this have been published, so it’s a safe bet that governments around the world are using it: it’s cheap, it’s effective and it would more than pay for itself.

Great ! Thanks for your subscription !

You will soon receive the first Content Loop Newsletter