Data science hackathons create fresh ideas though fast thinking. Data Science Game 2016 held at Capgemini’s campus in Les Fontaines just outside Paris, shows how the power of the challenge can create smarter thinking. Over 36 hours between September 10 and 11, 80 students in 20 teams from universities across the world such as Cambridge University, IIM Calcutta, Stanford University and Moscow State University were set a single challenge from real-world data, and the chance to impress some of the most accomplished minds in the field of data science.
This year’s challenge was set by AXA, who provided the data-set for the game. The challenge had to be a real-world problem that they believed could be solved in the given weekend. Additionally, it had to be a very different challenge from the qualifying round, which reduced the original number of contender universities from 143 to 20. This difference would force students to look at data science and the methodology in new ways they may not have been used to, and it’s what made this year’s event such a challenge.
Qualifying for the challenge
To qualify for the challenge, University teams had to solve a problem involving image recognition. This would be a test of how the teams would adapt to different issues and ways of thinking.
The qualification round asked the teams to discover how much solar energy could be produced in France based on satellite images of 80,000 buildings captured by the OpenSolarMap project. Volunteers at the project had categorized around 15,000 of the images according to their orientation and pitch, but could the identification of suitable roofs be automated by an algorithm?
Most of the teams that qualified relied on Deep Learning methods, known to be suitably effective in computer vision and big data issues. The teams that made the top 20 scored between 82 and 87% of good predictions, guaranteeing a place in the main challenge by predicting solar capacity through this method.
The weekend’s data science problem
These most successful teams made it to the Les Fontaines campus, where the AXA challenge was waiting. This was something entirely different: AXA presented the gathering with a dataset of car insurance quotes that had come from different brokers and comparison sites. The problem to be solved was simple to pose, but hard to answer: did the person who asked for the quote buy the policy?
“This was a real business problem, something that’s very close to a real business issue in a very difficult area,” says Olivier Auliard, Chief Data Scientist, Capgemini Group. “There is a very low penetration rate of the target, which means you need to do a lot of feature engineering to get around the problem. There were many requests in the data that were repetitions, because of the way data is collected through brokers and comparison sites; people use these in an automated way that makes each set very hard to distinguish.”
The majority of the teams used XGboost to solve this challenge. The key differences between them were in feature engineering and sampling techniques, as well as fine-tuning their parameters. This involved taking a very different approach than the qualification round, as Olivier Auliard explains:
“What surprised me most was that, in 36 hours, teams that were far more expert in image recognition problems managed to show a great performance in this task. The thing that makes the difference is not just having the ideas, and the theory, but putting those things into practice and into code. To succeed, teams had to have this balance, a practical approach that was both explorative and could be pragmatic and realistic.”
Moscow team wins the challenge
At the end of the challenge, each team had their work ranked according to log-loss. This is a measurement of accuracy that incorporates the confidence of a certain probability — it rewards the quality of a prediction, in other words — in this case the likelihood of a policy being bought by a person who requested a quote.
The winners of the challenge were Team Russian Data Mafia, of the Moscow Institute of Physics and Technology, Russia, who achieved a final log-loss score of 0.008344. They had qualified in fourth place, and were joined by three other Russian teams in the final awards, with a professionalism and efficiency that impressed the judges, according to Auliard.
The greatest surprise, he suggests, came from the team that had won the initial qualification round. “None of the University Pierre and Marie Curie (UPMC) team members had ever used XGboost before the challenge, and achieved an impressive fourth place in the rankings.”
Connecting with students for the future
Throughout the weekend, teams were mentored by Capgemini data scientists from across Europe, who were on hand to provide guidance to the students. It’s an aspect of the challenge that brings benefits to both sides, helping data science professionals maintain strong links with students and academia, reinforcing R&D capability and keeping abreast of the latest methodologies.
“We are in a position in data science where we can solve business problems by breaking down the boundaries between pure R&D and business applications,” explains Manuel Sevilla, Global Head of Big Data, Data Science and MDM, Insights & Data, Capgemini Group. “These events help us to grow the future leaders in data science: they’re a good way to get people like us working closely with students, sharing the day-to-day experience of problem solving.”
Next year, the organizers hope to expand the challenge to bring in twice the number of teams, incorporating ever harder datasets to solve, and with an ever greater contrast between the qualifying problem and the weekend challenge. Tougher tasks, but with indisputably greater rewards — for students, data science, and the world.
By Ed Chipperfield
Ed Chipperfield is a British journalist specialising in science and technology, and has contributed stories to the BBC, Sunday Times and Men’s Health.