Data & Analytics CapGemini: Insights & Data Blog

Apache Spark: The Future of Big Data Science?

Author

Matt Thomson

July 18, 2016

Apache Spark is the go-to tool for Data Science at scale. It is an open source, distributed compute platform which is the first tool in the Data Science toolbox which is built specifically with Data Science in mind. In this blog, I want to talk about why I think Spark is the future of Data Science at scale and why Capgemini are supporting the Spark London Meetup Group.

We all know that data volumes are growing at an alarming rate and in order to get the best value out of these datasets business need to be able to analyse the full breadth and depth of this data. Traditionally this has been achieved with the various NoSQL datastores like Hadoop, MongoDb, ElasticSearch and countless others. What has been lacking is the ability to process this data for analytics. Analytics has either been achieved by writing complex MapReduce jobs or by picking particular aspects to analyse with Python or R. This works well in a lot of use cases, and typically a machine learning application only need be trained on a small part of the data or the feature engineering and population work means this happens naturally. However, when the need does arise to work with big datasets, (and this is only likely to grow), data science has been at a bit of a loss. That is no longer true with Apache Spark.

I believe that Spark is different from the myriad other solutions to this problem because it allows Data Scientists to develop simple code to perform distributed computing, and the functionality available in Spark is growing at an incredible rate. Much has been made in the Data Science community around Spark’s ability to train Machine Learning models at scale, and this is a key benefit, but I think the real value comes from being able to put an entire analytics pipeline into spark, right from the data ingestion and ETL processes, through the data wrangling and feature engineering processes through to training and execution of models. What’s more with spark streaming and graphx spark can provide a much more complete analytics solution.

Spark 2.0 is already available as a preview and a full release is imminent and this will represent a real step forward with the unification of datasets and dataframes, everything you want to do analytically with dataframes becomes much faster. And this is also true for spark streaming with the “unending dataframe”.

It is for this reason that here at Capgemini we are supporting the London Apache Spark Meetup group. We want to support the development of this key technology because it helps the community and it will help our clients. The meetup group is free for anyone to join and discusses all aspects of Spark: http://www.meetup.com/Spark-London/

If you want to learn more about some of our the work we do here at Capgemini please take a look at my previous blog on integrating machine learning with multiple analytical techniques (http://ow.ly/4ntB3D) and blogs from my colleagues on Machine Learning in the public sector (http://ow.ly/4nuEp0), Network Analytics at Scale (http://bit.ly/29FCkMq) and Data Mining techniques (http://bit.ly/29GrbvJ).

Finally, if you are interested in joining our innovative team please see our job specs: Data Scientist (http://bit.ly/1UWmhwn), Big Data Analytics Architect (http://bit.ly/29RLhTB) and  Big Data Engineer (http://bit.ly/1OpX5HV)

This article was written by Matt Thomson from CapGemini: Insights & Data Blog and was legally licensed through the NewsCred publisher network.

There is 1 comment

  • Antonio Carlos Pina - 07/18/2016 17:17
    Great article and insight, thanks for sharing, just wanted to note that Hadoop is not only a NoSQL database for it also manages distributed Storage (and Spark doesn't although it's faster than Hadoop's MapReduce). For the time being I believe they are better working together. Thanks

Great ! Thanks for your subscription !

You will soon receive the first Content Loop Newsletter