By Mark Sellors, Mango UK
On Tuesday 5th of May, O’Reilly Media and Cloudera, a distributor of a Hadoop based big data platform, brought their ‘Strata + Hadoop World‘ conference to London. The conference features a mixture of Data Science, Big Data and business data strategy.
A plot produced using a quick scoring algorithm which we ran on data from the Strata session outlines published on their website. It clearly shows the most talked about technologies from the conference. Naturally, ‘Hadoop’ itself comes out on top (the ‘+ Hadoop World’ part of the title is not for nothing!), with Spark coming in a close second. Obviously this is is far from perfect as many talks focused on technology solutions even though those technologies were not mentioned in the abstract, but it’s a useful surrogate for now
The first day involved software tutorials and deep-dive’s, relating mostly to software in the Hadoop ecosystem (which is pretty large!), with many given by the software authors or contributors. This provides an excellent opportunity to take a closer look at a particular technology and ask in-depth questions of the people in the know.
On the second day, the conference proper starts and despite there being four Strata + Hadoop World conferences a year, offers a packed schedule of great speakers from many of the industry’s leading organisations. Speakers this year including people from Barclay’s Bank, Google, CERN, Accenture, Pivotal, Databricks, Dato, MapR, comparethemarket.com and a great many more.
Around six months ago we were lucky enough to see Cloudera Cheif Strategy Officer, Mike Olson (@mikeolson), speak at an event in Bristol. At that event Olson said, ‘Big Data hasn’t happened yet’, when talking about the hype surrounding the term and the number of actual, in-production uses of the Hadoop technology stack. If the Strata conference is anything to go by, ‘Big Data’ may not have ‘happened’ yet but it is well and truly happening all around us right now.
The number of organisations talking about their production platforms or speaking about the proof of concept experiences or new integrations they’ve developed, was truly staggering. As a Data Science organisation it’s been clear to Mango for a long time that things are changing; the complexity of analyses and the variety and volume of data sources, as well as the volume of data itself (the three V’s of Big Data) is increasing exponentially and we see this every day with our own customers. What made Strata so interesting was seeing how all these technologies and techniques are being integrated and exploited by other organisations across the globe.
Using data from the published Strata talk outlines, we’ve plotted the above bubbles. On the left we have other technologies mentioned when the title or outline contains the word ‘Hadoop’. On the right we have other technologies mentioned in conjunction with Spark.
If Hadoop is finally starting to make some headway inside organisations that aren’t your typical web-scale giants like Yahoo and Facebook, a reasonably recent entrant to the ecosystem seems to be making even more waves. That entrant is Spark, and it’s clear from the sheer number of talks about it, and the interest in those talks, that Spark is huge at the moment. Spark is an engine for large-scale data processing that seems to be in the process of replacing the incumbent (at least on Hadoop) MapReduce paradigm. It is capable of working with data stored inside a Hadoop cluster, can use data stored in Amazon’s S3 and can work with data stored locally, which means it’s really easy to experiment with, not something that can be said about MapReduce!
What seems to be driving Spark’s adoption at the moment is its raw speed. It claims speed increases of up to 100 times over in-memory MapReduce and 10 times for on-disk. This completely changes the game when analysing large scale data sets. Add to that it’s ability to work on streaming data and it’s not surprising the project is generating as much interest as it is.
One of the Strata talks was delivered by Patrick Wendell, co-founder of Databricks, a company set up by some of the creators of Spark. It was standing room only as Wendell gave us an overview of the current state of the Spark project and some highlights of where the project is heading. Most interestingly, from our perspective at least, is what’s coming in spark 1.4, due out later this year. Spark 1.4 will feature first class support and integration with R. Prior to this release, R users have had to use a separate project called SparkR from AMPLab at UC Berkeley (the same place where Spark itself comes from). With version 1.4, the SparkR project will be officially integrated, which means R will join Java, Scala and Python as a fully supported language. This is obviously great news for R users, and will allow a huge number of new users easy entry into data analysis with Spark.
It’s also worth noting that the current version of Spark, version 1.3, introduced a DataFrame API, that, according to Wendell, was directly inspired by R’s data frames. In Spark though, the DataFrame is built on top of Spark’s existing Resilient Distributed Dataset (RDD) abstraction. This allows users to work with an in-memory DataFrame, spread across multiple machines, thereby taking advantage of the available memory of the entire cluster.
With the inevitable rise of ‘Big Data’ and the processing challenges that accompany it, it’s clear the Strata conference is an invaluable resource in a fast changing industry. For 2016, Strata + Hadoop World London is moving to an even bigger venue in ExCeL London and should hopefully continue to showcase the best of the constantly evolving data-centric landscape.