Over the past decade, internet companies such as Google, Yahoo, Facebook and others have leveraged the Hadoop MapReduce platform effectively to process data that is truly large in scale. In addition, many enterprises with years of information gathered in siloes across multiple systems have also started using this platform to combine and process data at a scale beyond the realm of the imagination of most computer scientists and IT professionals.
In 2013, a new project donated to the Apache foundation started attracting the attention of Big Data practitioners. This was Apache Spark, and it has very quickly set the world of Big Data on fire! In contrast to MapReduce’s storage based two stage MapReduce paradigm, Spark provided developers the ability to access and process the same data in-memory multiple times, thus avoiding the costly expense of disk io. This has allowed Spark based applications to deliver anywhere from 10 to 100 times the performance gains as compared to a similar application using the MapReduce paradigm.
Though it works well on problems that Hadoop and MapReduce were being applied to, Spark was originally built to provide high performance data processing to support applications in the area of machine learning, graph processing and streaming analytics. With the wide adoption of Spark to replace MapReduce, people sometimes incorrectly assume that it was possibly developed to replace MapReduce or even Hadoop as a whole.
Luckily for the development community, the creators of Spark took a complementary approach with the Hadoop stack. Spark is completely built to coexist and work alongside existing investments in Hadoop including Hadoop’s data stores, file formats, data collection and management libraries, as well as it’s scheduling and resource management tools. That’s good news, especially for all the enterprises that have poured millions into applications and infrastructure using these technologies!
There are a few items to keep in mind as you look to implement new Big Data applications using Spark. Spark applications will clearly impose additional memory requirements on your cluster infrastructure to provide the performance improvements at scale. Spark may not always bring additional value for workloads that are not time sensitive, and better fit a batch processing model.
While not perfect, Hadoop MapReduce is definitely a more mature platform at this time. Spark will take some time to mature to the same level. On the flip side, it is highly likely that more resources will move to working on Spark, which will eventually lead to a slower cycle of updates and improvements in Hadoop’s core platform.
With so much promise from Spark, we can all now hope that the weather folks will finally be able to get their predictions right!