Don’t look now but Apache Spark is about to turn 10 years old. The open source project began quietly at UC Berkeley in 2009 before emerging as an open source project in 2010. For the past five years, Spark has been on an absolute tear, becoming one of the most widely used technologies in big data and AI. Let’s take a look at Spark’s remarkable run up to this point, and see where it might be headed next.
Apache Spark is best known as the in-memory replacement for MapReduce, the disk-based computational engine at the heart of early Hadoop clusters. That Spark kicked MapReduce out of the Hadoop nest was no fluke â Matei Zaharia specifically created Spark at Berkeley AMPLab because he heard his fellow computer science graduate students complaining about how horribly slow MapReduce was.
Spark’s original advantage over MapReduce was how it processed data across resilient distributed datasets (RDDs). Instead of MapReduce’s mapping, reducing, and shuffling phases, which require an abundance of computationally expensive trips to disk, Spark RDDs reduced all that I/O by keeping the working dataset in memory until the job completed. (For data sets that exceed RAM, Spark can spill over to disk.) This RDD approach made Spark several orders of magnitude faster than MapReduce.
Spark code is also much more efficient than MapReduce code, allowing developers to write concise routines in a variety of languages using APIs for Scala, Java, Python, and R. Spark’s productivity story was bolstered in 2015 with the introduction of the DataFrames, which allowed data to be stored in table-like structures that are cached in memory, a feature that coincided with the release of Spark SQL (originally called Shark). A year later, with Spark 2.0, Zaharia and the Spark community added the concept of Datasets, which is a type-safe object-oriented programming interface based on DataFrames. (The RDD API has been depreciated, but is still supported.)
Spark proved its Hadoop worth on a variety of workloads. It started out replacing MapReduce’s bread and butter: traditional batch-oriented extract, transform, and load (ETL) jobs. But Spark’s affinity for rapid iteration soon drew the attention of data scientists working to fine-tune machine learning algorithms. With the addition of a SQL layer, Spark would excel with interactive analytics and become a tool used by business analysts too.
Spark’s popularity started surging in 2013, and by 2014 the cat was clear of the bag. Cloudera was the first Hadoop distributor to recognize that the impact that Spark was having, but Hortonworks (now part of Cloudera) and MapR Technologies were not far behind. Many vendors hawking shrink-wrapped analytics packages atop Hadoop also jumped on the Spark bandwagon, and the rush was on to replace MapReduce en masse with the simpler, faster, and superior Spark engine.
While Spark integrated with YARN, it wasn’t limited to running on Hadoop. Indeed, it was co-developed at AMPLab beside Apache Mesos, which Zaharia helped develop, and it continues to support that open source resource scheduler today. Customers can also run Spark on a stand-alone basis on a laptop computer or a dedicated server.
Spark’s incredible versatility is also evident on the storage front. Spark was originally designed to work with HDFS, and MapR also adopted it for its hybrid MapR File System. But Spark has also been adapted to work with Amazon S3, Apache Cassandra, OpenStack Swift, Alluxio, Cloudera’s Kudu, Elasticsearch, and MemSQL storage. It’s available as a processing engine in all public clouds, and today is the core engine powering Amazon’s popular Elastic MapReduce (EMR) service and an increasingly popular choice in Microsoft Azure. Google Cloud Compute (GCP) supports Spark too, and Spark is one of a handful of “runners” in Google’s high-level Apache Beam construct.
In addition to running just about anywhere, Spark offers a variety of specialized processing engines. The so-called Spark Core functions as a direct replacement for batch routines in MapReduce, but over the years the Spark community has added a SQL engine (Spark SQL), a real-time streaming analytics engine (Spark Streaming), a machine learning library (MLLib), and even a graph database (GraphX).
Spark was written in Scala, but it doesn’t require developers to know Scala, which executes inside a Java Virtual Machine (JVM). APIs for Java, Python, R, and Scala ensure Spark is within reach of a wide audience of developers, and they have embraced the software. Power users and engineers can also write Spark routes directly from a command line interface, while others will create Spark applications through the comfort of a notebook interface like Jupyter, IBM‘s Data Science Experience, or Cloudera’s Data Science Workbench.
Spark is considered to be the most popular open source project on the planet, with more than 1,000 contributors from 250-plus organizations, according to Databricks, the San Francisco, California company founded by Matei and his two AMPLab advisors, Ali Ghodsi and Ion Stoica, and fellow AMPLab student Reynold Xin, to deliver a Spark service in the cloud. Databricks, which raised $250 million in a Series E funding round last month, employs many of the key individuals responsible for developing Spark.
What’s remarkable is how Spark has retained its popularity despite Hadoop’s troubles. The merger of Cloudera and Hortonworks last fall led some people in the industry to question the future of Hadoop, especially in light of the high level of technical complexity and operational challenges that have challenged many Hadoop users.
Spark’s popularity continued beyond peak Hadoop primarily because of its versatility, says Ghodsi, who was named one of Datanami‘s People to Watch for 2019 and continues as an advisor to RISELab, the successor to AMPlab at UC Berkeley.
Spark’s “flexibility to go between these many types of use cases so easily is really the main reason I think it took off,” Ghodsi tells Datanami in the People to Watch interview. “It unified all of these different types of analytics under one framework, whereas Hadoop didnât have, for instance, the machine learning, SQL, or other components, such as the real-time component which wasnât there. So bringing what we call ‘unified analytics’ under one umbrella is what made it super powerful.”
It’s hard to overstate Spark’s impact on big data. Familiarity with Spark Core is a must-have skill for data engineers developing new pipelines for data, and data scientists still turn to MLlib for machine learning development. It’s too complex to turn over to non-power users, but its remarkable ubiquity among big data developers puts it in rarified air.
The Big Three Oh
The future of Spark is bright and resources are pouring into keeping Spark relevant for years to come. The software received a major update with last year’s release of Spark 2.3, which brought support for Kubernetes and true real-time processing in Spark Streaming. And soon â perhaps even later this year — the Apache Spark community will formally unveil Spark 3.0.
What will Spark 3.0 bring? That’s been the source of some speculation. One school of thought is that Spark 3.0 will focus on emerging AI technologies. Spark currently does not integrate cleanly with leading deep learning frameworks. While some folks are using frameworks like Tensorflow, MXnet, and PyTorch with Spark, the job schedulers have incompatibilities that raise serious operational issues. At the Spark+AI Summit last June, Reynold Xin, who’s the chief architect at Databricks, discussed ways the Spark community is working to resolve this dilemma, specifically Project Hydrogen and the development of a new “gang scheduler.”
Whether Spark 3.0 focuses on deep learning has yet to be seen. There are a number of other improvements that are reportedly being considered, including better online serving of machine learning models, fully depreciating the RDD API, improvements to the Scala API, support for data formats (potentially Apache Arrow), better support for different processor types like GPUs and FPGAs, support for Neo4j‘s graph language Cypher, and making MLlib APIs type-safe.
Nobody knows for sure what will go into Spark 3.0, or even if it will ship this year. But one thing’s for certain: the future of Spark will be a major topic of discussion at the Spark+AI Summit, which takes place April 23-25 in San Francisco. Registration is open on the Databricks site.
Project Hydrogen Unites Apache Spark with DL Frameworks
Top 3 New Features in Apache Spark 2.3
Is Spark Overhyped?