Apache Spark is an open-source parallel processing framework for storing and processing Big Data across clustered computers. Spark can be used to perform computations much faster than Hadoop can rather Hadoop and Spark can be used together efficiently. Spark is written in Scala, which is considered the primary language for interacting with the Spark Core engine, but it doesn’t require developers to know Scala, which executes inside a Java Virtual Machine (JVM). APIs for Java, Python, R, and Scala ensure Spark is within reach of a wide audience of developers, and they have embraced the software.
Spark uses a master-slave architecture that contains a driver and a worker. A driver coordinates many distributed workers in order to execute tasks in a distributed manner while a cluster manager deals with the resource allocation to get the tasks done.
The driver is where the Main method runs. It converts the program into tasks and then schedules the tasks to the executors. The driver has at its disposal 3 different ways of communicating with the executors; Broadcast, Take, and DAG. It controls the execution of a Spark application and maintains all of the states of the Spark cluster, which includes the state and tasks of the executors. The driver must interface with the cluster manager in order to get physical resources and launch executors. To put this in simple terms, this process is just a process on a physical machine that is responsible for maintaining the state of the application running on the cluster.
- Broadcast Action: The driver transmits the necessary data to each executor. This action is optimal for data sets under a million records, +- 1GB of data. This action can become a very expensive task.
- Take Action: Driver takes data from all Executors. This action can be a very expensive and dangerous action as the driver might run out of memory and the network could become overwhelmed.
- DAG(Direct Acyclic Graph) Action: This is the least expensive action out of the three. It transmits control flow logic from the driver to the executors.
Executers execute the delegated tasks from the driver within a JVM instance. Executors are launched at the beginning of a Spark application and normally run for the whole life span of an application. This method allows for data to persist in memory while different tasks are loaded in and out of the execution throughout the application’s lifespan.
The JVM worker environments in Hadoop MapReduce in stark contrast power down and powers up for each task. The consequence of this is that Hadoop must perform reads and writes on disk at the start and end of every task.
Cluster Manager is responsible for maintaining a cluster of machines that will run your Spark Application. Cluster managers have their own ‘driver’ and ‘worker’ abstractions, but the difference is that these are tied to physical machines rather than processes.
Spark ecosystem consists of various components like Spark SQL, Spark Streaming, MLlib, GraphX, and the Core API component.
- Spark Core
Spark Core is the base engine for large-scale parallel and distributed data processing. Further, additional libraries which are built on the top of the core allow diverse workloads for streaming, SQL, and machine learning. It is responsible for memory management and fault recovery, scheduling, distributing, and monitoring jobs on a cluster & interacting with storage systems.
- Spark Streaming
Spark Streaming is the component of Spark that is used to process real-time streaming data. Thus, it is a useful addition to the core Spark API. It enables high-throughput and fault-tolerant stream processing of live data streams.
- Spark SQL
Spark SQL is a new module in Spark which integrates relational processing with Spark’s functional programming API. It supports querying data either via SQL or via the Hive Query Language. For those of you familiar with RDBMS, Spark SQL will be an easy transition from your earlier tools where you can extend the boundaries of traditional relational data processing.
GraphX is the Spark API for graphs and graph-parallel computation. Thus, it extends the Spark RDD with a Resilient Distributed Property Graph. At a high-level, GraphX extends the Spark RDD abstraction by introducing the Resilient Distributed Property Graph (a directed multigraph with properties attached to each vertex and edge).
- MLlib (Machine Learning)
MLlib stands for Machine Learning Library. MLLib is a machine learning library like Mahout. It is built on top of Spark and has the provision to support many machine learning algorithms. But the point difference with Mahout is that it runs almost 100 times faster than MapReduce. It is not yet as enriched as Mahout, but it is coming up pretty well, even though it is still in the initial stage of growth.
It is an R package that provides a distributed data frame implementation. It also supports operations like selection, filtering, aggregation but on large data-sets.
Advantages of Spark
- Spark provides a unified platform for batch processing, structured data handling, streaming, and much more.
- Compared with the map-reduce of Hadoop, the spark code is much easy to write and use.
- The most important feature of Spark, it abstracts the parallel programming aspect. Spark core abstracts the complexities of distributed storage, computation, and parallel programming.