Apache Spark
Apache Spark is an open-source parallel processing framework for storing and processing Big Data across clustered computers. Spark can be used to perform computations much faster than Hadoop can rather Hadoop and Spark can be used together efficiently. Spark is written in Scala, which is considered the primary language for interacting with the Spark Core engine, but it doesn’t require developers to know Scala, which executes inside a Java Virtual Machine (JVM). APIs for Java, Python, R, and Scala ensure Spark is within reach of a wide audience of developers, and they have embraced the software.
Structure
Spark uses a master-slave architecture that contains a driver and a worker. A driver coordinates many distributed workers in order to execute tasks in a distributed manner while a cluster manager deals with the resource allocation to get the tasks done.
Driver
The driver is where the Main method runs. It converts the program into tasks and then schedules the tasks to the executors. The driver has at its disposal 3 different ways of communicating with the executors; Broadcast, Take, and DAG. It controls the execution of a Spark application and maintains all of the states of the Spark cluster, which includes the state and tasks of the executors. The driver must interface with the cluster manager in order to get physical resources and launch executors. To put this in simple terms, this process is just a process on a physical machine that is responsible for maintaining the state of the application running on the cluster.
- Broadcast Action: The driver transmits the necessary data to each executor. This action is optimal for data sets under a million records, +- 1GB of data. This action can become a very expensive task.
- Take Action: Driver takes data from all Executors. This action can be a very expensive and dangerous action as the driver might run out of memory and the network could become overwhelmed.
- DAG(Direct Acyclic Graph) Action: This is the least expensive action out of the three. It transmits control flow logic from the driver to the executors.