About Spark

Information about Spark architecture and capabilities.

Apache Spark is a framework for analyzing large data sets across a cluster, and is enabled when you start an Analytics node. Spark runs locally on each node and executes in memory when possible. Spark uses multiple threads instead of multiple processes to achieve parallelism on a single node, avoiding the memory overhead of several JVMs.

Apache Spark integration with DataStax Enterprise includes:

Spark architecture

The software components for a single DataStax Enterprise analytics node are:
  • Spark Worker
  • DataStax Enterprise File System (DSEFS)
  • The database

A Spark Master acts purely as a resource manager for Spark applications. Spark Workers launch executors that are responsible for executing part of the job that is submitted to the Spark Master. Each application has its own set of executors. Spark architecture is described in the Apache documentation.

DSE Spark nodes use a different resource manager than standalone Spark nodes. The DSE Resource Manager simplifies integration between Spark and DSE. In a DSE Spark cluster, client applications use the CQL protocol to connect to any DSE node, and that node redirects the request to the Spark Master.

The communication between the Spark client application (or driver) and the Spark Master is secured the same way as connections to DSE, which means that plain password authentication as well as Kerberos authentication is supported, with or without SSL encryption. Encryption and authentication can be configured per application, rather than per cluster. Authentication and encryption between the Spark Master and Worker nodes can be enabled or disabled regardless of the application settings.

Spark supports multiple applications. A single application can spawn multiple jobs and the jobs run in parallel. An application reserves some resources on every node and these resources are not freed until the application finishes. For example, every session of Spark shell is an application that reserves resources. By default, the scheduler tries allocate the application to the highest number of different nodes. For example, if the application declares that it needs four cores and there are ten servers, each offering two cores, the application most likely gets four executors, each on a different node, each consuming a single core. However, the application can get also two executors on two different nodes, each consuming two cores. You can configure the application scheduler. Spark Workers and Spark Master are part of the main DSE process. Workers spawn executor JVM processes which do the actual work for a Spark application (or driver). Spark executors use native integration to access data in local transactional nodes through the Open Source Spark-Cassandra Connector. The memory settings for the executor JVMs are set by the user submitting the driver to DSE.

In deployment for each Analytics datacenter one node runs the Spark Master, and Spark Workers run on each of the nodes. The Spark Master comes with automatic high availability.

Figure 1. Spark integration with DataStax Enterprise

As you run Spark, you can access data in the Hadoop Distributed File System (HDFS), or the DataStax Enterprise File System (DSEFS) by using the URL for the respective file system.

Highly available Spark Master

The Spark Master High Availability mechanism uses a special table in the dse_analytics keyspace to store information required to recover Spark workers and the application. Reads to the recovery data in dse_analytics are always performed using the LOCAL_QUORUM consistency level. Writes are attempted first using LOCAL_QUORUM, and if that fails, the write is retried using LOCAL_ONE. Unlike the high availability mechanism mentioned in Spark documentation, DataStax Enterprise does not use ZooKeeper.

If the original Spark Master fails, the reserved one automatically takes over. To find the current Spark Master, run:

dse client-tool spark leader-address

DataStax Enterprise provides Automatic Spark Master management.

Note: The Spark Master will not start until LOCAL_QUORUM is attainable for the dse_analytics keyspace.

Unsupported features

The following Spark features and APIs are not supported:
  • Writing to blob columns from Spark

    Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serializing.