About Spark

Information about Spark architecture and capabilities.

Spark is the default mode when you start an analytics node in a packaged installation. Spark runs locally on each node and executes in memory when possible. Spark uses multiple threads instead of multiple processes to achieve parallelism on a single node, avoiding the memory overhead of several JVMs.

Apache Spark integration with DataStax Enterprise includes: Spark runs locally on each node and executes in memory when possible. Spark can cache intermediate RDDs in RAM, disk, or both. Spark stores files for chained iteration in memory, instead of using temporary storage in HDFS like Hadoop does. Spark uses multiple threads instead of multiple processes to achieve parallelism on a single node, avoiding the memory overhead of several JVMs.

Spark architecture 

The software components for a single DataStax Enterprise analytics node are:
  • Spark Worker, on all nodes
  • Cassandra File System (CFS)
  • Cassandra

A Spark Master controls the workflow, and a Spark Worker launches executors that are responsible for executing part of the job that is submitted to the Spark Master. Spark architecture is described in the Apache documentation.

Spark supports multiple applications. A single application can spawn multiple jobs and the jobs run in parallel. An application reserves some resources on every node and these resources are not freed until the application finishes. For example, every session of Spark shell is an application that reserves resources. By default, the scheduler tries allocate the application to the highest number of different nodes. For example, if the application declares that it needs four cores and there are ten servers, each offering two cores, the application most likely gets four executors, each on a different node, each consuming a single core. However, the application can get also two executors on two different nodes, each consuming two cores. You can configure the application scheduler. Spark Workers and Spark Master are spawned as separate processes and are very lightweight. Workers spawn other memory-heavy processes that are dedicated to handling queries. Memory settings for those additional processes are fully controlled by the administrator.

In deployment, one analytics node runs the Spark Master, and Spark Workers run on each of the analytics nodes. The Spark Master comes with automatic high availability. Spark executors use native integration to access data in local Cassandra nodes through the Open Source Spark-Cassandra Connector.

Spark integration with DataStax Enterprise

As you run Spark, you can access data in the Hadoop Distributed File System (HDFS) or the Cassandra File System (CFS) by using the URL for one or the other.

Highly available Spark Master 

The Spark Master High Availability mechanism uses a special table in the dse_system keyspace to store information required to recover Spark workers and the application. Unlike the high availability mechanism mentioned in Spark documentation, DataStax Enterprise does not use ZooKeeper.

If the original Spark Master fails, the reserved one automatically takes over. To set the reserve Spark Master, use the dsetool setrjt command.

If you enable password authentication in Cassandra, DataStax Enterprise creates special users. The Spark Master process accesses Cassandra through the special users, one per analytics node. The user names begin with the name of the node, followed by an encoded node identifier. The password is randomized. Do not remove these users or change the passwords.

In DataStax Enterprise, you manage the location of the Spark Master as you manage the location of the Hadoop Job Tracker. By running a cluster in Spark plus Hadoop mode, the Job Tracker and Spark Master always work on the same node. These dsetool commands also work for Spark Master, except for dsetool jobtracker. The Spark Master equivalent command is dsetool sparkmaster.

Unsupported features 

The following Spark features and APIs are not supported:
  • GraphX
  • Writing to blob columns from Spark

    Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serializing.