About Apache Spark™
Spark is the default mode when you start an analytics node in a packaged installation. Spark runs locally on each node and executes in memory when possible. Spark uses multiple threads instead of multiple processes to achieve parallelism on a single node, avoiding the memory overhead of several JVMs.
Apache Spark integration with DataStax Enterprise includes:
-
Spark Cassandra Connector for accessing data stores in DSE
-
DSE Resource Manager for managing Spark components in a DSE cluster
-
Spark SQL support
-
DataFrames API to manipulate data within Spark
Spark architecture
The software components for a single DataStax Enterprise analytics node are:
-
Spark Worker
-
DataStax Enterprise File System (DSEFS)
-
Cassandra File System (CFS), deprecated as of DSE 5.1
-
The database
A Spark Master acts purely as a resource manager for Spark applications. Spark Workers launch executors that are responsible for executing part of the job that is submitted to the Spark Master. Each application has its own set of executors. Spark architecture is described in the Apache documentation.
DSE Spark nodes use a different resource manager than standalone Spark nodes. The DSE Resource Manager simplifies integration between Spark and DSE. In a DSE Spark cluster, client applications use the CQL protocol to connect to any DSE node, and that node redirects the request to the Spark Master.
The communication between the Spark client application (or driver) and the Spark Master is secured the same way as connections to DSE, which means that plain password authentication as well as Kerberos authentication is supported, with or without SSL encryption. Encryption and authentication can be configured per application, rather than per cluster. Authentication and encryption between the Spark Master and Worker nodes can be enabled or disabled regardless of the application settings.
Spark supports multiple applications. A single application can spawn multiple jobs and the jobs run in parallel. An application reserves some resources on every node and these resources are not freed until the application finishes. For example, every session of Spark shell is an application that reserves resources. By default, the scheduler tries allocate the application to the highest number of different nodes. For example, if the application declares that it needs four cores and there are ten servers, each offering two cores, the application most likely gets four executors, each on a different node, each consuming a single core. However, the application can get also two executors on two different nodes, each consuming two cores. You can configure the application scheduler. Spark Workers and Spark Master are part of the main DSE process. Workers spawn executor JVM processes which do the actual work for a Spark application (or driver). Spark executors use native integration to access data in local transactional nodes through the Spark-Cassandra Connector. The memory settings for the executor JVMs are set by the user submitting the driver to DSE.
In deployment for each Analytics datacenter one node runs the Spark Master, and Spark Workers run on each of the nodes. The Spark Master comes with automatic high availability.
As you run Spark, you can access data in the Hadoop Distributed File System (HDFS), the Cassandra File System (CFS), or the DataStax Enterprise File System (DSEFS) by using the URL for the respective file system.
Highly available Spark Master
The Spark Master High Availability mechanism uses a special table in the spark_system
keyspace to store information required to recover Spark workers and the application.
Unlike the high availability mechanism mentioned in Spark documentation, DataStax Enterprise does not use ZooKeeper.
If the original Spark Master fails, the reserved one automatically takes over. To find the current Spark Master, run:
dse client-tool spark master-address
DataStax Enterprise provides Automatic Spark Master management.
Unsupported Spark features
Writing to blob columns from Spark is not supported.
Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serializing.