Configuring Spark nodes
Configure Spark nodes in data centers that are separate from nodes running other types of workloads, such as Cassandra real-time and DSE Search.
Spark nodes need to be configured in separate data centers from nodes running other types of workloads, such as Cassandra real-time and DSE Search. To isolate Spark traffic to a subset of dedicated nodes, follow workload isolation guidelines. In separate data centers, you can run Spark and Shark alongside integrated Hadoop or BYOH. However, you cannot run BYOH and integrated Hadoop on the same node.
DataStax recommends using the default values of Spark environment variables unless you need to increase the memory settings due to an OutOfMemoryError condition or garbage collection taking too long. The configuration options that you might want to change are in the dse.yaml and spark-env.sh files.
Installer-Services | /etc/dse/dse.yaml |
Package installations | /etc/dse/dse.yaml |
Installer-No Services | install_location/resources/dse/conf/dse.yaml |
Tarball installations | install_location/resources/dse/conf/dse.yaml |
Installer-Services and Package installations | /etc/dse/spark/spark-env.sh |
Installer-No Services and Tarball installations | install_location/resources/spark/conf/spark-env.sh |
Some of the options you can change to manage Spark performance or operations are:
Spark directories
After you start up a Spark cluster, DataStax Enterprise creates a Spark work directory for each Spark Worker on worker nodes. A worker node can have more than one worker, configured by the SPARK_WORKER_INSTANCES option in spark-env.sh. If SPARK_WORKER_INSTANCES is undefined, a single worker will be started. The work directory contains the standard output and standard error of executors and other application specific data stored by Spark Worker and executors; it is writable only by the Cassandra user.
By default, the Spark parent work directory is located in /var/lib/spark/work, with each worker in a subdirectory named worker-number, where the number starts at 0. To change the parent worker directory, configure SPARK_WORKER_DIR in the spark-env.sh file.
The Spark RDD directory is the directory where RDDs are placed when executors decide to spill them to disk. This directory might contain the data from the database or the results of running Spark applications. If the data in the directory is confidential, prevent access by unauthorized users. The RDD directory might contain a significant amount of data, so configure its location on a fast disk. The directory is writable only by the Cassandra user. The default location of the Spark RDD directory is /var/lib/spark/rdd. The directory should be located on a fast disk. To change the RDD directory, configure SPARK_LOCAL_DIRS in the spark-env.sh file.
Client-to-node SSL
Ensure that the truststore entries in cassandra.yaml are present as described in Client-to-node encryption, even when client authentication is not enabled.
Log directories
The Spark logging directory is the directory where the Spark components store individual log files. DataStax Enterprise places logs in the following locations:
- Executor logs:
- SPARK_WORKER_DIR/worker-n/application_id/executor_id/stderr
- SPARK_WORKER_DIR/worker-n/application_id/executor_id/stdout
- Spark Master/Worker logs:
- Spark Master: the global system.log
- Spark Worker:
SPARK_WORKER_LOG_DIR/worker-n/worker.log
SPARK_WORKER_LOG_DIR is set to /var/log/spark/worker by default.
- Spark Shell and application logs: console
Configure logging options, such as log levels, in the following files:
- Executors: logback-spark-executor.xml
- Spark Master: logback.xml
- Spark Worker: logback-spark-server.xml
- Spark Shell, Spark applications: logback-spark.xml
Installer-Services and Package installations | /etc/dse/cassandra/conf/logback.xml |
Installer-No Services and Tarball installations | install_location/resources/cassandra/conf/logback.xml |
Log configuration files are located in the same directory as spark-env.sh.
dse -u username -p password
, the credentials are present
in the logs of Spark workers when the driver is run in cluster mode. The Spark Master,
Worker, Executor, and Driver logs might include sensitive information. Sensitive information
includes passwords and digest authentication tokens for Kerberos
authentication mode that are passed in the command line or Spark configuration.
DataStax recommends using only safe communication channels like VPN and SSH to access the
Spark user interface.Spark memory and cores
Spark memory options affect different components of the Spark ecosystem:
- Spark Worker memory
The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node.
- Application executor memory
You can configure the amount of memory that each executor can consume for the application. Spark uses a 512MB default. Use either the spark.executor.memory option, described in "Spark 1.2.1 Available Properties", or the
--executor-memory <mem>
argument to the dse spark command.
Application memory
You can configure additional Java options that should be applied by the worker when spawning an
executor for the application. Use the spark.executor.extraJavaOptions
property, described in "Spark 1.2.1 Available Properties". For example:
spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two
three"
Management of cores
You can manage the number of cores by configuring these options.
- Spark Worker cores
The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.
- Application cores
In the Spark configuration object of your application, you configure the number of application cores that the application requests from the cluster using either the spark.cores.max configuration property or the
--total-executor-cores <cores>
argument to the dse spark command.
Refer to Spark documentation for a detailed description about memory and core allocation.
Spark Worker memory = initial_spark_worker_resources * (total system memory - memory assigned to Cassandra)
Spark Worker cores = initial_spark_worker_resources * total system cores
The lowest values you can assign to Spark Worker memory and cores are 64Mb and 1 core, respectively. If the results are lower, no exception is thrown and the values are automatically limited. The range of the initial_spark_worker_resources value is 0.01 to 1. If the range is not specified, the default value 0.7 is used.
This mechanism is used by default to set the Spark Worker memory and cores. To override the default, uncomment and edit one or both SPARK_WORKER_MEMORY and SPARK_WORKER_CORES options in the spark-env.sh file.
Deploying nodes for Spark jobs
$ sudo mkdir -p /var/lib/spark/rdd; sudo chmod a+w /var/lib/spark/rdd;
sudo chown -R $USER:$GROUP /var/lib/spark
$ sudo mkdir -p /var/log/spark; sudo chown -R $USER:$GROUP /var/log/spark
In multiple data center clusters, use a virtual data center to isolate Spark jobs. Running Spark jobs consume resources that can affect latency and throughput. To isolate Spark traffic to a subset of dedicated nodes, follow workload isolation guidelines.
DataStax Enterprise supports the use of Cassandra virtual nodes (vnodes) with Spark.