Spark configuration

Configure Spark nodes in data centers that are separate from nodes running other types of workloads, such as Cassandra real-time and DSE Search/Solr.

Spark nodes need to be configured in separate data centers from nodes running other types of workloads, such as Cassandra real-time and DSE Search/Solr. To isolate Spark traffic to a subset of dedicated nodes, follow workload isolation guidelines. In separate data centers, you can run Spark and Shark alongside integrated Hadoop or BYOH. You cannot run BYOH and integrated Hadoop on the same node.

DataStax recommends using the default values of Spark environment variables unless you need to increase the memory settings due to an OutOfMemoryError condition or garbage collection taking too long. The configuration options that you might want to change are in the dse.yaml and spark-env.sh files in these locations.

The location of the dse.yaml file depends on the type of installation:

Installer-Services	/etc/dse/dse.yaml
Package installations	/etc/dse/dse.yaml
Installer-No Services	`install_location`/resources/dse/conf/dse.yaml
Tarball installations	`install_location`/resources/dse/conf/dse.yaml

The default location of the Spark configuration files depends on the type of installation:

Installer-Services and Package installations	/etc/dse/spark/
Installer-No Services and Tarball installations	`install_location`/resources/spark/conf/

Some of the options you can change to manage Spark performance or operations are:

Spark directories

After you start up a Spark cluster, DataStax Enterprise creates the Spark work directory on worker nodes. The directory contains the standard output and standard error of executors and other application specific data stored by Spark worker and executors; it is writable only by the Cassandra user. By default, the Spark work directory is located in /var/lib/spark/work. To change the directory, configure SPARK_WORKER_DIR in the spark-env.sh file.

The Spark RDD directory is the directory where RDDs are placed when executors decide to spill them to disk. This directory might contain the data from the database or the results of running Spark applications. If the data in the directory is confidential, prevent access by unauthorized users. The RDD directory might contain a significant amount of data, so configure its location on a fast disk. The directory is writable only by the Cassandra user. The default location of the Spark RDD directory is /var/lib/spark/rdd. The directory should be located on a fast disk. To change the RDD directory, configure SPARK_RDD_DIR in the spark-env.sh file.

The following Spark temporary directories are created in /tmp/spark:

app/$USER - temporary files from user applications
repl/$USER - temporary files from Spark shell

To change the location of these directories, configure SPARK_TMP_DIR in the spark-env.sh. The directory contains the temporary files of Spark Shell and Spark applications; it is writable by all the users.

In addition to the directories in /tmp/spark, these directories hold temporary files:

/tmp/spark/master
Created on the node that runs the Spark Master and for use by the Spark Master for storing temporary files
/tmp/spark/worker
Used by the Spark Worker for storing temporary files

Client-to-node SSL

Ensure that the truststore entries in cassandra.yaml are present as described in Client-to-node encryption, even when client authentication is not enabled. See Spark SSL encryption.

Log directories

The Spark logging directory is the directory where the Spark components store individual log files. DataStax Enterprise places logs in the following locations:

Executor logs:
- SPARK_WORKER_DIR/application_id/executor_id/stderr
- SPARK_WORKER_DIR/application_id/executor_id/stdout
Spark Master/Worker logs:
- Spark Master: SPARK_LOG_DIR/master.log
- Spark Worker: SPARK_LOG_DIR/worker.log
  SPARK_LOG_DIR is set to /var/log/spark by default.
Spark Shell and application logs: console

Note: When user credentials are specified in the dse command line, like dse -u username -p password, the credentials are present in the logs of Spark workers when the driver is run in cluster mode. The Spark Master, Worker, Executor, and Driver logs might include sensitive information. Sensitive information includes passwords and digest authentication tokens for Kerberos authentication mode that are passed in the command line or Spark configuration. DataStax recommends using only safe communication channels like VPN and SSH to access the Spark user interface.

Configure logging options, such as log levels, in the following files:

Executors: log4j-executor.properties
Spark Master, Spark Worker: log4j-server.properties
Spark Shell, Spark applications: log4j.properties

Log configuration files are located in the same directory as spark-env.sh.

Spark memory and cores

Spark memory options affect different components of the Spark ecosystem:

Spark Worker memory
The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node.
Application executor memory
You can configure the amount of memory that each executor can consume for the application. Spark uses a 512MB default. Use either the spark.executor.memory option, described in "Spark 1.1.0 Available Properties", or the --executor-memory <mem> argument to the dse spark command.

Application memory

You can configure additional Java options that should be applied by the worker when spawning an executor for the application. Use either the spark.executor.extraJavaOptions property, described in "Spark 1.1.0 Available Properties". For example: spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

Management of cores

You can manage the number of cores by configuring these options.

Spark Worker cores
The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.
Application cores
In the Spark configuration object of your application, you configure the number of application cores that the application requests from the cluster using either the spark.cores.max configuration property or the --total-executor-cores <cores> argument to the dse spark command.

The DEFAULT_PER_APP_CORES provides the default value if your application does not configure the application cores. If the cluster does not offer a sufficient number of cores, the application fails to run. If you do not configure the number of cores in your application and the DEFAULT_PER_APP_CORES is also unset, the application will fail if there is not at least a single core available in the cluster.

Refer to Spark documentation for a detailed description about memory and core allocation.

DataStax Enterprise can control the memory and cores offered by particular Spark Workers in semi-automatic fashion. The initial_spark_worker_resources parameter in dse.yaml file specifies the fraction of system resources available to the Spark Worker. The available resources are calculated in the following way:

Spark worker memory = initial_spark_worker_resources * (total system memory - memory assigned to Cassandra)
Spark worker cores = initial_spark_worker_resources * total system cores

The lowest values you can assign to Spark worker memory and cores are 64Mb and 1 core, respectively. If the results are lower, no exception is thrown and the values are automatically limited. The range of the initial_spark_worker_resources value is 0.01 to 1. If the range is not specified, the default value 0.7 is used.

This mechanism is used by default to set the Spark worker memory and cores. To override the default, uncomment and edit one or both SPARK_WORKER_MEMORY and SPARK_WORKER_CORES options in the spark-env.sh file.