Configuring Spark nodes

Configure Spark nodes in data centers that are separate from nodes running other types of workloads, such as Cassandra real-time and DSE Search.

Spark nodes need to be configured in separate data centers from nodes running other types of workloads, such as Cassandra real-time and DSE Search. To isolate Spark traffic to a subset of dedicated nodes, follow workload isolation guidelines. In separate data centers, you can run Spark and Shark alongside integrated Hadoop or BYOH. However, you cannot run BYOH and integrated Hadoop on the same node.

DataStax recommends using the default values of Spark environment variables unless you need to increase the memory settings due to an OutOfMemoryError condition or garbage collection taking too long. The configuration options that you might want to change are in the dse.yaml and spark-env.sh files.

The location of the dse.yaml file depends on the type of installation:
Installer-Services /etc/dse/dse.yaml
Package installations /etc/dse/dse.yaml
Installer-No Services install_location/resources/dse/conf/dse.yaml
Tarball installations install_location/resources/dse/conf/dse.yaml
The default location of the spark-env.sh file depends on the type of installation:
Installer-Services and Package installations /etc/dse/spark/spark-env.sh
Installer-No Services and Tarball installations install_location/resources/spark/conf/spark-env.sh

Some of the options you can change to manage Spark performance or operations are:

Spark directories 

After you start up a Spark cluster, DataStax Enterprise creates a Spark work directory for each Spark Worker on worker nodes. A worker node can have more than one worker, configured by the SPARK_WORKER_INSTANCES option in spark-env.sh. If SPARK_WORKER_INSTANCES is undefined, a single worker will be started. The work directory contains the standard output and standard error of executors and other application specific data stored by Spark Worker and executors; it is writable only by the Cassandra user.

By default, the Spark parent work directory is located in /var/lib/spark/work, with each worker in a subdirectory named worker-number, where the number starts at 0. To change the parent worker directory, configure SPARK_WORKER_DIR in the spark-env.sh file.

The Spark RDD directory is the directory where RDDs are placed when executors decide to spill them to disk. This directory might contain the data from the database or the results of running Spark applications. If the data in the directory is confidential, prevent access by unauthorized users. The RDD directory might contain a significant amount of data, so configure its location on a fast disk. The directory is writable only by the Cassandra user. The default location of the Spark RDD directory is /var/lib/spark/rdd. The directory should be located on a fast disk. To change the RDD directory, configure SPARK_LOCAL_DIRS in the spark-env.sh file.

Client-to-node SSL 

Ensure that the truststore entries in cassandra.yaml are present as described in Client-to-node encryption, even when client authentication is not enabled.

Log directories 

The Spark logging directory is the directory where the Spark components store individual log files. DataStax Enterprise places logs in the following locations:

  • Executor logs:
    • SPARK_WORKER_DIR/worker-n/application_id/executor_id/stderr
    • SPARK_WORKER_DIR/worker-n/application_id/executor_id/stdout
  • Spark Master/Worker logs:
    • Spark Master: the global system.log
    • Spark Worker: SPARK_WORKER_LOG_DIR/worker-n/worker.log

      SPARK_WORKER_LOG_DIR is set to /var/log/spark/worker by default.

  • Spark Shell and application logs: console

Configure logging options, such as log levels, in the following files:

  • Executors: logback-spark-executor.xml
  • Spark Master: logback.xml
  • Spark Worker: logback-spark-server.xml
  • Spark Shell, Spark applications: logback-spark.xml
The location of the logback.xml file depends on the type of installation:
Installer-Services and Package installations /etc/dse/cassandra/conf/logback.xml
Installer-No Services and Tarball installations install_location/resources/cassandra/conf/logback.xml

Log configuration files are located in the same directory as spark-env.sh.

Note: When user credentials are specified in the dse command line, like dse -u username -p password, the credentials are present in the logs of Spark workers when the driver is run in cluster mode. The Spark Master, Worker, Executor, and Driver logs might include sensitive information. Sensitive information includes passwords and digest authentication tokens for Kerberos authentication mode that are passed in the command line or Spark configuration. DataStax recommends using only safe communication channels like VPN and SSH to access the Spark user interface.

Spark memory and cores 

Spark memory options affect different components of the Spark ecosystem:

  • Spark Worker memory

    The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node.

  • Application executor memory

    You can configure the amount of memory that each executor can consume for the application. Spark uses a 512MB default. Use either the spark.executor.memory option, described in "Spark 1.2.1 Available Properties", or the --executor-memory <mem> argument to the dse spark command.

Application memory

You can configure additional Java options that should be applied by the worker when spawning an executor for the application. Use the spark.executor.extraJavaOptions property, described in "Spark 1.2.1 Available Properties". For example: spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

Management of cores

You can manage the number of cores by configuring these options.

  • Spark Worker cores

    The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.

  • Application cores

    In the Spark configuration object of your application, you configure the number of application cores that the application requests from the cluster using either the spark.cores.max configuration property or the --total-executor-cores <cores> argument to the dse spark command.

Refer to Spark documentation for a detailed description about memory and core allocation.

DataStax Enterprise can control the memory and cores offered by particular Spark Workers in semi-automatic fashion. The initial_spark_worker_resources parameter in dse.yaml file specifies the fraction of system resources available to the Spark Worker. The available resources are calculated in the following way:
  • Spark Worker memory = initial_spark_worker_resources * (total system memory - memory assigned to Cassandra)
  • Spark Worker cores = initial_spark_worker_resources * total system cores

The lowest values you can assign to Spark Worker memory and cores are 64Mb and 1 core, respectively. If the results are lower, no exception is thrown and the values are automatically limited. The range of the initial_spark_worker_resources value is 0.01 to 1. If the range is not specified, the default value 0.7 is used.

This mechanism is used by default to set the Spark Worker memory and cores. To override the default, uncomment and edit one or both SPARK_WORKER_MEMORY and SPARK_WORKER_CORES options in the spark-env.sh file.

Deploying nodes for Spark jobs 

Before starting up nodes on a tarball installation, you need permission to access the default Spark directory locations: /var/lib/spark/rdd and /var/log/spark. Change ownership of these directories as follows:
$ sudo mkdir -p /var/lib/spark/rdd; sudo chmod a+w /var/lib/spark/rdd;
        sudo chown -R  $USER:$GROUP /var/lib/spark
$ sudo mkdir -p /var/log/spark; sudo chown -R  $USER:$GROUP /var/log/spark

In multiple data center clusters, use a virtual data center to isolate Spark jobs. Running Spark jobs consume resources that can affect latency and throughput. To isolate Spark traffic to a subset of dedicated nodes, follow workload isolation guidelines.

DataStax Enterprise supports the use of Cassandra virtual nodes (vnodes) with Spark.