Spark configuration

Configure Spark nodes in data centers that are separate from nodes running other types of workloads, such as Cassandra real-time and DSE Search/Solr.

Spark nodes need to be configured in separate data centers from nodes running other types of workloads, such as Cassandra real-time and DSE Search/Solr. To isolate Spark traffic to a subset of dedicated nodes, follow workload isolation guidelines. In separate data centers, you can run Spark and Shark alongside integrated Hadoop or BYOH. You cannot run BYOH and integrated Hadoop on the same node.

DataStax recommends using the default values of Spark environment variables unless you need to increase the memory settings due to an OutOfMemoryError condition or garbage collection taking too long. All configuration options you might want to change are in the dse.yaml and spark-env.sh files in these locations.

  • Installer-Services and Package installations:
    • /etc/dse/dse.yaml
    • /etc/dse/spark/spark-env.sh
  • Installer-No Services and Tarball installations:
    • install_location/resources/dse/conf/dse.yaml
    • install_location/resources/spark/conf/spark-env.sh

Some of the options you can change to manage Spark performance or operations are:

Spark directories 

After you start up a Spark cluster, DataStax Enterprise creates the Spark work directory on worker nodes. The directory contains the standard output and standard error of executors and other application specific data stored by Spark worker and executors; it is writable only by the Cassandra user. By default, the Spark work directory is located in /var/lib/spark/work. To change the directory, configure SPARK_WORKER_DIRECTORY in the spark-env.sh file.

The Spark RDD directory is the directory where RDDs are placed when executors decide to spill them to disk. This directory might contain the data from the database or the results of running Spark applications. If the data in the directory is confidential, prevent access by unauthorized users. The RDD directory might contain a significant amount of data, so configure its location on a fast disk. The directory is writable only by the Cassandra user. The default location of the Spark RDD directory is /var/lib/spark/rdd. The directory should be located on a fast disk. To change the RDD directory, configure SPARK_RDD_DIR in the spark-env.sh file.

The following Spark temporary directories are created in /tmp/spark:
  • app/$USER - temporary files from user applications
  • repl/$USER - temporary files from Spark shell

To change the location of these directories, configure SPARK_TMP_DIR in the spark-env.sh. The directory contains the temporary files of Spark Shell and Spark applications; it is writable by all the users.

In addition to the directories in /tmp/spark, these directories hold temporary files:
  • /tmp/spark/master

    Created on the node that runs the Spark Master and for use by the Spark Master for storing temporary files

  • /tmp/spark/worker

    Used by the Spark Worker for storing temporary files

JVM options 

DataStax recommends that you do not modify Java Virtual Machine (JVM) options. Spark automatically sets the following JVM options, so do not add these options to the spark-env.sh:
  • SPARK_COMMON_OPTS. Options common to all Spark processes
  • SPARK_MASTER_OPTS. Spark Master options
  • SPARK_WORKER_OPTS. Spark Worker options
  • SPARK_EXECUTOR_OPTS. Executor options
  • SPARK_REPL_OPTS. Spark shell options
  • SPARK_APP_OPTS. Options for Spark applications run by dse spark-class, such as the Spark demo

Log directories 

The Spark logging directory is the directory where the Spark components store individual log files. DataStax Enterprise places logs in the following locations:

  • Executor logs:
    • SPARK_WORKER_DIR/application_id/executor_id/stderr
    • SPARK_WORKER_DIR/application_id/executor_id/stdout
  • Spark Master/Worker logs:
    • Spark Master: SPARK_LOG_DIR/master.log
    • Spark Worker: SPARK_LOG_DIR/worker.log

      SPARK_LOG_DIR is set to /var/log/spark by default.

  • Spark Shell and application logs: console

Configure logging options, such as log levels, in the following files:

  • Executors: log4j-executor.properties
  • Spark Master, Spark Worker: log4j-server.properties
  • Spark Shell, Spark applications: log4j.properties

Log configuration files are located in the same directory as spark-env.sh.

Spark memory and cores 

Spark memory options affect different components of the Spark ecosystem:

  • Spark Worker memory

    The SPARK_WORKER_MEMORY option configures the total amount of memory that you can assign to all executors that a single Spark Worker runs on the particular node.

  • Application executor memory

    The SPARK_MEM option configures the amount of memory that each executor can consume for the application. For example, an application running executors by Spark Workers on three different nodes consume the amount of memory configured by SPARK_MEM. The configuration decreases the memory pool available to the related Spark Workers. By default, SPARK_MEM is commented out in the spark-env.sh. Use the Spark configuration object in your application to configure SPARK_MEM. In lieu of configuration by the application, Spark uses the commented-out default.

  • Application memory

    The JAVA_OPTS environment variable configures the amount of memory consumed by the application on the client machine. Configure JAVA_OPTS before starting your application. Remember that your application probably does not need much memory because it does not perform the real work.

Management of cores

You can manage the number of cores by configuring these options.

  • Spark Worker cores

    The SPARK_WORKER_CORES option configures the number of cores offered by Spark Worker for use by executors. A single executor can borrow more than one core from the worker. The number of cores used by the executor relates to the number of parallel tasks the executor might perform. The number of cores offered by the cluster is the sum of cores offered by all the workers in the cluster.

  • Application cores

    In the Spark configuration object of your application, you configure the number of application cores that the application requests from the cluster. The DEFAULT_PER_APP_CORES provides the default value if your application does not configure the application cores. If the cluster does not offer a sufficient number of cores, the application fails to run. If you do not configure the number of cores in your application and the DEFAULT_PER_APP_CORES is also unset, the application will fail if there is not at least a single core available in the cluster.

Refer to Spark documentation for a detailed description about memory and core allocation.

DataStax Enterprise 4.5.2 and later can control the memory and cores offered by particular Spark Workers in semi-automatic fashion. The initial_spark_worker_resources parameter in dse.yaml file specifies the fraction of system resources available to the Spark Worker. The available resources are calculated in the following way:
  • Spark worker memory = initial_spark_worker_resources * (total system memory - memory assigned to Cassandra)
  • Spark worker cores = initial_spark_worker_resources * total system cores

The lowest values you can assign to Spark worker memory and cores are 64Mb and 1 core, respectively. If the results are lower, no exception is thrown and the values are automatically limited. The range of the initial_spark_worker_resources value is 0.01 to 1. If the range is not specified, the default value 0.7 is used.

This mechanism is used by default to set the Spark worker memory and cores. To override the default, uncomment and edit one or both SPARK_WORKER_MEMORY and SPARK_WORKER_CORES options in the spark-env.sh file.