Enabling Spark apps in cluster mode when authentication is enabled

Configuration steps to enable Spark applications in cluster mode when JAR files are on the Cassandra file system (CFS) and authentication is enabled.

You must enable Spark applications in cluster mode when JAR files are on the Cassandra file system (CFS) and authentication is enabled. When the application is submitted in cluster mode and the JAR files are on the Cassandra File System (CFS), the Spark Worker process is responsible for obtaining the required JAR file. When authentication is required, the Spark Worker process requires the authentication credentials to CFS. The Spark Worker will start executors for unrelated Spark jobs, so giving the Spark Worker process credentials enables all future Spark jobs to pull JAR files from CFS for their dependencies. Credentials that are granted to the Spark Worker must be considered "shared" among all submitted applications, regardless of the submitting user. Shared credentials do not apply to accessing CFS from the application code.
The default location of the spark-env.sh file depends on the type of installation:
Installer-Services and Package installations /etc/dse/spark/spark-env.sh
Installer-No Services and Tarball installations install_location/resources/spark/conf/spark-env.sh

Procedure

  1. To enable Spark applications in cluster mode when JAR files are on the Cassandra file system (CFS) and authentication is enabled, do one of the following:
    • Add this statement to the spark-env.sh on every DataStax Enterprise node:
      SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.hadoop.cassandra.username=username
            -Dspark.hadoop.cassandra.password=password"
    • Before you start the DataStax Enterprise server process, set the SPARK_WORKER_OPTS environment variable in a way that guarantees visibility to DataStax Enterprise server processes.

      This environment variable does not need to be passed to applications that are submitted with the dse spark or dse spark-submit commands.

  2. Follow these best practices:
    • Create a unique user with privileges only on CFS (access to related CFS keyspace), and then use the unique user credentials for the Spark Worker authentication. This best practice limits the amount of protected information in the Cassandra database that is accessible through user Spark Jobs without explicit permission.
    • Create a distinct CFS directory and limit the directory access privileges to read only.