Enabling Spark apps in cluster mode when authentication is enabled
You must enable Spark applications in cluster mode when JAR files are on the Cassandra File System (CFS) and authentication is enabled. When the application is submitted in cluster mode and the JAR files are on CFS, the Spark Worker process is responsible for obtaining the required JAR file. When authentication is required, the Spark Worker process requires the authentication credentials to CFS. The Spark Worker starts executors for unrelated Spark jobs, so giving the Spark Worker process credentials enables all future Spark jobs to pull JAR files from CFS for their dependencies. Credentials that are granted to the Spark Worker must be considered "shared" among all submitted applications, regardless of the submitting user. Shared credentials do not apply to accessing CFS from the application code.
Where is the spark-env.sh
file?
The default location of the spark-env.sh
file depends on the type of installation:
Installation Type | Location |
---|---|
Package installations + Installer-Services installations |
|
Tarball installations + Installer-No Services installations |
|
Procedure
-
To enable Spark applications in cluster mode when JAR files are on CFS and authentication is enabled, do one of the following:
-
Add this statement to the
spark-env.sh
on every DataStax Enterprise node:SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.hadoop.cassandra.username=username -Dspark.hadoop.cassandra.password=password"
-
Before you start the DataStax Enterprise server process, set the
SPARK_WORKER_OPTS
environment variable in a way that guarantees visibility to DataStax Enterprise server processes. This environment variable does not need to be passed to applications that are submitted with thedse spark
ordse spark-submit
commands.
-
-
Follow these best practices:
-
Create a unique user with privileges only on CFS (access to related CFS keyspace), and then use the unique user credentials for the Spark Worker authentication. This best practice limits the amount of protected information in the database that is accessible through user Spark Jobs without explicit permission.
-
Create a distinct CFS directory and limit the directory access privileges to read only.
-