Enabling Spark apps in cluster mode when authentication is enabled

You must enable Spark applications in cluster mode when JAR files are on the Cassandra File System (CFS) and authentication is enabled. When the application is submitted in cluster mode and the JAR files are on CFS, the Spark Worker process is responsible for obtaining the required JAR file. When authentication is required, the Spark Worker process requires the authentication credentials to CFS. The Spark Worker starts executors for unrelated Spark jobs, so giving the Spark Worker process credentials enables all future Spark jobs to pull JAR files from CFS for their dependencies. Credentials that are granted to the Spark Worker must be considered "shared" among all submitted applications, regardless of the submitting user. Shared credentials do not apply to accessing CFS from the application code.

Where is the spark-env.sh file?

The default location of the spark-env.sh file depends on the type of installation:

Installation Type Location

Package installations + Installer-Services installations

/etc/dse/spark/spark-env.sh

Tarball installations + Installer-No Services installations

<installation_location>/resources/spark/conf/spark-env.sh

Procedure

  1. To enable Spark applications in cluster mode when JAR files are on CFS and authentication is enabled, do one of the following:

    • Add this statement to the spark-env.sh on every DataStax Enterprise node:

      SPARK_WORKER_OPTS="$SPARK_WORKER_OPTS -Dspark.hadoop.cassandra.username=username
            -Dspark.hadoop.cassandra.password=password"
    • Before you start the DataStax Enterprise server process, set the SPARK_WORKER_OPTS environment variable in a way that guarantees visibility to DataStax Enterprise server processes. This environment variable does not need to be passed to applications that are submitted with the dse spark or dse spark-submit commands.

  2. Follow these best practices:

    • Create a unique user with privileges only on CFS (access to related CFS keyspace), and then use the unique user credentials for the Spark Worker authentication. This best practice limits the amount of protected information in the database that is accessible through user Spark Jobs without explicit permission.

    • Create a distinct CFS directory and limit the directory access privileges to read only.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com