Running Apache Spark™ commands against a remote cluster

To run Spark commands against a remote cluster, you must export the DSE configuration from one of the remote nodes to the local client machine.

To run a driver application remotely, there must be full public network communication between the remote nodes and the client machine.

Prerequisites

The local client requires Spark driver ports on the client to be accessible by the remote DSE cluster nodes. This might require configuring the firewall on the client machine and the remote DSE cluster nodes to allow communication between the machines.

Spark dynamically selects ports for internal communication unless the ports are manually set. To use dynamically chosen ports, the firewall needs to allow all port access from the remote cluster.

To set the ports manually, set the ports in the respective properties in spark-defaults.conf as shown in this example:

spark.blockManager.port 38000
spark.broadcast.port 38001
spark.driver.port 38002
spark.executor.port 38003
spark.fileserver.port 38004
spark.replClassServer.port 38005

For a full list of ports used by DSE, see Securing DataStax Enterprise ports.

Where is the spark-defaults.conf file?

The location of the spark-defaults.conf file depends on the type of installation:

Installation Type Location

Package installations + Installer-Services installations

/etc/dse/spark/spark-defaults.conf

Tarball installations + Installer-No Services installations

<installation_location>/resources/spark/conf/spark-defaults.conf

Procedure

  1. Export the DataStax Enterprise client configuration from the remote node to the client node:

    1. On the remote node:

      dse client-tool configuration export dse-config.jar
    2. Copy the exported JAR to the client nodes.

      scp dse-config.jar user@clientnode1.example.com:
    3. On the client node:

      dse client-tool configuration import dse-config.jar
  2. Run the Spark command against the remote node.

    dse spark-submit submit options myApplication.jar

    To set the driver host to a publicly accessible IP address, pass in the spark.driver.host option.

    dse spark-submit --conf spark.driver.host=IP address myApplication.jar

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com