Starting Apache Spark™

Before you start Spark, configure Authorizing remote procedure calls for CQL execution for the DseClientTool object.

RPC permission for the DseClientTool object is required to run Spark because the DseClientTool object is called implicitly by the Spark launcher.

By default DSEFS is required to execute Spark applications. DSEFS should not be disabled when Spark is enabled on a DSE node. If there is a strong reason not to use DSEFS as the default file system, reconfigure Spark to use a different file system. For example, to use a local file system, set the following properties in spark-daemon-defaults.conf:

spark.hadoop.fs.defaultFS=file:///
spark.hadoop.hive.metastore.warehouse.dir=file:///tmp/warehouse
Where is the spark-daemon-defaults.conf file?

The location of the spark-daemon-defaults.conf file depends on the type of installation:

Installation Type Location

Package installations + Installer-Services installations

/etc/dse/spark/spark-daemon-defaults.conf

Tarball installations + Installer-No Services installations

<installation_location>/resources/spark/conf/spark-daemon-defaults.conf

How you start Spark depends on the installation and if you want to run in Spark mode or SearchAnalytics mode:

Package and Installer-Services installations

To start the Spark trackers on a cluster of analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.

When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can enable additional components.

Mode Option in /etc/default/dse Description

Spark

SPARK_ENABLED=1

Start the node in Spark mode.

SearchAnalytics mode

SPARK_ENABLED=1

SEARCH_ENABLED=1

SearchAnalytics mode requires testing in your environment before it is used in production clusters. In dse.yaml, cql_solr_query_paging: driver is required.

Where is the dse.yaml file?

The location of the dse.yaml file depends on the type of installation:

Installation Type Location

Package installations + Installer-Services installations

/etc/dse/dse.yaml

Tarball installations + Installer-No Services installations

<installation_location>/resources/dse/conf/dse.yaml

Tarball and Installer-No Services installations

To start the Spark trackers on a cluster of analytics nodes, use the -k option:

<installation_location>/bin/dse cassandra -k

Nodes started with -k are automatically assigned to the default Analytics datacenter if you do not configure a datacenter in the snitch property file.

You can enable additional components:

Mode Option Description

Spark

-k

Start the node in Spark mode.

SearchAnalytics mode

-k -s

In dse.yaml, cql_solr_query_paging: driver is required.

For example:

To start a node in SearchAnalytics mode, use the -k and -s options.

installation_location/bin/dse cassandra -k -s

Starting the node with the Spark option starts a node that is designated as the master, as shown by the Analytics(SM) workload in the output of the dsetool ring command:

dsetool ring
Address          DC                   Rack         Workload             Graph  Status  State    Load             Owns                 Token                                        Health [0,1]
                                                                                                                                      0
10.200.175.149   Analytics            rack1        Analytics(SM)        no     Up      Normal   185 KiB          ?                    -9223372036854775808                         0.90
10.200.175.148   Analytics            rack1        Analytics(SW)        no     Up      Normal   194.5 KiB        ?                    0                                            0.90
Note: you must specify a keyspace to get ownership information.

Launching Spark

After starting a Spark node, use dse commands to launch Spark.

Spark location
Installation Type Location

Package installations + Installer-Services installations

$ dse spark

Tarball installations + Installer-No Services installations

installation_location/bin/dse spark

You can use Cassandra specific properties to start Spark. Spark binds to the listen_address that is specified in cassandra.yaml.

Where is the cassandra.yaml file?

The location of the cassandra.yaml file depends on the type of installation:

Installation Type Location

Package installations + Installer-Services installations

/etc/dse/cassandra/cassandra.yaml

Tarball installations + Installer-No Services installations

<installation_location>/resources/cassandra/conf/cassandra.yaml

DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:

dse spark

Enters interactive Spark shell, offers basic auto-completion.

dse spark-submit

Launches applications on a cluster like spark-submit. Using this interface you can use Spark cluster managers without the need for separate configurations for each application. The syntax for Package and Installer-Services installations is:

dse spark-submit --class class_name jar_file other_options

For example, if you write a class that defines an option named d, enter the command as follows:

dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

The JAR file can be located in a DSEFS directory. If the DSEFS cluster is secured, provide authentication credentials as described in DSEFS authentication.

The dse spark-submit command supports the same options as Apache Spark’s spark-submit. For example, to submit an application using cluster mode using the supervise option to restart in case of failure:

dse spark-submit --deploy-mode cluster --supervise --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

Unlike the standard behavior for the Spark status and kill options, in DSE deployments these options do not require the Spark Master IP address:

  • dse spark-submit -kill driver_id [--master master_ip_address]

  • dse spark-submit -status driver_id [--master master_ip_address]

For example, to kill a driver of a Spark application running in the DSE cluster:

dse spark-submit --kill driver-20180726160353-0019

To get the status of a Spark application running in the DSE cluster:

dse spark-submit --status driver-20180726160353-0019

The directory in which you run the dse Spark commands must be writable by the current user.

Internal authentication is supported.

Use the optional environment variables DSE_USERNAME and DSE_PASSWORD to increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI. To specify a user name and password using environment variables, add the following to your Bash .profile or .bash_profile:

export DSE_USERNAME=user
export DSE_PASSWORD=secret

These environment variables are supported for all Spark and dse client-tool commands.

DataStax recommends using the environment variables instead of passing user credentials on the command line.

Authentication credentials can be provided in several ways, see Connecting to authentication enabled clusters.

Specifying Spark URLs

You do not need to specify the Spark Master address when starting Spark jobs with DSE. If you connect to any Spark node in a datacenter, DSE automatically discovers the Master address and connects the client to the Master.

Specify the URL for any Spark node using the following format:

dse://[Spark node address[:port number]]?[parameter name=parameter value;]...

By default the URL is dse://?, which is equivalent to dse://localhost:9042. Any parameters you set in the URL override the configuration read from DSE’s Spark configuration settings. Valid parameters are CassandraConnectorConf settings with the spark.cassandra. prefix stripped. For example, you can set the spark.cassandra.connection.local_dc option to dc2 by specifying dse://?connection.local_dc=dc2.

Or to specify multiple spark.cassandra.connection.host addresses for high-availability if the specified connection point is down: dse://1.1.1.1:123?connection.host=1.1.2.2,1.1.3.3.

If the connection.host parameter is specified, the host provided in the standard URL is prepended to the list of hosts set in connection.host. If the port is specified in the standard URL, it overrides the port number set in the connection.port parameter.

Connection options when using dse spark-submit are retrieved in the following order: from the Master URL, then the Spark Cassandra Connector options, then the DSE configuration files.

Detecting Spark application failures

DSE has a failure detector for Spark applications, which detects whether a running Spark application is dead or alive. If the application has failed, the application is removed from the DSE Spark Resource Manager.

The failure detector works by keeping an open TCP connection from a DSE Spark node to the Spark Driver in the application. No data is exchanged, but regular TCP connection keep-alive control messages are sent and received. When the connection is interrupted, the failure detector attempts to reacquire the connection every 1 second for the duration of the appReconnectionTimeoutSeconds timeout value (5 seconds by default). If it fails to reacquire the connection during that time, the application is removed.

A custom timeout value is specified by adding appReconnectionTimeoutSeconds=value in the master URI when submitting the application. For example to set the timeout value to 10 seconds:

dse spark --master dse://?appReconnectionTimeoutSeconds=10

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com