Starting Spark

Before you start Spark, configure Authorizing remote procedure calls (RPC) for the DseClientTool object.

RPC permission for the DseClientTool object is required to run Spark because the DseClientTool object is called implicitly by the Spark launcher.

By default DSEFS is required to execute Spark applications. DSEFS should not be disabled when Spark is enabled on a DSE node. If there is a strong reason not to use DSEFS as the default file system, reconfigure Spark to use a different file system. For example to use a local file system set the following properties in spark-daemon-defaults.conf:

spark.hadoop.fs.defaultFS=file:///
spark.hadoop.hive.metastore.warehouse.dir=file:///tmp/warehouse

How you start Spark depends on the installation and if you want to run in Spark mode or SearchAnalytics mode:

Package installations:

To start the Spark trackers on a cluster of analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.

When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can enable additional components.

Mode Option in /etc/default/dse Description

Spark

SPARK_ENABLED=1

Start the node in Spark mode.

SearchAnalytics mode

SPARK_ENABLED=1
SEARCH_ENABLED=1

SearchAnalytics mode requires testing in your environment before it is used in production clusters. In dse.yaml, cql_solr_query_paging: driver is required.

Tarball installations:

To start the Spark trackers on a cluster of analytics nodes, use the -k option:

<installation_location>/bin/dse cassandra -k

Nodes started with -k are automatically assigned to the default Analytics datacenter if you do not configure a datacenter in the snitch property file.

You can enable additional components:

Mode Option Description

Spark

-k

Start the node in Spark mode.

SearchAnalytics mode

-k -s

In dse.yaml, cql_solr_query_paging: driver is required.

For example:

To start a node in SearchAnalytics mode, use the -k and -s options.

<installation_location>/bin/dse cassandra -k -s

Starting the node with the Spark option starts a node that is designated as the master, as shown by the Analytics(SM) workload in the output of the dsetool ring command:

dsetool ring
Address          DC                   Rack         Workload             Graph  Status  State    Load             Owns                 Token                                        Health [0,1]
                                                                                                                                      0
10.200.175.149   Analytics            rack1        Analytics(SM)        no     Up      Normal   185 KiB          ?                    -9223372036854775808                         0.90
10.200.175.148   Analytics            rack1        Analytics(SW)        no     Up      Normal   194.5 KiB        ?                    0                                            0.90
NOTE: you must specify a keyspace to get ownership information.

Launching Spark

After starting a Spark node, use dse commands to launch Spark.

Usage:

Package installations:dse spark

Tarball installations:<installation_location>/bin/dse spark

You can use Cassandra specific properties to start Spark. Spark binds to the listen_address that is specified in cassandra.yaml.

DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:

dse spark

Enters interactive Spark shell, offers basic auto-completion.

Package installations:dse spark

Tarball installations:<installation_location>/bin/ dse spark

dse spark-submit

Launches applications on a cluster like spark-submit. Using this interface you can use Spark cluster managers without the need for separate configurations for each application. The syntax for package installations is:

dse spark-submit --class <class_name> <jar_file> <other_options>

For example, if you write a class that defines an option named d, enter the command as follows:

dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

The JAR file can be located in a DSEFS directory. If the DSEFS cluster is secured, provide authentication credentials as described in DSEFS authentication.

The dse spark-submit command supports the same options as Apache Spark’s spark-submit. For example, to submit an application using cluster mode using the supervise option to restart in case of failure:

dse spark-submit --deploy-mode cluster --supervise --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

The directory in which you run the dse Spark commands must be writable by the current user.

Internal authentication is supported.

Use the optional environment variables DSE_USERNAME and DSE_PASSWORD to increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI. To specify a user name and password using environment variables, add the following to your Bash .profile or .bash_profile:

export DSE_USERNAME=user
export DSE_PASSWORD=secret

These environment variables are supported for all Spark and dse client-tool commands.

DataStax recommends using the environment variables instead of passing user credentials on the command line.

You can provide authentication credentials in several ways, see Credentials for authentication.

Specifying Spark URLs

You do not need to specify the Spark Master address when starting Spark jobs with DSE. If you connect to any Spark node in a datacenter, DSE will automatically discover the Master address and connect the client to the Master.

Specify the URL for any Spark node using the following format:

dse://[<Spark node address>[:<port number>]]?[<parameter name>=<parameter value>;]<...>

By default the URL is dse://?, which is equivalent to dse://localhost:9042. Any parameters you set in the URL will override the configuration read from DSE’s Spark configuration settings.

You can specify the work pool in which the application will be run by adding the workpool=<work pool name> as a URL parameter. For example, dse://1.1.1.1:123?workpool=workpool2.

Valid parameters are CassandraConnectorConf settings with the spark.cassandra. prefix stripped. For example, you can set the spark.cassandra.connection.local_dc option to dc2 by specifying dse://?connection.local_dc=dc2.

Or to specify multiple spark.cassandra.connection.host addresses for high-availability if the specified connection point is down: dse://1.1.1.1:123?connection.host=1.1.2.2,1.1.3.3.

If the connection.host parameter is specified, the host provided in the standard URL is prepended to the list of hosts set in connection.host. If the port is specified in the standard URL, it overrides the port number set in the connection.port parameter.

Connection options when using dse spark-submit are retrieved in the following order: from the Master URL, then the Spark Cassandra Connector options, then the DSE configuration files.

Detecting Spark application failures

DSE has a failure detector for Spark applications, which detects whether a running Spark application is dead or alive. If the application has failed, the application will be removed from the DSE Spark Resource Manager.

The failure detector works by keeping an open TCP connection from a DSE Spark node to the Spark Driver in the application. No data is exchanged, but regular TCP connection keep-alive control messages are sent and received. When the connection is interrupted, the failure detector will attempt to reacquire the connection every 1 second for the duration of the appReconnectionTimeoutSeconds timeout value (5 seconds by default). If it fails to reacquire the connection during that time, the application is removed.

A custom timeout value is specified by adding appReconnectionTimeoutSeconds=<value> in the master URI when submitting the application. For example to set the timeout value to 10 seconds:

dse spark --master dse://?appReconnectionTimeoutSeconds=10

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com