Start Apache Spark

Before you start Apache Spark™, configure RPCfor the DseClientTool object.

RPC permission for the DseClientTool object is required to run Apache Spark because the DseClientTool object is called implicitly by the Spark launcher.

By default DSEFS is required to execute Spark applications. DSEFS should not be disabled when you enable Apache Spark on a DataStax Enterprise (DSE) node. If there is a strong reason not to use DSEFS as the default file system, reconfigure Apache Spark to use a different file system. For example to use a local file system set the following properties in spark-daemon-defaults.conf:

spark.hadoop.fs.defaultFS=file:///
spark.hadoop.hive.metastore.warehouse.dir=file:///tmp/warehouse

How you start Apache Spark depends on the installation and if you want to run in Spark mode or SearchAnalytics mode:

Package installations:

To start the Spark trackers on a cluster of analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.

When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can enable additional components.

Mode Option in /etc/default/dse Description

Mode	Option in /etc/default/dse	Description
Spark	SPARK_ENABLED=1	Start the node in Spark mode.
SearchAnalytics mode	SPARK_ENABLED=1 SEARCH_ENABLED=1	SearchAnalytics mode requires testing in your environment before it is used in production clusters. In `dse.yaml`, cql_solr_query_paging: driver is required.

Spark

SPARK_ENABLED=1

Start the node in Spark mode.

SearchAnalytics mode

SPARK_ENABLED=1
SEARCH_ENABLED=1

SearchAnalytics mode requires testing in your environment before it is used in production clusters. In dse.yaml, cql_solr_query_paging: driver is required.

Tarball installations:

To start the Spark trackers on a cluster of analytics nodes, use the -k option:

<installation_location>/bin/dse cassandra -k

Nodes started with -k are automatically assigned to the default Analytics datacenter if you do not configure a datacenter in the snitch property file.

You can enable additional components:

Mode Option Description

Mode	Option	Description
Spark	-k	Start the node in Spark mode.
SearchAnalytics mode	-k -s	In `dse.yaml`, cql_solr_query_paging: driver is required.

Spark

-k

Start the node in Spark mode.

SearchAnalytics mode

-k -s

In dse.yaml, cql_solr_query_paging: driver is required.

For example:

To start a node in SearchAnalytics mode, use the -k and -s options.

<installation_location>/bin/dse cassandra -k -s

Starting the node with the Spark option starts a node that is designated as the master, as shown by the Analytics(SM) workload in the output of the dsetool ring command:

dsetool ring

Address          DC                   Rack         Workload             Graph  Status  State    Load             Owns                 Token                                        Health [0,1]
                                                                                                                                      0
10.200.175.149   Analytics            rack1        Analytics(SM)        no     Up      Normal   185 KiB          ?                    -9223372036854775808                         0.90
10.200.175.148   Analytics            rack1        Analytics(SW)        no     Up      Normal   194.5 KiB        ?                    0                                            0.90
NOTE: you must specify a keyspace to get ownership information.

Launch Apache Spark

After starting a Spark node, use dse commands to launch Apache Spark.

Usage:

Package installations:dse spark

Tarball installations:<installation_location>/bin/dse spark

You can use Apache Cassandra specific properties to start Apache Spark. Apache Spark binds to the listen_address that is specified in cassandra.yaml.

DSE supports these commands for launching Apache Spark on the DSE command line:

dse spark

Enters interactive Spark shell, offers basic auto-completion.

Package installations:dse spark

Tarball installations:<installation_location>/bin/ dse spark

dse spark-submit

Launches applications on a cluster like spark-submit. Using this interface you can use Spark cluster managers without the need for separate configurations for each application. The syntax for package installations is:

dse spark-submit --class <class_name> <jar_file> <other_options>

For example, if you write a class that defines an option named d, enter the command as follows:

dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

The JAR file can be located in a DSEFS directory. If the DSEFS cluster is secured, provide authentication credentials as described in DSEFS authentication.

The dse spark-submit command supports the same options as Apache Spark’s spark-submit. For example, to submit an application using cluster mode using the supervise option to restart in case of failure:

dse spark-submit --deploy-mode cluster --supervise --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

The directory in which you run the dse Spark commands must be writable by the current user.

Internal authentication is supported.

Use the optional environment variables DSE_USERNAME and DSE_PASSWORD to increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI. To specify a user name and password using environment variables, add the following to your Bash .profile or .bash_profile:

export DSE_USERNAME=user
export DSE_PASSWORD=secret

These environment variables are supported for all Apache Spark and dse client-tool commands.

DataStax recommends using the environment variables instead of passing user credentials on the command line.

You can provide authentication credentials in several ways, see Credentials for authentication.

Specify Spark URLs

You do not need to specify the Spark Master address when starting Spark jobs with DSE. If you connect to any Spark node in a datacenter, DSE will automatically discover the Master address and connect the client to the Master.

Specify the URL for any Spark node using the following format:

dse://[<Spark node address>[:<port number>]]?[<parameter name>=<parameter value>;]<...>

By default the URL is dse://?, which is equivalent to dse://localhost:9042. Any parameters you set in the URL will override the configuration read from DSE’s Apache Spark configuration settings.

You can specify the work pool in which the application will be run by adding the workpool=<work pool name> as a URL parameter. For example, dse://1.1.1.1:123?workpool=workpool2.

Valid parameters are CassandraConnectorConf settings without the spark.cassandra. prefix. For example, you can set the spark.cassandra.connection.local_dc option to dc2 by specifying dse://?connection.local_dc=dc2.

Or to specify multiple spark.cassandra.connection.host addresses for high-availability if the specified connection point is down: dse://1.1.1.1:123?connection.host=1.1.2.2,1.1.3.3.

If the connection.host parameter is specified, the host provided in the standard URL is prepended to the list of hosts set in connection.host. If the port is specified in the standard URL, it overrides the port number set in the connection.port parameter.

Connection options when using dse spark-submit are retrieved in the following order: from the Master URL, then the Cassandra Spark connector options, then the DSE configuration files.

Detect Spark application failures

DSE has a failure detector for Spark applications, which detects whether a running Spark application is dead or alive. If the application has failed, the application will be removed from the DSE Spark Resource Manager.

The failure detector works by keeping an open TCP connection from a DSE Spark node to the Spark Driver in the application. No data is exchanged, but regular TCP connection keep-alive control messages are sent and received. When the connection is interrupted, the failure detector will attempt to reacquire the connection every 1 second for the duration of the appReconnectionTimeoutSeconds timeout value (5 seconds by default). If it fails to reacquire the connection during that time, the application is removed.

A custom timeout value is specified by adding appReconnectionTimeoutSeconds=<value> in the master URI when submitting the application. For example to set the timeout value to 10 seconds:

dse spark --master dse://?appReconnectionTimeoutSeconds=10

Start Apache Spark

Launch Apache Spark

Specify Spark URLs

Detect Spark application failures

Was this helpful?

Give Feedback