Starting Spark
Before you start Spark, configure RPCfor the DseClientTool
object.
RPC permission for the |
By default DSEFS is required to execute Spark applications.
DSEFS should not be disabled when Spark is enabled on a DSE node.
If there is a strong reason not to use DSEFS as the default file system, reconfigure Spark to use a different file system.
For example to use a local file system set the following properties in
|
How you start Spark depends on the installation and if you want to run in Spark mode or SearchAnalytics mode:
- Package installations:
-
To start the Spark trackers on a cluster of analytics nodes, edit the
/etc/default/dse
file to set SPARK_ENABLED to 1.When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can enable additional components.
Mode Option in /etc/default/dse Description Spark
SPARK_ENABLED=1
Start the node in Spark mode.
SearchAnalytics mode
SPARK_ENABLED=1
SEARCH_ENABLED=1SearchAnalytics mode requires testing in your environment before it is used in production clusters. In
dse.yaml
, cql_solr_query_paging: driver is required. - Tarball installations:
-
To start the Spark trackers on a cluster of analytics nodes, use the -k option:
<installation_location>/bin/dse cassandra -k
Nodes started with -k are automatically assigned to the default Analytics datacenter if you do not configure a datacenter in the snitch property file.
You can enable additional components:
Mode Option Description Spark
-k
Start the node in Spark mode.
SearchAnalytics mode
-k -s
In
dse.yaml
, cql_solr_query_paging: driver is required.
For example:
To start a node in SearchAnalytics mode, use the -k and -s options.
<installation_location>/bin/dse cassandra -k -s
Starting the node with the Spark option starts a node that is designated as the master, as shown by the Analytics(SM) workload in the output of the dsetool ring
command:
dsetool ring
Address DC Rack Workload Graph Status State Load Owns Token Health [0,1]
0
10.200.175.149 Analytics rack1 Analytics(SM) no Up Normal 185 KiB ? -9223372036854775808 0.90
10.200.175.148 Analytics rack1 Analytics(SW) no Up Normal 194.5 KiB ? 0 0.90
NOTE: you must specify a keyspace to get ownership information.
Launching Spark
After starting a Spark node, use dse
commands to launch Spark.
Usage:
Package installations:dse spark
Tarball installations:<installation_location>/bin/dse spark
You can use Cassandra specific properties to start Spark.
Spark binds to the listen_address that is specified in cassandra.yaml
.
DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:
- dse spark
-
Enters interactive Spark shell, offers basic auto-completion.
Package installations:
dse spark
Tarball installations:
<installation_location>/bin/ dse spark
- dse spark-submit
-
Launches applications on a cluster like spark-submit. Using this interface you can use Spark cluster managers without the need for separate configurations for each application. The syntax for package installations is:
dse spark-submit --class <class_name> <jar_file> <other_options>
For example, if you write a class that defines an option named d, enter the command as follows:
dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES
The JAR file can be located in a DSEFS directory. If the DSEFS cluster is secured, provide authentication credentials as described in DSEFS authentication.
The
dse spark-submit
command supports the same options as Apache Spark’sspark-submit
. For example, to submit an application using cluster mode using thesupervise
option to restart in case of failure:dse spark-submit --deploy-mode cluster --supervise --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES
The directory in which you run the |
Internal authentication is supported.
Use the optional environment variables DSE_USERNAME
and DSE_PASSWORD
to increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI.
To specify a user name and password using environment variables, add the following to your Bash .profile
or .bash_profile
:
export DSE_USERNAME=user
export DSE_PASSWORD=secret
These environment variables are supported for all Spark and dse client-tool commands.
DataStax recommends using the environment variables instead of passing user credentials on the command line. |
You can provide authentication credentials in several ways, see Credentials for authentication.
Specifying Spark URLs
You do not need to specify the Spark Master address when starting Spark jobs with DSE. If you connect to any Spark node in a datacenter, DSE will automatically discover the Master address and connect the client to the Master.
Specify the URL for any Spark node using the following format:
dse://[<Spark node address>[:<port number>]]?[<parameter name>=<parameter value>;]<...>
By default the URL is dse://?
, which is equivalent to dse://localhost:9042
.
Any parameters you set in the URL will override the configuration read from DSE’s Spark configuration settings.
You can specify the work pool in which the application will be run by adding the workpool=<work pool name>
as a URL parameter.
For example, dse://1.1.1.1:123?workpool=workpool2
.
Valid parameters are CassandraConnectorConf
settings with the spark.cassandra.
prefix stripped.
For example, you can set the spark.cassandra.connection.local_dc
option to dc2
by specifying dse://?connection.local_dc=dc2
.
Or to specify multiple spark.cassandra.connection.host
addresses for high-availability if the specified connection point is down: dse://1.1.1.1:123?connection.host=1.1.2.2,1.1.3.3
.
If the connection.host
parameter is specified, the host provided in the standard URL is prepended to the list of hosts set in connection.host
.
If the port is specified in the standard URL, it overrides the port number set in the connection.port
parameter.
Connection options when using dse spark-submit
are retrieved in the following order: from the Master URL, then the Spark Cassandra Connector options, then the DSE configuration files.
Detecting Spark application failures
DSE has a failure detector for Spark applications, which detects whether a running Spark application is dead or alive. If the application has failed, the application will be removed from the DSE Spark Resource Manager.
The failure detector works by keeping an open TCP connection from a DSE Spark node to the Spark Driver in the application.
No data is exchanged, but regular TCP connection keep-alive control messages are sent and received.
When the connection is interrupted, the failure detector will attempt to reacquire the connection every 1 second for the duration of the appReconnectionTimeoutSeconds
timeout value (5 seconds by default).
If it fails to reacquire the connection during that time, the application is removed.
A custom timeout value is specified by adding appReconnectionTimeoutSeconds=<value>
in the master URI when submitting the application.
For example to set the timeout value to 10 seconds:
dse spark --master dse://?appReconnectionTimeoutSeconds=10