Starting Apache Spark™
Before you start Spark, configure Authorizing remote procedure calls for CQL execution for the DseClientTool
object.
RPC permission for the |
By default DSEFS is required to execute Spark applications.
DSEFS should not be disabled when Spark is enabled on a DSE node.
If there is a strong reason not to use DSEFS as the default file system, reconfigure Spark to use a different file system.
For example, to use a local file system, set the following properties in
|
Where is the spark-daemon-defaults.conf
file?
The location of the spark-daemon-defaults.conf
file depends on the type of installation:
Installation Type | Location |
---|---|
Package installations + Installer-Services installations |
|
Tarball installations + Installer-No Services installations |
|
How you start Spark depends on the installation and if you want to run in Spark mode or SearchAnalytics
mode:
- Package and Installer-Services installations
-
To start the Spark trackers on a cluster of analytics nodes, edit the
/etc/default/dse
file to setSPARK_ENABLED
to 1.When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can enable additional components.
Mode Option in /etc/default/dse
Description Spark
SPARK_ENABLED=1
Start the node in Spark mode.
SearchAnalytics
modeSPARK_ENABLED=1
SEARCH_ENABLED=1
SearchAnalytics
mode requires testing in your environment before it is used in production clusters. Indse.yaml
,cql_solr_query_paging: driver
is required.Where is the
dse.yaml
file?The location of the
dse.yaml
file depends on the type of installation:Installation Type Location Package installations + Installer-Services installations
/etc/dse/dse.yaml
Tarball installations + Installer-No Services installations
<installation_location>/resources/dse/conf/dse.yaml
- Tarball and Installer-No Services installations
-
To start the Spark trackers on a cluster of analytics nodes, use the
-k
option:<installation_location>/bin/dse cassandra -k
Nodes started with
-k
are automatically assigned to the default Analytics datacenter if you do not configure a datacenter in the snitch property file.You can enable additional components:
Mode Option Description Spark
-k
Start the node in Spark mode.
SearchAnalytics
mode-k -s
In
dse.yaml
,cql_solr_query_paging: driver
is required.
For example:
To start a node in SearchAnalytics mode, use the -k
and -s
options.
installation_location/bin/dse cassandra -k -s
Starting the node with the Spark option starts a node that is designated as the master, as shown by the Analytics(SM)
workload in the output of the dsetool ring
command:
dsetool ring
Address DC Rack Workload Graph Status State Load Owns Token Health [0,1]
0
10.200.175.149 Analytics rack1 Analytics(SM) no Up Normal 185 KiB ? -9223372036854775808 0.90
10.200.175.148 Analytics rack1 Analytics(SW) no Up Normal 194.5 KiB ? 0 0.90
Note: you must specify a keyspace to get ownership information.
Launching Spark
After starting a Spark node, use dse
commands to launch Spark.
Installation Type | Location |
---|---|
Package installations + Installer-Services installations |
|
Tarball installations + Installer-No Services installations |
|
You can use Cassandra specific properties to start Spark.
Spark binds to the listen_address
that is specified in cassandra.yaml
.
Where is the cassandra.yaml
file?
The location of the cassandra.yaml
file depends on the type of installation:
Installation Type | Location |
---|---|
Package installations + Installer-Services installations |
|
Tarball installations + Installer-No Services installations |
|
DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:
dse spark
-
Enters interactive Spark shell, offers basic auto-completion.
dse spark-submit
-
Launches applications on a cluster like spark-submit. Using this interface you can use Spark cluster managers without the need for separate configurations for each application. The syntax for Package and Installer-Services installations is:
dse spark-submit --class class_name jar_file other_options
For example, if you write a class that defines an option named d, enter the command as follows:
dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES
The JAR file can be located in a DSEFS directory. If the DSEFS cluster is secured, provide authentication credentials as described in DSEFS authentication.
The
dse spark-submit
command supports the same options as Apache Spark’sspark-submit
. For example, to submit an application using cluster mode using thesupervise
option to restart in case of failure:dse spark-submit --deploy-mode cluster --supervise --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES
Unlike the standard behavior for the Spark
status
andkill
options, in DSE deployments these options do not require the Spark Master IP address:-
dse spark-submit -kill driver_id [--master master_ip_address]
-
dse spark-submit -status driver_id [--master master_ip_address]
For example, to kill a driver of a Spark application running in the DSE cluster:
dse spark-submit --kill driver-20180726160353-0019
To get the status of a Spark application running in the DSE cluster:
dse spark-submit --status driver-20180726160353-0019
-
The directory in which you run the |
Internal authentication is supported.
Use the optional environment variables DSE_USERNAME
and DSE_PASSWORD
to increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI.
To specify a user name and password using environment variables, add the following to your Bash .profile
or .bash_profile
:
export DSE_USERNAME=user
export DSE_PASSWORD=secret
These environment variables are supported for all Spark and dse client-tool
commands.
DataStax recommends using the environment variables instead of passing user credentials on the command line. |
Authentication credentials can be provided in several ways, see Connecting to authentication enabled clusters.
Specifying Spark URLs
You do not need to specify the Spark Master address when starting Spark jobs with DSE. If you connect to any Spark node in a datacenter, DSE automatically discovers the Master address and connects the client to the Master.
Specify the URL for any Spark node using the following format:
dse://[Spark node address[:port number]]?[parameter name=parameter value;]...
By default the URL is dse://?
, which is equivalent to dse://localhost:9042
.
Any parameters you set in the URL override the configuration read from DSE’s Spark configuration settings.
Valid parameters are CassandraConnectorConf
settings with the spark.cassandra.
prefix stripped.
For example, you can set the spark.cassandra.connection.local_dc
option to dc2
by specifying dse://?connection.local_dc=dc2
.
Or to specify multiple spark.cassandra.connection.host
addresses for high-availability if the specified connection point is down: dse://1.1.1.1:123?connection.host=1.1.2.2,1.1.3.3
.
If the connection.host
parameter is specified, the host provided in the standard URL is prepended to the list of hosts set in connection.host
.
If the port is specified in the standard URL, it overrides the port number set in the connection.port
parameter.
Connection options when using dse spark-submit
are retrieved in the following order: from the Master URL, then the Spark Cassandra Connector options, then the DSE configuration files.
Detecting Spark application failures
DSE has a failure detector for Spark applications, which detects whether a running Spark application is dead or alive. If the application has failed, the application is removed from the DSE Spark Resource Manager.
The failure detector works by keeping an open TCP connection from a DSE Spark node to the Spark Driver in the application.
No data is exchanged, but regular TCP connection keep-alive control messages are sent and received.
When the connection is interrupted, the failure detector attempts to reacquire the connection every 1 second for the duration of the appReconnectionTimeoutSeconds
timeout value (5 seconds by default).
If it fails to reacquire the connection during that time, the application is removed.
A custom timeout value is specified by adding appReconnectionTimeoutSeconds=value
in the master URI when submitting the application.
For example to set the timeout value to 10 seconds:
dse spark --master dse://?appReconnectionTimeoutSeconds=10