Starting Spark

How you start Spark depends on the installation and if want to run in Spark mode, Spark and Hadoop mode, or SearchAnalytics mode:

How you start Spark depends on the installation and if you want to run in Spark mode, Spark and Hadoop mode, or SearchAnalytics:

Installer-Services and Package installations

To start the Spark trackers on a cluster of analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.

When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can enable additional components.

Mode	Option in /etc/default/dse	Description
Spark	SPARK_ENABLED=1	Start the node in Spark mode.
SearchAnalytics mode	SPARK_ENABLED=1 SEARCH_ENABLED=1	In dse.yaml, cql_solr_query_paging: driver is required.
Spark and Hadoop mode	SPARK_ENABLED=1 HADOOP_ENABLED=1	Spark and Hadoop mode should be used only for development purposes.

Installer-No Services and Tarball installations:

To start the Spark trackers on a cluster of analytics nodes, use the -k option:

dse cassandra -k

Note:

Nodes started with -t or -k are automatically assigned to the default Analytics datacenter if you do not configure a datacenter in the snitch property file.

You can enable additional components:

Mode	Option	Description
Spark	-k	Start the node in Spark mode.
SearchAnalytics mode	-k -s	In dse.yaml, cql_solr_query_paging: driver is required.
Spark and Hadoop mode	-k -t	Spark and Hadoop mode should be used only for development purposes.

For example:

To start a node in SearchAnalytics mode, use the -k -s options.

dse cassandra -k -s

SearchAnalytics mode is experimental, and is not recommended for production clusters.

To start a node in Spark and Hadoop mode, use the -k -t options:

dse cassandra -k -t

Spark and Hadoop mode should only be used for development purposes.

Starting the node with the Spark or Hadoop option starts a node that is designated as the Job Tracker, as shown by the Analytics(JT) workload in the output of the dsetool ring command:

dsetool ring

Note: Ownership information does not include topology, please specify a keyspace. 
Address          DC           Rack   Workload      Status  State    Load      Owns   Token                       
10.160.137.165   Analytics    rack1  Analytics(JT)    Up   Normal   87.04 KB  33.33% -9223372036854775808                        
10.168.193.41    Analytics    rack1  Analytics(TT)    Up   Normal   92.91 KB  33.33% -3074457345618258603                        
10.176.83.32     Analytics    rack1  Analytics(TT)    Up   Normal   94.9 KB   33.33% 3074457345618258602

The default location of the dsetool command depends on the type of installation:

Package installations	/usr/bin/dsetool
Installer-Services installations	/usr/bin/dsetool
Installer-No Services and Tarball installations	`install_location`/bin/dsetool

If you use sudo to start DataStax Enterprise, remove the ~./spark directory before you restart the cluster :

sudo rm -r ~/.spark

Launching Spark

After starting a Spark node, use dse commands to launch Spark.

The default location of the dse tool depends on the type of installation:

Package installations	/usr/bin/dse
Installer-Services installations	/usr/bin/dse
Installer-No Services and Tarball installations	`install_location`/bin/dse

You can use Cassandra specific properties to start Spark. Spark binds to the listen_address that is specified in cassandra.yaml.

The location of the cassandra.yaml file depends on the type of installation:

Package installations	/etc/dse/cassandra/cassandra.yaml
Tarball installations	`install_location`/resources/cassandra/conf/cassandra.yaml

DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:

dse spark

Enters interactive Spark shell, offers basic autocompletion.

dse spark

dse spark-submit

Launches applications on a cluster like spark-submit. Replaces the deprecated

dse
              spark-class

command. Using this interface you can use Spark cluster managers without the need for separate configurations for each application. The syntax is:

dse spark-submit --class class_name jar_file other_options

For example, if you write a class that defines an option named d, enter the command as follows:

dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES

Note: The directory in which you run the dse Spark commands must be writable by the current user.

Internal authentication is supported.

You can use environment variables to increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI. To specify a user name and password using environment variables, use the following syntax:

DSE_USERNAME=user DSE_PASSWORD=secret dse spark[-submit]

These environment variables are supported for all Spark commands.

To specify a user name and password to run an application, use the following syntax:

$ dse [-f config_file] [-u username -p password] [-a jmx_username -b jmx_password]  spark[-submit]

where:

-f config_file is the path to a configuration file that stores credentials. If not specified, then use ~/.dserc if it exists.
The configuration file can contain Cassandra and JMX login credentials. For example:
```
username=cassandra
password=cassandra
jmx_username=cassandra
jmx_password=jmx
```
The credentials in the configuration file are stored in clear text. DataStax recommends restricting access to this file only to the specific user.
--ssl enables SSL encryption.
dse -u username is the user name to authenticate against the configured Cassandra user.
dsetool -l username is the user name to authenticate against the configured Cassandra user.
-p password is the password to authenticate against the configured Cassandra user. If you do not provide a password on the command line, you are prompted to enter one.
-a jmx_username is the user name for authenticating with secure JMX.
-b jmx_username is the password for authenticating with secure JMX. If you do not provide a password on the command line, you are prompted to enter one.

Note: To increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI, DataStax recommends using the environment variables instead of passing user credentials on the command line.

The location of the dse.yaml file depends on the type of installation:

Installer-Services	/etc/dse/dse.yaml
Package installations	/etc/dse/dse.yaml
Installer-No Services	`install_location`/resources/dse/conf/dse.yaml
Tarball installations	`install_location`/resources/dse/conf/dse.yaml