Starting Spark

How you start Spark depends on the installation and if want to run in Spark mode, Spark and Hadoop mode, or SearchAnalytics mode:

How you start Spark depends on the installation and if you want to run in Spark mode, Spark and Hadoop mode, or SearchAnalytics:

Installer-Services and Package installations
To start the Spark trackers on a cluster of analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.

When you start DataStax Enterprise as a service, the node is launched as a Spark node. You can enable additional components.

Mode Option in /etc/default/dse Description
Spark SPARK_ENABLED=1 Start the node in Spark mode.
SearchAnalytics mode

SPARK_ENABLED=1
SEARCH_ENABLED=1

In dse.yaml, cql_solr_query_paging: driver is required.
Spark and Hadoop mode

SPARK_ENABLED=1
HADOOP_ENABLED=1

Spark and Hadoop mode should be used only for development purposes.
Installer-No Services and Tarball installations:
To start the Spark trackers on a cluster of analytics nodes, use the -k option:
dse cassandra -k
Note:

Nodes started with -t or -k are automatically assigned to the default Analytics datacenter if you do not configure a datacenter in the snitch property file.

You can enable additional components:
Mode Option Description
Spark -k Start the node in Spark mode.
SearchAnalytics mode

-k -s

In dse.yaml, cql_solr_query_paging: driver is required.
Spark and Hadoop mode

-k -t

Spark and Hadoop mode should be used only for development purposes.
For example:

To start a node in SearchAnalytics mode, use the -k -s options.

dse cassandra -k -s

SearchAnalytics mode is experimental, and is not recommended for production clusters.

To start a node in Spark and Hadoop mode, use the -k -t options:
dse cassandra -k -t

Spark and Hadoop mode should only be used for development purposes.

Starting the node with the Spark or Hadoop option starts a node that is designated as the Job Tracker, as shown by the Analytics(JT) workload in the output of the dsetool ring command:

dsetool ring

Note: Ownership information does not include topology, please specify a keyspace. 
Address          DC           Rack   Workload      Status  State    Load      Owns   Token                       
10.160.137.165   Analytics    rack1  Analytics(JT)    Up   Normal   87.04 KB  33.33% -9223372036854775808                        
10.168.193.41    Analytics    rack1  Analytics(TT)    Up   Normal   92.91 KB  33.33% -3074457345618258603                        
10.176.83.32     Analytics    rack1  Analytics(TT)    Up   Normal   94.9 KB   33.33% 3074457345618258602
The default location of the dsetool command depends on the type of installation:
Package installations /usr/bin/dsetool
Installer-Services installations /usr/bin/dsetool
Installer-No Services and Tarball installations install_location/bin/dsetool
If you use sudo to start DataStax Enterprise, remove the ~./spark directory before you restart the cluster :
sudo rm -r ~/.spark

Launching Spark 

After starting a Spark node, use dse commands to launch Spark.

The default location of the dse tool depends on the type of installation:
Package installations /usr/bin/dse
Installer-Services installations /usr/bin/dse
Installer-No Services and Tarball installations install_location/bin/dse

You can use Cassandra specific properties to start Spark. Spark binds to the listen_address that is specified in cassandra.yaml.

The location of the cassandra.yaml file depends on the type of installation:
Package installations /etc/dse/cassandra/cassandra.yaml
Tarball installations install_location/resources/cassandra/conf/cassandra.yaml

DataStax Enterprise supports these commands for launching Spark on the DataStax Enterprise command line:

dse spark
Enters interactive Spark shell, offers basic autocompletion.
dse spark 
dse spark-submit
Launches applications on a cluster like spark-submit. Replaces the deprecated dse spark-class command. Using this interface you can use Spark cluster managers without the need for separate configurations for each application. The syntax is:
dse spark-submit --class class_name jar_file other_options
For example, if you write a class that defines an option named d, enter the command as follows:
dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES
Note: The directory in which you run the dse Spark commands must be writable by the current user.

Internal authentication is supported.

You can use environment variables to increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI. To specify a user name and password using environment variables, use the following syntax:
DSE_USERNAME=user DSE_PASSWORD=secret dse spark[-submit]
These environment variables are supported for all Spark commands.
To specify a user name and password to run an application, use the following syntax:
$ dse [-f config_file] [-u username -p password] [-a jmx_username -b jmx_password]  spark[-submit] 
where:
  • -f config_file is the path to a configuration file that stores credentials. If not specified, then use ~/.dserc if it exists.
    The configuration file can contain Cassandra and JMX login credentials. For example:
    username=cassandra
    password=cassandra
    jmx_username=cassandra
    jmx_password=jmx
    The credentials in the configuration file are stored in clear text. DataStax recommends restricting access to this file only to the specific user.
  • --ssl enables SSL encryption.
  • dse -u username is the user name to authenticate against the configured Cassandra user.
  • dsetool -l username is the user name to authenticate against the configured Cassandra user.
  • -p password is the password to authenticate against the configured Cassandra user. If you do not provide a password on the command line, you are prompted to enter one.
  • -a jmx_username is the user name for authenticating with secure JMX.
  • -b jmx_username is the password for authenticating with secure JMX. If you do not provide a password on the command line, you are prompted to enter one.
Note: To increase security and prevent the user name and passwords from appearing in the Spark log files or in the process list on the Spark Web UI, DataStax recommends using the environment variables instead of passing user credentials on the command line.
The location of the dse.yaml file depends on the type of installation:
Installer-Services /etc/dse/dse.yaml
Package installations /etc/dse/dse.yaml
Installer-No Services install_location/resources/dse/conf/dse.yaml
Tarball installations install_location/resources/dse/conf/dse.yaml