Starting Spark

How you start Spark depends on the installation and if want to run in Hadoop mode:

How you start Spark depends on the installation and if want to run in Hadoop mode:

  • Installer-Services and Package installations: To start the Spark trackers on a cluster of Analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.

    When you start DataStax Enterprise as a service, the node is launched as a Spark node.

    To start a node in Spark and Hadoop mode, edit the /etc/default/dse file to set HADOOP_ENABLED and SPARK_ENABLED to 1.

  • Installer-No Services and Tarball installations: To start the Spark trackers on a cluster of Analytics nodes, use the -k option:
    $ dse cassandra -k
    To start a node in Spark and Hadoop mode, use the -k and -t options:
    $ dse cassandra -k -t

    Nodes started with either -t or -k are automatically assigned to the default Analytics data center if you do not configure a data center in the snitch property file.

Starting the node with the Spark or Hadoop options starts a node designated as the job tracker, as shown by the Analytics(JT) workload in the output of the dsetool ring command:

$ dsetool ring
Note: Ownership information does not include topology, please specify a keyspace. 
Address          DC           Rack   Workload      Status  State    Load      Owns   Token                       
10.160.137.165   Analytics    rack1  Analytics(JT)    Up   Normal   87.04 KB  33.33% -9223372036854775808                        
10.168.193.41    Analytics    rack1  Analytics(TT)    Up   Normal   92.91 KB  33.33% -3074457345618258603                        
10.176.83.32     Analytics    rack1  Analytics(TT)    Up   Normal   94.9 KB   33.33% 3074457345618258602
If you use sudo to start DataStax Enterprise, before restarting the cluster remove the ~./spark directory:
$ sudo rm -r ~/.spark

Launching Spark 

After starting a Spark node, use dse commands to launch Spark. For example, on Linux from the installation directory use the following syntax:

$ bin/<dse command>

You can use the Cassandra specific properties to start Spark.

DataStax Enterprise supports these commands for launching Spark on the Datastax Enterprise command line:

dse spark
Enters interactive Spark shell, offers basic autocompletion.
dse spark-submit
Launches applications on a cluster like spark-submit. Replaces the deprecated dse spark-class command. Using this interface you can use Spark cluster managers without the need for separate configurations for each application.The syntax is:
$ dse spark-submit --class <class name> <jar file> <other_options>
For example, if you write a class that defines an option named d, enter the command as follows:
$ dse spark-submit --class com.datastax.HttpSparkStream target/HttpSparkStream.jar -d $NUM_SPARK_NODES
dse spark-submit-with-cc
Launches a Spark program in batch mode and generates the Cassandra context. Replaces the deprecated dse spark-class-with-cc command. You can pass configuration arguments to Spark using this command.
dse spark-with-cc
Enters the interactive Spark shell and generates the Cassandra context. This feature is deprecated and might be modified or removed in the future. You can pass configuration arguments to Spark using this command.
dse spark-schema
Generate a Cassandra context JAR. This feature is deprecated and might be modified or removed in the future.
Usage:
$ export SPARK_CASSANDRA_CONTEXT_DIR=<some directory>; dse spark-schema

To use a user name and password to run an application, use the following syntax:

$ dse -u <username> -p <password> spark[-submit] 

Generating a Cassandra context from a file 

You can specify the following additional options when using dse spark-schema:

  • --force

    Force recompile all the sources in Cassandra context.

  • --output=...

    Path to the output directory where the cassandra context is to be generated, if not specified, SPARK_CASSANDRA_CONTEXT_DIR env variable is used.

  • --script=...

    Path to cql script; if specified, the context classes are generated from the schema provided in that CQL file rather than from the current schema in Cassandra. Running Cassandra is not required.

Using the dse spark-schema command, you can generate the Cassandra context to a specified directory. You can base the context on a script that contains arbitrary CQL statements and comments. However, only CREATE TABLE and USE statements are processed. Other statements are ignored and generate a warning message.