Starting Spark and Shark

How you start Spark and Shark depends on the installation and if want to run in Hadoop mode.

How you start Spark and Shark depends on the installation and if want to run in Hadoop mode:

  • Installer-Services and Package installations: To start the Spark trackers on a cluster of Analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.

    When you start DataStax Enterprise as a service, the node is launched as a Spark node.

    To start a node in Spark and Hadoop mode, edit the /etc/default/dse file to set HADOOP_ENABLED and SPARK_ENABLED to 1.

  • Installer-No Services and Tarball installations: To start the Spark trackers on a cluster of Analytics nodes, use the -k option:
    $ dse cassandra -k
    To start a node in Spark and Hadoop mode, use the -k and -t options:
    $ dse cassandra -k -t

    Nodes started with either -t or -k are automatically assigned to the default Analytics data center if you do not configure a data center in the snitch property file.

Starting the node with the Spark or Hadoop options starts a node designated as the job tracker, as shown by the Analytics(JT) workload in the output of the dsetool ring command:

$ dsetool ring
Note: Ownership information does not include topology, please specify a keyspace. 
Address          DC           Rack   Workload      Status  State    Load      Owns   Token                       
10.160.137.165   Analytics    rack1  Analytics(JT)    Up   Normal   87.04 KB  33.33% -9223372036854775808                        
10.168.193.41    Analytics    rack1  Analytics(TT)    Up   Normal   92.91 KB  33.33% -3074457345618258603                        
10.176.83.32     Analytics    rack1  Analytics(TT)    Up   Normal   94.9 KB   33.33% 3074457345618258602
If you use sudo to start DataStax Enterprise, before restarting the cluster remove the ~./spark directory:
$ sudo rm -r ~/.spark

Launching Spark/Shark 

After starting a Spark node, use dse commands to launch Spark or Shark. For example, on Linux from the installation directory use the following syntax:

$ bin/<dse command>

You can use the Cassandra specific properties (-Dname=value) to start Spark and Shark.

DataStax Enterprise supports these commands for launching Spark and Shark on the Datastax Enterprise command line:

dse spark
Enters interactive Spark shell, offers basic autocompletion.
dse spark-class
Launches a Spark program in a batch mode manner, similar to running a hadoop jar command to launch a MapReduce program
dse spark-with-cc
Enters the interactive Spark shell and generates the Cassandra context. This feature is deprecated and might be modified or removed in the future.
dse spark-class-with-cc
Launches a Spark program in batch mode and generates the Cassandra context. This feature is deprecated and might be modified or removed in the future.
dse shark
Launches the Shark shell.
dse shark --service sharkserver -p <port>
Launches the Shark server
dse spark-schema
Generate a Cassandra context JAR. This feature is deprecated and might be modified or removed in the future.
Usage:
$ export SPARK_CASSANDRA_CONTEXT_DIR=<some directory>; dse spark-schema

Generating a Cassandra context from a file 

You can specify the following additional options when using dse spark-schema:

  • --force

    Force recompile all the sources in Cassandra context.

  • --output=...

    Path to the output directory where the cassandra context is to be generated, if not specified, SPARK_CASSANDRA_CONTEXT_DIR env variable is used.

  • --script=...

    Path to cql script; if specified, the context classes are generated from the schema provided in that CQL file rather than from the current schema in Cassandra. Running Cassandra is not required.

Using the dse spark-schema command, you can generate the Cassandra context to a specified directory. You can base the context on a script that contains arbitrary CQL statements and comments. However, only CREATE TABLE and USE statements are processed. Other statements are ignored and generate a warning message.

Starting and stopping a Shark client 

If you do not need to keep Shark memory tables persistent between sessions, start a Shark standalone client, use this dse command on the dse command line. On Ubuntu, for example:

$ dse shark

Use the -skipRddReload flag to skip reloading data into memory tables when you start Shark.

The shark command line prompt appears:
Starting the Shark Command Line Client
    
    shark>

To stop the Shark client:

shark> exit;

You can also start a Shark as a server to provide Shark service to clients.

Configuring the Shark server 

To prevent accumulation of scheduler.ShuffleMapTasks and scheduler.ResultTask metadata objects, which ultimately cause an out-of-memory condition on the Shark server, set the spark.cleaner.ttl in the shark-env file. The default is 12 hours.

Starting the Shark server 

You can keep Shark memory tables persistent and run applications between sessions if you use the Shark server instead of the client. To start the Shark server:

$ dse shark --service sharkserver -p <port number>

For example:

$ dse shark --service sharkserver -p 10000

Connect a Shark client to the server:

$ dse shark -h localhost -p 10000
    [localhost:10000] shark>