Starting Spark and Shark
How you start Spark and Shark depends on the installation and if want to run in Hadoop mode.
How you start Spark and Shark depends on the installation and if want to run in Hadoop mode:
- Installer-Services and Package installations: To start the Spark trackers on a cluster of
Analytics nodes, edit the /etc/default/dse file to set SPARK_ENABLED to 1.
When you start DataStax Enterprise as a service, the node is launched as a Spark node.
To start a node in Spark and Hadoop mode, edit the /etc/default/dse file to set HADOOP_ENABLED and SPARK_ENABLED to 1.
- Installer-No Services and Tarball installations: To start the Spark trackers on a cluster of
Analytics nodes, use the -k option:
$ dse cassandra -k
To start a node in Spark and Hadoop mode, use the -k and -t options:$ dse cassandra -k -t
Nodes started with either -t or -k are automatically assigned to the default Analytics data center if you do not configure a data center in the snitch property file.
Starting the node with the Spark or Hadoop options starts a node designated as the job tracker, as shown by the Analytics(JT) workload in the output of the dsetool ring command:
$ dsetool ring Note: Ownership information does not include topology, please specify a keyspace. Address DC Rack Workload Status State Load Owns Token 10.160.137.165 Analytics rack1 Analytics(JT) Up Normal 87.04 KB 33.33% -9223372036854775808 10.168.193.41 Analytics rack1 Analytics(TT) Up Normal 92.91 KB 33.33% -3074457345618258603 10.176.83.32 Analytics rack1 Analytics(TT) Up Normal 94.9 KB 33.33% 3074457345618258602
$ sudo rm -r ~/.spark
Launching Spark/Shark
After starting a Spark node, use dse commands to launch Spark or Shark. For example, on Linux from the installation directory use the following syntax:
$ bin/<dse command>
You can use the Cassandra specific properties (-Dname=value) to start Spark and Shark.
DataStax Enterprise supports these commands for launching Spark and Shark on the Datastax Enterprise command line:
- dse spark
- Enters interactive Spark shell, offers basic autocompletion.
- dse spark-class
- Launches a Spark program in a batch mode manner, similar to running a hadoop jar command to launch a MapReduce program
- dse spark-with-cc
- Enters the interactive Spark shell and generates the Cassandra context. This feature is deprecated and might be modified or removed in the future.
- dse spark-class-with-cc
- Launches a Spark program in batch mode and generates the Cassandra context. This feature is deprecated and might be modified or removed in the future.
- dse shark
- Launches the Shark shell.
- dse shark --service sharkserver -p <port>
- Launches the Shark server
- dse spark-schema
- Generate a Cassandra context JAR. This feature is deprecated and might be modified or removed in the future.
Generating a Cassandra context from a file
You can specify the following additional options when using dse spark-schema:
- --force
Force recompile all the sources in Cassandra context.
- --output=...
Path to the output directory where the cassandra context is to be generated, if not specified, SPARK_CASSANDRA_CONTEXT_DIR env variable is used.
- --script=...
Path to cql script; if specified, the context classes are generated from the schema provided in that CQL file rather than from the current schema in Cassandra. Running Cassandra is not required.
Using the dse spark-schema command, you can generate the Cassandra context to a specified directory. You can base the context on a script that contains arbitrary CQL statements and comments. However, only CREATE TABLE and USE statements are processed. Other statements are ignored and generate a warning message.
Starting and stopping a Shark client
If you do not need to keep Shark memory tables persistent between sessions, start a Shark standalone client, use this dse command on the dse command line. On Ubuntu, for example:
$ dse shark
Use the -skipRddReload flag to skip reloading data into memory tables when you start Shark.
Starting the Shark Command Line Client shark>
To stop the Shark client:
shark> exit;
You can also start a Shark as a server to provide Shark service to clients.
Configuring the Shark server
To prevent accumulation of scheduler.ShuffleMapTasks and scheduler.ResultTask metadata objects, which ultimately cause an out-of-memory condition on the Shark server, set the spark.cleaner.ttl in the shark-env file. The default is 12 hours.
Starting the Shark server
You can keep Shark memory tables persistent and run applications between sessions if you use the Shark server instead of the client. To start the Shark server:
$ dse shark --service sharkserver -p <port number>
For example:
$ dse shark --service sharkserver -p 10000
Connect a Shark client to the server:
$ dse shark -h localhost -p 10000 [localhost:10000] shark>