Setting Cassandra Spark connector-specific properties
Spark integration uses the Apache Cassandra® Spark connector under the hood.
You can use the configuration options defined in that project to configure DataStax Enterprise Spark.
Spark recognizes system properties that have the spark. prefix and adds the properties to the configuration object implicitly upon creation.
You can avoid adding system properties to the configuration object by passing false for the loadDefaults parameter in the SparkConf constructor.
For the complete list of parameters, see the Apache Cassandra® Spark connector reference documentation.
You pass settings for Spark, Spark Shell, and other DataStax Enterprise Spark built-in applications using the intermediate application spark-submit, described in Spark documentation.
Configuring the Spark shell
Pass Spark configuration arguments using the following syntax:
dse spark [submission_arguments] [application_arguments]
where submission_arguments are:
-
--properties-file path_to_properties_fileThe location of the properties file that has the configuration settings. By default, Spark loads the settings from
spark-defaults.conf.Where is the
spark-defaults.conffile?The location of the
spark-defaults.conffile depends on the type of installation:Installation Type Location Package installations + Installer-Services installations
/etc/dse/spark/spark-defaults.confTarball installations + Installer-No Services installations
<installation_location>/resources/spark/conf/spark-defaults.conf -
--executor-memory memoryHow much memory to allocate on each machine for the application. You can provide the memory argument in JVM format using either the
k,m, orgsuffix. -
--total-executor-cores coresThe total number of cores the application uses
-
--conf name=valueAn arbitrary Spark option to the Spark configuration prefixed by
spark. -
--helpShows a help message that displays all options except DataStax Enterprise Spark shell options.
-
--jars <additional-jars>A comma-separated list of paths to additional JAR files.
-
--verboseDisplays which arguments are recognized as Spark configuration options and which arguments are forwarded to the Spark shell.
Spark shell application arguments:
-
-i fileRuns a script from the specified file.
Configuring Spark applications
You pass the Spark submission arguments using the following syntax:
dse spark-submit [submission_arguments] application_file [application_arguments]
All submission_arguments and these additional spark-submit submission_arguments:
--class class_name-
The full name of the application main class.
--name name-
The application name as displayed in the Spark web application.
--py-files files-
A comma-separated list of the
.zip,.egg, or.pyfiles that are set onPYTHONPATHfor Python applications. --files files-
A comma-separated list of files that are distributed among the executors and available for the application.
In general, Spark submission arguments are translated into system properties -Dname=value and other VM parameters like classpath.
The application arguments are passed directly to the application.
Property list
When you run dse spark-submit on a node in your Analytics cluster, all the following properties are set automatically, and the Spark Master is automatically detected.
Only set the following properties if you need to override the automatically managed properties.
spark.cassandra.connection.native.port-
Default = 9042. Port for native client protocol connections.
spark.cassandra.connection.rpc.port-
Default = 9160. Port for thrift connections.
spark.cassandra.connection.host-
The host name or IP address to which the Thrift RPC service and native transport is bound. The
rpc_addressproperty in thecassandra.yaml, which islocalhostby default, determines the default value of this property.
Where is the cassandra.yaml file?
The location of the cassandra.yaml file depends on the type of installation:
| Installation Type | Location |
|---|---|
Package installations + Installer-Services installations |
|
Tarball installations + Installer-No Services installations |
|
You can explicitly set the Spark Master address using the --master master address parameter to dse spark-submit.
dse spark-submit --master master address application JAR file
For example, if the Spark node is at 10.0.0.2:
dse spark-submit --master dse://10.0.0.2? myApplication.jar
The following properties can be overridden for performance or availability:
Read properties
spark.cassandra.input.split.size-
Default = 100000. Approximate number of rows in a single Spark partition. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.
spark.cassandra.input.fetch.size_in_rows-
Default = 1000. Number of rows being fetched per round-trip to the database. Increasing this value increases memory consumption. Decreasing the value increases the number of round-trips. In earlier releases, this property was
spark.cassandra.input.page.row.size. spark.cassandra.input.consistency.level-
Default =
LOCAL_ONE. Consistency level to use when reading.
Write properties
You can set the following properties in SparkConf to fine tune the saving process.
spark.cassandra.output.batch.size.bytes-
Default = 1024. Maximum total size of a single batch in bytes.
spark.cassandra.output.consistency.level-
Default =
LOCAL_QUORUM. Consistency level to use when writing. spark.cassandra.output.concurrent.writes-
Default = 5. Maximum number of batches executed in parallel by a single Spark task.
spark.cassandra.output.batch.size.rows-
Default = None. Number of rows per single batch. The default is unset, which means the connector adjusts the number of rows based on the amount of data in each row.
For information about additional, low-level properties, see the Apache Cassandra® Spark connector reference documentation.