Setting Spark Cassandra Connector-specific properties
Use the Spark Cassandra Connector options to configure DataStax Enterprise Spark.
spark-defaults.conf
The default location of the spark-defaults.conf file depends on the type of installation:Package installations | /etc/dse/spark/spark-defaults.conf |
Tarball installations | installation_location/resources/spark/conf/spark-defaults.conf |
cassandra.yaml
The location of the cassandra.yaml file depends on the type of installation:Package installations | /etc/dse/cassandra/cassandra.yaml |
Tarball installations | installation_location/resources/cassandra/conf/cassandra.yaml |
Spark integration uses the Spark Cassandra Connector under the hood. You can use the
configuration options defined in that project to configure DataStax Enterprise Spark. Spark
recognizes system properties that have the spark. prefix and adds the
properties to the configuration object implicitly upon creation. You can avoid adding system
properties to the configuration object by passing false for the
loadDefaults
parameter in the SparkConf
constructor.
The full list of parameters is included in the Spark Cassandra Connector documentation.
You pass settings for Spark, Spark Shell, and other DataStax Enterprise Spark built-in
applications using the intermediate application spark-submit
, described in
Spark documentation.
Configuring the Spark shell
dse spark [submission_arguments] [application_arguments]
submission_arguments
are:[--help] [--verbose]
[--conf name=spark.value|sparkproperties.conf]
[--executor-memory memory]
[--jars additional-jars]
[--master dse://?appReconnectionTimeoutSeconds=secs]
[--properties-file path_to_properties_file]
[--total-executor-cores cores]
- --conf name=spark.value|sparkproperties.conf
- An arbitrary Spark option to the Spark configuration prefixed by spark.
- name-spark.value
- sparkproperties.conf - a configuration
- --executor-memory mem
- The amount of memory that each executor can consume for the application. Spark uses a 512 MB default. Specify the memory argument in JVM format using the k, m, or g suffix.
- --help
- Shows a help message that displays all options except DataStax Enterprise Spark shell options.
- --jars path_to_additional_jars
- A comma-separated list of paths to additional JAR files.
- --properties-file path_to_properties_file
- The location of the properties file that has the configuration settings. By default, Spark loads the settings from spark-defaults.conf.
- --total-executor-cores cores
- The total number of cores the application uses.
- --verbose
- Displays which arguments are recognized as Spark configuration options and which arguments are forwarded to the Spark shell.
- -i app_script_file
- Spark shell application argument that runs a script from the specified file.
Configuring Spark applications
dse spark-submit [submission_arguments] application_file [application_arguments]
spark-submit submission_arguments
:- --class class_name
- The full name of the application main class.
- --name appname
- The application name as displayed in the Spark web application.
- --py-files files
- A comma-separated list of the .zip, .egg, or .py files that are set on PYTHONPATH for Python applications.
- --files files
- A comma-separated list of files that are distributed among the executors and available for the application.
In general, Spark submission arguments are translated into system properties
-Dname=value
and other VM parameters like classpath. The application
arguments are passed directly to the application.
Property list
When you run dse spark-submit
on a node in your Analytics cluster, all the
following properties are set automatically, and the Spark Master is automatically detected.
Only set the following properties if you need to override the automatically managed
properties.
- spark.cassandra.connection.native.port
- Default = 9042. Port for native client protocol connections.
- spark.cassandra.connection.rpc.port
- Default = 9160. Port for thrift connections.
- spark.cassandra.connection.host
- The host name or IP address to which the Thrift RPC service and native transport is
bound. The
native_transport_address
property in the cassandra.yaml, which is localhost by default, determines the default value of this property.
You can explicitly set the Spark Master address using the --master master address
parameter to dse spark-submit
.
dse spark-submit --master master address application JAR file
dse spark-submit --master dse://10.0.0.2? myApplication.jar
The following properties can be overridden for performance or availability:
Connection properties
- spark.cassandra.session.consistency.level
- Default = LOCAL_ONE. The default consistency level for sessions which are accessed
from the
CassandraConnector
object as inCassandraConnector.withSessionDo
.Note: This property does not affect the consistency level of DataFrame and RDD read and write operations. Usespark.cassandra.input.consistency.level
for read operations andspark.cassandra.output.consistency.level
for write operations.
Read properties
- spark.cassandra.input.split.size
- Default = 100000. Approximate number of rows in a single Spark partition. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.
- spark.cassandra.input.fetch.size_in_rows
- Default = 1000. Number of rows being fetched per round-trip to the database. Increasing this value increases memory consumption. Decreasing the value increases the number of round-trips. In earlier releases, this property was spark.cassandra.input.page.row.size.
- spark.cassandra.input.consistency.level
- Default = LOCAL_ONE. Consistency level to use when reading.
- spark.cassandra.input.throughputMBPerSec
- Default = Unlimited. Threshold in MB per second to set a read throttle per task. This threshold helps manage resources when multiple jobs are running in parallel.
Write properties
You can set the following properties in SparkConf to fine tune the saving process.
- spark.cassandra.output.batch.size.bytes
-
Default = 1024. Maximum total size of a single batch in bytes.
- spark.cassandra.output.consistency.level
- Default = LOCAL_QUORUM. Consistency level to use when writing.
- spark.cassandra.output.concurrent.writes
- Default = 100. Maximum number of batches executed in parallel by a single Spark task.
- spark.cassandra.output.batch.size.rows
-
Default = None. Number of rows per single batch. The default is unset, which means the connector will adjust the number of rows based on the amount of data in each row.
See the Spark Cassandra Connector documentation for details on additional, low-level properties.