Setting Spark Cassandra Connector-specific properties

Spark integration uses the Spark Cassandra Connector under the hood. You can use the configuration options defined in that project to configure DataStax Enterprise Spark. Spark recognizes system properties that have the spark. prefix and adds the properties to the configuration object implicitly upon creation. You can avoid adding system properties to the configuration object by passing false for the loadDefaults parameter in the SparkConf constructor.

You pass settings for Spark, Spark Shell, and other DataStax Enterprise Spark built-in applications using the intermediate application spark-submit, described in Spark documentation.

Configuring the Spark shell

Pass Spark configuration arguments using the following syntax:

dse spark [<submission_arguments>] [<application_arguments>]

where submission_arguments are:

[--help] [--verbose]
[--conf name=spark.value|<sparkproperties.conf>]
[--executor-memory <memory>]
[--jars <additional-jars>]
[--master dse://?appReconnectionTimeoutSeconds=<secs>]
[--properties-file <path_to_properties_file>]
[--total-executor-cores <cores>]
--conf name=spark.value|sparkproperties.conf

An arbitrary Spark option to the Spark configuration prefixed by spark.

  • name-spark.value

  • sparkproperties.conf - a configuration

--executor-memory mem

The amount of memory that each executor can consume for the application. Spark uses a 512 MB default. Specify the memory argument in JVM format using the k, m, or g suffix.`

--help

Shows a help message that displays all options except DataStax Enterprise Spark shell options.

--jars path_to_additional_jars

A comma-separated list of paths to additional JAR files.

--properties-file path_to_properties_file

The location of the properties file that has the configuration settings. By default, Spark loads the settings from spark-defaults.conf.

--total-executor-cores cores

The total number of cores the application uses.

--verbose

Displays which arguments are recognized as Spark configuration options and which arguments are forwarded to the Spark shell.

Spark shell application arguments:

-i app_script_file

Spark shell application argument that runs a script from the specified file.

Configuring Spark applications

You pass the Spark submission arguments using the following syntax:

dse spark-submit [<submission_arguments>] <application_file> [<application_arguments>]

All submission_arguments and these additional spark-submit <submission_arguments>:

--class class_name

The full name of the application main class.

--name appname

The application name as displayed in the Spark web application.

--py-files files

A comma-separated list of the .zip, .egg, or .py files that are set on PYTHONPATH for Python applications.

--files files

A comma-separated list of files that are distributed among the executors and available for the application.

In general, Spark submission arguments are translated into system properties -Dname=value and other VM parameters like classpath. The application arguments are passed directly to the application.

Property list

When you run dse spark-submit on a node in your Analytics cluster, all the following properties are set automatically, and the Spark Master is automatically detected. Only set the following properties if you need to override the automatically managed properties.

spark.cassandra.connection.native.port

Default = 9042. Port for native client protocol connections.

spark.cassandra.connection.rpc.port

Default = 9160. Port for thrift connections.

spark.cassandra.connection.host

The host name or IP address to which the Thrift RPC service and native transport is bound. The native_transport_address property in the cassandra.yaml, which is localhost by default, determines the default value of this property.

You can explicitly set the Spark Master address using the --master master address parameter to dse spark-submit.

dse spark-submit --master <master address> <application JAR file>

For example, if the Spark node is at 10.0.0.2:

dse spark-submit --master dse://10.0.0.2? myApplication.jar

The following properties can be overridden for performance or availability:

Connection properties

spark.cassandra.session.consistency.level

Default = LOCAL_ONE. The default consistency level for sessions which are accessed from the CassandraConnector object as in CassandraConnector.withSessionDo.

This property does not affect the consistency level of DataFrame and RDD read and write operations. Use spark.cassandra.input.consistency.level for read operations and spark.cassandra.output.consistency.level for write operations.

spark.cassandra.connection.quietPeriodBeforeCloseMS

Default = 0. The time in seconds that must pass without any additional requests after requesting a connection close.

spark.cassandra.connection.timeoutBeforeCloseMS

Default = 15. The time in seconds for all in-flight connections to finish after requesting a connection close.

Read properties

spark.cassandra.input.split.size

Default = 100000. Approximate number of rows in a single Spark partition. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.

spark.cassandra.input.fetch.size_in_rows

Default = 1000. Number of rows being fetched per round-trip to the database. Increasing this value increases memory consumption. Decreasing the value increases the number of round-trips. In earlier releases, this property was spark.cassandra.input.page.row.size.

spark.cassandra.input.consistency.level

Default = LOCAL_ONE. Consistency level to use when reading.

spark.cassandra.input.throughputMBPerSec

Default = Unlimited. Threshold in MB per second to set a read throttle per task. This threshold helps manage resources when multiple jobs are running in parallel.

Write properties

You can set the following properties in SparkConf to fine tune the saving process.

spark.cassandra.output.batch.size.bytes

Default = 1024. Maximum total size of a single batch in bytes.

spark.cassandra.output.consistency.level

Default = LOCAL_QUORUM. Consistency level to use when writing.

spark.cassandra.output.concurrent.writes

Default = 100. Maximum number of batches executed in parallel by a single Spark task.

spark.cassandra.output.batch.size.rows

Default = None. Number of rows per single batch. The default is unset, which means the connector will adjust the number of rows based on the amount of data in each row.

See the Spark Cassandra Connector documentation for details on additional, low-level properties.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com