Setting Cassandra-specific properties

Use the Spark Cassandra Connector options to configure DataStax Enterprise Spark.

Spark integration uses the Spark Cassandra Connector under the hood. You can use the configuration options defined in that project to configure DataStax Enterprise Spark. Spark recognizes system properties that have the spark. prefix and adds the properties to the configuration object implicitly upon creation. You can avoid adding system properties to the configuration object by passing false for the loadDefaults parameter in the SparkConf constructor.

The full list of parameters is included in the Spark Cassandra Connector documentation.

You pass settings for Spark, Spark Shell, and other DataStax Enterprise Spark built-in applications using the intermediate application spark-submit, described in Spark documentation.

Configuring the Spark shell

Pass Spark configuration arguments using the following syntax:
dse spark [submission_arguments] [application_arguments]
where submission_arguments are:
  • --properties-file path_to_properties_file

    The location of the properties file that has the configuration settings. By default, Spark loads the settings from conf/spark-defaults.conf.

  • --executor-memory memory

    How much memory to allocate on each machine for the application. You can provide the memory argument in JVM format using either the k, m, or g suffix.

  • --total-executor-cores cores

    The total number of cores the application uses

  • --conf name=value

    An arbitrary Spark option to the Spark configuration prefixed by spark.

  • --help

    Shows a help message that displays all options except DataStax Enterprise Spark shell options.

  • --jars <additional-jars>

    A comma-separated list of paths to additional JAR files.

  • --verbose

    Displays which arguments are recognized as Spark configuration options and which arguments are forwarded to the Spark shell.

Spark shell application arguments:
  • -i file

    Runs a script from the specified file.

Configuring Spark applications

You pass the Spark submission arguments using the following syntax:
dse spark-submit [submission_arguments] application_file [application_arguments]
All submission_arguments and these additional spark-submit submission_arguments:
  • --class class_name

    The full name of the application main class.

  • --name name

    The application name as displayed in the Spark web application.

  • --py-files files

    A comma-separated list of the .zip, .egg, or .py files that are set on PYTHONPATH for Python applications.

  • --files files

    A comma-separated list of files that are distributed among the executors and available for the application.

  • --master master_URL

    The URL of the Spark Master.

    If you run dse spark-submit from a node in your Analytics cluster, the master URL will be set automatically. If you run dse spark-submit from a remote node, set the cassandra.connection.host property and the master URL will be set automatically.

In general, Spark submission arguments are translated into system properties -Dname=value and other VM parameters like classpath. The application arguments are passed directly to the application.

Property list 

When you run dse spark-submit on a node in your Analytics cluster, all the properties are set automatically. Only set the following properties if you need to override the automatically managed properties.

spark.cassandra.connection.native.port
Default = 9042. Port for native client protocol connections.
spark.cassandra.connection.rpc.port
Default = 9160. Port for thrift connections.
spark.cassandra.connection.host
The host name or IP address to which the Thrift RPC service and native transport is bound. The rpc_address property in the cassandra.yaml, which is localhost by default, determines the default value of this property.

The following Cassandra-specific properties can be overridden for performance or availability:

spark.cassandra.keyspace
The default keyspace for Spark SQL.

Read properties

spark.cassandra.input.split.size
Default = 100000. Approximate number of rows in a single Spark partition. The higher the value, the fewer Spark tasks are created. Increasing the value too much may limit the parallelism level.
spark.cassandra.input.fetch.size_in_rows
Default = 1000. Number of rows being fetched per round-trip to Cassandra. Increasing this value increases memory consumption. Decreasing the value increases the number of round-trips. In earlier releases, this property was spark.cassandra.input.page.row.size.
spark.cassandra.input.consistency.level
Default = LOCAL_ONE. Consistency level to use when reading.

Write properties

You can set the following properties in SparkConf to fine tune the saving process.

spark.cassandra.output.batch.size.bytes

Default = auto. Number of bytes per single batch. The default, auto, means the connector adjusts the number of bytes based on the amount of data.

spark.cassandra.output.consistency.level
Default = LOCAL_ONE. Consistency level to use when writing.
spark.cassandra.output.concurrent.writes
Default = 5. Maximum number of batches executed in parallel by a single Spark
task.
spark.cassandra.output.batch.size.rows

Default = 64K. The maximum total size of the batch in bytes.

See the Spark Cassandra Connector documentation for details on additional, low-level properties.

The location of the cassandra.yaml file depends on the type of installation:
Installer-Services /etc/dse/cassandra/cassandra.yaml
Package installations /etc/dse/cassandra/cassandra.yaml
Installer-No Services install_location/resources/cassandra/conf/cassandra.yaml
Tarball installations install_location/resources/cassandra/conf/cassandra.yaml