Bulk saving data from Spark RDD to Cassandra

Bulk saving data from Spark RDD to Cassandra bypasses the standard Cassandra write-path.

Bulk saving data from a Spark resilient distributed dataset (RDD) to the Cassandra database writes rows directly to SSTables that are created in a local temporary directory on each Spark Executor. This bulk savings from Spark RDD improves performance by bypassing the standard Cassandra write-path.

The standard saveToCassandra method sends rows through the Java driver to a Cassandra node, which then orders the rows and flushes into SSTables. Using the standard Cassandra write-path, much of the work is pushed onto the Cassandra cluster.

The bulk saving from Spark RDD to Cassandra uses the bulkSaveToCassandra method with the same semantics, but writes rows directly to SSTables that are created in a local temporary directory on each Spark Executor. The bulkSaveToCassandra method then streams the SSTables to Cassandra nodes in a DataStax Enterprise cluster. Performance is improved because the data bypasses a number of stages in the Cassandra write path, which results in a reduced load on the server side.

Example of bulk saving from Spark RDD 

You must use the SparkContext to load data from an RDD. The following example shows how to bulk save data from a Spark RDD to a Cassandra database.

val rdd: RDD[SomeType] = ... // create some RDD to save
import com.datastax.bdp.spark.writer.BulkTableWriter._
rdd.bulkSaveToCassandra(keyspace, table)

Performance tuning  

BulkTableWriter generates at least one local SSTable per Spark partition. Use these guidelines to tune performance:
  • Ensure that the partitions of the RDD are at least several tens of megabytes large to minimize the cost of compacting the partitions on the server side.
  • Increase the buffer size for each task to generate larger SSTables.

    By default, a 64 MB buffer is reserved for each task. Bulk writing requires a significant amount of memory on the client. To control the buffer size, pass a custom writeConf object with the bufferSizeInMB property value with the bulkSaveToCassandra method call. When the RDD partitions are large enough, increase the size of this buffer to generate larger SSTables.

Note: If your keyspaces or table names use mixed case, specify the keyspace and table name in all lower case when calling bulkSaveToCassandra. For example, if your keyspace name is myKeyspace and your table name is myTable, call bulkSaveToCassandra("mykeyspace", "mytable").

There is no point in setting this parameter to a value much larger than RDD partition size.