Using the DataFrames API

The Spark DataFrame API encapsulates data sources, including DataStax Enterprise data, organized into named columns.

The Spark DataFrames API encapsulates data sources, including DataStax Enterprise data, organized into named columns.

The Spark Cassandra Connector provides an integrated DataSource to simplify creating DataFrames. For more technical details, see the Spark Cassandra Connector documentation that is maintained by DataStax and the Cassandra and PySpark DataFrames post.

Examples of using the DataFrames API

This Python example shows using the DataFrames API to read from the table ks.kv and insert into a different table ks.othertable.

dse pyspark
table1 = spark.read.format("org.apache.spark.sql.cassandra")
  .options(table="kv", keyspace="ks")
  .load()
table1.write.format("org.apache.spark.sql.cassandra")
  .options(table="othertable", keyspace = "ks")
  .save(mode ="append")

Using the DSE Spark console, the following Scala example shows how to create a DataFrame object from one table and save it to another.

dse spark
val table1 = spark.read.format("org.apache.spark.sql.cassandra")
  .options(Map( "table" -> "words", "keyspace" -> "test"))
  .load()
table1.createCassandraTable("test", "otherwords", partitionKeyColumns = Some(Seq("word")), clusteringKeyColumns = Some(Seq("count")))
table1.write.cassandraFormat("otherwords", "test").save()

The write operation uses one of the helper methods, cassandraFormat, included in the Spark Cassandra Connector. This is a simplified way of setting the format and options for a standard DataFrame operation. The following command is equivalent to write operation using cassandraFormat:

table1.write.format("org.apache.spark.sql.cassandra")
  .options(Map("table" -> "othertable", "keyspace" -> "test"))
  .save()