Using the DataFrames API
The Apache Spark™ DataFrames API encapsulates data sources, including DataStax Enterprise data, organized into named columns.
The Cassandra Spark connector provides an integrated DataSource to simplify creating DataFrames. For more technical details, see Apache Cassandra® Spark connector data frames documentation and Cassandra and PySpark DataFrames Revisited.
Examples of using the DataFrames API
This Python example shows using the DataFrames API to read from the table ks.kv
and insert into a different table ks.othertable
.
dse pyspark
table1 = spark.read.format("org.apache.spark.sql.cassandra")
.options(table="kv", keyspace="ks")
.load()
table1.write.format("org.apache.spark.sql.cassandra")
.options(table="othertable", keyspace = "ks")
.save(mode ="append")
Using the DSE Spark console, the following Scala example shows how to create a DataFrame
object from one table and save it to another.
dse spark
val table1 = spark.read.format("org.apache.spark.sql.cassandra")
.options(Map( "table" -> "words", "keyspace" -> "test"))
.load()
table1.createCassandraTable("test", "otherwords", partitionKeyColumns = Some(Seq("word")), clusteringKeyColumns = Some(Seq("count")))
table1.write.cassandraFormat("otherwords", "test").save()
The write operation uses one of the helper methods, cassandraFormat
, included in the Cassandra Spark connector.
This is a simplified way of setting the format and options for a standard DataFrame
operation.
The following command is equivalent to write operation using cassandraFormat
:
table1.write.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "othertable", "keyspace" -> "test"))
.save()