Using the DataFrames API with PySpark
The DataStax Enterprise integration with PySpark works efficiently with the DataFrames API with PySpark. DSE PySpark Scala wrappers are removed.
PySpark and DSE PySpark are supported using the more efficient DataFrames API to manipulate data within Spark. The Spark Cassandra Connector provides an integrated DataSource to simplify creating Cassandra DataFrames. For more technical details, see the Spark Cassandra Connector documentation that is maintained by DataStax and the Cassandra and PySpark DataFrames post.
PySpark prerequisites
- Python 2.6 or later
- Start a DataStax Enterprise node in Spark mode.
Examples of using DataFrames API
This example shows using the DataFrames API to read from the Cassandra table
ks.kv
and insert into a different Cassandra table
ks.othertable
.
table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()
table1.write.format("org.apache.spark.sql.cassandra").options(table="othertable", keyspace = "ks").save(mode ="append")
Run a Python script using dse spark-submit and DataFrames
You run a Python script using the spark-submit
command. For example,
create the following file and save it as standalone.py:
#standalone.py
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setAppName("Stand Alone Python Script")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load().show()
DataStax Enterprise sets the cassandra.connection.host
environment
variable, eliminating the need to set the variable in the Python file. On Linux, for
example, from the installation directory, execute standalone.py as
follows:
$ bin/dse spark-submit /path/standalone.py