Using the DataFrames API with PySpark

The DataStax Enterprise integration with PySpark works efficiently with the DataFrames API with PySpark. DSE PySpark Scala wrappers are deprecated.

Spark 1.4 and DataStax Enterprise 4.8 integration with PySpark works more efficiently with the DataFrames API (for Spark 1.4.1).
Note: The DataStax Enterprise wrappers for Spark Cassandra Connector Scala functions are deprecated.

PySpark and DSE PySpark are supported using the more efficient DataFrames API to manipulate data within Spark. The Spark Cassandra Connector provides an integrated DataSource to simplify creating Cassandra DataFrames. For more technical details, see the Spark Cassandra Connector documentation that is maintained by DataStax and the Cassandra and PySpark DataFrames post.

PySpark prerequisites 

The prerequisites for starting PySpark are:

Examples of using DataFrames API

This example shows using the DataFrames API to read from the Cassandra table ks.kv and insert into a different Cassandra table ks.othertable.

table1 = sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load()
table1.write.format("org.apache.spark.sql.cassandra").options(table="othertable", keyspace = "ks").save(mode ="append")

Run a Python script using dse spark-submit and DataFrames 

You run a Python script using the spark-submit command. For example, create the following file and save it as standalone.py:

#standalone.py

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext

conf = SparkConf().setAppName("Stand Alone Python Script")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
sqlContext.read.format("org.apache.spark.sql.cassandra").options(table="kv", keyspace="ks").load().show()

DataStax Enterprise sets the cassandra.connection.host environment variable, eliminating the need to set the variable in the Python file. On Linux, for example, from the installation directory, execute standalone.py as follows:

$ bin/dse spark-submit /path/standalone.py