Querying Cassandra data using Spark SQL in Java

You can execute Spark SQL queries in Java applications that traverse over Cassandra column families. Java applications that query Cassandra data using Spark SQL require a Spark configuration instance and Spark context instance.

Java applications that query Cassandra data using Spark SQL first need a Spark configuration instance and Spark context instance.

The default location of the dse-spark-version.jar file depends on the type of installation:

Installer-Services and Package installations	/usr/share/dse/dse-spark-`version`.jar
Installer-No Services and Tarball installations	`install_location`/lib/dse-spark-`version`.jar

The Spark context object is used to create a Cassandra-aware Spark SQL context object to connect to Cassandra. We recommend using a HiveContext instance, as HiveContext is a superset of SQLContext and allows you to write more complicated queries using HiveQL. Create an instance of org.apache.spark.sql.hive.HiveContext using the JavaSparkContext object.

Create the Spark configuration object and Spark context:

// create a new configuration
SparkConf conf = new SparkConf()
                .setAppName( "My application");
// create a Spark context
JavaSparkContext sc = new JavaSparkContext(conf);
HiveContext hiveContext = new HiveContext(sc.toSparkContext(sc));

After the Spark context is created, you can use it to create a DataFrame instance from the query. Queries are executed by calling the SparkContext.sql method.


DataFrame employees = hiveContext.sql("SELECT * FROM company.employees");
employees.registerTempTable("employees");
DataFrame managers = hiveContext.sql("SELECT name FROM employees WHERE role == 'Manager' ");

The returned DataFrame object supports the standard Spark operations.

employees.collect();