Using Spark SQL to query data

Spark SQL allows you to execute Spark queries using a variation of the SQL language. Spark SQL includes APIs for Scala and Java.

You use Spark SQL to query data that is stored in Cassandra clusters, and execute the queries using Spark. Typically, queries run faster in Spark SQL than in Hive.

Spark SQL basics

In DataStax Enterprise, Spark SQL allows you to perform relational queries over data stored in Cassandra clusters, and executed using Spark. Spark SQL is a unified relational query language for traversing over Spark Resilient Distributed Datasets (RDDs), and supports a variation of the SQL language used in relational databases. Spark SQL is intended as a replacement for Shark and Hive, including the ability to run Hive QL queries over RDDs. You can use traditional Spark applications in conjunction with Spark SQL queries to analyze large data sets.

The SqlContext class and its subclasses are the entry point for running relational queries in Spark. SqlContext instances are created from a SparkContext instance. The CassandraSQLContext class is a subclass of SqlContext and allows you to run these queries against a Cassandra data source.

Spark SQL uses a special type of RDD called SchemaRDD, and are similar to tables in a traditional relational database. A SchemaRDD consists of object data and a schema that describes the data types of the objects. You can create SchemaRDD instances from existing Spark RDDs. Once a SchemaRDD has been applied to a SqlContext, it can be registered as a table, and SQL queries can be run against it.

Starting the Spark SQL shell

The Spark SQL shell allows you to interactively perform Spark SQL queries. To start the shell, run dse spark-sql:

dse spark-sql

For more information on Spark SQL, see the migration information in the Spark documentation.