Using Apache Spark™ SQL to query data

Spark SQL allows you to execute Spark queries using a variation of the SQL language. Spark SQL includes APIs for returning Spark Datasets in Scala and Java, and interactively using a SQL shell.

Spark SQL basics

In DSE, Spark SQL allows you to perform relational queries over data stored in DSE clusters, and executed using Spark. Spark SQL is a unified relational query language for traversing over distributed collections of data, and supports a variation of the SQL language used in relational databases. Spark SQL is intended as a replacement for Shark and Hive, including the ability to run SQL queries over Spark data sets. You can use traditional Spark applications in conjunction with Spark SQL queries to analyze large data sets.

The SparkSession class and its subclasses are the entry point for running relational queries in Spark.

DataFrames are Spark Datasets organized into named columns, and are similar to tables in a traditional relational database. You can create DataFrame instances from any Spark data source, like CSV files, Spark RDDs, or, for DSE, tables in the database. In DSE, when you access a Spark SQL table from the data in DSE transactional cluster, it registers that table to the Hive metastore so SQL queries can be run against it.

Any tables you create or destroy, and any table data you delete, in a Spark SQL session will not be reflected in the underlying DSE database, but only in that session’s metastore.

Starting the Spark SQL shell

The Spark SQL shell allows you to interactively perform Spark SQL queries. To start the shell, run dse spark-sql:

dse spark-sql

The Spark SQL shell in DSE automatically creates a Spark session and connects to the Spark SQL Thrift server to handle the underlying JDBC connections.

Spark SQL limitations

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com