Using Spark SQL to query data
Spark SQL allows you to execute Spark queries using a variation of the SQL language.
Spark SQL allows you to execute Spark queries using a variation of the SQL language. Spark SQL
includes APIs for returning Spark Dataset
s in Scala and Java, and interactively
using a SQL shell.
Spark SQL basics
In DSE, Spark SQL allows you to perform relational queries over data stored in DSE clusters, and executed using Spark. Spark SQL is a unified relational query language for traversing over distributed collections of data, and supports a variation of the SQL language used in relational databases. Spark SQL is intended as a replacement for Shark and Hive, including the ability to run SQL queries over Spark data sets. You can use traditional Spark applications in conjunction with Spark SQL queries to analyze large data sets.
The SparkSession class and its subclasses are the entry point for running relational queries in Spark.
DataFrames are Spark Dataset
s organized into named
columns, and are similar to tables in a traditional relational database. You can create
DataFrame instances from any Spark data source, like CSV files, Spark RDDs,
or, for DSE, tables in the database. In DSE, when you access a Spark SQL table from the data in
DSE transactional cluster, it registers that table to the Hive metastore so SQL queries can be
run against it.
Starting the Spark SQL shell
The Spark SQL shell allows you to interactively perform Spark SQL queries. To start the shell,
run dse spark-sql
:
dse spark-sql
The Spark SQL shell in DSE automatically creates a Spark session and connects to the Spark SQL Thrift server to handle the underlying JDBC connections.
If the schema changes in the underlying database table during a Spark SQL session (for example, a column was added using CQL), drop the table and then refresh the metastore to continue querying the table with the correct schema.
DROP TABLE tablename; SHOW TABLES;
Queries to a table whose schema has been modified cause a runtime exception.
Spark SQL limitations
- You cannot load data from one file system to a table in a different file
system.
CREATE TABLE IF NOT EXISTS test (id INT, color STRING) PARTITIONED BY (ds STRING); LOAD DATA INPATH 'hdfs2://localhost/colors.txt' OVERWRITE INTO TABLE test PARTITION (ds ='2008-08-15');
The first line creates a table on the default file system. The second line attempts to load data into that table from a path on a different file system, and will fail.