Using Apache Spark™ SQL to query data
Spark SQL allows you to execute Spark queries using a variation of the SQL language.
Spark SQL includes APIs for returning Spark
Datasets in Scala and Java, and interactively using a SQL shell.
In DSE, Spark SQL allows you to perform relational queries over data stored in DSE clusters, and executed using Spark. Spark SQL is a unified relational query language for traversing over distributed collections of data, and supports a variation of the SQL language used in relational databases. Spark SQL is intended as a replacement for Shark and Hive, including the ability to run SQL queries over Spark data sets. You can use traditional Spark applications in conjunction with Spark SQL queries to analyze large data sets.
SparkSession class and its subclasses are the entry point for running relational queries in Spark.
DataFrames are Spark Datasets organized into named columns, and are similar to tables in a traditional relational database.
You can create
DataFrame instances from any Spark data source, like CSV files, Spark RDDs, or, for DSE, tables in the database.
In DSE, when you access a Spark SQL table from the data in DSE transactional cluster, it registers that table to the Hive metastore so SQL queries can be run against it.
Any tables you create or destroy, and any table data you delete, in a Spark SQL session will not be reflected in the underlying DSE database, but only in that session’s metastore.
The Spark SQL shell allows you to interactively perform Spark SQL queries.
To start the shell, run
The Spark SQL shell in DSE automatically creates a Spark session and connects to the Spark SQL Thrift server to handle the underlying JDBC connections.
You cannot load data from one file system to a table in a different file system.
CREATE TABLE IF NOT EXISTS test (id INT, color STRING) PARTITIONED BY (ds STRING); LOAD DATA INPATH 'hdfs2://localhost/colors.txt' OVERWRITE INTO TABLE test PARTITION (ds ='2008-08-15');
The first line creates a table on the default file system. The second line attempts and fails to load data into that table from a path on a different file system.
- Querying database data using Spark SQL in Scala
You can execute Spark SQL queries in Scala by starting the Spark shell. When you start Spark, DataStax Enterprise creates a Spark session instance to allow you to run Spark SQL queries against database tables.
- Querying database data using Spark SQL in Java
You can execute Spark SQL queries in Java applications that traverse over tables. Java applications that query table data using Spark SQL require a Spark session instance.
- Querying DSE Graph vertices and edges with Spark SQL
Spark SQL can query DSE Graph vertex and edge tables.
- Supported syntax of Spark SQL
Spark SQL supports a subset of the SQL-92 language.
- Inserting data into tables with static columns using Spark SQL
Static columns are mapped to different columns in Spark SQL and require special handling.
- Running HiveQL queries using Spark SQL
Spark SQL supports queries that are written using HiveQL, a SQL-like language that produces queries that are converted to Spark jobs.
- Using the DataFrames API
The Spark DataFrame API encapsulates data sources, including DataStax Enterprise data, organized into named columns.
- Using the Spark SQL Thrift server
The Spark SQL Thrift server uses a JDBC and an ODBC interface for client connections to DSE.
- Enabling SSL for the Spark SQL Thrift Server
Communication with the Spark SQL Thrift Server can be encrypted using SSL.
- Accessing the Spark SQL Thrift Server with the Simba JDBC driver
The Simba JDBC driver allows you to access the Spark SQL Thrift Server
- Simba ODBC Driver for Apache Spark (Windows)
The Simba ODBC Driver for Spark allows you to connect to The Spark SQL Thrift Server from Windows
- Simba ODBC Driver for Apache Spark (Linux)
The Simba ODBC Driver for Spark allows you to connect to The Spark SQL Thrift Server from Linux.
- Connecting to the Spark SQL Thrift server using Beeline
Use Shark Beeline to test the Spark SQL Thrift server.