Querying DSE Graph vertices and edges with Apache Spark™ SQL
Spark SQL can query DSE Graph vertex and edge tables.
The com.datastax.bdp.graph.spark.sql.vertex
and com.datastax.bdp.graph.spark.sql.edge
data sources are used to specify vertex and edge tables in Spark SQL.
spark-sql> CREATE DATABASE graph_example;
spark-sql> USE graph_example;
spark-sql> CREATE TABLE vertices USING com.datastax.bdp.graph.spark.sql.vertex OPTIONS (graph 'example');
spark-sql> CREATE TABLE edges USING com.datastax.bdp.graph.spark.sql.edge OPTIONS (graph 'example');
If you have properties that are spelled the same but with different capitalizations (for example, id
and Id
), start Spark SQL with the --conf spark.sql.caseSensitive=true
option.
Prerequisites
Start your cluster with both Graph and Spark enabled.
Procedure
-
Start the Spark SQL shell.
dse spark-sql
-
Register the vertex and edge tables for your graph using
CREATE TABLE
.CREATE DATABASE graph_gods; USE graph_gods; CREATE TABLE vertices USING com.datastax.bdp.graph.spark.sql.vertex OPTIONS (graph 'gods'); CREATE TABLE edges USING com.datastax.bdp.graph.spark.sql.edge OPTIONS (graph 'gods');
-
Query the vertices and edges using
SELECT
statements.SELECT * FROM vertices where name = 'Zeus';
-
Join the vertices and edges in a query.
Vertices are identified by
id
columns. Edges tables havesrc
anddst
columns that identify the from and to vertices, respectively. A join can be used to traverse the graph. For example to find all vertex ids that are reached by the out edges:SELECT edges.dst FROM vertices JOIN edges ON vertices.id = edges.src;
What’s next
The same steps work from the Spark shell using spark.sql()
to run the query statements, or using the JDBC/ODBC driver and the Spark SQL Thrift Server.