Querying DSE Graph vertices and edges with Spark SQL

Spark SQL can query DSE Graph vertex and edge tables.

Spark SQL can query DSE Graph vertex and edge tables. The com.datastax.bdp.graph.spark.sql.vertex and com.datastax.bdp.graph.spark.sql.edge data sources are used to specify vertex and edge tables in Spark SQL.
spark-sql> CREATE DATABASE graph_example;
spark-sql> USE graph_example;
spark-sql> CREATE TABLE vertices USING com.datastax.bdp.graph.spark.sql.vertex OPTIONS (graph 'example');
spark-sql> CREATE TABLE edges USING com.datastax.bdp.graph.spark.sql.edge OPTIONS (graph 'example');
If you have properties that are spelled the same but with different capitalizations (for example, id and Id), start Spark SQL with the --conf spark.sql.caseSensitive=true option.

Prerequisites

Start your cluster with both Graph and Spark enabled.

Procedure

  1. Start the Spark SQL shell.
    dse spark-sql
  2. Register the vertex and edge tables for your graph using CREATE TABLE.
    CREATE DATABASE graph_gods;
    USE graph_gods;
    CREATE TABLE vertices USING com.datastax.bdp.graph.spark.sql.vertex OPTIONS (graph 'gods');
    CREATE TABLE edges USING com.datastax.bdp.graph.spark.sql.edge OPTIONS (graph 'gods');
  3. Query the vertices and edges using SELECT statements.
    SELECT * FROM vertices where name = 'Zeus';
  4. Join the vertices and edges in a query.

    Vertices are identified by id columns. Edges tables have src and dst columns that identify the from and to vertices, respectively. A join can be used to traverse the graph. For example to find all vertex ids that are reached by the out edges:

    SELECT edges.dst FROM vertices JOIN edges ON vertices.id = edges.src;

task_postreq

The same steps work from the Spark shell using spark.sql() to run the query statements, or using the JDBC/ODBC driver and the Spark SQL Thrift Server.