Querying DSE Graph vertices and edges with Apache Spark™ SQL

Spark SQL can query DSE Graph vertex and edge tables. The com.datastax.bdp.graph.spark.sql.vertex and com.datastax.bdp.graph.spark.sql.edge data sources are used to specify vertex and edge tables in Spark SQL.

spark-sql> CREATE DATABASE graph_example;
spark-sql> USE graph_example;
spark-sql> CREATE TABLE vertices USING com.datastax.bdp.graph.spark.sql.vertex OPTIONS (graph 'example');
spark-sql> CREATE TABLE edges USING com.datastax.bdp.graph.spark.sql.edge OPTIONS (graph 'example');

If you have properties that are spelled the same but with different capitalizations (for example, id and Id), start Spark SQL with the --conf spark.sql.caseSensitive=true option.

Prerequisites

Start your cluster with both Graph and Spark enabled.

Procedure

  1. Start the Spark SQL shell.

    dse spark-sql
  2. Register the vertex and edge tables for your graph using CREATE TABLE.

    CREATE DATABASE graph_gods;
    USE graph_gods;
    CREATE TABLE vertices USING com.datastax.bdp.graph.spark.sql.vertex OPTIONS (graph 'gods');
    CREATE TABLE edges USING com.datastax.bdp.graph.spark.sql.edge OPTIONS (graph 'gods');
  3. Query the vertices and edges using SELECT statements.

    SELECT * FROM vertices where name = 'Zeus';
  4. Join the vertices and edges in a query.

    Vertices are identified by id columns. Edges tables have src and dst columns that identify the from and to vertices, respectively. A join can be used to traverse the graph. For example to find all vertex ids that are reached by the out edges:

    SELECT edges.dst FROM vertices JOIN edges ON vertices.id = edges.src;

What’s next

The same steps work from the Spark shell using spark.sql() to run the query statements, or using the JDBC/ODBC driver and the Spark SQL Thrift Server.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com