Querying database data using Apache Spark™ SQL in Java

Java applications that query table data using Spark SQL first need an instance of org.apache.spark.sql.SparkSession.

The default location of the dse-spark-version.jar file depends on the type of installation:

Installation Type Location

Package installations + Installer-Services installations

/usr/share/dse/dse-spark-version.jar

Tarball installations + Installer-No Services installations

<installation_location>/lib/dse-spark-version.jar

The Spark session object is used to connect to DataStax Enterprise.

Create the Spark session instance using the builder interface:

SparkSession spark = SparkSession
    .builder()
    .appName("My application name")
    .config("option name", "option value")
    .master("dse://1.1.1.1?connection.host=1.1.2.2,1.1.3.3")
    .getOrCreate();

After the Spark session instance is created, you can use it to create a DataFrame instance from the query. Queries are executed by calling the SparkSession.sql method.

DataFrame employees = spark.sql("SELECT * FROM company.employees");
employees.registerTempTable("employees");
DataFrame managers = spark.sql("SELECT name FROM employees WHERE role = 'Manager' ");

The returned DataFrame object supports the standard Spark operations.

employees.collect();

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com