Querying database data using Spark SQL in Java

You can execute Spark SQL queries in Java applications that traverse over tables. Java applications that query table data using Spark SQL require a Spark session instance.

Java applications that query table data using Spark SQL first need an instance of org.apache.spark.sql.SparkSession.

dse-spark-version.jar

The default location of the dse-spark-version.jar file depends on the type of installation:

Package installations
Installer-Services installations

/usr/share/dse/dse-spark-version.jar

Tarball installations
Installer-No Services installations

installation_location/lib/dse-spark-version.jar

The Spark session object is used to connect to DataStax Enterprise.

Create the Spark session instance using the builder interface:

SparkSession spark = SparkSession
    .builder()
    .appName("My application name")
    .config("option name", "option value")
    .master("dse://1.1.1.1?connection.host=1.1.2.2,1.1.3.3")
    .getOrCreate();

After the Spark session instance is created, you can use it to create a DataFrame instance from the query. Queries are executed by calling the SparkSession.sql method.

DataFrame employees = spark.sql("SELECT * FROM company.employees");
employees.registerTempTable("employees");
DataFrame managers = spark.sql("SELECT name FROM employees WHERE role = 'Manager' ");

The returned DataFrame object supports the standard Spark operations.

employees.collect();