Connect with Apache Spark

Apache Spark with Scala in spark-shell seamlessly connects to and accesses your HCD tables for advanced data analysis. With this approach you can do the following:

Directly execute SQL and CQL queries to interact with your data.
Employ Spark DataFrames and RDDs for sophisticated data manipulation and analysis.

Using Spark’s comprehensive processing features can boost your data analysis and processing capabilities with HCD.

HCD users can use the Spark Cassandra Connector (SCC). The SCC allows for better support of container orchestration platforms. For more information, see Advanced Apache Cassandra Analytics Now Open for All.

Prerequisites

A running HCD database
Download Apache Spark pre-built for Apache Hadoop and Scala. DataStax recommends the latest versions:
Download and install Scala 2.x.
Download a compatible version of the Spark Cassandra Connector (SSC) from the Maven central repository.
Install a compatible Java version and set it as the default Java version.

Connect to an HCD database with Apache Spark

Extract the Apache Spark package into a directory.

The following steps use SPARK_HOME as a placeholder for the path to your Apache Spark directory.
Add the following lines to the end of the spark-defaults.conf file located at SPARK_HOME/conf/spark-defaults.conf. If no such file exists, look for a template in the SPARK_HOME/conf directory.
```
spark.cassandra.auth.username SUPERUSER_USERNAME
spark.cassandra.auth.password SUPERUSER_PASSWORD
spark.dse.continuousPagingEnabled false
```
Replace the following with your values for SUPERUSER_USERNAME and SUPERUSER_PASSWORD.

Launch spark-shell from the root directory of your Spark installation.

bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_SCALA_VERSION:SCC_VERSION

Replace SCALA_VERSION with your Scala version, and replace SCC_VERSION with your Spark Cassandra Connector version.

Result

$ bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1608781805157).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala>

Run the following Scala commands to connect Apache Spark with your database through the SCC:

import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
spark.read.cassandraFormat("tables", "system_schema").load().count()

Result

scala> import com.datastax.spark.connector._
import com.datastax.spark.connector._

scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._

scala> spark.read.cassandraFormat("tables", "system_schema").load().count()
res0: Long = 25

scala> :quit

Connect with Apache Spark

Prerequisites

Connect to an HCD database with Apache Spark

Was this helpful?

Give Feedback