Connect with Apache Spark
Using Apache Spark in local mode, you can connect to DSE databases only. |
Apache Spark with Scala in spark-shell
seamlessly connects to and accesses your DataStax Enterprise (DSE) tables for advanced data analysis.
With this approach you can do the following:
-
Directly execute SQL and CQL queries to interact with your data.
-
Employ Spark DataFrames and RDDs for sophisticated data manipulation and analysis.
Using Spark’s comprehensive processing features can boost your data analysis and processing capabilities with DataStax Enterprise (DSE).
DSE users can use the Spark Cassandra Connector (SCC). The SCC allows for better support of container orchestration platforms. For more information, see Advanced Apache Cassandra Analytics Now Open for All.
Prerequisites
-
A running DSE database
-
Download Apache Spark pre-built for Apache Hadoop and Scala. DataStax recommends the latest versions:
-
Download and install Scala 2.x.
-
Download a compatible version of the Spark Cassandra Connector (SSC) from the Maven central repository.
-
Install a compatible Java version and set it as the default Java version.
Connect to a DSE database with Apache Spark
-
Extract the Apache Spark package into a directory.
The following steps use
SPARK_HOME
as a placeholder for the path to your Apache Spark directory. -
Add the following lines to the end of the
spark-defaults.conf
file located atSPARK_HOME/conf/spark-defaults.conf
. If no such file exists, look for a template in theSPARK_HOME/conf
directory.spark.cassandra.auth.username SUPERUSER_USERNAME spark.cassandra.auth.password SUPERUSER_PASSWORD spark.dse.continuousPagingEnabled false
Replace the following with your values for SUPERUSER_USERNAME and SUPERUSER_PASSWORD.
-
Launch
spark-shell
from the root directory of your Spark installation.bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_SCALA_VERSION:SCC_VERSION
Replace SCALA_VERSION with your Scala version, and replace SCC_VERSION with your Spark Cassandra Connector version.
Result
$ bin/spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1608781805157). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1) Type in expressions to have them evaluated. Type :help for more information. scala>
-
Run the following Scala commands to connect Apache Spark with your database through the SCC:
import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra._ spark.read.cassandraFormat("tables", "system_schema").load().count()
Result
scala> import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala> import org.apache.spark.sql.cassandra._ import org.apache.spark.sql.cassandra._ scala> spark.read.cassandraFormat("tables", "system_schema").load().count() res0: Long = 25 scala> :quit