Connect with Apache Spark
Using Apache Spark in local mode, you can connect to Serverless (Non-Vector) databases only. |
Apache Spark with Scala in spark-shell seamlessly connects to and accesses your Astra DB Serverless tables for advanced data analysis. With this approach you can do the following:
-
Directly execute SQL and CQL queries to interact with your data.
-
Employ Spark DataFrames and RDDs for sophisticated data manipulation and analysis.
Using Spark’s comprehensive processing features can boost your data analysis and processing capabilities with Astra DB Serverless.
All Apache Cassandra® users, including Astra DB users, can use the Spark Cassandra Connector (SCC). The SCC allows for better support of container orchestration platforms. For more information, see Advanced Apache Cassandra Analytics Now Open for All.
Prerequisites
Before you can connect to your Serverless (Non-Vector) database with Apache Spark, you need to complete the following prerequisites:
-
Have an active Astra account.
-
Create a Serverless (Non-Vector) database in the Astra Portal.
-
Create an application token with the required roles. The example in this document requires the Read Only User.
-
Download a Secure Connect Bundle (SCB), and then connect to your database using a CQL driver or Spark.
-
Download Apache Spark pre-built for Apache Hadoop and Scala. DataStax recommends the latest versions.
-
Download and install Scala 2.x.
-
Download a compatible version of the Spark Cassandra Connector (SSC) from the Maven central repository.
-
Install a compatible Java version and set it as the default Java version.
Connect to a Serverless (Non-Vector) database with Apache Spark
-
Extract the Apache Spark package into a directory.
The following steps use
SPARK_HOME
as a placeholder for the path to your Apache Spark directory. -
Add the following lines to the end of the
spark-defaults.conf
file located atSPARK_HOME/conf/spark-defaults.conf
. If no such file exists, look for a template in theSPARK_HOME/conf
directory.spark.files SECURE_CONNECT_BUNDLE_PATH/SECURE_CONNECT_BUNDLE.zip spark.cassandra.connection.config.cloud.path SECURE_CONNECT_BUNDLE.zip spark.cassandra.auth.username CLIENT_ID spark.cassandra.auth.password CLIENT_SECRET spark.dse.continuousPagingEnabled false
Replace the following with the values from the SCB:
-
SECURE_CONNECT_BUNDLE_PATH
-
SECURE_CONNECT_BUNDLE
-
CLIENT_ID
-
CLIENT_SECRET
-
-
Launch
spark-shell
from the root directory of your Spark installation.$ bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_SCALA_VERSION:SCC_VERSION
Replace
SCALA_VERSION
with your Scala version, and replaceSCC_VERSION
with your Spark Cassandra Connector version.Result
$ bin/spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1608781805157). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1) Type in expressions to have them evaluated. Type :help for more information. scala>
-
Run the following Scala commands to connect Apache Spark with your database through the SCC:
import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra._ spark.read.cassandraFormat("tables", "system_schema").load().count()
Result
scala> import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala> import org.apache.spark.sql.cassandra._ import org.apache.spark.sql.cassandra._ scala> spark.read.cassandraFormat("tables", "system_schema").load().count() res0: Long = 25 scala> :quit