Connect with Apache Spark
Using Apache Spark in local mode, you can connect to Serverless (Non-Vector) databases only. |
Using Spark’s comprehensive processing features can boost your data analysis and processing capabilities with Astra DB Serverless.
Apache Spark with Scala in spark-shell seamlessly connects to and accesses your Astra DB Serverless tables for advanced data analysis. You can directly execute SQL and CQL queries to interact with your data, and you can employ Spark DataFrames and RDDs for sophisticated data manipulation and analysis.
Astra DB is compatible with the Spark Cassandra Connector (SCC), which allows for better support of container orchestration platforms. For more information, see Advanced Apache Cassandra Analytics Now Open for All.
Prerequisites
-
An active Astra account
-
An active Serverless (Non-Vector) database
-
An application token and Secure Connect Bundle (SCB) for your database
Prepare packages and dependencies
The following steps assume you will install the latest version of the Spark Cassandra Connector (SSC). If you want to use a different version, you must use Spark, Java, and Scala versions compatible with your chosen SCC version. For more information, see SCC version compatibility
-
Download Apache Spark pre-built for Apache Hadoop and Scala. DataStax recommends the latest version.
-
Download the latest SCC package.
-
Install Java version 8 or later, and then set it as the default Java version.
-
Install Scala version 2.12 or 2.13.
Connect to a Serverless (Non-Vector) database with Apache Spark
-
Extract the Apache Spark package into a directory.
The following steps use
SPARK_HOME
as a placeholder for the path to your Apache Spark directory. -
Add the following lines to the end of the
spark-defaults.conf
file located atSPARK_HOME/conf/spark-defaults.conf
. If no such file exists, look for a template in theSPARK_HOME/conf
directory.spark.files SECURE_CONNECT_BUNDLE_PATH/SECURE_CONNECT_BUNDLE.zip spark.cassandra.connection.config.cloud.path SECURE_CONNECT_BUNDLE.zip spark.cassandra.auth.username token spark.cassandra.auth.password ASTRA_DB_APPLICATION_TOKEN spark.dse.continuousPagingEnabled false
Replace the following:
-
SECURE_CONNECT_BUNDLE_PATH
: The path to the directory where you stored your database’s SCB. -
SECURE_CONNECT_BUNDLE
: The name of your SCB ZIP file. -
ASTRA_DB_APPLICATION_TOKEN
: Your application token.
-
-
Launch
spark-shell
from the root directory of your Spark installation.$ bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_SCALA_VERSION:SCC_VERSION
Replace
SCALA_VERSION
with your Scala version, and replaceSCC_VERSION
with your Spark Cassandra Connector version.Result
$ bin/spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1608781805157). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1) Type in expressions to have them evaluated. Type :help for more information. scala>
-
Run the following Scala commands to connect Apache Spark with your database through the SCC:
import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra._ spark.read.cassandraFormat("tables", "system_schema").load().count()
Result
scala> import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala> import org.apache.spark.sql.cassandra._ import org.apache.spark.sql.cassandra._ scala> spark.read.cassandraFormat("tables", "system_schema").load().count() res0: Long = 25 scala> :quit