Connect to Astra Managed Clusters with the Apache Cassandra Spark Connector
Using the comprehensive processing features in Apache Spark™ can boost your data analysis and processing capabilities with Astra Managed Clusters.
Apache Spark with Scala in spark-shell seamlessly integrates with tables in your Astra Managed Cluster databases for advanced data analysis. You can run SQL and CQL queries to interact with your data. You can also use Spark DataFrames and RDDs for sophisticated data manipulation and analysis.
Managed Cluster databases are compatible with the Apache Cassandra Spark Connector, which allows for better support of container orchestration platforms.
Prerequisites
-
An active Managed Cluster database
-
An application token and Secure Connect Bundle (SCB) for your database
Prepare packages and dependencies
This guide recommends the latest version of the Spark Connector. If you want to use a different version, you must use Spark, Java, and Scala versions compatible with your chosen connector version. For more information, see Spark Connector version compatibility.
-
Download Apache Spark pre-built for Apache Hadoop® and Scala. DataStax recommends the latest version.
-
Download the latest
cassandra-spark-connectorpackage.
-
Install Java version 8 or later, and then set it as the default Java version.
-
Install Scala version 2.12 or 2.13.
Connect to Astra DB with Spark
-
Extract the Apache Spark package into a directory.
The following steps use
SPARK_HOMEas a placeholder for the path to your Spark directory. -
Add the following lines to the end of the
spark-defaults.conffile located atSPARK_HOME/conf/spark-defaults.conf. If no such file exists, look for a template in theSPARK_HOME/confdirectory.spark.files PATH/TO/SCB/DIR spark.cassandra.connection.config.cloud.path SCB.zip spark.cassandra.auth.username token spark.cassandra.auth.password APPLICATION_TOKEN spark.dse.continuousPagingEnabled falseReplace the following:
-
PATH/TO/SCB/DIR: The path to the directory where you stored your database’s SCB. -
SCB.zip: The name of your SCB zip file. -
APPLICATION_TOKEN: Your application token.
-
-
Launch
spark-shellfrom the root directory of your Spark installation:bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_SCALA_VERSION:CONNECTOR_VERSIONReplace
SCALA_VERSIONwith your Scala version, and replaceCONNECTOR_VERSIONwith your Spark Connector version.Result
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). Spark context Web UI available at http://localhost:4040 Spark context available as 'sc' (master = local[*], app id = local-1608781805157). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.1 /_/ Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1) Type in expressions to have them evaluated. Type :help for more information. scala> -
Run the following Scala commands to connect Spark to your database through the connector:
import com.datastax.spark.connector._ import org.apache.spark.sql.cassandra._ spark.read.cassandraFormat("tables", "system_schema").load().count()Result
scala> import com.datastax.spark.connector._ import com.datastax.spark.connector._ scala> import org.apache.spark.sql.cassandra._ import org.apache.spark.sql.cassandra._ scala> spark.read.cassandraFormat("tables", "system_schema").load().count() res0: Long = 25 scala> :quit
Next steps
To learn more about using the Spark Connector, see the Apache Cassandra Spark Connector documentation.