Using Apache Spark to connect your database

Use Apache Spark to connect to your database and begin accessing your Astra DB tables using Scala in spark-shell. Connect Spark to Astra DB, run SQL statements, interact with Spark DataFrames/RDDs, or even run CQL statements directly.

Prerequisites

  1. Click Download Bundle (Connect using a native driver or Spark under Integrate with other tools) for connection credentials to your database. For more, see Downloading secure connect bundle.

    Download bundle
  2. Download Apache Spark pre-built for Apache Hadoop 2.7.

  3. Create an application token with the appropriate role set. The following example requires a read-only role.

  4. Download the Spark Cassandra Connector (SCC) that matches your Apache Spark and Scala version from the maven central repository. To find the right version of SCC, check the SCC compatibility.

Procedure

Use the following steps if you are using Apache Spark in local mode.
  1. Expand the downloaded Apache Spark package into a directory and assign the directory name to \$SPARK_HOME (cd \$SPARK_HOME).

  2. Append the following lines at the end of a file, $SPARK_HOME/conf/spark-defaults.conf. If necessary, look for a template under the $SPARK_HOME/conf directory.

  3. Replace the second column (value) with the first four lines:

spark.files $SECURE_CONNECT_BUNDLE_FILE_PATH/secure-connect-{{safeName}}.zip
spark.cassandra.connection.config.cloud.path secure-connect-{{safeName}}.zip
spark.cassandra.auth.username <<CLIENT ID>>
spark.cassandra.auth.password <<CLIENT SECRET>>
spark.dse.continuousPagingEnabled false
  1. Launch spark-shell and enter the following scala commands:

import com.datastax.spark.connector._
import org.apache.spark.sql.cassandra._
spark.read.cassandraFormat("tables", "system_schema").load().count()

The following output appears:

$ bin/spark-shell
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://localhost:4040
Spark context available as 'sc' (master = local[*], app id = local-1608781805157).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
Type in expressions to have them evaluated.
Type :help for more information.

scala> import com.datastax.spark.connector._
import com.datastax.spark.connector._

scala> import org.apache.spark.sql.cassandra._
import org.apache.spark.sql.cassandra._

scala> spark.read.cassandraFormat("tables", "system_schema").load().count()
res0: Long = 25

scala> :quit

The Spark Cassandra Connector (SCC) is available for any Cassandra user, including Astra users. The SCC allows for better support of container orchestration platforms. For more, read Advanced Apache Cassandra Analytics Now Open for All.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com