Use Apache Spark to connect your database

Use Apache Spark to connect to your database and begin accessing your Astra DB tables using Scala in spark-shell. Connect Spark to Astra DB, run SQL statements, interact with Spark DataFrames/RDDs, or even run CQL statements directly.

Prerequisites

  1. Download your database’s secure connect bundle

  2. Download Apache Spark pre-built for Apache Hadoop 2.7

  3. Create an application token with, at minimum, a read-only role

  4. Download the Spark Cassandra Connector (SCC) version that is compatible with your Apache Spark and Scala version

Procedure

Use the following steps if you are using Apache Spark in local mode:

  1. Expand the downloaded Apache Spark package into a directory and assign the directory name to \$SPARK_HOME (cd \$SPARK_HOME).

  2. Append the following lines at the end of a file, $SPARK_HOME/conf/spark-defaults.conf. If necessary, look for a template under the $SPARK_HOME/conf directory.

  3. Replace the second column (value) with the first four lines:

    spark.files $SECURE_CONNECT_BUNDLE_FILE_PATH/secure-connect-{{safeName}}.zip
    spark.cassandra.connection.config.cloud.path secure-connect-{{safeName}}.zip
    spark.cassandra.auth.username <<CLIENT ID>>
    spark.cassandra.auth.password <<CLIENT SECRET>>
    spark.dse.continuousPagingEnabled false
  4. Launch spark-shell and enter the following scala commands:

    import com.datastax.spark.connector._
    import org.apache.spark.sql.cassandra._
    spark.read.cassandraFormat("tables", "system_schema").load().count()
    Response
    $ bin/spark-shell
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://localhost:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608781805157).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
          /_/
    
    Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala> import com.datastax.spark.connector._
    import com.datastax.spark.connector._
    
    scala> import org.apache.spark.sql.cassandra._
    import org.apache.spark.sql.cassandra._
    
    scala> spark.read.cassandraFormat("tables", "system_schema").load().count()
    res0: Long = 25
    
    scala> :quit

The Spark Cassandra Connector (SCC) is available for any Apache Cassandra® user, including Astra DB users. The SCC allows for better support of container orchestration platforms. For more information, see Advanced Apache Cassandra Analytics Now Open for All.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com