Connect with Apache Spark

Using Apache Spark in local mode, you can connect to Serverless (Non-Vector) databases only.

Using Spark’s comprehensive processing features can boost your data analysis and processing capabilities with Astra DB Serverless.

Apache Spark with Scala in spark-shell seamlessly connects to and accesses your Astra DB Serverless tables for advanced data analysis. You can directly execute SQL and CQL queries to interact with your data, and you can employ Spark DataFrames and RDDs for sophisticated data manipulation and analysis.

Astra DB is compatible with the Spark Cassandra Connector (SCC), which allows for better support of container orchestration platforms. For more information, see Advanced Apache Cassandra Analytics Now Open for All.

Prepare packages and dependencies

The following steps assume you will install the latest version of the Spark Cassandra Connector (SSC). If you want to use a different version, you must use Spark, Java, and Scala versions compatible with your chosen SCC version. For more information, see SCC version compatibility

  1. Download Apache Spark pre-built for Apache Hadoop and Scala. DataStax recommends the latest version.

  2. Download the latest SCC package.

  3. Install Java version 8 or later, and then set it as the default Java version.

  4. Install Scala version 2.12 or 2.13.

Connect to a Serverless (Non-Vector) database with Apache Spark

  1. Extract the Apache Spark package into a directory.

    The following steps use SPARK_HOME as a placeholder for the path to your Apache Spark directory.

  2. Add the following lines to the end of the spark-defaults.conf file located at SPARK_HOME/conf/spark-defaults.conf. If no such file exists, look for a template in the SPARK_HOME/conf directory.

    spark.files SECURE_CONNECT_BUNDLE_PATH/SECURE_CONNECT_BUNDLE.zip
    spark.cassandra.connection.config.cloud.path SECURE_CONNECT_BUNDLE.zip
    spark.cassandra.auth.username token
    spark.cassandra.auth.password ASTRA_DB_APPLICATION_TOKEN
    spark.dse.continuousPagingEnabled false

    Replace the following:

    • SECURE_CONNECT_BUNDLE_PATH: The path to the directory where you stored your database’s SCB.

    • SECURE_CONNECT_BUNDLE: The name of your SCB ZIP file.

    • ASTRA_DB_APPLICATION_TOKEN: Your application token.

  3. Launch spark-shell from the root directory of your Spark installation.

    $ bin/spark-shell --packages com.datastax.spark:spark-cassandra-connector_SCALA_VERSION:SCC_VERSION

    Replace SCALA_VERSION with your Scala version, and replace SCC_VERSION with your Spark Cassandra Connector version.

    Result
    $ bin/spark-shell
    Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
    Setting default log level to "WARN".
    To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
    Spark context Web UI available at http://localhost:4040
    Spark context available as 'sc' (master = local[*], app id = local-1608781805157).
    Spark session available as 'spark'.
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /___/ .__/\_,_/_/ /_/\_\   version 3.0.1
          /_/
    
    Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 11.0.9.1)
    Type in expressions to have them evaluated.
    Type :help for more information.
    
    scala>
  4. Run the following Scala commands to connect Apache Spark with your database through the SCC:

    import com.datastax.spark.connector._
    import org.apache.spark.sql.cassandra._
    spark.read.cassandraFormat("tables", "system_schema").load().count()
    Result
    scala> import com.datastax.spark.connector._
    import com.datastax.spark.connector._
    
    scala> import org.apache.spark.sql.cassandra._
    import org.apache.spark.sql.cassandra._
    
    scala> spark.read.cassandraFormat("tables", "system_schema").load().count()
    res0: Long = 25
    
    scala> :quit

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com