Get started with the Apache Cassandra® Sparkconnector Java API
The Apache Cassandra® Spark connector Java API allows you to create Java applications that use Spark to analyze database data.
For the current version of the Cassandra Spark connector used by DataStax Enterprise, see the component versions.
For example project templates, see the SparkBuildExamples repository.
Add Cassandra Spark connector dependencies
Connector versions 2.0 and later include the Java API by default. No additional module dependencies are required.
For connector versions earlier than 2.0, you must add the Java API dependencies to your project.
-
Use the Java API in SBT build files
-
Use the Java API in Maven build files
For connector versions earlier than 2.0, add the following library dependency to the build.sbt or other SBT build file:
libraryDependencies ++= Seq(
"com.datastax.dse" % "dse-spark-dependencies" % dseVersion % "provided" excludeAll (
ExclusionRule("com.datastax.dse", "dse-java-driver-core"),
ExclusionRule("org.apache.solr", "solr-solrj")
),
"com.datastax.dse" % "dse-java-driver-core" % "1.2.3",
"org.apache.solr" % "solr-solrj" % "6.0.1"
)
Add the following dependencies to the pom.xml
file:
<dependency>
<groupId>com.datastax.dse</groupId>
<artifactId>dse-spark-dependencies</artifactId>
<version>${dse.version}</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>com.datastax.dse</groupId>
<artifactId>dse-java-driver-core</artifactId>
</exclusion>
<exclusion>
<groupId>org.apache.solr</groupId>
<artifactId>solr-solrj</artifactId>
</exclusion>
</exclusions>
</dependency>
Then add the DataStax repository:
<repositories>
<repository>
<id>DataStax-Repo</id>
<url>https://datastax.artifactoryonline.com/datastax/public-repos/</url>
</repository>
</repositories>
Locate the dse-spark-version.jar
The default location of the dse-spark-version.jar
file depends on the type of installation:
-
Package installations and installer-services installations:
/usr/share/dse/dse-spark-version.jar
-
Tarball installations and installer-no services installations:
installation_location/lib/dse-spark-version.jar
Access database data in Scala applications
To perform Spark actions on table data, you need an RDD object.
To create the RDD object, create a Spark configuration object, which is then used to create a Spark context object:
import com.datastax.spark.connector._
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("dse://127.0.0.1:7077", "test", conf)
val rdd = sc.cassandraTable("my_keyspace", "my_table")
To save data to the database in Scala applications, use the saveToCassandra
method, passing in the keyspace, table, and mapping information:
val collection = sc.parallelize(Seq(("key3", 3), ("key4", 4)))
collection.saveToCassandra("my_keyspace", "my_table", SomeColumns("key", "value"))
To perform DSE Graph queries in a Scala application, you can cast a CassandraConnector
session to a com.datastax.driver.dse.DseSession
and then use the executeGraph
method to run graph statements:
val session = CassandraConnector(sc.getConf).withSessionDo(session => session.asInstanceOf[DseSession])
session.executeGraph(graph statement)
Access database data in Java applications
To perform Spark actions on table data, you need a CassandraJavaRDD
object, which is a subclass of the JavaRDD
class.
The CassandraJavaRDD
is the Java language equivalent of the CassandraRDD
object used in Scala applications.
To create the CassandraJavaRDD
object, create a Spark configuration object, which is then used to create a Spark context object:
SparkConf conf = new SparkConf()
.setAppName( "My application");
SparkContext sc = new SparkContext(conf);
Use the static methods of the com.datastax.spark.connector.japi.CassandraJavaUtil
class to get and manipulate CassandraJavaRDD
instances.
To get a new CassandraJavaRDD
instance, call one of the javaFunctions
methods in CassandraJavaUtil
, pass in a context object, and then call the cassandraTable
method and pass in the keyspace, table name, and mapping class:
JavaRDDstring cassandraRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", .mapColumnTo(String.class))
.select("my_column");
Map column data to Java types
You can specify the Java type of a single column from a table row by specifying the type in when creating the CassandraJavaRDD<T>
instance and calling the mapColumnTo
method and passing in the type.
Then call the select
method to set the column name.
JavaRDD<Integer> cassandraRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", .mapColumnTo(Integer.class))
.select("column1");
JavaBeans classes can be mapped using the mapRowTo
method.
The JavaBeans property names should correspond to the column names following the default mapping rules.
For example, the firstName
property will map by default to the first_name
column name.
JavaRDD<Person> personRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", mapRowTo(Person.class));
CassandraJavaPairRDD<T, T>
instances are extensions of the JavaPairRDD class, and have mapping readers for rows and columns similar to the previous examples.
These pair RDDs typically are used for key/value pairs, where the first type is the key and the second type is the value.
When mapping a single column for both the key and the value, call mapColumnTo
and specify the key and value types, then the select
method and pass in the key and value column names.
CassandraJavaPairRDD<Integer, String> pairRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", mapColumnTo(Integer.class), mapColumnTo(String.class))
select("id", "first_name");
Use the mapRowTo
method to map row data to a Java type.
For example, to create a pair RDD instance with the primary key and then a JavaBeans object:
CassandraJavaPairRDD<Integer, Person> idPersonRdd = CassandraJavaUtil.javaFunctions(sc)
.cassandraTable("my_keyspace", "my_table", mapColumnTo(Integer.class), mapRowTo(Person.class))
.select("id", "first_name", "last_name", "birthdate", "email");
Save data to the database in Java applications
To save data from an RDD to the database, call the writerBuilder
method on the CassandraJavaRDD instance, passing in the keyspace, table name, and optionally type mapping information for the column or row:
CassandraJavaUtil.javaFunctions(personRdd)
.writerBuilder("my_keyspace", "my_table", mapToRow(Person.class)).saveToCassandra();