Getting started with Spark Streaming
Spark Streaming allows you to consume live data streams from sources, including Akka, Kafka, and Twitter. This data can then be analyzed by Spark applications, and the data can be stored in Cassandra. This example uses Scala.
You use Spark Streaming by creating a org.apache.spark.streaming.StreamingContext instance based on your Spark configuration. You then create a DStream instance, or a discretionized stream, an object that represents an input stream. DStream objects are created by calling one of the methods of StreamingContext, or using a utility class from external libraries to connect to other sources like Twitter.
The data you consume and analyze is saved to Cassandra by calling one of the saveToCassandra methods on the stream object, passing in the keyspace name, the table name, and optionally the column names and batch size.
The following Scala example demonstrates how to connect to a text input stream at a particular IP address and port, count the words in the stream, and saves the results to Cassandra.
Procedure
Example
In the following example, you will start a service using the nc utility that repeats strings, then consume the output of that service using Spark Streaming.
In a terminal window, enter the following command to start the service:
$ nc -lk 9999 one two two three three three four four four four someword
In a different terminal start a Spark shell.
$ bin/dse spark
In the Spark shell enter the following:
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._
import com.datastax.spark.connector.streaming._
import com.datastax.spark.connector.cql.CassandraConnector
val conf = new SparkConf().setMaster( "local[2]").setAppName( "NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val lines = ssc.socketTextStream( "localhost", 9999)
val words = lines.flatMap(_.split( " "))
val pairs = words.map(word => (word, 1))
CassandraConnector(conf).withSessionDo { session =>
| session.execute(s "CREATE KEYSPACE IF NOT EXISTS streaming_test WITH REPLICATION = {'class': 'SimpleStrategy', 'replication_factor': 1 }")
| session.execute(s "CREATE TABLE IF NOT EXISTS streaming_test.words_table (word TEXT PRIMARY KEY, count COUNTER)")
| session.execute(s "TRUNCATE streaming_test.words_table")
| }
val wordCounts = pairs.reduceByKey(_ + _)
wordCounts.saveToCassandra( "streaming_test", "words_table", SomeColumns( "word", "count"))
wordCounts.print()
ssc.start()
ssc.awaitTermination()
exit()
Using cqlsh connect to the streaming_test
keyspace and run a query to show the results.
$ cqlsh -k streaming_test cqlsh:streaming_test> select * from words_table; word | count ---------+------- three | 3 one | 1 two | 2 four | 4 someword | 1
What's next
See the Spark Streaming Programming Guide for more information, API documentation, and examples.