Loading external HDFS data into Cassandra using Spark
Hadoop HDFS data can be accessed from DataStax Enterprise Analytics nodes and saved to Cassandra tables using Spark.
This task demonstrates how to access Hadoop data and save it to Cassandra using Spark on DSE Analytics nodes.
To simplify accessing the Hadoop data, it uses WebHDFS, a REST-based server for interacting with a Hadoop cluster. WebHDFS handles redirect requests to the data nodes, so every DSE Analytics node needs to be able to route to every HDFS node using the Hadoop node's hostname.
These instructions use example weather data, but the principles can be applied to any kind of Hadoop data that can be stored in Cassandra.
Prerequisites
You will need:
- A working Hadoop installation with HDFS and WebHDFS enabled and running. You will need the hostname of the machine on which Hadoop is running, and the cluster must be accessible from the DSE Analytics nodes in your DataStax Enterprise cluster.
- A running DataStax Enterprise cluster with DSE Analytics nodes enabled.
- Git installed on a DSE Analytics node.