Getting started with Analytics and Hadoop in DataStax Enterprise
The Hadoop component in DataStax Enterprise enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CFS) keyspaces for the underlying storage layer.
In DataStax Enterprise, you can run analytics on your Cassandra data via the platform's built-in Hadoop integration. The Hadoop component in DataStax Enterprise is not meant to be a full Hadoop distribution, but rather enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides replication, data location awareness, and takes full advantage of Cassandra's peer-to-peer architecture.
DataStax Enterprise supports running analytics on Cassandra data with the following Hadoop components:
- MapReduce
- Hive for running HiveQL queries on Cassandra data
- Pig for exploring very large data sets
- Apache Mahout for machine learning applications
Before starting an analytics/Hadoop node on a production cluster or data center, it is important to disable the virtual node configuration. You can skip this step to run the Hadoop getting started tutorial.
DataStax recommends using virtual nodes only in data centers running Cassandra real-time workloads. If you have enabled virtual nodes in Hadoop nodes, see Disabling virtual nodes.
Starting a DSE Analytics node
The way you start up a DSE Analytics node depends on the type of installation:
- Tarball installs:
From the installation directory:
$ bin/dse cassandra -t
- Packaged installation:
- Enable Hadoop mode by setting this option in
/etc/default/dse:
HADOOP_ENABLED=1
- Use this command to start the
service:
$ sudo service dse start
- Enable Hadoop mode by setting this option in
/etc/default/dse:
Stopping a DSE Analytics node
The way you stop up a DSE Analytics node depends on the type of installation:
- Tarball installs:
- From the install directory:
$ bin/dse cassandra-stop
- Check that the dse process has
stopped.
$ ps auwx | grep dse
If the dse process stopped, the output should be minimal, for example:jdoe 12390 0.0 0.0 2432768 620 s000 R+ 2:17PM 0:00.00 grep dse
If the output indicates that the dse process is not stopped, rerun the cassandra-stop command using the process ID (PID) from the top of the output.
bin/dse cassandra-stop PID
- From the install directory:
- Packaged
installation:
$ sudo service dse stop