Getting started with Analytics and Hadoop in DataStax Enterprise

The Hadoop component in DataStax Enterprise enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CFS) keyspaces for the underlying storage layer.

In DataStax Enterprise, you can run analytics on your Cassandra data via the platform's built-in Hadoop integration. The Hadoop component in DataStax Enterprise is not meant to be a full Hadoop distribution, but rather enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides replication, data location awareness, and takes full advantage of Cassandra's peer-to-peer architecture.

DataStax Enterprise supports running analytics on Cassandra data with the following Hadoop components:

  • MapReduce
  • Hive for running HiveQL queries on Cassandra data
  • Pig for exploring very large data sets
  • Apache Mahout for machine learning applications

Before starting an analytics/Hadoop node on a production cluster or data center, it is important to disable the virtual node configuration. You can skip this step to run the Hadoop getting started tutorial.

DataStax recommends using virtual nodes only in data centers running Cassandra real-time workloads. If you have enabled virtual nodes in Hadoop nodes, see Disabling virtual nodes.

Starting a DSE Analytics node 

The way you start up a DSE Analytics node depends on the type of installation:

  • Tarball installs:

    From the installation directory:

    $ bin/dse cassandra -t
  • Packaged installation:
    1. Enable Hadoop mode by setting this option in /etc/default/dse:
      HADOOP_ENABLED=1
    2. Use this command to start the service:
      $ sudo service dse start

Stopping a DSE Analytics node 

The way you stop up a DSE Analytics node depends on the type of installation:

  • Tarball installs:
    1. From the install directory:
      $ bin/dse cassandra-stop
    2. Check that the dse process has stopped.
      $ ps auwx | grep dse
      If the dse process stopped, the output should be minimal, for example:
      jdoe  12390 0.0 0.0  2432768   620 s000  R+ 2:17PM   0:00.00 grep dse

      If the output indicates that the dse process is not stopped, rerun the cassandra-stop command using the process ID (PID) from the top of the output.

      bin/dse cassandra-stop PID
  • Packaged installation:
    $ sudo service dse stop