Getting started with Analytics and Hadoop in DataStax Enterprise

The Hadoop component in DataStax Enterprise enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CFS) keyspaces for the underlying storage layer.

You can run analytics on your Cassandra data using Hadoop, which is integrated into DataStax Enterprise. The Hadoop component in DataStax Enterprise is not meant to be a full Hadoop distribution, but rather enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CassandraFS) keyspaces for the underlying storage layer. This provides replication, data location awareness, and takes full advantage of Cassandra's peer-to-peer architecture.

DataStax Enterprise 4.0.3 and later supports internal authentication for running hadoop, hive, pig, and sqoop commands.

DataStax Enterprise supports running analytics on Cassandra data with the following Hadoop components:

  • MapReduce
  • Hive for running HiveQL queries on Cassandra data
  • Pig for exploring very large data sets
  • Apache Mahout for machine learning applications

To get started using DSE Analytics with Hadoop, run the Portfolio Manager demo.

DataStax Enterprise 4.0 turns off virtual nodes (vnodes) by default. DataStax does not recommend turning on vnodes for Hadoop or Solr nodes, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. To run Hadoop, disable virtual nodes.

Performance enhancement 

Performance reading MapReduce files in the CassandraFS has been improved in DataStax Enterprise 4.0. These files are now stored in the page cache, making the files available on the next read.

Starting a DSE Analytics node 

The way you start up a DSE Analytics node depends on the type of installation:

  • Tarball installs:

    From the installation directory:

    $ bin/dse cassandra -t
  • Packaged installation:
    1. Enable Hadoop mode by setting this option in /etc/default/dse:
      HADOOP_ENABLED=1
    2. Use this command to start the service:
      $ sudo service dse start

Stopping a DSE Analytics node 

The way you stop up a DSE Analytics node depends on the type of installation:

  • Tarball installs:
    1. From the install directory:
      $ bin/dse cassandra-stop
    2. Check that the dse process has stopped.
      $ ps auwx | grep dse
      If the dse process stopped, the output should be minimal, for example:
      jdoe  12390 0.0 0.0  2432768   620 s000  R+ 2:17PM   0:00.00 grep dse

      If the output indicates that the dse process is not stopped, rerun the cassandra-stop command using the process ID (PID) from the top of the output.

      bin/dse cassandra-stop PID
  • Packaged installation:
    $ sudo service dse stop