About DSE Hadoop

The Hadoop component in DataStax Enterprise enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CFS) keyspaces for the underlying storage layer.

You can run analytics on Cassandra data using Hadoop, which is integrated into DataStax Enterprise. The Hadoop component in DataStax Enterprise is not meant to be a full Hadoop distribution, but rather enables analytics to be run across DataStax Enterprise's distributed, shared-nothing architecture. Instead of using the Hadoop Distributed File System (HDFS), DataStax Enterprise uses Cassandra File System (CFS) keyspaces for the underlying storage layer. This provides replication, data location awareness, and takes full advantage of Cassandra's peer-to-peer architecture. DSE Hadoop uses an embedded Apache Hadoop 1.0.4 to eliminate the need to install a separate Hadoop cluster. This is the fastest and easiest option for analyzing Cassandra data using Hadoop.

Unless using DSE Analytics and Search integration, DSE Hadoop workloads are isolated from other workloads that might run in your cluster, Cassandra and Search, never accessing nodes outside of the Analytics datacenter. Therefore, you can run heavy data analysis without affecting performance of your realtime-transactional system.

DataStax Enterprise supports internal authentication for analyzing data using the following Hadoop components:

  • MapReduce
  • Hive for running HiveQL queries on Cassandra data
  • Pig for exploring very large data sets
  • Apache Mahout for machine learning applications

To get started using DSE Hadoop, run the Portfolio Manager demo.

DataStax Enterprise turns off virtual nodes (vnodes) by default. Before turning vnodes on, understand the implications of doing so.

Performance enhancement 

DataStax Enterprise optimizes performance reading MapReduce files in the Cassandra File System (CFS) by storing files in the page cache, making the files available on the next read.

Starting a DSE Hadoop node 

The way you start up a DSE Hadoop node depends on the type of installation:

  • Installer-Services and Package installations:
    1. Enable Hadoop mode by setting HADOOP_ENABLED=1 in /etc/default/dse.
    2. Use this command to start the service:
      $ sudo service dse start
  • Installer-No Services and Tarball installations:

    From the installation directory:

    $ bin/dse cassandra -t

Stopping a DSE Hadoop node 

The way you stop a DSE Hadoop node depends on the type of installation:

  • Installer-No Services and Tarball installations:
    1. From the install directory:
      $ bin/dse cassandra-stop
    2. Check that the dse process has stopped.
      $ ps auwx | grep dse
      If the dse process stopped, the output should be minimal, for example:
      jdoe  12390 0.0 0.0  2432768   620 s000  R+ 2:17PM   0:00.00 grep dse

      If the output indicates that the dse process is not stopped, rerun the cassandra-stop command using the process ID (PID) from the top of the output.

      bin/dse cassandra-stop PID
  • Installer-Services and Package installations:
    $ sudo service dse stop