Hadoop getting started tutorial

A tutorial about using Apache Hadoop embedded in DataStax Enterprise.

In this tutorial, you download a text file containing a State of the Union speech and run a classic MapReduce job that counts the words in the file and creates a sorted list of word/count pairs as output. The mapper and reducer are provided in a JAR file. Download the State of the Union speech now.

This tutorial assumes you started an analytics node on Linux. Also, the tutorial assumes you have permission to perform Hadoop and other DataStax Enterprise operations, for example, or that you preface commands with sudo if necessary.

Procedure

  1. Unzip the downloaded obama.txt.zip file into a directory of your choice on your file system.

    This file will be the input for the MapReduce job.

  2. Create a directory in the CassandraFS for the input file using the dse command version of the familiar hadoop fs command. On a Linux tarball installation, for example:
    $ cd install_location
    $ bin/dse hadoop fs -mkdir /user/hadoop/wordcount/input
  3. Copy the input file that you downloaded to the CassandraFS.
    $ bin/dse hadoop fs -copyFromLocal
      path/obama.txt
      /user/hadoop/wordcount/input
  4. Check the version number of the hadoop-examples-version.jar. On tarball installations, the JAR is in the DataStax Enterprise resources directory. On packaged and AMI installations, the JAR is in the /usr/share/dse/hadoop/lib directory.
  5. Get usage information about how to run the MapReduce job from the jar file.
    $ bin/dse hadoop jar /install_location/resources/hadoop/hadoop-examples-1.0.4.9.jar wordcount

    The output is:

    2013-10-02 12:40:16.983 java[9505:1703] Unable to load realm info from SCDynamicStore
    Usage: wordcount <in> <out>

    If you see the SCDynamic Store message, just ignore it. The internet provides information about the message.

  6. Run the Hadoop word count example in the JAR.
    $ bin/dse hadoop jar
      /install_location/resources/hadoop/hadoop-examples-1.0.4.9.jar wordcount
      /user/hadoop/wordcount/input
      /user/hadoop/wordcount/output

    The output is:

    13/10/02 12:40:36 INFO input.FileInputFormat: Total input paths to process : 0
    13/10/02 12:40:36 INFO mapred.JobClient: Running job: job_201310020848_0002
    13/10/02 12:40:37 INFO mapred.JobClient:  map 0% reduce 0%
    . . .
    13/10/02 12:40:55 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=19164
    13/10/02 12:40:55 INFO mapred.JobClient:   Map-Reduce Framework
  7. List the contents of the output directory on the CassandraFS.
    $ bin/dse hadoop fs -ls /user/hadoop/wordcount/output

    The output looks something like this:

    Found 3 items
    -rwxrwxrwx   1 root wheel      0 2013-10-02 12:58 /user/hadoop/wordcount/output/_SUCCESS
    drwxrwxrwx   - root wheel      0 2013-10-02 12:57 /user/hadoop/wordcount/output/_logs
    -rwxrwxrwx   1 root wheel  24528 2013-10-02 12:58 /user/hadoop/wordcount/output/part-r-00000
  8. Using the output file name from the directory listing, get more information using the dsetool utility.
    $ bin/dsetool checkcfs /user/hadoop/wordcount/output/part-r-00000

    The output is:

    Path: cfs://127.0.0.1/user/hadoop/wordcount/output/part-r-00000
      INode header:
        File type: FILE
        User: root
        Group: wheel
        Permissions: rwxrwxrwx (777)
        Block size: 67108864
        Compressed: true
        First save: true
        Modification time: Wed Mar 02 12:58:05 PDT 2014
      INode:
        Block count: 1
        Blocks:                               subblocks   length   start     end
          (B) f2fa9d90-2b9c-11e3-9ccb-73ded3cb6170:   1    24528       0   24528
              f3030200-2b9c-11e3-9ccb-73ded3cb6170:        24528       0   24528
        Block locations:
        f2fa9d90-2b9c-11e3-9ccb-73ded3cb6170: [localhost]
      Data:
        All data blocks ok.
  9. Finally, look at the output of the MapReduce job--the list of word/count pairs using a familiar Hadoop command.
    $ bin/dse hadoop fs -cat /user/hadoop/wordcount/output/part-r-00000

    The output is:

    "D."  1
    "Don't  1
    "I  4
    . . .