A tutorial to use DSE Hadoop that is embedded in DataStax Enterprise.
In this tutorial, you download a text file containing a State of the Union speech and
run a classic MapReduce job that counts the words in the file and creates a sorted
list of word/count pairs as output. The mapper and reducer are provided in a JAR
file. Download the State of the Union speech
now.
This tutorial assumes that you started an analytics
node on Linux. Also, the tutorial assumes you have permission to perform
Hadoop and other DataStax Enterprise operations, for example, or that you preface
commands with sudo if necessary.
Procedure
-
Unzip the downloaded obama.txt.zip file into a directory
of your choice on your file system.
This file will be the input for the MapReduce job.
-
Create a directory in the Cassandra File System (CFS) for the input file using
the dse command version of the familiar
hadoop fs command. For example, on Installer-No Services and Tarball installations:
$ cd install_location
$ bin/dse hadoop fs -mkdir /user/hadoop/wordcount/input
-
Copy the input file that you downloaded to the CFS.
$ bin/dse hadoop fs -copyFromLocal
path/obama.txt
/user/hadoop/wordcount/input
-
Check the version number of the
hadoop-examples-version.jar
file, located in:
- Installer-Services installations: /usr/share/dse/hadoop/lib
- Installer-No Services installations:
install_location/resources/hadoop
- Package installations: /usr/share/dse/hadoop/lib
- Tarball installations:
install_location/resources/hadoop
-
Get usage information about how to run the MapReduce job from the jar
file.
$ bin/dse hadoop jar /install_location/resources/hadoop/hadoop-examples-1.0.4.13.jar wordcount
The output is:
2013-10-02 12:40:16.983 java[9505:1703] Unable to load realm info from SCDynamicStore
Usage: wordcount <in> <out>
If you see the SCDynamic Store message, just ignore it. The internet provides
information about the message.
-
Run the Hadoop word count example in the JAR.
$ bin/dse hadoop jar
/install_location/resources/hadoop/hadoop-examples-1.0.4.13.jar wordcount
/user/hadoop/wordcount/input
/user/hadoop/wordcount/output
The output is:
13/10/02 12:40:36 INFO input.FileInputFormat: Total input paths to process : 0
13/10/02 12:40:36 INFO mapred.JobClient: Running job: job_201310020848_0002
13/10/02 12:40:37 INFO mapred.JobClient: map 0% reduce 0%
. . .
13/10/02 12:40:55 INFO mapred.JobClient: FILE_BYTES_WRITTEN=19164
13/10/02 12:40:55 INFO mapred.JobClient: Map-Reduce Framework
-
List the contents of the output directory on the CFS.
$ bin/dse hadoop fs -ls /user/hadoop/wordcount/output
The output looks something like this:
Found 3 items
-rwxrwxrwx 1 root wheel 0 2013-10-02 12:58 /user/hadoop/wordcount/output/_SUCCESS
drwxrwxrwx - root wheel 0 2013-10-02 12:57 /user/hadoop/wordcount/output/_logs
-rwxrwxrwx 1 root wheel 24528 2013-10-02 12:58 /user/hadoop/wordcount/output/part-r-00000
-
Using the output file name from the directory listing, get more information
using the dsetool utility.
$ bin/dsetool checkcfs /user/hadoop/wordcount/output/part-r-00000
The output is:
Path: cfs://127.0.0.1/user/hadoop/wordcount/output/part-r-00000
INode header:
File type: FILE
User: root
Group: wheel
Permissions: rwxrwxrwx (777)
Block size: 67108864
Compressed: true
First save: true
Modification time: Wed Mar 02 12:58:05 PDT 2014
INode:
Block count: 1
Blocks: subblocks length start end
(B) f2fa9d90-2b9c-11e3-9ccb-73ded3cb6170: 1 24528 0 24528
f3030200-2b9c-11e3-9ccb-73ded3cb6170: 24528 0 24528
Block locations:
f2fa9d90-2b9c-11e3-9ccb-73ded3cb6170: [localhost]
Data:
All data blocks ok.
-
Finally, look at the output of the MapReduce job--the list of word/count pairs
using a familiar Hadoop command.
$ bin/dse hadoop fs -cat /user/hadoop/wordcount/output/part-r-00000
The output is:
"D." 1
"Don't 1
"I 4
. . .