Describes the steps to run the Mahout demo included with DSE on an external
installation of Mahout.
The DataStax Enterprise installation includes a Mahout demo. The demo determines with
some percentage of certainty which entries in the input data remained statistically
in control and which have not. The input data is time series historical data. Using
the Mahout algorithms, the demo classifies the data into categories based on whether
it exhibited relatively stable behavior over a period of time. The demo produces a
file of classified results. This procedure describes how to run the Mahout demo.
Procedure
-
Go to the Hadoop home directory and make the test data directory.
$ cd <Hadoop home>
$ bin/hadoop fs -mkdir testdata
-
Add the data from the demo directory to Mahout.
$ bin/hadoop fs -put <DSE home>/demos/mahout/synthetic_control.data testdata
-
Go to the DSE home directory and run the demo's analysis job using
byoh.
$ bin/byoh mahout org.apache.mahout.clustering.syntheticcontrol.canopy.Job
The job will take some time to complete. You can monitor the process of
the job in OpsCenter if you have it installed.
-
When the job completes, output the classified data into a file in a temporary
location.
$ bin/byoh mahout clusterdump --input output/clusters-0-final --pointsDir output/clusteredPoints --output /tmp/clusteranalyze.txt
-
Open the /tmp/clusteranalyze.txt output data file and look
at the results.