Running the Spark MLlib demo application
The Spark MLlib demo application demonstrates how to run machine-learning analytic jobs using Spark and DataStax Enterprise. The demo solves the classic iris flower classification problem, using the iris flower data set. The application will use the iris flower data set to build a Naive Bayes classifier that will recognize a flower based on four feature measurements.
We strongly recommend that you install the BLAS library on your machines before running Spark MLlib jobs. For instructions on installing the BLAS library on your platform, see https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries.
The BLAS library is not distributed with DSE due to licensing restrictions, but improves MLlib performance significantly.
You must have the Gradle build tool installed to build the demo. See the Gradle website for details on installing Gradle on your OS.
Start the nodes in Analytics mode.
Package and Installer-Services installations: See Starting DataStax Enterprise as a service.
Tarball and Installer-No Services installations:See Starting DataStax Enterprise as a stand-alone process.
In a terminal, go to the spark-mlib directory located in the Spark demo directory.
The default location of the Spark demo depends on the type of installation:
Package installations and Installer-Services:
Tarball installations and Installer-No Services:
Build the application using the
spark-submitto submit the application JAR.
The Spark MLlib demo application reads the Spark demo
directory/spark-mllib/iris.csvfile on each node. This file must be accessible in the same location on each node. If some nodes do not have the same local file path, set up a shared network location accessible to all the nodes in the cluster.
To run the application where each node has access to the same local location of
$ dse spark-submit NaiveBayesDemo.jar
To specify a shared location of
$ dse spark-submit NaiveBayesDemo.jar /mnt/shared/iris.csv