Running the Spark MLlib demo application
The Spark MLlib demo application demonstrates how to run machine-learning analytic jobs using Spark and DataStax Enterprise.
The Spark MLlib demo application demonstrates how to run machine-learning analytic jobs using Spark and DataStax Enterprise. The demo solves the classic iris flower classification problem, using the iris flower data set. The application will use the iris flower data set to build a Naive Bayes classifier that will recognize a flower based on four feature measurements.
Prerequisites
We strongly recommend that you install the BLAS library on your machines before running Spark MLlib jobs. For instructions on installing the BLAS library on your platform, see https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries.
The BLAS library is not distributed with DSE due to licensing restrictions, but improves MLlib performance significantly.
You must have the Gradle build tool installed to build the demo. See https://gradle.org/ for details on installing Gradle on your OS.
Procedure
-
Start the nodes in Analytics mode.
- Package installations: See Starting DataStax Enterprise as a service.
- Tarball installations: See Starting DataStax Enterprise as a stand-alone process.
-
In a terminal, go to the spark-mlib directory located in
the Spark demo directory.
The default location of the Spark demo depends on the type of installation:
- Package installations: /usr/share/dse/demos/portfolio_manager
- Tarball installations: installation_location/demos/portfolio_manager
-
Build the application using the
gradle
build tool.gradle
-
Use
spark-submit
to submit the application JAR.The Spark MLlib demo application reads the Spark demo directory/spark-mllib/iris.csv file on each node. This file must be accessible in the same location on each node. If some nodes do not have the same local file path, set up a shared network location accessible to all the nodes in the cluster.
To run the application where each node has access to the same local location of iris.csv.
dse spark-submit NaiveBayesDemo.jar
To specify a shared location of iris.csv:
dse spark-submit NaiveBayesDemo.jar /mnt/shared/iris.csv