Running the Spark MLlib demo application

The Spark MLlib demo application demonstrates how to run machine-learning analytic jobs using Spark and Cassandra.

The Spark MLlib demo application demonstrates how to run machine-learning analytic jobs using Spark and Cassandra. The demo solves the classic iris flower classification problem, using the iris flower data set. The application will use the iris flower data set to build a Naive Bayes classifier that will recognize a flower based on four feature measurements.

Prerequisites

We strongly recommend that you install the BLAS library on your machines before running Spark MLlib jobs. For instructions on installing the BLAS library on your platform, see https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries.

The BLAS library is not distributed with DataStax Enterprise due to licensing restrictions, but improves MLlib performance significantly.

You must have the Gradle build tool installed to build the demo. See https://gradle.org/ for details on installing Gradle on your OS.

Procedure

Start the nodes in Analytics mode.
- Installer-Services and Package installations: See Starting DataStax Enterprise as a service.
- Installer-No Services and Tarball installations: See Starting DataStax Enterprise as a stand-alone process.

In a terminal, go to the spark-mlib directory located in the Spark demo directory.

The default location of the Spark demo depends on the type of installation:

Installer-Services and Package installations	/usr/share/dse/demos/spark
Installer-No Services and Tarball installations	`install_location`/demos/spark

Build the application using the gradle build tool.
```
gradle
```
Use spark-submit to submit the application JAR.

The Spark MLlib demo application reads the Spark demo directory/spark-mllib/iris.csv file on each node. This file must be accessible in the same location on each node. If some nodes do not have the same local file path, set up a shared network location accessible to all the nodes in the cluster.
To run the application where each node has access to the same local location of iris.csv.
```
dse spark-submit NaiveBayesDemo.jar
```
To specify a shared location of iris.csv:
```
dse spark-submit NaiveBayesDemo.jar /mnt/shared/iris.csv
```