Running the Spark MLlib demo application

The Spark MLlib demo application demonstrates how to run machine-learning analytic jobs using Spark and DataStax Enterprise. The demo solves the classic iris flower classification problem, using the iris flower data set. The application will use the iris flower data set to build a Naive Bayes classifier that will recognize a flower based on four feature measurements.

We strongly recommend that you install the BLAS library on your machines before running Spark MLlib jobs. For instructions on installing the BLAS library on your platform, see https://github.com/fommil/netlib-java/blob/master/README.md#machine-optimised-system-libraries.

The BLAS library is not distributed with DSE due to licensing restrictions, but improves MLlib performance significantly.

You must have the Gradle build tool installed to build the demo. See the Gradle website for details on installing Gradle on your OS.

  1. Start the nodes in Analytics mode.

  2. In a terminal, go to the spark-mlib directory located in the Spark demo directory.

    The default location of the Spark demo depends on the type of installation:

    • Package installations and Installer-Services: /usr/share/dse/demos/portfolio_manager

    • Tarball installations and Installer-No Services: installation_location/demos/portfolio_manager

  3. Build the application using the gradle build tool.

    $ gradle
  4. Use spark-submit to submit the application JAR.

    The Spark MLlib demo application reads the Spark demo directory/spark-mllib/iris.csv file on each node. This file must be accessible in the same location on each node. If some nodes do not have the same local file path, set up a shared network location accessible to all the nodes in the cluster.

    To run the application where each node has access to the same local location of iris.csv.

    $ dse spark-submit NaiveBayesDemo.jar

    To specify a shared location of iris.csv:

    $ dse spark-submit NaiveBayesDemo.jar /mnt/shared/iris.csv

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com