Configuring the Spark history server

The Spark history server provides a way to load the event logs from Spark jobs that were run with event logging enabled. The Spark history server works only when files were not flushed before the Spark Master attempted to build a history user interface.

Procedure

To enable the Spark history server:

  1. Create a directory for event logs in the DSEFS file system:

    dse fs 'mkdir -p /spark/events'
  2. On each node in the cluster, edit the spark-defaults.conf file to enable event logging and specify the directory for event logs:

    #Turns on logging for applications submitted from this machine
    spark.eventLog.dir dsefs:///spark/events
    spark.eventLog.enabled true
    #Sets the logging directory for the history server
    spark.history.fs.logDirectory dsefs:///spark/events
    # Optional property that changes permissions set to event log files
    # spark.eventLog.permissions=777
  3. Start the Spark history server on one of the nodes in the cluster:

    The Spark history server is a front-end application that displays logging data from all nodes in the Spark cluster. It can be started from any node in the cluster.

    If you’ve enabled authentication set the authentication method and credentials in a properties file and pass it to the dse command. For example, for basic authentication:

    spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.username=<role name>
    spark.hadoop.com.datastax.bdp.fs.client.authentication.basic.password=<password>

    If you set the event log location in spark-defaults.conf, set the spark.history.fs.logDirectory property in your properties file.

    spark.history.fs.logDirectory=dsefs:///spark/events
    dse spark-history-server start

    With a properties file:

    dse spark-history-server start --properties-file <properties file>

    If you specify a properties file, none of the configuration in spark-defaults.conf is used. The properties file should contain all the required configuration properties.

    The history server is started and can be viewed by opening a browser to http://<node hostname>:18080.

    The Spark Master web UI does not show the historical logs. To work around this known issue, access the history from port 18080.

  4. When event logging is enabled, the default behavior is for all logs to be saved, which causes the storage to grow over time. To enable automated cleanup edit spark-defaults.conf and edit the following options:

    spark.history.fs.cleaner.enabled true
    spark.history.fs.cleaner.interval 1d
    spark.history.fs.cleaner.maxAge 7d

    For these settings, automated cleanup is enabled, the cleanup is performed daily, and logs older than seven days are deleted.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com