Analytics node configuration

Steps to configure analytic nodes for DSE Hadoop.

Important configuration changes, excluding those related to the job tracker, are:

Advanced users can also configure DataStax Enterprise to run jobs remotely.

DataStax Enterprise turns off virtual nodes (vnodes) by default because using vnodes causes a sharp increase in the Hadoop task scheduling latency. This increase is due to the number of Hadoop splits, which cannot be lower than the number of vnodes in the analytics DC. Using vnodes, instead of N splits for tiny data, you have, for example, 256 * N splits, where N number of physical nodes in the cluster. This may raise job latency from tens of seconds to single or even tens of minutes. This increase in job latency is relatively insignificant when running jobs for hours to analyze huge quantities of data that inherently has lots of splits anyway. In this case, vnodes are perfectly fine.

DataStax does not recommend turning on vnodes for other Hadoop use cases or for Solr nodes, but you can use vnodes for any Cassandra-only cluster, or a Cassandra-only data center in a mixed Hadoop/Solr/Cassandra deployment. If you have enabled virtual nodes on Hadoop nodes, disable virtual nodes before using the cluster.

Setting the replication factor 

The default replication for the HiveMetaStore, cfs, and cfs_archive system keyspaces is 1. A replication factor of 1 using the default data center Analytics is configured for development and testing of a single node, not for a production environment. For production clusters, increase the replication factor to at least 3. How high you increase the replication depends on the number of nodes in the cluster, as discussed in the "Choosing keyspace replication options". The default consistency level for operations in CFS is QUORUM. If a node fails in a cluster with a replication factor of 2, Hadoop will fail. The higher replication factor ensures resilience to single-node failures. To change the replication factors of these keyspaces:
  1. Change the replication of the cfs and cfs_archive keyspaces from 1 to 3, for example:
    ALTER KEYSPACE cfs
      WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};
    ALTER KEYSPACE cfs_archive
        WITH REPLICATION= {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};
  2. If you use Hive, update the HiveMetaStore keyspace to increase the replication from 1 to 3, for example.
    ALTER KEYSPACE HiveMetaStore
        WITH REPLICATION= {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};
  3. Run nodetool repair to avoid having missing data problems or data unavailable exceptions.

Configuring the verbosity of log messages 

To adjust the verbosity of log messages for Hadoop map/reduce tasks, add the following settings to the log4j.properties file on each analytic node:

log4j.logger.org.apache.hadoop.mapred=WARN 
log4j.logger.org.apache.hadoop.filecache=WARN
Default Hadoop log4j-server.properties locations:
  • Installer-Services and Package installations: /etc/dse/hadoop/
  • Installer-No Services and Tarball installations: install_location/resources/hadoop/conf/

Connecting to non-standard Cassandra native port 

If the Cassandra native port was changed to a port other than the default port 9042, you must change the cassandra.input.native.port configuration setting for Hive and Hadoop to use the non-default port. The following examples change the Cassandra native port protocol connections to use port 9999.
  • Inside the Hive shell, set the port after starting DSE Hive:
    $ dse hive
    hive> set cassandra.input.native.port=9999; 
  • General Hive, add cassandra.input.native.port to the hive-site.xml file:
    <property> 
        <name>cassandra.input.native.port</name>
        <value>9999</value> 
    </property> 
  • For Hadoop, add cassandra.input.native.port to the core-site.xml file:
    <property> 
        <name>cassandra.input.native.port</name> 
        <value>9999</value> 
    </property>

Configuration for running jobs on a remote cluster 

This information is intended for advanced users.

Procedure

To connect to external addresses:

  1. Make sure that the hostname resolution works properly on the localhost for the remote cluster nodes.
  2. Copy the dse-core-default.xml and dse-mapred-default.xml files from any working remote cluster node to your local Hadoop conf directory.
  3. Run the job using dse hadoop.
  4. If you need to override the JT location or if DataStax Enterprise cannot automatically detect the JT location, before running the job, define the HADOOP_JT environment variable:
    $ export HADOOP_JT=jobtracker host:jobtracker port dse hadoop jar ....
  5. If you need to connect to many different remote clusters from the same host:
    1. Before starting the job, copy the remote Hadoop conf directories fully to the local node (into different locations).
    2. Select the appropriate location by defining HADOOP_CONF_DIR.