Analytics node configuration

Steps to configure analytic Hadoop nodes.

Important configuration changes, excluding those related to the job tracker, are:

Advanced users can also configure DataStax Enterprise to run jobs remotely.

Disable virtual nodes

DataStax recommends using virtual nodes only on data centers running Cassandra real-time workloads. You should disable virtual nodes on data centers running either Hadoop or Solr workloads.

To disable virtual nodes:

In the cassandra.yaml file, set num_tokens to 1.
```
num_tokens = 1
```
Uncomment the initial_token property and set it to 1 or to the value of a generated token for a multi-node cluster.

Setting the replication factor

The default replication for the HiveMetaStore, cfs, and cfs_archive system keyspaces is 1. A replication factor of 1 using the default data center Analytics is configured for development and testing of a single node, not for a production environment. For production clusters, increase the replication factor to at least 2. The higher replication factor ensures resilience to single-node failures. For example:

ALTER KEYSPACE cfs
  WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'dc1' : 3};

Configuring the verbosity of log messages

To adjust the verbosity of log messages for Hadoop map/reduce tasks, add the following settings to the log4j.properties file on each analytic node:

log4j.logger.org.apache.hadoop.mapred=WARN 
log4j.logger.org.apache.hadoop.filecache=WARN

Installer-Services and Package installations: /etc/dse/cassandra/
Installer-No Services and Tarball installations: install_location/resources/cassandra/conf/

Configuration for running jobs on a remote cluster

This information is intended for advanced users.

Procedure

To connect to external addresses:

Make sure that the hostname resolution works properly on the localhost for the remote cluster nodes.
Copy the dse-core-default.xml and dse-mapred-default.xml files from any working remote cluster node to your local Hadoop conf directory.
Run the job using dse hadoop.
If you need to override the JT location or if DataStax Enterprise cannot automatically detect the JT location, before running the job, define the HADOOP_JT environment variable:
```
$ export HADOOP_JT=jobtracker host:jobtracker port dse hadoop jar ....
```
If you need to connect to many different remote clusters from the same host:
1. Before starting the job, copy the remote Hadoop conf directories fully to the local node (into different locations).
2. Select the appropriate location by defining HADOOP_CONF_DIR.