Analytics node configuration for DSE Hadoop

Steps to configure analytic nodes for DSE Hadoop.

Important configuration changes, excluding those related to the Job Tracker, are:

Advanced users can also configure DataStax Enterprise to run jobs remotely.

DataStax Enterprise turns off virtual nodes (vnodes) by default because using vnodes causes a sharp increase in the Hadoop task scheduling latency. This increase is due to the number of Hadoop splits, which cannot be lower than the number of vnodes in the analytics datacenter. Using vnodes, instead of N splits for tiny data, you have, for example, 256 * N splits, where N number of physical nodes in the cluster. This may raise job latency from tens of seconds to single or even tens of minutes. This increase in job latency is relatively insignificant when running jobs for hours to analyze huge quantities of data that inherently has lots of splits anyway. In this case, vnodes are perfectly fine. You can use vnodes for any Cassandra-only cluster, a Cassandra-only datacenter, a Spark datacenter, or a Search-only datacenter in a mixed Hadoop/Search/Cassandra deployment.

Attention: DataStax Enterprise turns off virtual nodes (vnodes) by default. DataStax does not recommend turning on vnodes for DSE Hadoop or BYOH nodes. Before turning vnodes on for Hadoop, understand the implications. DataStax Enterprise does support turning on vnodes for Spark nodes.

Setting the replication factor

Change the default replication factor to a production-appropriate value of at least 3.

Configuring the verbosity of log messages

To adjust the verbosity of log messages for Hadoop map/reduce tasks, add the following settings to the logback.xml file on each analytic node:

logback.logger.org.apache.hadoop.mapred=WARN 
logback.logger.org.apache.hadoop.filecache=WARN

The location of the logback.xml file depends on the type of installation:

Installer-Services and Package installations	/etc/dse/cassandra/logback.xml
Installer-No Services and Tarball installations	`install_location`/resources/cassandra/logback.xml

Connecting to non-standard Cassandra native port

If the Cassandra native port was changed to a port other than the default port 9042, you must change the cassandra.input.native.port configuration setting for Hive and Hadoop to use the non-default port. The following examples change the Cassandra native port protocol connections to use port 9999.

Inside the Hive shell, set the port after starting the DataStax Enterprise Hive shell:
```
dse hive
hive> set cassandra.input.native.port=9999; 
```

General Hive, add cassandra.input.native.port to the file:

There are two instances of the hive-site.xml file.

For use with Spark, the default location of the hive-site.xml file is:

Installer-Services and Package installations	/etc/dse/spark/hive-site.xml
Installer-No Services and Tarball installations	`install_location`/resources/spark/conf/hive-site.xml

For use with Hive, the default location of the hive-site.xml file is:

Installer-Services and Package installations	/etc/dse/hive/hive-site.xml
Installer-No Services and Tarball installations	`install_location`/resources/hive/conf/hive-site.xml

<property> 
    <name>cassandra.input.native.port</name>
    <value>9999</value> 
</property>

For Hadoop, add cassandra.input.native.port to the file:

The default location of the core-site.xml file depends on the type of installation:

Installer-Services and Package installations	/etc/dse/hadoop/conf/core-site.xml
Installer-No Services and Tarball installations	`install_location`/resources/hadoop/conf/core-site.xml

<property> 
    <name>cassandra.input.native.port</name> 
    <value>9999</value> 
</property>

Configuration for running jobs on a remote cluster

This information is intended for advanced users.

Procedure

To connect to external addresses:

Make sure that the hostname resolution works properly on the localhost for the remote cluster nodes.
Copy the dse-core-default.xml and dse-mapred-default.xml files from any working remote cluster node to your local Hadoop conf directory.
Run the job using dse hadoop.
To override the Job Tracker location or if DataStax Enterprise cannot automatically detect the Job Tracker location, define the HADOOP_JT environment variable before running the job:
```
$ export HADOOP_JT=jobtracker host:jobtracker port dse hadoop jar ....
```
If you need to connect to many different remote clusters from the same host:
1. Before starting the job, copy the remote Hadoop conf directories fully to the local node (into different locations).
2. Select the appropriate location by defining HADOOP_CONF_DIR.