Analytics node configuration for DSE Hadoop
Steps to configure analytic nodes for DSE Hadoop.
Important configuration changes, excluding those related to the Job Tracker, are:
- Disabling virtual nodes
- Setting the replication factor
- Configuring the verbosity of log messages
- Connecting to non-standard Cassandra native port
Advanced users can also configure DataStax Enterprise to run jobs remotely.
DataStax Enterprise turns off virtual nodes (vnodes) by default because using vnodes causes a sharp increase in the Hadoop task scheduling latency. This increase is due to the number of Hadoop splits, which cannot be lower than the number of vnodes in the analytics datacenter. Using vnodes, instead of N splits for tiny data, you have, for example, 256 * N splits, where N number of physical nodes in the cluster. This may raise job latency from tens of seconds to single or even tens of minutes. This increase in job latency is relatively insignificant when running jobs for hours to analyze huge quantities of data that inherently has lots of splits anyway. In this case, vnodes are perfectly fine. You can use vnodes for any Cassandra-only cluster, a Cassandra-only datacenter, a Spark datacenter, or a Search-only datacenter in a mixed Hadoop/Search/Cassandra deployment.
Setting the replication factor
Change the default replication factor to a production-appropriate value of at least 3.
Configuring the verbosity of log messages
To adjust the verbosity of log messages for Hadoop map/reduce tasks, add the following settings to the logback.xml file on each analytic node:
logback.logger.org.apache.hadoop.mapred=WARN
logback.logger.org.apache.hadoop.filecache=WARN
Installer-Services and Package installations | /etc/dse/cassandra/logback.xml |
Installer-No Services and Tarball installations | install_location/resources/cassandra/logback.xml |
Connecting to non-standard Cassandra native port
If the Cassandra native port was changed to a port other than the default port 9042, you must change the cassandra.input.native.port configuration setting for Hive and Hadoop to use the non-default port. The following examples change the Cassandra native port protocol connections to use port 9999.- Inside the Hive shell, set the port after starting the DataStax Enterprise Hive
shell:
dse hive hive> set cassandra.input.native.port=9999;
- General Hive, add cassandra.input.native.port to the
hive-site.xml file:There are two instances of the hive-site.xml file.
For use with Spark, the default location of the hive-site.xml file is:
Installer-Services and Package installations /etc/dse/spark/hive-site.xml Installer-No Services and Tarball installations install_location/resources/spark/conf/hive-site.xml For use with Hive, the default location of the hive-site.xml file is:
Installer-Services and Package installations /etc/dse/hive/hive-site.xml Installer-No Services and Tarball installations install_location/resources/hive/conf/hive-site.xml <property> <name>cassandra.input.native.port</name> <value>9999</value> </property>
- For Hadoop, add cassandra.input.native.port to the
core-site.xml file:The default location of the core-site.xml file depends on the type of installation:
Installer-Services and Package installations /etc/dse/hadoop/conf/core-site.xml Installer-No Services and Tarball installations install_location/resources/hadoop/conf/core-site.xml <property> <name>cassandra.input.native.port</name> <value>9999</value> </property>
Configuration for running jobs on a remote cluster
This information is intended for advanced users.
Procedure
To connect to external addresses: