MapReduce performance tuning

DataStax Enterprise includes a Cassandra-enabled Hive MapReduce client. Change settings to enable improved MadReduce performance.

You can change performance settings in the following ways:

  • In an external table definition, using the TBLPROPERTIES or SERDEPROPERTIES clauses.
  • Using the Hive SET command. For example: SET mapred.reduce.tasks=32;
  • In the mapred-site.xml file:
    • Packaged installs: /etc/dse/hadoop/mapred-site.xml
    • Tarball installs: install_location/resources/hadoop/conf/mapred-site.xml
    Note: This is a system setting so if you change it you must restart the analytics nodes.

Speeding up map reduce jobs

Increase your mappers to one per CPU core by setting mapred.tasktracker.map.tasks.maximum in mapred-site.xml.

Increasing the number of map tasks to maximize performance

You can increase the number of map tasks in these ways:
  • Turn off map output compression in the mapred-site.xml file to lower memory usage.
  • The cassandra.input.split.size property specifies rows to be processed per mapper. The default size is 64k rows per split. You can decrease the split size to create more mappers.

Out of Memory Errors

When your mapper or reduce tasks fail, reporting Out of Memory (OOM) errors, turn the mapred.map.child.java.opts setting in Hive to:

SET mapred.child.java.opts="-server -Xmx512M"

You can also lower memory usage by turning off map output compression in mapred-site.xml.

Using the Fair Scheduler 

The Hadoop Fair Scheduler assigns resources to jobs to balance the load, so that each job gets roughly the same amount of CPU time. The fair-scheduler.xml is located in the resources/hadoop/conf directory of the DataStax Enterprise installation.

To enable the fair scheduler you uncomment a section in the mapred-site.xml that looks something like this:

<property>
  <name>mapred.jobtracker.taskScheduler</name>
  <value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
. . .
  <value>dse-3.0.2/dse/resources/hadoop/conf/fair-scheduler.xml</value>
</property>

You might need to change the value element shown here. If the Fair Scheduler file has a different name, change the name of the file to fair-scheduler.xml. Specify the absolute path to the file.

DataStax Enterprise also supports the Capacity Scheduler.