MapReduce performance tuning

DataStax Enterprise includes a Cassandra-enabled Hive MapReduce client. Change settings to enable improved MadReduce performance.

You can change performance settings in the following ways:

  • In an external table definition, using the TBLPROPERTIES or SERDEPROPERTIES clauses.
  • Using the Hive SET command. For example: SET mapred.reduce.tasks=32;
  • In the mapred-site.xml file.
    Note: Restart the analytics nodes after you make changes to mapred-site.xml.
The default location of the mapred-site.xml file depends on the type of installation:
Installer-Services and Package installations /etc/dse/hadoop/mapred-site.xml
Installer-No Services and Tarball installations install_location/resources/hadoop/conf/mapred-site.xml

Performance changes using mapred-site.xml

Speeding up map reduce jobs
Increase your mappers to one per CPU core by setting mapred.tasktracker.map.tasks.maximum.
Increasing the number of map tasks to maximize performance 
You can increase the number of map tasks in these ways:
  • Turn off map output compression in the mapred-site.xml file to lower memory usage.
  • The cassandra.input.split.size property specifies rows to be processed per mapper. The default size is 64k rows per split. You can decrease the split size to create more mappers.
Out of Memory Errors 
When your mapper or reduce tasks fail, reporting Out of Memory (OOM) errors, turn the mapred.map.child.java.opts setting in Hive to:
SET mapred.child.java.opts="-server -Xmx512M"

Loading balancing using the Fair Scheduler 

The Hadoop Fair Scheduler assigns resources to jobs to balance the load, so that each job gets roughly the same amount of CPU time.

To enable the fair scheduler, uncomment a section in the mapred-site.xml that looks something like this:

<property>
  <name>mapred.jobtracker.taskScheduler</name>
  <value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
. . .
  <value>dse-3.0.2/dse/resources/hadoop/conf/fair-scheduler.xml</value>
</property>

You might need to change the value element shown here. If the Fair Scheduler file has a different name, change the name of the file to fair-scheduler.xml. Specify the absolute path to the file.

The default location of the fair-scheduler.xml file depends on the type of installation:
Installer-Services and Package installations /etc/dse/hadoop/fair-scheduler.xml
Installer-No Services and Tarball installations install_location/resources/hadoop/conf/fair-scheduler.xml

DataStax Enterprise also supports the Capacity Scheduler.