MapReduce performance tuning
DataStax Enterprise includes a Cassandra-enabled Hive MapReduce client. Change settings to enable improved MadReduce performance.
You can change performance settings in the following ways:
- In an external table definition, using the TBLPROPERTIES or SERDEPROPERTIES clauses.
- Using the Hive SET command. For example:
SET mapred.reduce.tasks=32;
- In the mapred-site.xml file.Note: Restart the analytics nodes after you make changes to mapred-site.xml.
Installer-Services and Package installations | /etc/dse/hadoop/mapred-site.xml |
Installer-No Services and Tarball installations | install_location/resources/hadoop/conf/mapred-site.xml |
Performance changes using mapred-site.xml
- Speeding up map reduce jobs
- Increase your mappers to one per CPU core by setting mapred.tasktracker.map.tasks.maximum.
- Increasing the number of map tasks to maximize performance
- You can increase the number of map tasks in these ways:
- Turn off map output compression in the mapred-site.xml file to lower memory usage.
- The cassandra.input.split.size property specifies rows to be processed per mapper. The default size is 64k rows per split. You can decrease the split size to create more mappers.
- Out of Memory Errors
- When your mapper or reduce tasks fail, reporting Out of Memory (OOM) errors,
turn the mapred.map.child.java.opts setting in Hive
to:
SET mapred.child.java.opts="-server -Xmx512M"
Loading balancing using the Fair Scheduler
The Hadoop Fair Scheduler assigns resources to jobs to balance the load, so that each job gets roughly the same amount of CPU time.
To enable the fair scheduler, uncomment a section in the mapred-site.xml that looks something like this:
<property>
<name>mapred.jobtracker.taskScheduler</name>
<value>org.apache.hadoop.mapred.FairScheduler</value>
</property>
. . .
<value>dse-3.0.2/dse/resources/hadoop/conf/fair-scheduler.xml</value>
</property>
You might need to change the value element shown here. If the Fair Scheduler file has a different name, change the name of the file to fair-scheduler.xml. Specify the absolute path to the file.
Installer-Services and Package installations | /etc/dse/hadoop/fair-scheduler.xml |
Installer-No Services and Tarball installations | install_location/resources/hadoop/conf/fair-scheduler.xml |
DataStax Enterprise also supports the Capacity Scheduler.