Operating system performance metrics

Tracking operating system metrics on Cassandra nodes to watch for disk I/O, network, memory and CPU utilization trends helps identify and troubleshoot hardware-related performance problems.

As with any database system, Cassandra performance greatly depends on underlying systems on which it is running. Monitoring Cassandra nodes for increasing disk and CPU utilization can help identify and remedy issues before performance degrades to unacceptable levels. The graphs in OpsCenter provide a quick way to view variations in OS metrics at a glance, and drill-down for specific data points. Especially in systems with heavy write loads, monitoring disk space is also important because it allows for advanced expansion planning while there is still adequate capacity to handle expansion and rebalancing operations.

System metrics are prefaced with OS.

OS: Memory

Shows memory usage metrics in megabytes.
  • Linux - Shows how much total system memory is currently used, cached, buffered or free.
  • Windows - Shows the available physical memory, the cached operating system code, and the allocated pool-paged-resident and pool-nonpaged memory.
  • Mac OS X - Shows free and used system memory.

OS: CPU

Shows average percentages for CPU utilization metrics, which is the percentage of time the CPU was idle subtracted from 100 percent. CPU metrics can be useful for determining the origin of CPU performance reduction.
  • Linux- Shows how much time the CPU devotes to system and user processes, to tasks stolen by virtual operating systems, to waiting for I/O to complete, and to processing nice tasks. High percentages of nice might indicate that other processes are crowding out Cassandra processes, while high percentages of iowait might indicate I/O contention. On fully virtualized environments like Amazon EC2, a Cassandra cluster under load might show high steal values while other virtual processors use the available system resources.
  • Windows and Mac OS X - Shows how much time the CPU spends on user processes and system processes.

OS: Load

The amount of work that a computer system performs. An idle computer has a load number of 0 and each process using or waiting for CPU time increments the load number by 1. Any value above one indicates that the machine was temporarily overloaded and some processes were required to wait. Shows minimum, average, and maximum OS load expressed as an integer.

OS: Disk usage (GB)

Tracks growth or reduction in the amount of available disk space used. If this metric indicates a growth trend leading to high or total disk space usage, consider strategies to relieve it, such as adding capacity to the cluster. DataStax recommends leaving 30-50% free disk space for optimal repair and compaction operations.

OS: Disk Usage (percentage)

The percentage of disk space that is being used by Cassandra at a given time. When Cassandra is reading and writing heavily from disk, or building SSTables as the final product of compaction processes, disk usage values may be temporarily higher than expected.

OS: Disk Throughput

The average disk throughput for read and write operations, measured in megabytes per second. Exceptionally high disk throughput values may indicate I/O contention. This is typically caused by numerous compaction processes competing with read operations. Reducing the frequency of memtable flushing can relieve I/O contention.

OS: Disk Rates

  • Linux and Windows - Averaged disk speed for read and write operations.
  • Mac OS X - Not supported.

OS: Disk Latency

  • Linux and Windows - Measures the average time consumed by disk seeks in milliseconds. Disk latency is among the higher-level metrics that may be useful to monitor on an ongoing basis by keeping this graph posted on your OpsCenter performance console. Consistently high disk latency may be a signal to investigate causes, such as I/O contention from compactions or read/write loads that call for expanded capacity.
  • Mac OS X - Not supported.

OS: Disk Request Size

  • Linux and Windows - The average size in sectors of requests issued to the disk.
  • Mac OS X - Not supported.

OS: Disk Queue Size

  • Linux and Windows - The average number of requests queued due to disk latency issues.
  • Mac OS X - Not supported.

OS: Disk Utilization

  • Linux and Windows - The percentage of CPU time consumed by disk I/O.
  • Mac OS X - Not supported.