Cluster performance metrics

Cluster metrics are aggregated across all nodes in the cluster. Cluster metrics are a good way to monitor cluster performance at a high level. OpsCenter tracks a number of cluster-wide metrics for read performance, write performance, memory and capacity.

Watching for variations in cluster performance can signal potential performance issues that may require further investigation. For general performance monitoring, watching for spikes in read and write latency, along with an accumulation of pending operations can signal issues that may require further investigation. Drilling down on high-demand column families can further pinpoint the source of performance issues with your application.

Write requests

The number of write requests per second on the coordinator nodes, analogous to client writes. Monitoring the number of requests over a given time period can give you and idea of system write workload and usage patterns.

Write request latency

The response time (in milliseconds) for successful write requests. The time period starts when a node receives a client write request, and ends when the node responds back to the client. Optimal or acceptable levels of write latency vary widely according to your hardware, your network, and the nature of your write load. For example, the performance for a write load consisting largely of granular data at low consistency levels would be evaluated differently from a load of large strings written at high consistency levels.

Read requests

The number of read requests per second on the coordinator nodes, analogous to client reads. Monitoring the number of requests over a given time period can give you and idea of system read workload and usage patterns.

Read request latency

The response time (in milliseconds) for successful read requests. The time period starts when a node receives a client read request, and ends when the node responds back to the client. Optimal or acceptable levels of read latency vary widely according to your hardware, your network, and the nature of your application read patterns. For example, the use of secondary indexes, the size of the data being requested, and the consistency level required by the client can all impact read latency. An increase in read latency can signal I/O contention. Reads can slow down when rows are fragmented across many SSTables and compaction cannot keep up with the write load.

Cassandra JVM memory usage

The average amount of Java heap memory (in megabytes) being used by Cassandra processes. Cassandra opens the JVM with a heap size that is half of available system memory by default, which still allows an optimal amount of memory remaining for the OS disk cache. You may need to increase the amount of heap memory if you have increased column family memtable or cache sizes and are getting out-of-memory errors. If you monitor Cassandra Java processes with an OS tool such as top, you may notice the total amount of memory in use exceeds the maximum amount specified for the Java heap. This is because Java allocates memory for other things besides the heap. It is not unusual for the total memory consumption of the JVM to exceed the maximum value of heap memory.

JVM CMS collection count

The number of concurrent mark-sweep (CMS) garbage collections performed by the JVM per second. These are large, resource-intensive collections. Typically, the collections occur every 5 to 30 seconds.

JVM CMS collection time

The time spent collecting CMS garbage in milliseconds per second (ms/sec).

Note: A ms/sec unit defines the number of milliseconds for garbage collection for each second that passes. For example, the percentage of time spent on garbage collection in one millisecond (.001 sec) is 0.1%.

JVM ParNew collection count

The number of parallel new-generation garbage collections performed by the JVM per second. These are small and not resource intensive. Normally, these collections occur several times per second under load.

JVM ParNew Collection Time

The time spent performing ParNew garbage collections in ms/sec. The rest of the JVM is paused during ParNew garbage collection. A serious performance hit can result from spending a significant fraction of time on ParNew collections.

Data size

The size of column family data (in gigabytes) that has been loaded/inserted into Cassandra, including any storage overhead and system metadata. DataStax recommends that data size not exceed 70 percent of total disk capacity to allow free space for maintenance operations such as compaction and repair.

Total bytes compacted

The number of sstable data compacted in bytes per second.

Total compactions

The number of compactions (minor or major) performed per second.