Important Metrics and Alerts

Monitoring OSS Apache Cassandra® clusters is a very important activity that allows you to identify problems in your clusters and react faster to mitigate them.

OSS Cassandra exposes metrics for observation and analysis. Cassandra uses Java Management Extensions (JMX), which is a technology within Java that provides tools for managing and monitoring applications, to do the following:

Expose metrics
Make temporary configuration changes, such as changing the compaction throughput
Run operations, such as compaction.

JMX is also used by nodetool and other Cassandra tools.

There are approximately 40 metrics per keyspace, 60 to 70 metrics per individual table, and even more metrics for different subsystems. The remainder of this topic provides guidance for understanding the most important metrics. For more information about the exposed metrics, see the Cassandra documentation. This guide focuses on the metrics that are typically considered the most valuable.

Observability tools

You can use the following tools for collection and analysis of metrics:

Tools for one-off analysis with JMX, including JConsole, jmxterm, and nodetool sjk.
DSE OpsCenter collects metrics using JMX, stores them in DSE, and uses them for visualization and alerts. Metrics collection requires that the DataStax Agent is running on your DSE nodes.
DSE Metrics Collector collects metrics from DSE and other entities, such as CPU and disks, using collectd.

The DSE Metrics Collector also enables integration with different monitoring systems using collectd plugins. For example, you can expose data to Prometheus with visualization via Grafana using predefined dashboards. Because metrics are exposed directly, you do not need the DataStax Agent running on your nodes.
The Metrics Collector for Apache Cassandra together with Prometheus and Grafana (also with predefined dashboards), provides the same functionality as DSE Metrics Collector for open-source Cassandra clusters.
External tools for integration with monitoring systems like Prometheus (via JMX Exporter for Prometheus) and other monitoring tools may require additional tuning and dashboard creation.

What do you need to monitor?

The important metrics that require monitoring can split into several groups:

Metrics related to client request

How the system performs from the point of view of the client application.

Coordinator level latency for read and write operations, especially for 95/99th percentiles.
Number of client connections.

Metrics related to threadpools that process data and execute different tasks

Examples include compaction, and flushing of data.

How many threads are in the blocked state. For example, memtable flush writer, memtable pool allocations, and so on.
How many threads are in the aborted state, like aborted compactions.
How many threads are in the pending state, such as pending compactions and pending flushes.

Metrics related to Thread-per-Core (TPC)

Applies to DSE 6.0 and later only.

Metrics related to individual tables

It is useful to track such metrics for your most important tables to make sure that SLAs are met and avoid problems.

Partition size.
Number of SSTables overall.
Number of SSTables read per request.
Number of tombstones scanned during read request.
Coordinator-level read and write latencies.

Metrics related to inter-cluster communication

These metrics provide information on how data exchange happens in the cluster: replication, hinted handoff, and so on:

Number of dropped mutations, and other messages.
Total number of timeouts and timeouts per host.
Cross-datacenter latency.
Number of hints on disk.
Hint replay (number of failed and timed out hint messages).

Metrics related to the Java Virtual Machine (JVM)

Amount of memory used.
Duration of garbage collection pauses.

Metrics related to operating system and hardware

CPU usage on the node.
Amount of disk space available.

Important metrics exposed via JMX

The following list of metrics are recommended by DataStax for monitoring and generating alerts that cross their threshold setting.

These are general recommendations. You might need to increase or decrease the target values depending on your workloads.

Read and write latencies at the coordinator level

Total and per keyspace/table.