Important Metrics and Alerts

Monitoring Apache Cassandra™ and DataStax Enterprise (DSE) clusters is a very important activity that allows you to identify problems in your clusters and react faster to mitigate them.

Both Apache Cassandra and DSE expose metrics for observation and analysis. Cassandra uses Java Management Extensions (JMX) to expose various metrics; allow temporary configuration changes, such as changing the compaction throughput; and provide the ability to execute actions, such as triggering compaction. JMX is also used by nodetool and other Cassandra tools. The different types of the exposed metrics are described in the Cassandra documentation.

JMX is a technology within Java that provides tools for managing and monitoring applications.

You can use the following tools for collection of metrics for analysis:

Tools for one-off analysis, including JConsole, jmxterm, and nodetool sjk, use JMX and are described in Tools for work with JMX.
DSE OpsCenter collects metrics using JMX, stores them in DSE, and uses them for visualization and alerts. Metrics collection requires that the DataStax Agent is running on your DSE nodes.
DSE Metrics Collector collects metrics from DSE and other entities, such as CPU and disks, using collectd.

The DSE Metrics Collector also enables integration with different monitoring systems using collectd plugins. For example, you can expose data to Prometheus with visualization via Grafana using predefined dashboards. Because metrics are exposed directly, you do not need the DataStax Agent running on your nodes.
The Metrics Collector for Apache Cassandra together with Prometheus and Grafana (also with predefined dashboards), provides the same functionality as DSE Metrics Collector.
External tools for integration with monitoring systems like Prometheus (via JMX Exporter for Prometheus) and other monitoring tools may require additional tuning and dashboard creation.

When using any of these methods, you get a lot of information. There are approximately 40 metrics per keyspace, 60 to 70 metrics per individual table, and even more metrics for different subsystems. The remainder of this topic provides guidance for understanding the most important metrics.

What do you need to monitor?

The important metrics that require monitoring can split into several groups:

Metrics related to client request:

How the system performs from the point of view of the client application.
- Coordinator level latency for read and write operations, especially for 95/99th percentiles.
- Number of client connections.
Metrics related to threadpools that process data and execute different tasks:

Examples include compaction, and flushing of data.
- How many threads are in the blocked state. For example, memtable flush writer, memtable pool allocations, and so on.
- How many threads are in the aborted state, like aborted compactions.
- How many threads are in the pending state, such as pending compactions and pending flushes.
Metrics related to Thread-per-Core (TPC).

Applies to only DSE 6.0 and later.
Metrics related to individual tables:

It is useful to track such metrics for your most important tables to make sure that SLAs are met and avoid problems.
- Partition size.
- Number of SSTables overall.
- Number of SSTables read per request.
- Number of tombstones scanned during read request.
- Coordinator-level read and write latencies.
Metrics related to inter-cluster communication:

These metrics provide information on how data exchange happens in the cluster: replication, hinted handoff, and so on:
- Number of dropped mutations, and other messages.
- Total number of timeouts and timeouts per host.
- Cross-data center latency.
- Number of hints on disk.
- Hint replay (number of failed and timed out hint messages).
Metrics related to the Java Virtual Machine (JVM):
- Amount of memory used.
- Duration of garbage collection pauses.
Metrics related to operating system and hardware:
- CPU usage on the node.
- Amount of disk space available.

Important metrics exposed via JMX

The following list of metrics are recommended by DataStax for monitoring and generating alerts that cross their threshold setting.

Some values, such as latency, are general recommendations and could be lower or higher depending on your requirements.

Read and write latencies (at the coordinator level): Total and per keyspace/table.

JMX in MBean org.apache.cassandra.metrics:

type=ClientRequest,scope=Write,name=Latency ClientRequest,scope=Read,name=Latency type=Table,keyspace=ks,scope=,name=ReadLatency type=Table,keyspace=ks,scope=,name=WriteLatency

Alerting condition: 99pt greater than 200ms for more than 1 minute
Overall internode latency: JMX in MBean org.apache.cassandra.metrics:

type=Messaging,name=CrossNodeLatency
Internode latency for datacenter with name DC-Name: JMX in MBean org.apache.cassandra.metrics:

type=Messaging,name=<DC-Name>-Latency

Number of pending compaction tasks

Total per node, and/or for tables in specific keyspace.