Important metrics and alerts
Information about various tools for monitoring clusters and which metrics are most useful.
cassandra.yaml
The location of the cassandra.yaml file depends on the type of installation:Package installations | /etc/dse/cassandra/cassandra.yaml |
Tarball installations | installation_location/resources/cassandra/conf/cassandra.yaml |
Monitoring Apache Cassandra® and DataStax Enterprise (DSE) clusters is a very important activity that allows you to identify problems in your clusters and react faster to mitigate them.
You can use the following tools for collection of metrics for analysis:
- Tools for one-off analysis, including JConsole, jmxterm, and nodetool sjk, use JMX and are described below.
- DSE OpsCenter collects metrics using JMX, stores them in DSE, and uses them for visualization and alerts. Metrics collection requires that the DataStax Agent is running on your DSE nodes.
- DSE Metrics Collector collects metrics from DSE and
other entities, such as CPU and disks, using collectd.
The DSE Metrics Collector also enables integration with different monitoring systems using collectd plugins. For example, you can expose data to Prometheus with visualization via Grafana using predefined dashboards. Because metrics are exposed directly, you don’t need the DataStax Agent running on your nodes.
- The Metrics Collector for Apache Cassandra together with Prometheus and Grafana (also with predefined dashboards) provides the same functionality as DSE Metrics Collector.
- External tools for integration with monitoring systems like Prometheus (via JMX Exporter for Prometheus) and other monitoring tools may require additional tuning and dashboard creation.
When using any of these methods, you get a lot of information. There are approximately 40 metrics per keyspace, 60 to 70 metrics per individual table, and even more metrics for different subsystems. The remainder of this topic provides guidance for understanding the most important metrics.
What do you need to monitor?
The important metrics that require monitoring can split into several groups:
- Metrics related to client request:
How the system performs from the point of view of the client application.
- Coordinator level latency for read and write operations, especially for 95/99th percentiles.
- Number of client connections.
- Metrics related to threadpools that process data and execute different tasks:
Examples include compaction, and flushing of data.
- How many threads are in the blocked state. For example, memtable flush writer, memtable pool allocations, and so on.
- How many threads are in the aborted state, like aborted compactions.
- How many threads are in the pending state, such as pending compactions and pending flushes.
- Metrics related to Thread-per-Core (TPC).
Applies only to DSE 6.0 and later.
- Metrics related to individual tables:
It’s useful to track such metrics for your most important tables to make sure that SLAs are met and avoid problems.
- Partition size.
- Number of SSTables overall.
- Number of SSTables read per request.
- Number of tombstones scanned during read request.
- Coordinator-level read and write latencies.
- Metrics related to inter-cluster communication:
These metrics provide information on how data exchange happens in the cluster: replication, hinted handoff, and so on:
- Number of dropped mutations, and other messages.
- Total number of timeouts and timeouts per host.
- Cross-data center latency.
- Number of hints on disk.
- Hint replay (number of failed and timed out hint messages).
- Metrics related to the Java Virtual Machine (JVM):
- Amount of memory used.
- Duration of garbage collection pauses.
- Metrics related to operating system and hardware:
- CPU usage on the node.
- Amount of disk space available.
Important metrics exposed via JMX
The following list of metrics are recommended by DataStax for monitoring and generating alerts that cross their threshold. Note that some values, such as latency, are general recommendations and could be lower or higher depending on your requirements.
- Read and write latencies (at the coordinator level)
- Total and per keyspace/table.
- Overall internode latency
- JMX in MBean
org.apache.cassandra.metrics:
type=Messaging,name=CrossNodeLatency
- Internode latency for datacenter with name DC-Name
- JMX in MBean
org.apache.cassandra.metrics:
type=Messaging,name=<DC-Name>-Latency
- Number of pending compaction tasks
- Total per node, and/or for tables in specific keyspace.
- Number of dropped mutations
- Ttotal and/or per table in given keyspace.
- Number of timeouts occurring on a specific node
- JMX in MBean
org.apache.cassandra.metrics:
type=MessagingService,name=TotalTimeouts type=MessagingService,name=TimeoutsPerHost
- Maximum partition size in bytes
- JMX in MBean
org.apache.cassandra.metrics:
Table,name=MaxPartitionSize type=Table,keyspace=ks,scope=*,name=MaxPartitionSize
- Total number of SSTables in system and per table
- JMX in MBean
org.apache.cassandra.metrics:
type=Table,name=LiveSSTableCount type=Table,keyspace=ks,scope=*,name=LiveSSTableCount
- Number of hints stored on individual node
- JMX in MBean
org.apache.cassandra.metrics:
type=Storage,name=TotalHints
- Hint Replay Success/Failure/Timeout Rate
- JMX in MBean
org.apache.cassandra.metrics:
type=HintsService,name=HintsSucceeded type=HintsService,name=HintsFailed type=HintsService,name=HintsTimedOut
- Number threads are blocked by memtable allocation
- JMX in MBean
org.apache.cassandra.metrics:
type=MemtablePool,name=BlockedOnAllocation
- Number of blocked memtable flush writer tasks.
-
Note: This condition could lead to heavy write performance degradation.
- Number of blocked compaction tasks
- JMX in MBean
org.apache.cassandra.metrics:
type=ThreadPools,path=internal,scope=CompactionExecutor,name=CurrentlyBlockedTasks
- Number of aborted compaction tasks
- JMX in MBean
org.apache.cassandra.metrics:
name=CompactionsAborted,type=Compaction
- Information about Java's garbage collection
- Caused by Mac GC Elapsed and similar.
- Number of segments waiting on commit
- JMX in MBean
org.apache.cassandra.metrics:
type=CommitLog,name=WaitingOnCommit,name=Count
- 99th percentile of time spent waiting on commit
- JMX in MBean
org.apache.cassandra.metrics:
type=CommitLog,name=WaitingOnCommit,name=99thPercentile
- Number of pending flushes
- JMX in MBean
org.apache.cassandra.metrics:
type=Table,name=PendingFlushes
- Hit ratio for key cache
- Only Cassandra and DSE prior to version 6.0.
Visualizing important metrics in OpsCenter
OpsCenter collects metrics (6.8 | 6.7 | 6.5 | 6.1) from all nodes in a cluster and stores original data, together with rollups in the DSE cluster. This data is then used to create graphs and alerts. When using OpsCenter for monitoring, the following list of the metrics/graphs is useful for setting up an effective monitoring of the cluster:
- Active Alerts
- Cluster Health
- Storage Capacity
- Read and Write Request Latency
- Read and Write Requests
- Data Size
- Compactions Pending
- Dropped Messages: Mutations
- Dropped Messages: Reads
- Native Clients
- For specific tables (setup for most important tables):
- TBL: SSTables per read (percentiles)
- TBL: Tombstones per read (percentiles)
- TBL: Partition size (percentiles)
- Related to hinted handoff:
- Hints on Disk
- TP: Hint Dispatcher Active
- TP: Hint Dispatcher Completed
- Dropped Messages: Hinted Handoff
- Related to operating system:
- OS: Disk Latency
- OS: Load
- OS: CPU Iowait
- OS: Memory Free
- Related to Java Virtual machine:
- Heap Used
- JVM G1 Old Collection Count and Time
- JVM G1 Young Collection Count and Time
- When DSE Search is enabled:
- Search: Core Size
- Search: Read Latency
- Search: Timeouts
- When NodeSync is enabled, NodeSync related:
- TP: Read Range NodeSync Active
- NodeSync: Uncompleted Pages, Failed Pages
Alerts in OpsCenter
OpsCenter alerts (6.8 | 6.7 | 6.5 | 6.1the operator when certain conditions are met. Examples include when a node is down and latency is too high for a long period of time. OpsCenter can deliver alerts by email, SNMP, and HTTP requests.
Configure the following alerts to operatively react to problems in DSE clusters.
- Node Down
- When a node is marked as down by OpsCenter.
- Agent Issue
- When a DataStax Agent is monitored having some issues.
- CPU Usage
- The percentage of CPU was busy.
- Load
- The overall amount of work that a computer system performs.
- Write Request Latency (percentiles)
- The response time (in milliseconds) for successful write operations.
- Read Request Latency (percentiles)
- The response time (in milliseconds) for successful read operations.
- Advanced -> System -> Disk Usage(%)
- The percentage of disk being used for a particular disk partition.
- Advanced -> Tables -> TBL: SSTables per Read (percentiles)
- For a specified percentile, how many SSTables are accessed during a read
- Advanced -> Tables -> TBL: Tombstones per Read (percentiles)
- For a specified percentile, how many Tombstones are accessed during a read.
- Advanced -> Tables -> TBL: Partition Size (percentiles)
- For a specified percentile, what is the size (in bytes) of partitions of this table.
Tools for work with JMX
A number of tools exist for one-off analysis of specific metrics. Usually you only use these tools for debugging as they aren’t designed to replace monitoring solutions. These tools primarily provide access to individual metrics at the moment on a specific node. They do not generate a view of what happens over time or provide multiple metrics on multiple nodes.
JConsole

To access metrics, JMX either needs to be running on the server (usually requiring installation of the GUI libraries), or network access, which exposes JMX to the externally and may have a security impact.
jmxterm
$>open localhost:7199
#Connection to localhost:7199 is opened
$>get -b org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size Value
#mbean = org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size:
Value = 0;
To get a full list of supported commands, run help
inside the interactive
console. Besides getting the values for attributes, you can also set values for attributes
(if they are settable), or call the function, which you can use to temporarily modify the
behavior of Cassandra (the same way as nodetool commands).
nodetool sjk (DSE and Cassandra 4.0)
nodetool sjk
(6.8 | 6.7 | 6.0 | 5.1), which is a wrapper for the well-known library
called Swiss Java Knife1
(SJK). This subcommand is really handy, as you don’t need to specify the DSE process PID or
other parameters; you just provide necessary flags. For example, to get a hit rate of key
cache, use following command, where -b
flag specifies the name of the bean
and -f
specifies the
field:nodetool sjk mx -b "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size" -mg -f Value
Similarly to jmxterm
, you can use this command to set values (when
settable) or call functions.