Important Metrics and Alerts

Monitoring Apache Cassandra™ and DataStax Enterprise (DSE) clusters is a very important activity that allows you to identify problems in your clusters and react faster to mitigate them.

Both Apache Cassandra and DSE expose metrics for observation and analysis. Cassandra uses Java Management Extensions (JMX) to expose various metrics; allow temporary configuration changes, such as changing the compaction throughput; and provide the ability to execute actions, such as triggering compaction. JMX is also used by nodetool and other Cassandra tools. The different types of the exposed metrics are described in the Cassandra documentation.

JMX is a technology within Java that provides tools for managing and monitoring applications.

You can use the following tools for collection of metrics for analysis:

  • Tools for one-off analysis, including JConsole, jmxterm, and nodetool sjk, use JMX and are described in Tools for work with JMX.

  • DSE OpsCenter collects metrics using JMX, stores them in DSE, and uses them for visualization and alerts. Metrics collection requires that the DataStax Agent is running on your DSE nodes.

  • DSE Metrics Collector collects metrics from DSE and other entities, such as CPU and disks, using collectd.

    The DSE Metrics Collector also enables integration with different monitoring systems using collectd plugins. For example, you can expose data to Prometheus with visualization via Grafana using predefined dashboards. Because metrics are exposed directly, you do not need the DataStax Agent running on your nodes.

  • The Metrics Collector for Apache Cassandra together with Prometheus and Grafana (also with predefined dashboards), provides the same functionality as DSE Metrics Collector.

  • External tools for integration with monitoring systems like Prometheus (via JMX Exporter for Prometheus) and other monitoring tools may require additional tuning and dashboard creation.

When using any of these methods, you get a lot of information. There are approximately 40 metrics per keyspace, 60 to 70 metrics per individual table, and even more metrics for different subsystems. The remainder of this topic provides guidance for understanding the most important metrics.

What Do You Need to Monitor?

The important metrics that require monitoring can split into several groups:

  • Metrics related to client request:

    How the system performs from the point of view of the client application.

    • Coordinator level latency for read and write operations, especially for 95/99th percentiles.

    • Number of client connections.

  • Metrics related to threadpools that process data and execute different tasks:

    Examples include compaction, and flushing of data.

    • How many threads are in the blocked state. For example, memtable flush writer, memtable pool allocations, and so on.

    • How many threads are in the aborted state, like aborted compactions.

    • How many threads are in the pending state, such as pending compactions and pending flushes.

  • Metrics related to Thread-per-Core (TPC).

    Applies to only DSE 6.0 and later.

  • Metrics related to individual tables:

    It is useful to track such metrics for your most important tables to make sure that SLAs are met and avoid problems.

    • Partition size.

    • Number of SSTables overall.

    • Number of SSTables read per request.

    • Number of tombstones scanned during read request.

    • Coordinator-level read and write latencies.

  • Metrics related to inter-cluster communication:

    These metrics provide information on how data exchange happens in the cluster: replication, hinted handoff, and so on:

    • Number of dropped mutations, and other messages.

    • Total number of timeouts and timeouts per host.

    • Cross-data center latency.

    • Number of hints on disk.

    • Hint replay (number of failed and timed out hint messages).

  • Metrics related to the Java Virtual Machine (JVM):

    • Amount of memory used.

    • Duration of garbage collection pauses.

  • Metrics related to operating system and hardware:

    • CPU usage on the node.

    • Amount of disk space available.

Important metrics exposed via JMX

The following list of metrics are recommended by DataStax for monitoring and generating alerts that cross their threshold setting.

Some values, such as latency, are general recommendations and could be lower or higher depending on your requirements.

Read and write latencies (at the coordinator level)

Total and per keyspace/table.

JMX in MBean org.apache.cassandra.metrics:

type=ClientRequest,scope=Write,name=Latency ClientRequest,scope=Read,name=Latency type=Table,keyspace=ks,scope=,name=ReadLatency type=Table,keyspace=ks,scope=,name=WriteLatency

Alerting condition: 99pt greater than 200ms for more than 1 minute

Overall internode latency

JMX in MBean org.apache.cassandra.metrics:

type=Messaging,name=CrossNodeLatency

Internode latency for datacenter with name DC-Name

JMX in MBean org.apache.cassandra.metrics:

type=Messaging,name=<DC-Name>-Latency

Number of pending compaction tasks

Total per node, and/or for tables in specific keyspace.

JMX in MBean org.apache.cassandra.metrics:

type=Compaction,name=PendingTasks type=Table,keyspace=ks,scope=*,name=PendingCompactions

Alerting condition: more than 30 for more than 15 minutes.

Number of dropped mutations

Ttotal and/or per table in given keyspace.

JMX in MBean org.apache.cassandra.metrics:

type=Table,name=DroppedMutations type=Table,keyspace=ks,scope=*,name=DroppedMutations

Alerting condition: Non-zero value.

Number of timeouts occurring on a specific node

JMX in MBean org.apache.cassandra.metrics:

type=MessagingService,name=TotalTimeouts type=MessagingService,name=TimeoutsPerHost

Alerting condition: Heavy number increase during the last 5 to 15 minutes.

Required reaction: Possible sign of network problems and similar problems.

Maximum partition size in bytes

JMX in MBean org.apache.cassandra.metrics:

Table,name=MaxPartitionSize type=Table,keyspace=ks,scope=*,name=MaxPartitionSize

Alerting condition: Partitions greater than 100 Mb.

Required reaction: Alert development team, as this indicates problems with the data model.

Total number of SSTables in system and per table

JMX in MBean org.apache.cassandra.metrics:

type=Table,name=LiveSSTableCount type=Table,keyspace=ks,scope=*,name=LiveSSTableCount

Alerting condition: More than 200 per individual table (depends on the used compaction strategy).

Required reaction: Too many big tables, which leads to performance degradation.

Number of hints stored on individual node

JMX in MBean org.apache.cassandra.metrics:

type=Storage,name=TotalHints

Alerting condition: Value greater than zero indicates that some nodes are not reachable.

Hint Replay Success/Failure/Timeout Rate

JMX in MBean org.apache.cassandra.metrics:

type=HintsService,name=HintsSucceeded type=HintsService,name=HintsFailed type=HintsService,name=HintsTimedOut

Number threads are blocked by memtable allocation

JMX in MBean org.apache.cassandra.metrics:

type=MemtablePool,name=BlockedOnAllocation

Alerting condition: Non-zero value

Number of blocked memtable flush writer tasks.

This condition could lead to heavy write performance degradation.

JMX in MBean org.apache.cassandra.metrics:

type=ThreadPools,path=internal,scope=MemtableFlushWriter,name=CurrentlyBlockedTasks

Alerting condition: Non-zero value.

Required reaction: Investigate. This condition is caused by failing disks, excessive disk operations, and so on.

Number of blocked compaction tasks

JMX in MBean org.apache.cassandra.metrics:

type=ThreadPools,path=internal,scope=CompactionExecutor,name=CurrentlyBlockedTasks

Alerting condition: Non-zero value.

Number of aborted compaction tasks

JMX in MBean org.apache.cassandra.metrics:

name=CompactionsAborted,type=Compaction

Alerting condition: Non-zero value.

Information about Java’s garbage collection

Caused by Mac GC Elapsed and similar.

JMX in MBean org.apache.cassandra.metrics:

type=GCInspector

Number of segments waiting on commit

JMX in MBean org.apache.cassandra.metrics:

type=CommitLog,name=WaitingOnCommit,name=Count

Alerting condition: High count during last minute.

99th percentile of time spent waiting on commit

JMX in MBean org.apache.cassandra.metrics:

type=CommitLog,name=WaitingOnCommit,name=99thPercentile

Number of pending flushes

JMX in MBean org.apache.cassandra.metrics:

type=Table,name=PendingFlushes

Hit ratio for key cache

Only Cassandra and DSE prior to version 6.0.

JMX in MBean org.apache.cassandra.metrics:

type=Cache,scope=KeyCache,name=HitRate

Alerting condition: Hit ratio is lower than 0.9.

Required reaction: If the cache is full (capacity is equal to size), increase the size of the key cache.

Visualizing Important Metrics in OpsCenter

OpsCenter collects metrics (6.8 | 6.7 | 6.5 | 6.1) from all nodes in a cluster and stores original data, together with rollups in the DSE cluster. This data is then used to create graphs and alerts. When using OpsCenter for monitoring, the following list of both the metrics and the graphs is useful for setting up an effective monitoring of the cluster:

  • Active Alerts

  • Cluster Health

  • Storage Capacity

  • Read and Write Request Latency

  • Read and Write Requests

  • Data Size

  • Compactions Pending

  • Dropped Messages: Mutations

  • Dropped Messages: Reads

  • Native Clients

  • For specific tables (setup for most important tables):

    • TBL: SSTables per read (percentiles)

    • TBL: Tombstones per read (percentiles)

    • TBL: Partition size (percentiles)

  • Related to hinted handoff:

    • Hints on Disk

    • TP: Hint Dispatcher Active

    • TP: Hint Dispatcher Completed

    • Dropped Messages: Hinted Handoff

  • Related to operating system:

    • OS: Disk Latency

    • OS: Load

    • OS: CPU Iowait

    • OS: Memory Free

  • Related to Java Virtual machine:

    • Heap Used

    • JVM G1 Old Collection Count and Time

    • JVM G1 Young Collection Count and Time

  • When DSE Search is enabled:

    • Search: Core Size

    • Search: Read Latency

    • Search: Timeouts

  • When NodeSync is enabled, NodeSync related:

    • TP: Read Range NodeSync Active

    • NodeSync: Uncompleted Pages, Failed Pages

Alerts in OpsCenter

OpsCenter alerts the operator when certain conditions are met (6.8 | 6.7 | 6.5 | 6.1). Examples include when a node is down and latency is too high for a long period of time. OpsCenter can deliver alerts by email, SNMP, and HTTP requests.

Configure the following alerts to operatively react to problems in DSE clusters.

Node Down

When a node is marked as down by OpsCenter.

Condition: “<event>” for more than X <minutes|hours|days>

Recommendation: X: immediately or 1 minute (depending on whether some level of tolerance of the event is possible)

Criticality/Notification frequency: Urgent

Agent Issue

When a DataStax Agent being monitored is having some issues.

Condition: “<event>” for more than X <minutes|hours|days>

Recommendation: X: 30 minutes

Criticality/Notification frequency: Low

CPU Usage

The percentage of CPU was busy.

Condition: “<event>” is above X for more than Y <minutes|hours|days>

Recommendation: X: 100, Y: 1 hour

Criticality/Notification frequency: Low

Load

The overall amount of work that a computer system performs.

Condition: “<event>” is above X for more than Y <minutes|hours|days>

Recommendation: X: 0.7 x total number of CPU cores, Y: 1 hour

Criticality/Notification frequency: High

Write Request Latency (percentiles)

The response time (in milliseconds) for successful write operations.

Condition: “<event>” is above X ms/op for more than Y <minutes|hours|days> for selected Z percentile

Recommendation: X: <depending on application SLA>, Y: 4 hours, Z: 99

Criticality/Notification frequency: Medium

Read Request Latency (percentiles)

The response time (in milliseconds) for successful read operations.

Condition: “<event>” is above X ms/op for more than Y <minutes|hours|days> for selected Z percentile

Recommendation: X: <depending on application SLA>, Y: 4 hours, Z: 99

Criticality/Notification frequency: Medium

Advanced → System → Disk Usage(%)

The percentage of disk being used for a particular disk partition.

Condition: “<event>” is above X % for more than Y <minutes|hours|days>

Recommendation: X: 50, Y: 4 hours

Criticality/Notification frequency: Medium

Advanced → Tables → TBL: SSTables per Read (percentiles)

For a specified percentile, how many SSTables are accessed during a read.

Condition: “<event>” is above X SSTables for more than Y <minutes|hours|days> for W table at Z percentile.

Recommendation: X: 10, Y: 1 day, W: <table of interest>, Z: 99

Criticality/Notification frequency: Low

Advanced → Tables → TBL: Tombstones per Read (percentiles)

For a specified percentile, how many Tombstones are accessed during a read.

Condition: “<event>” is above X tombstones for more than Y <minutes|hours|days> for W table at Z percentile.

Recommendation: X: tombstone_warn_threshold in cassandra.yaml, Y: 1 day, W: <table of interest>, Z: 99 Criticality/Notification frequency: Low

Advanced → Tables → TBL: Partition Size (percentiles)

For a specified percentile, what is the size (in bytes) of partitions of this table.

Condition: “<event>” is above X for more than Y <minutes|hours|days> for W table at Z percentile.

Recommendation: X: 200 MB (in bytes), Y: 1 day, W: <table of interest>, Z: 99

Criticality/Notification frequency: Low

Tools for work with JMX

A number of tools exist for one-off analysis of specific metrics. Usually you only use these tools for debugging as they are not designed to replace monitoring solutions. These tools primarily provide access to individual metrics at the moment on a specific node. They do not generate a view of what happens over time or provide multiple metrics on multiple nodes.

JConsole

Jconsole is a GUI tool included in a Java distribution, such as OpenJDK. Jconsole allows for easy browsing of the metrics and inspection of their values. It also provides the possibility to graph them over time.

Many more tools exist than are listed here.

Java Monitoring & Management Console

Java Monitoring & Management Console

To access metrics, JMX either needs to be running on the server (usually requiring installation of the GUI libraries), or to be accessible via the network, which exposes JMX externally and may have a security impact.

jmxterm

This is a very popular JMX command-line tool. After downloading, it is easy to run and connect to a local node, or to other nodes when JMX is exposed externally:

$>open localhost:7199

#Connection to localhost:7199 is opened

You can access specific metrics with commands such as:

$>get -b org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size Value #mbean = org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size: Value = 0;

To get a full list of supported commands, run help inside the interactive console. In addition to obtaining the values for attributes, you can set values for attributes (if they are settable), or call the function, which you can use to temporarily modify the behavior of Cassandra (in the same way as nodetool commands are used).

nodetool sjk (DSE and Cassandra 4.0)

DSE provides nodetool sjk (6.8 | 6.7 | 6.0 | 5.1). This is a wrapper for the well-known library called Swiss Java Knife (SJK). This subcommand is convenient because you do not need to specify the DSE process PID or other parameters; you just provide necessary flags. For example, to get a hit rate of key cache, use the following command, where the -b flag specifies the name of the bean and the -f specifies the field:

nodetool sjk mx -b "org.apache.cassandra.metrics:type=Cache,scope=KeyCache,name=Size" -mg -f Value

Similar to jmxterm, you can use this command to set values (when settable), or to call functions.

The scope of the SJK library is not limited to JMX. You can use it to get the thread dump, information about threads, and other functionality.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com