Monitoring CDC for Cassandra
Change Agent Metrics
The change agent is a JVM agent running in Apache Cassandra® nodes and provides a dedicated MBean type=CdcAgent
with the following metrics:
Metric | Type | Description |
---|---|---|
SentMutations |
Counter |
Number of Cassandra mutations successfully sent to the streaming platform. |
SentErrors |
Counter |
Number of errors when sending mutations to the streaming platform. |
CommitLogReadErrors |
Counter |
Number of unrecoverable commitlog file reads. |
SkippedMutations |
Counter |
Number of ignored mutations because the primary key has an unsupported column type. |
ExecutedTasks |
Counter |
Number of executed tasks to process commitlog files. |
SubmittedTasks |
Gauge |
The current number of submitted tasks to the dedicated thread pool. |
MaxSubmittedTasks |
Gauge |
The maximum number of submitted tasks. |
PendingTasks |
Gauge |
The current number of pending tasks to re-process commitlog files. |
MaxPendingTasks |
Gauge |
The maximum number of pending tasks. |
UncleanedTasks |
Gauge |
The current number of tasks for which processed commitlog file have not yet been removed from the |
MaxUncleanedTasks |
Gauge |
The maximum number of uncleaned tasks. |
CDC for Cassandra stats
The CDC for Cassandra framework reports stats for each connector. You can view the stats for a connector like this:
pulsar-admin source stats --name cassandra-source-1
{
"numInstances" : 1,
"numRunning" : 0,
"instances" : [ {
"instanceId" : 0,
"status" : {
"running" : false,
"error" : "",
"numRestarts" : 0,
"numReceivedFromSource" : 0,
"numSystemExceptions" : 0,
"numSourceExceptions" : 0,
"numWritten" : 0,
"lastReceivedTime" : 0,
"workerId" : "pulsar-perf-aws-useast2-function-0"
}
} ]
}
The stats numReceivedFromSource
and numWritten
indicate how many events have been processed by the CDC for Cassandra.
If the connector has errors, the counts are shown.
A description of the last seen error is displayed in the error
field.
CDC for Cassandra metrics
CDC for Cassandra also publishes per message metrics:
Metric | Description |
---|---|
cache_hits |
Number of mutation cache hits. |
cache_misses |
Number of mutation cache misses. |
cache_evictions |
Number of mutation cache evictions. |
cache_size |
Number of entries in the mutation cache. |
query_latency |
The CQL query latency in milliseconds to fetch the updated row. This is 0 when hitting the memory cache. |
query_executors |
The number of threads available to execute the CQL queries. |
replication_latency |
The replication latency in milliseconds (the CDC for Cassandra processing time minus the Cassandra mutation writetime). |
Here an example of those user-defined metrics aggregated by Apache Pulsar™ when processing 2000 mutations:
curl http://localhost:8080/metrics/ 2>/dev/null | grep user
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# HELP pulsar_source_user_metric_ User defined metric.
# TYPE pulsar_source_user_metric_ summary
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.5",} 71683.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.9",} 99667.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.99",} 106717.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.999",} 106763.0
pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",} 20000.0
pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",} 1.3355407E9
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.5",} 1.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.9",} 1.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.99",} 1.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.999",} 1.0
pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",} 20000.0
pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",} 10000.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.5",} 2.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.9",} 9.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.99",} 104.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.999",} 1035.0
pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",} 20000.0
pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",} 83886.0
Monitoring and Alerting resources
-
The change agent exposes metrics with JMX, a technology within Java that provides tools for managing and monitoring applications.
-
DSE Ops Center can collect these exposed metrics for visualization and alerts, and pass them on to DSE Metrics Collector for additional integration with Prometheus and Grafana.
-
The Metrics Collector for Apache Cassandra with Prometheus and Grafana dashboards provides the same functionality as DSE Metrics Collector, built on the well-supported collectd agent.
-
Other monitoring tools like JMX Exporter by Prometheus are available, but may require additional tuning.