Monitoring CDC for Cassandra Change Agent Metrics The change agent is a JVM agent running in Apache Cassandra® nodes and provides a dedicated MBean type=CdcAgent with the following metrics: Metric Type Description SentMutations Counter Number of Cassandra mutations successfully sent to the streaming platform. SentErrors Counter Number of errors when sending mutations to the streaming platform. CommitLogReadErrors Counter Number of unrecoverable commitlog file reads. SkippedMutations Counter Number of ignored mutations because the primary key has an unsupported column type. ExecutedTasks Counter Number of executed tasks to process commitlog files. SubmittedTasks Gauge The current number of submitted tasks to the dedicated thread pool. MaxSubmittedTasks Gauge The maximum number of submitted tasks. PendingTasks Gauge The current number of pending tasks to re-process commitlog files. MaxPendingTasks Gauge The maximum number of pending tasks. UncleanedTasks Gauge The current number of tasks for which processed commitlog file have not yet been removed from the cdc_raw directory. MaxUncleanedTasks Gauge The maximum number of uncleaned tasks. CDC for Cassandra stats The CDC for Cassandra framework reports stats for each connector. You can view the stats for a connector like this: pulsar-admin source stats --name cassandra-source-1 { "numInstances" : 1, "numRunning" : 0, "instances" : [ { "instanceId" : 0, "status" : { "running" : false, "error" : "", "numRestarts" : 0, "numReceivedFromSource" : 0, "numSystemExceptions" : 0, "numSourceExceptions" : 0, "numWritten" : 0, "lastReceivedTime" : 0, "workerId" : "pulsar-perf-aws-useast2-function-0" } } ] } The stats numReceivedFromSource and numWritten indicate how many events have been processed by the CDC for Cassandra. If the connector has errors, the counts are shown. A description of the last seen error is displayed in the error field. CDC for Cassandra metrics CDC for Cassandra also publishes per message metrics: Metric Description cache_hits Number of mutation cache hits. cache_misses Number of mutation cache misses. cache_evictions Number of mutation cache evictions. cache_size Number of entries in the mutation cache. query_latency The CQL query latency in milliseconds to fetch the updated row. This is 0 when hitting the memory cache. query_executors The number of threads available to execute the CQL queries. replication_latency The replication latency in milliseconds (the CDC for Cassandra processing time minus the Cassandra mutation writetime). Here an example of those user-defined metrics aggregated by Apache Pulsar™ when processing 2000 mutations: curl http://localhost:8080/metrics/ 2>/dev/null | grep user # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. # HELP pulsar_source_user_metric_ User defined metric. # TYPE pulsar_source_user_metric_ summary pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.5",} 71683.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.9",} 99667.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.99",} 106717.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.999",} 106763.0 pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",} 20000.0 pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",} 1.3355407E9 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.5",} 1.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.9",} 1.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.99",} 1.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.999",} 1.0 pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",} 20000.0 pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",} 10000.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.5",} 2.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.9",} 9.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.99",} 104.0 pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.999",} 1035.0 pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",} 20000.0 pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",} 83886.0 Monitoring and Alerting resources The change agent exposes metrics with JMX, a technology within Java that provides tools for managing and monitoring applications. DSE Ops Center can collect these exposed metrics for visualization and alerts, and pass them on to DSE Metrics Collector for additional integration with Prometheus and Grafana. The Metrics Collector for Apache Cassandra with Prometheus and Grafana dashboards provides the same functionality as DSE Metrics Collector, built on the well-supported collectd agent. Other monitoring tools like JMX Exporter by Prometheus are available, but may require additional tuning.