Monitoring CDC for Cassandra

Change Agent Metrics

The change agent is a JVM agent running in Apache Cassandra® nodes and provides a dedicated MBean type=CdcAgent with the following metrics:

Metric Type Description

SentMutations

Counter

Number of Cassandra mutations successfully sent to the streaming platform.

SentErrors

Counter

Number of errors when sending mutations to the streaming platform.

CommitLogReadErrors

Counter

Number of unrecoverable commitlog file reads.

SkippedMutations

Counter

Number of ignored mutations because the primary key has an unsupported column type.

ExecutedTasks

Counter

Number of executed tasks to process commitlog files.

SubmittedTasks

Gauge

The current number of submitted tasks to the dedicated thread pool.

MaxSubmittedTasks

Gauge

The maximum number of submitted tasks.

PendingTasks

Gauge

The current number of pending tasks to re-process commitlog files.

MaxPendingTasks

Gauge

The maximum number of pending tasks.

UncleanedTasks

Gauge

The current number of tasks for which processed commitlog file have not yet been removed from the cdc_raw directory.

MaxUncleanedTasks

Gauge

The maximum number of uncleaned tasks.

CDC for Cassandra stats

The CDC for Cassandra framework reports stats for each connector. You can view the stats for a connector like this:

pulsar-admin source stats --name cassandra-source-1

{
  "numInstances" : 1,
  "numRunning" : 0,
  "instances" : [ {
    "instanceId" : 0,
    "status" : {
      "running" : false,
      "error" : "",
      "numRestarts" : 0,
      "numReceivedFromSource" : 0,
      "numSystemExceptions" : 0,
      "numSourceExceptions" : 0,
      "numWritten" : 0,
      "lastReceivedTime" : 0,
      "workerId" : "pulsar-perf-aws-useast2-function-0"
    }
  } ]
}

The stats numReceivedFromSource and numWritten indicate how many events have been processed by the CDC for Cassandra. If the connector has errors, the counts are shown. A description of the last seen error is displayed in the error field.

CDC for Cassandra metrics

CDC for Cassandra also publishes per message metrics:

Metric Description

cache_hits

Number of mutation cache hits.

cache_misses

Number of mutation cache misses.

cache_evictions

Number of mutation cache evictions.

cache_size

Number of entries in the mutation cache.

query_latency

The CQL query latency in milliseconds to fetch the updated row. This is 0 when hitting the memory cache.

query_executors

The number of threads available to execute the CQL queries.

replication_latency

The replication latency in milliseconds (the CDC for Cassandra processing time minus the Cassandra mutation writetime).

Here an example of those user-defined metrics aggregated by Apache Pulsar™ when processing 2000 mutations:

curl http://localhost:8080/metrics/ 2>/dev/null | grep user
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# HELP pulsar_source_user_metric_ User defined metric.
# TYPE pulsar_source_user_metric_ summary
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.5",} 71683.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.9",} 99667.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.99",} 106717.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",quantile="0.999",} 106763.0
pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",} 20000.0
pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="replication_latency",} 1.3355407E9
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.5",} 1.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.9",} 1.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.99",} 1.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",quantile="0.999",} 1.0
pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",} 20000.0
pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="cache_hit",} 10000.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.5",} 2.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.9",} 9.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.99",} 104.0
pulsar_source_user_metric_{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",quantile="0.999",} 1035.0
pulsar_source_user_metric__count{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",} 20000.0
pulsar_source_user_metric__sum{tenant="public",namespace="public/default",name="data-table1",instance_id="0",cluster="standalone",fqfn="public/default/data-table1",metric="query_latency",} 83886.0

Monitoring and Alerting resources

  • The change agent exposes metrics with JMX, a technology within Java that provides tools for managing and monitoring applications.

  • DSE Ops Center can collect these exposed metrics for visualization and alerts, and pass them on to DSE Metrics Collector for additional integration with Prometheus and Grafana.

  • The Metrics Collector for Apache Cassandra with Prometheus and Grafana dashboards provides the same functionality as DSE Metrics Collector, built on the well-supported collectd agent.

  • Other monitoring tools like JMX Exporter by Prometheus are available, but may require additional tuning.