CDC for Cassandra FAQs

The following are frequently asked questions about DataStax CDC for Apache Cassandra® and its features.

What is CDC for Cassandra?

The CDC for Cassandra is a an open-source product from DataStax.

With CDC for Cassandra, updates to data in Apache Cassandra or DataStax Enterprise (DSE) are put into a Pulsar topic, which in turn can write the data to external targets such as Elasticsearch, Snowflake, and other platforms. The DataStax Cassandra Source Connector (CSC) for Apache Pulsar™ component has a one-to-one correspondence between a Cassandra table and a single Pulsar topic.

Is CDC for Cassandra an open-source project? Where can I find the repository?

Yes, CDC for Cassandra is open source software under the Apache 2.0 license. You can find the source code on the DataStax CDC for Cassandra GitHub repository.

What does CDC for Cassandra provide that I cannot get with open-source Apache Pulsar™?

In effect, the CDC for Cassandra implements the reverse of Apache Pulsar or DataStax Cassandra Sink Connector. With those sink connectors, data is taken from a Pulsar topic and put into Cassandra. With CDC for Cassandra, updates to a Cassandra table are converted into events and put into a data topic. From there, the data can be published to external platforms like Elasticsearch, Snowflake, and other platforms.

How do I install CDC for Cassandra?

Follow the installation instructions.

What are the requirements for CDC for Cassandra?

See the installation instructions.

I have multiple Cassandra datacenters. How do I configure CDC for Cassandra?

See Deploy multiple Cassandra datacenters.

What is the impact of the Cassandra CDC solution on the existing Cassandra cluster?

For each CDC-enabled Cassandra table, Cassandra needs extra processing cycles and storage to process the CDC commit logs. The impact for dealing with a single CDC-enabled table is small, but when there are a large number of Cassandra tables with CDC enabled, the impact within Cassandra increases. The performance impact occurs within Cassandra itself, not the Cassandra CDC solution with Pulsar.

The Change Agent for Cassandra is started as a JVM agent of the Cassandra process and it shares the same hardware resource of the same Cassandra node. However, the only job that the Change Agent for Cassandra does is to scan the CDC commit log directory on a regular basis and send messages to the Pulsar cluster. This is a lightweight process when launched on a single thread, but the Change Agent for Cassandra can be launched with multiple threads. As more threads are launched, more resources are consumed.

For each Cassandra write operation (one detected change-event), the Pulsar CSC connector performs a primary key-based Cassandra read to get the most complete, up-to-date information of that particular Cassandra row.

In a worst-case scenario, where a CDC-enabled Cassandra has 100% write workload, the CDC solution would double the workload by adding the same amount of read workload to Cassandra table. Since the Cassandra read is primary key-based, it is efficient.

What are the CDC for Cassandra limitations?

See CDC for Cassandra limitations.

What happens if the Apache Pulsar service is unavailable?

If the Pulsar cluster is down, the Change Agent for Cassandra on each Cassandra node attempts to send the mutations periodically, and it keeps the CDC commitlog segments on disk until the data sending is successful.

The Change Agent for Cassandra keeps track of the CDC commitlog segment offsets, so the Change Agent for Cassandra knows where to resume sending the mutation messages when the Pulsar cluster is back online.

DataStax recommends active monitoring of the disk space of the Cassandra nodes. If the Pulsar cluster is down, the change agent continues trying to send messages, and the CDC commitlog files accumulate on the Cassandra node. If the maximum CDC directory disk space is reached, future Cassandra writes to the CDC-enabled table will fail.

When the disk space of the cdc_raw directory reaches your cdc_total_space_in_mb Cassandra setting (less than 4 GB by default), writes to CDC-enabled tables fail with a CDCWriteException. The following warning message is included in Cassandra logs:

WARN  [CoreThread-5] 2021-10-29 09:12:52,790  NoSpamLogger.java:98 - Rejecting Mutation containing CDC-enabled table. Free up space in /mnt/data/cdc_raw.

To avoid or recover from this situation, increase the cdc_total_space_in_mb and restart the node. To prevent hitting this new limit, increase the write throughput to your Pulsar cluster, or decrease the write throughput to your node.

Increasing the write throughput can involve tuning one or more of the following:

Change agent configuration: The number of allocated threads, the batching delay, the number of inflight messages
Pulsar cluster configuration: The number of partitions of your topics
CSC for Pulsar configuration: The query executors, batching and cache settings, connector parallelism

As a last resort, if losing data is acceptable in your CDC pipeline, remove commitlog files from the cdc_raw directory. Restarting the node isn’t needed in this case.

How do I know if CDC is enabled on a table?

You can check the CDC status of a table by running the following CQL query:

SELECT * FROM system_distributed.cdc_local WHERE keyspace_name = 'keyspace_name' AND table_name = 'table_name';

There are three possible statuses:

Enabled

If the CDC status is enabled, then CDC is enabled on the table.

From this status, you can disable CDC on the table by running the following CQL query:

ALTER TABLE keyspace_name.table_name WITH cdc = {'enabled': false};

Disabled

If the CDC status is disabled then CDC is disabled on the table.

From this status, you can enable CDC on the table by running the following CQL query:

ALTER TABLE keyspace_name.table_name WITH cdc = {'enabled': true};

Null

If the CDC status is null then CDC isn’t enabled on the table.

From this status, you can enable CDC on the table by running the following CQL query:

ALTER TABLE keyspace_name.table_name WITH cdc = {'enabled': true};

How do I know if the Change Agent for Cassandra is running?

You can check the status of the Change Agent for Cassandra by running the following CQL query:

SELECT * FROM system_distributed.cdc_local WHERE keyspace_name = 'cdc' AND table_name = 'raw_cdc';

There are three possible statuses:

Running

If the status column is running, then the agent is running.

From this status, you can stop the agent by running the following CQL query:

ALTER TABLE cdc.raw_cdc WITH cdc = {'enabled': false};

Stopped

If the status column is stopped then the agent isn’t running.

From this status, you can start the agent by running the following CQL query:

ALTER TABLE cdc.raw_cdc WITH cdc = {'enabled': true};

Null

If the status column is null, then the agent isn’t running.

From this status, you can start the agent by running the following CQL query:

ALTER TABLE cdc.raw_cdc WITH cdc = {'enabled': true};

What happens to unacknowledged event messages the Change Agent for Cassandra cannot deliver?

Unacknowledged messages mean the Change Agent for Cassandra couldn’t produce the event message in Pulsar. In this case, the table row mutation fails. The Cassandra client handles this as an exception. The data is committed to Cassandra and no event is created.

Another scenario might be the Pulsar broker is too busy to process messages and a backlog has been created. In this case, the Pulsar backlog policies take effect and event messages are handled accordingly. The data is committed to Cassandra but there might be some additional latency to the event message creation.

The design of CDC in Cassandra assumed that when table changes are synchronized to the raw_cdc log, another process is draining that log. There is a max log size setting that disables writes to the table when the set threshold is reached. If a connection to the Pulsar cluster is needed for the log to be drained, and it isn’t responsive, then the log begins to fill, which can impact a table’s write availability.

For more, see the Scaling up your CDC configuration.

Does the DataStax Cassandra Source Connector (CSC) for Apache Pulsar™ use a dead-letter topic?

A dead letter topic is used when a message cannot be delivered to a consumer. Possible causes include the following:

The message acknowledgment time expired with no consumer acknowledged receipt of the message
A consumer negatively acknowledged the message
A retry letter topic is in use and retries were exhausted

The CSC for Pulsar creates a consumer to receive new event messages from the Change Agent for Cassandra, but doesn’t configure a dead letter topic. It is assumed that parallel instances, broker compute, and function worker compute are sized to handle the workload.

How do I scale CDC to handle my production loads?

There are 3 areas of scalability to focus on:

Hosts in the Cassandra cluster: The Change Agent for Cassandra is running on each host in its own JVM. If you are administering your own Cassandra cluster, then you can tune the JVM compute properties to handle the appropriate workload. If you are using Cassandra in a serverless environment, then the JVM is already set to handle significant load.
Number of CSC for Pulsar instances that are running: This is initially set when the Cassandra source connector is created, and it can be updated throughout the life of the running connector. Depending on your Pulsar configuration, an instance can represent a process thread on the broker or a function worker. If using Kubernetes, this could be a pod. Each represents different scaling strategies like increasing compute, adding more workers, and more K8s nodes.
Broker backlog size and throughput tolerances: There are potentially a large amount of messages being created, so you must ensure the Pulsar cluster is sized correctly. For more information, see Production Cluster Sizing.

How do I filter table data by column?

Transformation functions are a great way to manipulate messages on CDC data (with no code required). Put them inline to watch the data topic and write to a different topic. Name the topic something memorable, like filtered-data topic.

How do I configure multi-region CDC using the Cassandra sink?

One of the requirements for CDC is that both the Cassandra and Pulsar clusters must be in the same cloud region or on-premise data center. If you are using geo-replication, you need the change data to be replicated across multiple clusters. The most manageable way to handle this is to use the Pulsar Cassandra sink to "watch" the CDC data topic and write the change to a different Cassandra table in another organization.

The Cassandra sink requires the following provisions:

Use the CDC data topic as its source of messages
Provide a secure bundle (creds) to another Cassandra cluster
Map message values to a specific table in the other cluster
Use the Pulsar delivery guarantee to ensure success
Use the Pulsar connector health metrics to monitor failures

How do I migrate table data using CDC?

Migrating data between tables solves quite a few different challenges. The basic approach is to use a Cassandra sink to watch the Cassandra source and write to another table while mapping columns appropriately. As the original table is phased out, the number of messages decreases to none, while consumers are watching the new table’s CDC data topic. For more information, see How do I configure multi-region CDC using the Cassandra sink?.