CDC for Cassandra FAQs

If you are new to DataStax CDC for Apache Cassandra®, these frequently asked questions are for you.

Introduction

What is CDC for Cassandra?

The CDC for Cassandra is a an open-source product from DataStax.

With CDC for Cassandra, updates to data in Apache Cassandra are put into a Pulsar topic, which in turn can write the data to external targets such as Elasticsearch, Snowflake, and other platforms. The DataStax Cassandra Source Connector for Apache Pulsar™ component is simple, with a 1:1 correspondence between the Cassandra table and a single Pulsar topic.

What are the requirements for CDC for Cassandra?

Minimum requirements are:

  • Cassandra version 3.11+ or 4.0+, DSE 6.8.16+ for near real-time event streaming CDC

  • Cassandra version 3.0 to 3.10 for batch CDC

  • Luna Streaming 2.8.0+ or Apache Pulsar 2.8.1+

  • Additional memory and CPU available on all Cassandra nodes

Cassandra has supported batch CDC since Cassandra 3.0, but for near real-time event streaming, Cassandra 3.11+ or DSE 6.8.16+ are required.

Depending on the workloads of the CDC enabled C* tables, you may need to increase the CPU and memory specification of the C* nodes.

What is the impact of the C* CDC solution on the existing C* cluster?

For each CDC-enabled C* table, C* needs extra processing cycles and storage to process the CDC commit logs. The impact for dealing with a single CDC-enabled table is small, but when there are a large number of C* tables with CDC enabled, the impact within C* increases. The performance impact occurs within C* itself, not the C* CDC solution with Pulsar.

The CDC agent is started as a JVM agent of the C* process and it will share the same hardware resource of the same C* node. However, the only job that the CDC agent does is to scan the CDC commit log directory on a regular basis and send messages to the Pulsar cluster. This is a lightweight process when launched on a single thread, but the CDC agent can be launched with multiple threads. The more threads that are launched, the more resources will be consumed.

For each C* write operation (one detected change-event), the Pulsar CSC connector performs a primary key-based C* read to get the most complete, up-to-date information of that particular C* row.

In a worst-case scenario, where a CDC-enabled C* has 100% write workload, the CDC solution would double the workload by adding the same amount of read workload to C* table. Since the C* read is primary key-based, it will be efficient.

What are the CDC for Cassandra limitations?

CDC for Cassandra has the following limitations:

  • Does not manage table truncates.

  • Does not sync data available before starting the CDC agent.

  • Does not replay logged batches.

  • Does not manage time-to-live.

  • Does not support range deletes.

  • CQL column names must not match a Pulsar primitive type name (ex: INT32) below

Table Pulsar primitive types

Primitive type Description

BOOLEAN

A binary value

INT8

A 8-bit signed integer

INT16

A 16-bit signed integer

INT32

A 32-bit signed integer

INT64

A 64-bit signed integer

FLOAT

A single precision (32-bit) IEEE 754 floating-point number

DOUBLE

A double-precision (64-bit) IEEE 754 floating-point number

BYTES

A sequence of 8-bit unsigned bytes

STRING

A Unicode character sequence

TIMESTAMP (DATE, TIME)

A logic type represents a specific instant in time with millisecond precision.
It stores the number of milliseconds since January 1, 1970, 00:00:00 GMT as an INT64 value

What happens if Luna Streaming or Apache Pulsar is unavailable?

If the Pulsar cluster is down, the CDC agent on each C* node will periodically try to send the mutations, and will keep the CDC commitlog segments on disk until the data sending is successful.

The CDC agent keeps track of the CDC commitlog segment offsets, so the CDC agent knows where to resume sending the mutation messages when the Pulsar cluster is back online.

We recommend active monitoring of the disk space of the C* nodes. If the Pulsar cluster is down, the change agent will continue trying to send messages and the CDC commitlog files will accumulate on the C* node. If the maximum CDC directory disk space is reached, future C* writes to the CDC-enabled table will fail.

When the disk space of the cdc_raw directory reaches your cdc_total_space_in_mb Cassandra setting (less than 4GB by default), writes to CDC-enabled tables will fail with a CDCWriteException. The following warning message is included in Cassandra logs:

WARN  [CoreThread-5] 2021-10-29 09:12:52,790  NoSpamLogger.java:98 - Rejecting Mutation containing CDC-enabled table. Free up space in /mnt/data/cdc_raw.

To avoid or recover from this situation, increase the cdc_total_space_in_mb and restart the node. To prevent hitting this new limit, increase the write throughput to Luna Streaming or Apache Pulsar, or decrease the write throughput to your node.

Increasing the Luna Streaming or Apache Pulsar write throughput may involve tuning the change agent configuration (the number of allocated threads, the batching delay, the number of inflight messages), the Luna Streaming or Apache Pulsar configuration (the number of partitions of your topics), or the CDC for Cassandra configuration (query executors, batching and cache settings, connector parallelism).

As a last resort, if losing data is acceptable in your CDC pipeline, remove commitlog files from the cdc_raw directory. Restarting the node is not needed in this case.

I have multiple Cassandra datacenters. How do I configure CDC for Cassandra?

In a multi-datacenter Cassandra configuration, enable CDC and install the change agent in only one datacenter. To ensure the data sent to all datacenters are delivered to the data topic, make sure to configure replication to the datacenter that has CDC enabled on the table.

For example, given a Cassandra cluster with three datacenters (DC1, DC2, and DC3), you would enable CDC and install the change agent in only DC1. To ensure all updates in DC2 and DC3 are propagated to the data topic, configure the table’s keyspace to replicate data from DC2 and DC3 to DC1. For example, replication = {'class': 'NetworkTopologyStrategy', 'dc1': 3, 'dc2': 3, 'dc3': 3}). The data replicated to DC1 will be processed by the change agent and eventually end up in the data topic.

Is CDC for Cassandra an open-source project?

Yes, CDC for Cassandra is open source using the Apache 2.0 license. You can find the source code on the GitHub repository datastax/cdc-apache-cassandra.

What does CDC for Cassandra provide that I cannot get with open-source Apache Pulsar?

In effect, the CDC for Cassandra implements the reverse of Apache Pulsar or DataStax Cassandra Sink Connector. With those sink connectors, data is taken from a Pulsar topic and put into Cassandra. With CDC for Cassandra, updates to a Cassandra table are converted into events and put into a data topic. From there, the data can be published to external platforms like Elasticsearch, Snowflake, and other platforms.

Where is the CDC for Cassandra public GitHub repository?

The source for this FAQs document is co-located with the CDC for Cassandra repository code. You can access the repository here.

How do I install CDC for Cassandra?

Follow the install instructions.

What is Prometheus?

Prometheus is an open-source tool to collect metrics on a running app, providing real-time monitoring and alerts.

What is Grafana?

Grafana is a visualization tool that helps you make sense of metrics and related data coming from your apps via Prometheus.