Use Cassandra Data Migrator with ZDM Proxy

You can use Cassandra Data Migrator (CDM) for data migration and validation between Apache Cassandra®-based databases. It is best for large or complex migrations that benefit from advanced features and configuration options, such as the following:

Logging and run tracking
Automatic reconciliation
Performance tuning
Record filtering
Column renaming
Support for advanced data types, including sets, lists, maps, and UDTs
Support for SSL, including custom cipher algorithms
Use writetime timestamps to maintain chronological write history
Use Time To Live (TTL) values to maintain data lifecycles

For more information and a complete list of features, see the CDM GitHub repository.

Cassandra Data Migrator requirements

To use CDM successfully, your origin and target clusters must be Cassandra-based databases with matching schemas.

CDM with ZDM Proxy

You can use CDM alone, with ZDM Proxy, or for data validation after using another data migration tool.

When using CDM with ZDM Proxy, Cassandra’s last-write-wins semantics ensure that new, real-time writes accurately take precedence over historical writes.

Last-write-wins compares the writetime of conflicting records, and then retains the most recent write.

For example, if a new write occurs in your target cluster with a writetime of 2023-10-01T12:05:00Z, and then CDM migrates a record against the same row with a writetime of 2023-10-01T12:00:00Z, the target cluster retains the data from the new write because it has the most recent writetime.

Install Cassandra Data Migrator

DataStax recommends that you always install the latest version of CDM to get the latest features, dependencies, and bug fixes.

Install as a container
Install as a JAR file

Get the latest cassandra-data-migrator image that includes all dependencies from DockerHub.

The container’s assets directory includes all required migration tools: cassandra-data-migrator, dsbulk, and cqlsh.

Install Java 11 or later, which includes Spark binaries.
Install Apache Spark™ version 3.5.x with Scala 2.13 and Hadoop 3.3 and later.
- Single VM
- Spark cluster
For one-off migrations, you can install the Spark binary on a single VM where you will run the CDM job.
Get the Spark tarball from the Apache Spark archive.

wget https://archive.apache.org/dist/spark/spark-3.5.PATCH/spark-3.5.PATCH-bin-hadoop3-scala2.13.tgz

Replace PATCH with your Spark patch version.

Change to the directory where you want install Spark, and then extract the tarball:

tar -xvzf spark-3.5.PATCH-bin-hadoop3-scala2.13.tgz

Replace PATCH with your Spark patch version.
For large (several terabytes) migrations, complex migrations, and use of CDM as a long-term data transfer utility, DataStax recommends that you use a Spark cluster or Spark Serverless platform.

If you deploy CDM on a Spark cluster, you must modify your spark-submit commands as follows:
Replace --master "local[*]" with the host and port for your Spark cluster, as in --master "spark://MASTER_HOST:PORT".

Remove parameters related to single-VM installations, such as --driver-memory and --executor-memory.
Download the latest cassandra-data-migrator JAR file .

Add the cassandra-data-migrator dependency to pom.xml:

<dependency>
  <groupId>datastax.cdm</groupId>
  <artifactId>cassandra-data-migrator</artifactId>
  <version>VERSION</version>
</dependency>

Replace VERSION with your CDM version.

Run mvn install.

If you need to build the JAR for local development or your environment only has Scala version 2.12.x, see the alternative installation instructions in the CDM README.

Configure CDM

Create a cdm.properties file.

If you use a different name, make sure you specify the correct filename in your spark-submit commands.
Configure the properties for your environment.

In the CDM repository, you can find a sample properties file with default values, as well as a fully annotated properties file.

CDM jobs process all uncommented parameters. Any parameters that are commented out are ignored or use default values.

If you want to reuse a properties file created for a previous CDM version, make sure it is compatible with the version you are currently using. Check the CDM release notes for possible breaking changes in interim releases. For example, the 4.x series of CDM isn’t backwards compatible with earlier properties files.
Store your properties file where it can be accessed while running CDM jobs using spark-submit.

Run a CDM data migration job

A data migration job copies data from a table in your origin cluster to a table with the same schema in your target cluster.

To optimize large-scale migrations, CDM can run multiple concurrent migration jobs on the same table.

The following spark-submit command migrates one table from the origin to the target cluster, using the configuration in your properties file. The migration job is specified in the --class argument.

Local installation
Spark cluster

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Replace or modify the following, if needed:

--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.
KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to migrate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.
--driver-memory and --executor-memory: For local installations, specify the appropriate memory settings for your environment.
VERSION: Specify the full CDM version that you installed, such as 5.2.1.

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \
--master "spark://MASTER_HOST:PORT" \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Replace or modify the following, if needed:

--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.
KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to migrate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.
--master: Provide the URL of your Spark cluster.
VERSION: Specify the full CDM version that you installed, such as 5.2.1.

This command generates a log file (logfile_name_TIMESTAMP.txt) instead of logging output to the console.

For additional modifications to this command, see Additional CDM options.

Run a CDM data validation job

After migrating data, use CDM’s data validation mode to identify any inconsistencies between the origin and target tables, such as missing or mismatched records.

Optionally, CDM can automatically correct discrepancies in the target cluster during validation.

Use the following spark-submit command to run a data validation job using the configuration in your properties file. The data validation job is specified in the --class argument.
- Local installation
- Spark cluster
./spark-submit --properties-file cdm.properties \ --conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \ --master "local[*]" --driver-memory 25G --executor-memory 25G \ --class com.datastax.cdm.job.DiffData cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.

KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to validate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.

--driver-memory and --executor-memory: For local installations, specify the appropriate memory settings for your environment.

VERSION: Specify the full CDM version that you installed, such as 5.2.1.
./spark-submit --properties-file cdm.properties \ --conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \ --master "spark://MASTER_HOST:PORT" \ --class com.datastax.cdm.job.DiffData cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.

KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to validate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.

--master: Provide the URL of your Spark cluster.

VERSION: Specify the full CDM version that you installed, such as 5.2.1.
Allow the command some time to run, and then open the log file (logfile_name_TIMESTAMP.txt) and look for ERROR entries.

The CDM validation job records differences as ERROR entries in the log file, listed by primary key values. For example:
```
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
```
When validating large datasets or multiple tables, you might want to extract the complete list of missing or mismatched records. There are many ways to do this. For example, you can grep for all ERROR entries in your CDM log files or use the log4j2 example provided in the CDM repository.

Run a validation job in AutoCorrect mode

Optionally, you can run CDM validation jobs in AutoCorrect mode, which offers the following functions:

autocorrect.missing: Add any missing records in the target with the value from the origin.

autocorrect.mismatch: Reconcile any mismatched records between the origin and target by replacing the target value with the origin value.

Timestamps have an effect on this function.

If the writetime of the origin record (determined with .writetime.names) is before the writetime of the corresponding target record, then the original write won’t appear in the target cluster.

This comparative state can be challenging to troubleshoot if individual columns or cells were modified in the target cluster.

autocorrect.missing.counter: By default, counter tables are not copied when missing, unless explicitly set.

In your cdm.properties file, use the following properties to enable (true) or disable (false) autocorrect functions:

spark.cdm.autocorrect.missing                     false|true
spark.cdm.autocorrect.mismatch                    false|true
spark.cdm.autocorrect.missing.counter             false|true

The CDM validation job never deletes records from either the origin or target. Data validation only inserts or updates data on the target.

For an initial data validation, consider disabling AutoCorrect so that you can generate a list of data discrepancies, investigate those discrepancies, and then decide whether you want to rerun the validation with AutoCorrect enabled.

Additional CDM options

You can modify your properties file or append additional --conf arguments to your spark-submit commands to customize your CDM jobs. For example, you can do the following:

Check for large field guardrail violations before migrating.
Use the partition.min and partition.max parameters to migrate or validate specific token ranges.
Use the track-run feature to monitor progress and rerun a failed migration or validation job from point of failure.

For all options, see the CDM repository. Specifically, see the fully annotated properties file.

Troubleshoot CDM

Java NoSuchMethodError

If you installed Spark as a JAR file, and your Spark and Scala versions aren’t compatible with your installed version of CDM, CDM jobs can throw exceptions such a the following:

Exception in thread "main" java.lang.NoSuchMethodError: 'void scala.runtime.Statics.releaseFence()'

Make sure that your Spark binary is compatible with your CDM version. If you installed an earlier version of CDM, you might need to install an earlier Spark binary.

Rerun a failed or partially completed job

You can use the track-run feature to track the progress of a migration or validation, and then, if necessary, use the run-id to rerun a failed job from the last successful migration or validation point.

For more information, see the CDM repository and the fully annotated properties file.