Cassandra Data Migrator

You can use Cassandra Data Migrator (CDM) to migrate and validate tables between the origin and target Cassandra clusters, with optional logging and reconciliation support.

CDM facilitates data transfer by creating multiple jobs that access the Cassandra cluster concurrently, making it an ideal choice for migrating large datasets. It offers extensive configuration options, including logging, reconciliation, performance optimization, and more.

Install Cassandra Data Migrator

DataStax recommends that you always install the latest version of CDM to get the latest features, dependencies, and bug fixes.

Install as a container
Install as a JAR file

Get the latest cassandra-data-migrator image that includes all dependencies from DockerHub.

The container’s assets directory includes all required migration tools: cassandra-data-migrator, dsbulk, and cqlsh.

Install Java 11 or later, which includes Spark binaries.
Install Apache Spark™ version 3.5.x with Scala 2.13 and Hadoop 3.3 and later.
- Single VM
- Spark cluster
For one-off migrations, you can install the Spark binary on a single VM where you will run the CDM job.
Get the Spark tarball from the Apache Spark archive.

wget https://archive.apache.org/dist/spark/spark-3.5.PATCH/spark-3.5.PATCH-bin-hadoop3-scala2.13.tgz

Replace PATCH with your Spark patch version.

Change to the directory where you want install Spark, and then extract the tarball:

tar -xvzf spark-3.5.PATCH-bin-hadoop3-scala2.13.tgz

Replace PATCH with your Spark patch version.
For large (several terabytes) migrations, complex migrations, and use of CDM as a long-term data transfer utility, DataStax recommends that you use a Spark cluster or Spark Serverless platform.

If you deploy CDM on a Spark cluster, you must modify your spark-submit commands as follows:
Replace --master "local[*]" with the host and port for your Spark cluster, as in --master "spark://MASTER_HOST:PORT".

Remove parameters related to single-VM installations, such as --driver-memory and --executor-memory.
Download the latest cassandra-data-migrator JAR file from the CDM repository.

Add the cassandra-data-migrator dependency to pom.xml:

<dependency>
  <groupId>datastax.cdm</groupId>
  <artifactId>cassandra-data-migrator</artifactId>
  <version>VERSION</version>
</dependency>

Replace VERSION with your CDM version.

Run mvn install.

If you need to build the JAR for local development or your environment only has Scala version 2.12.x, see the alternative installation instructions in the CDM README.

Configure CDM

Create a cdm.properties file.

If you use a different name, make sure you specify the correct filename in your spark-submit commands.
Configure the properties for your environment.

In the CDM repository, you can find a sample properties file with default values, as well as a fully annotated properties file.

CDM jobs process all uncommented parameters. Any parameters that are commented out are ignored or use default values.

If you want to reuse a properties file created for a previous CDM version, make sure it is compatible with the version you are currently using. Check the CDM release notes for possible breaking changes in interim releases. For example, the 4.x series of CDM isn’t backwards compatible with earlier properties files.
Store your properties file where it can be accessed while running CDM jobs using spark-submit.

Run a CDM data migration job

The following spark-submit command migrates one table from the origin to the target cluster, using the configuration in your properties file. The migration job is specified in the --class argument.

Local installation
Spark cluster

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Replace or modify the following, if needed:

--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.
KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to migrate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.
--driver-memory and --executor-memory: For local installations, specify the appropriate memory settings for your environment.
VERSION: Specify the full CDM version that you installed, such as 5.2.1.

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \
--master "spark://MASTER_HOST:PORT" \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Replace or modify the following, if needed:

--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.
KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to migrate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.
--master: Provide the URL of your Spark cluster.
VERSION: Specify the full CDM version that you installed, such as 5.2.1.

This command generates a log file (logfile_name_TIMESTAMP.txt) instead of logging output to the console.

For additional modifications to this command, see Additional CDM options.

Run a CDM data validation job

After you migrate data, you can use CDM’s data validation mode to find inconsistencies between the origin and target tables.

Use the following spark-submit command to run a data validation job using the configuration in your properties file. The data validation job is specified in the --class argument.
- Local installation
- Spark cluster
./spark-submit --properties-file cdm.properties \ --conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \ --master "local[*]" --driver-memory 25G --executor-memory 25G \ --class com.datastax.cdm.job.DiffData cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.

KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to validate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.

--driver-memory and --executor-memory: For local installations, specify the appropriate memory settings for your environment.

VERSION: Specify the full CDM version that you installed, such as 5.2.1.
./spark-submit --properties-file cdm.properties \ --conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \ --master "spark://MASTER_HOST:PORT" \ --class com.datastax.cdm.job.DiffData cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
--properties-file cdm.properties: If your properties file has a different name, specify the actual name of your properties file.

Depending on where your properties file is stored, you might need to specify the full or relative file path.

KEYSPACE_NAME.TABLE_NAME: Specify the name of the table that you want to validate and the keyspace that it belongs to.

You can also set spark.cdm.schema.origin.keyspaceTable in your properties file using the same format of KEYSPACE_NAME.TABLE_NAME.

--master: Provide the URL of your Spark cluster.

VERSION: Specify the full CDM version that you installed, such as 5.2.1.
Allow the command some time to run, and then open the log file (logfile_name_TIMESTAMP.txt) and look for ERROR entries.

The CDM validation job records differences as ERROR entries in the log file, listed by primary key values. For example:
```
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
```
When validating large datasets or multiple tables, you might want to extract the complete list of missing or mismatched records. There are many ways to do this. For example, you can grep for all ERROR entries in your CDM log files or use the log4j2 example provided in the CDM repository.

Run a validation job in AutoCorrect mode

Optionally, you can run CDM validation jobs in AutoCorrect mode, which offers the following functions:

autocorrect.missing: Add any missing records in the target with the value from the origin.

autocorrect.mismatch: Reconcile any mismatched records between the origin and target by replacing the target value with the origin value.

TIMESTAMP has an effect on this function.

If the WRITETIME of the origin record (determined with .writetime.names) is earlier than the WRITETIME of the target record, then the change doesn’t appear in the target cluster. This comparative state can be challenging to troubleshoot if individual columns or cells were modified in the target cluster.

autocorrect.missing.counter: By default, counter tables are not copied when missing, unless explicitly set.

In your cdm.properties file, use the following properties to enable (true) or disable (false) autocorrect functions:

spark.cdm.autocorrect.missing                     false|true
spark.cdm.autocorrect.mismatch                    false|true
spark.cdm.autocorrect.missing.counter             false|true

The CDM validation job never deletes records from either the origin or target. Data validation only inserts or updates data on the target.

For an initial data validation, consider disabling AutoCorrect so that you can generate a list of data discrepancies, investigate those discrepancies, and then decide whether you want to rerun the validation with AutoCorrect enabled.

Additional CDM options

You can modify your properties file or append additional --conf arguments to your spark-submit commands to customize your CDM jobs. For example, you can do the following:

Check for large field guardrail violations before migrating.
Use the partition.min and partition.max parameters to migrate or validate specific token ranges.
Use the track-run feature to monitor progress and rerun a failed migration or validation job from point of failure.

For all options, see the CDM repository. Specifically, see the fully annotated properties file.

Troubleshoot CDM

Java NoSuchMethodError

If you installed Spark as a JAR file, and your Spark and Scala versions aren’t compatible with your installed version of CDM, CDM jobs can throw exceptions such a the following:

Exception in thread "main" java.lang.NoSuchMethodError: 'void scala.runtime.Statics.releaseFence()'

Make sure that your Spark binary is compatible with your CDM version. If you installed an earlier version of CDM, you might need to install an earlier Spark binary.

Rerun a failed or partially completed job

You can use the track-run feature to track the progress of a migration or validation, and then, if necessary, use the run-id to rerun a failed job from the last successful migration or validation point.

For more information, see the CDM repository and the fully annotated properties file.