Cassandra Data Migrator
You can use Cassandra Data Migrator (CDM) to migrate and validate tables between the origin and target Cassandra clusters, with optional logging and reconciliation support.
CDM facilitates data transfer by creating multiple jobs that access the Cassandra cluster concurrently, making it an ideal choice for migrating large datasets. It offers extensive configuration options, including logging, reconciliation, performance optimization, and more.
Install Cassandra Data Migrator
DataStax recommends that you always install the latest version of CDM to get the latest features, dependencies, and bug fixes.
-
Install as a container
-
Install as a JAR file
Get the latest cassandra-data-migrator
image that includes all dependencies from DockerHub.
The container’s assets
directory includes all required migration tools: cassandra-data-migrator
, dsbulk
, and cqlsh
.
-
Install Java 11 or later, which includes Spark binaries.
-
Install Apache Spark™ version 3.5.x with Scala 2.13 and Hadoop 3.3 and later.
-
Single VM
-
Spark cluster
For one-off migrations, you can install the Spark binary on a single VM where you will run the CDM job.
-
Get the Spark tarball from the Apache Spark archive.
wget https://archive.apache.org/dist/spark/spark-3.5.PATCH/spark-3.5.PATCH-bin-hadoop3-scala2.13.tgz
Replace
PATCH
with your Spark patch version. -
Change to the directory where you want install Spark, and then extract the tarball:
tar -xvzf spark-3.5.PATCH-bin-hadoop3-scala2.13.tgz
Replace
PATCH
with your Spark patch version.
For large (several terabytes) migrations, complex migrations, and use of CDM as a long-term data transfer utility, DataStax recommends that you use a Spark cluster or Spark Serverless platform.
If you deploy CDM on a Spark cluster, you must modify your
spark-submit
commands as follows:-
Replace
--master "local[*]"
with the host and port for your Spark cluster, as in--master "spark://MASTER_HOST:PORT"
. -
Remove parameters related to single-VM installations, such as
--driver-memory
and--executor-memory
.
-
-
Download the latest
cassandra-data-migrator
JAR filefrom the CDM repository.
-
Add the
cassandra-data-migrator
dependency topom.xml
:<dependency> <groupId>datastax.cdm</groupId> <artifactId>cassandra-data-migrator</artifactId> <version>VERSION</version> </dependency>
Replace
VERSION
with your CDM version. -
Run
mvn install
.
If you need to build the JAR for local development or your environment only has Scala version 2.12.x, see the alternative installation instructions in the CDM README.
Configure CDM
-
Create a
cdm.properties
file.If you use a different name, make sure you specify the correct filename in your
spark-submit
commands. -
Configure the properties for your environment.
In the CDM repository, you can find a sample properties file with default values, as well as a fully annotated properties file.
CDM jobs process all uncommented parameters. Any parameters that are commented out are ignored or use default values.
If you want to reuse a properties file created for a previous CDM version, make sure it is compatible with the version you are currently using. Check the CDM release notes for possible breaking changes in interim releases. For example, the 4.x series of CDM isn’t backwards compatible with earlier properties files.
-
Store your properties file where it can be accessed while running CDM jobs using
spark-submit
.
Run a CDM data migration job
The following spark-submit
command migrates one table from the origin to the target cluster, using the configuration in your properties file.
The migration job is specified in the --class
argument.
-
Local installation
-
Spark cluster
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
-
--properties-file cdm.properties
: If your properties file has a different name, specify the actual name of your properties file.Depending on where your properties file is stored, you might need to specify the full or relative file path.
-
KEYSPACE_NAME.TABLE_NAME
: Specify the name of the table that you want to migrate and the keyspace that it belongs to.You can also set
spark.cdm.schema.origin.keyspaceTable
in your properties file using the same format ofKEYSPACE_NAME.TABLE_NAME
. -
--driver-memory
and--executor-memory
: For local installations, specify the appropriate memory settings for your environment. -
VERSION
: Specify the full CDM version that you installed, such as5.2.1
.
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \
--master "spark://MASTER_HOST:PORT" \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
-
--properties-file cdm.properties
: If your properties file has a different name, specify the actual name of your properties file.Depending on where your properties file is stored, you might need to specify the full or relative file path.
-
KEYSPACE_NAME.TABLE_NAME
: Specify the name of the table that you want to migrate and the keyspace that it belongs to.You can also set
spark.cdm.schema.origin.keyspaceTable
in your properties file using the same format ofKEYSPACE_NAME.TABLE_NAME
. -
--master
: Provide the URL of your Spark cluster. -
VERSION
: Specify the full CDM version that you installed, such as5.2.1
.
This command generates a log file (logfile_name_TIMESTAMP.txt
) instead of logging output to the console.
For additional modifications to this command, see Additional CDM options.
Run a CDM data validation job
After you migrate data, you can use CDM’s data validation mode to find inconsistencies between the origin and target tables.
-
Use the following
spark-submit
command to run a data validation job using the configuration in your properties file. The data validation job is specified in the--class
argument.-
Local installation
-
Spark cluster
./spark-submit --properties-file cdm.properties \ --conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \ --master "local[*]" --driver-memory 25G --executor-memory 25G \ --class com.datastax.cdm.job.DiffData cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
-
--properties-file cdm.properties
: If your properties file has a different name, specify the actual name of your properties file.Depending on where your properties file is stored, you might need to specify the full or relative file path.
-
KEYSPACE_NAME.TABLE_NAME
: Specify the name of the table that you want to validate and the keyspace that it belongs to.You can also set
spark.cdm.schema.origin.keyspaceTable
in your properties file using the same format ofKEYSPACE_NAME.TABLE_NAME
. -
--driver-memory
and--executor-memory
: For local installations, specify the appropriate memory settings for your environment. -
VERSION
: Specify the full CDM version that you installed, such as5.2.1
.
./spark-submit --properties-file cdm.properties \ --conf spark.cdm.schema.origin.keyspaceTable="KEYSPACE_NAME.TABLE_NAME" \ --master "spark://MASTER_HOST:PORT" \ --class com.datastax.cdm.job.DiffData cassandra-data-migrator-VERSION.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Replace or modify the following, if needed:
-
--properties-file cdm.properties
: If your properties file has a different name, specify the actual name of your properties file.Depending on where your properties file is stored, you might need to specify the full or relative file path.
-
KEYSPACE_NAME.TABLE_NAME
: Specify the name of the table that you want to validate and the keyspace that it belongs to.You can also set
spark.cdm.schema.origin.keyspaceTable
in your properties file using the same format ofKEYSPACE_NAME.TABLE_NAME
. -
--master
: Provide the URL of your Spark cluster. -
VERSION
: Specify the full CDM version that you installed, such as5.2.1
.
-
-
Allow the command some time to run, and then open the log file (
logfile_name_TIMESTAMP.txt
) and look forERROR
entries.The CDM validation job records differences as
ERROR
entries in the log file, listed by primary key values. For example:23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999) 23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3] 23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2] 23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
When validating large datasets or multiple tables, you might want to extract the complete list of missing or mismatched records. There are many ways to do this. For example, you can grep for all
ERROR
entries in your CDM log files or use thelog4j2
example provided in the CDM repository.
Run a validation job in AutoCorrect mode
Optionally, you can run CDM validation jobs in AutoCorrect mode, which offers the following functions:
-
autocorrect.missing
: Add any missing records in the target with the value from the origin. -
autocorrect.mismatch
: Reconcile any mismatched records between the origin and target by replacing the target value with the origin value.TIMESTAMP
has an effect on this function.If the
WRITETIME
of the origin record (determined with.writetime.names
) is earlier than theWRITETIME
of the target record, then the change doesn’t appear in the target cluster. This comparative state can be challenging to troubleshoot if individual columns or cells were modified in the target cluster. -
autocorrect.missing.counter
: By default, counter tables are not copied when missing, unless explicitly set.
In your cdm.properties
file, use the following properties to enable (true
) or disable (false
) autocorrect functions:
spark.cdm.autocorrect.missing false|true
spark.cdm.autocorrect.mismatch false|true
spark.cdm.autocorrect.missing.counter false|true
The CDM validation job never deletes records from either the origin or target. Data validation only inserts or updates data on the target.
For an initial data validation, consider disabling AutoCorrect so that you can generate a list of data discrepancies, investigate those discrepancies, and then decide whether you want to rerun the validation with AutoCorrect enabled.
Additional CDM options
You can modify your properties file or append additional --conf
arguments to your spark-submit
commands to customize your CDM jobs.
For example, you can do the following:
-
Check for large field guardrail violations before migrating.
-
Use the
partition.min
andpartition.max
parameters to migrate or validate specific token ranges. -
Use the
track-run
feature to monitor progress and rerun a failed migration or validation job from point of failure.
For all options, see the CDM repository. Specifically, see the fully annotated properties file.
Troubleshoot CDM
Java NoSuchMethodError
If you installed Spark as a JAR file, and your Spark and Scala versions aren’t compatible with your installed version of CDM, CDM jobs can throw exceptions such a the following:
Exception in thread "main" java.lang.NoSuchMethodError: 'void scala.runtime.Statics.releaseFence()'
Make sure that your Spark binary is compatible with your CDM version. If you installed an earlier version of CDM, you might need to install an earlier Spark binary.
Rerun a failed or partially completed job
You can use the track-run
feature to track the progress of a migration or validation, and then, if necessary, use the run-id
to rerun a failed job from the last successful migration or validation point.
For more information, see the CDM repository and the fully annotated properties file.