Cassandra Data Migrator

Use Cassandra Data Migrator to migrate and validate tables between origin and target Cassandra clusters, with available logging and reconciliation support.

Cassandra Data Migrator prerequisites

Read the prerequisites below before using the Cassandra Data Migrator.

  • Install or switch to Java 11. The Spark binaries are compiled with this version of Java.

  • Select a single VM to run this job and install Spark 3.5.3 there. No cluster is necessary for most one-time migrations. However, Spark cluster mode is also supported for complex migrations.

  • Optionally, install Maven 3.9.x if you want to build the JAR for local development.

Run the following commands to install Apache Spark:

wget https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3-scala2.13.tgz

tar -xvzf spark-3.5.3-bin-hadoop3-scala2.13.tgz

Install Cassandra Data Migrator as a Container

Get the latest image that includes all dependencies from DockerHub.

All migration tools, cassandra-data-migrator and dsbulk and cqlsh, are available in the /assets/ folder of the container.

Install Cassandra Data Migrator as a JAR file

Download the latest JAR file from the Cassandra Data Migrator GitHub repo. Latest release

Version 4.x of Cassandra Data Migrator is not backward-compatible with *.properties files created in previous versions, and package names have changed. If you’re starting new, use the latest released version if possible.

Build Cassandra Data Migrator JAR for local development (optional)

Optionally, you can build the Cassandra Data Migrator JAR for local development. You’ll need Maven 3.9.x.

Example:

cd ~/github
git clone git@github.com:datastax/cassandra-data-migrator.git
cd cassandra-data-migrator
mvn clean package

The fat jar file, cassandra-data-migrator-x.y.z.jar, should be present now in the target folder.

Use Cassandra Data Migrator

  1. Configure for your environment the cdm*.properties file that’s provided in the Cassandra Data Migrator GitHub repo. The file can have any name. It does not need to be cdm.properties or cdm-detailed.properties. In both versions, the spark-submit job processes only the parameters that aren’t commented out. Other parameter values use defaults or are ignored.

    See the descriptions and defaults in each file. For more information about the sample properties configuration, see the cdm-detailed.properties. This is the full set of configurable settings.

  2. Place the properties file that you elected to use and customize where it can be accessed while running the job using spark-submit.

  3. Run the job using spark-submit command:

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
  • The command generates a log file logfile_name_*.txt to prevent log output on the console.

  • Update the memory options, driver and executor memory, based on your use case.

Use Cassandra Data Migrator steps in validation mode

To run your migration job with Cassandra Data Migrator in data validation mode, use class option --class com.datastax.cdm.job.DiffData. Example:

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

The Cassandra Data Migrator validation job reports differences as ERROR entries in the log file. Example:

23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]

To get the list of missing or mismatched records, grep for all ERROR entries in the log files. Differences noted in the log file are listed by primary-key values.

You can also run the Cassandra Data Migrator validation job in an AutoCorrect mode, which can:

  • Add any missing records from the origin to target cluster.

  • Update any mismatched records between the origin and target clusters; this action makes the target cluster the same as the origin cluster.

To enable or disable this feature, use one or both of the following settings in your *.properties configuration file.

spark.cdm.autocorrect.missing                     false|true
spark.cdm.autocorrect.mismatch                    false|true

The Cassandra Data Migrator validation job never deletes records from the source or target clusters. The job only adds or updates data on the target cluster.

Migrate or validate specific partition ranges

You can also use Cassandra Data Migrator to migrate or validate specific partition ranges by passing the below additional parameters.

--conf spark.cdm.filter.cassandra.partition.min=<token-range-min>
--conf spark.cdm.filter.cassandra.partition.max=<token-range-max>

This mode is specifically useful to process a subset of partition-ranges.

Perform large-field guardrail violation checks

Use Cassandra Data Migrator to identify large fields from a table that may break your cluster guardrails. For example, Astra DB has a 10MB limit for a single large field. Specify --class com.datastax.cdm.job.GuardrailCheck on the command. Example:

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Next steps

For advanced operations, see documentation at the repository.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2025 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com