Cassandra Data Migrator

Use Cassandra Data Migrator to migrate and validate tables between the origin and target Cassandra clusters, with available logging and reconciliation support.

Use Cassandra Data Migrator

  1. Configure for your environment the cdm*.properties file that’s provided in the Cassandra Data Migrator GitHub repo. The file can have any name. It does not need to be cdm.properties or cdm-detailed.properties. In both versions, the spark-submit job processes only the parameters that aren’t commented out. Other parameter values use defaults or are ignored.

    See the descriptions and defaults in each file. For more information about the sample properties configuration, see the cdm-detailed.properties. This is the full set of configurable settings.

  2. Place the properties file that you elected to use and customize where it can be accessed while running the job using spark-submit.

  3. Run the job using spark-submit command:

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
  • The command generates a log file logfile_name_*.txt to prevent log output on the console.

  • Update the memory options, driver and executor memory, based on your use case.

Use Cassandra Data Migrator steps in validation mode

To run your migration job with Cassandra Data Migrator in data validation mode, use class option --class com.datastax.cdm.job.DiffData. Example:

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

The Cassandra Data Migrator validation job reports differences as ERROR entries in the log file. Example:

23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]

To get the list of missing or mismatched records, grep for all ERROR entries in the log files. Differences noted in the log file are listed by primary-key values.

You can also run the Cassandra Data Migrator validation job in an AutoCorrect mode, which can:

  • Add any missing records from the origin to target cluster.

  • Update any mismatched records between the origin and target clusters; this action makes the target cluster the same as the origin cluster.

To enable or disable this feature, use one or both of the following settings in your *.properties configuration file.

spark.cdm.autocorrect.missing                     false|true
spark.cdm.autocorrect.mismatch                    false|true

The Cassandra Data Migrator validation job never deletes records from the source or target clusters. The job only adds or updates data on the target cluster.

Migrate or validate specific partition ranges

You can also use Cassandra Data Migrator to migrate or validate specific partition ranges by passing the below additional parameters.

--conf spark.cdm.filter.cassandra.partition.min=<token-range-min>
--conf spark.cdm.filter.cassandra.partition.max=<token-range-max>

This mode is specifically useful to process a subset of partition-ranges.

Perform large-field guardrail violation checks

Use Cassandra Data Migrator to identify large fields from a table that may break your cluster guardrails. For example, Astra DB has a 10MB limit for a single large field. Specify --class com.datastax.cdm.job.GuardrailCheck on the command. Example:

./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com