Cassandra Data Migrator
Use Cassandra Data Migrator to migrate and validate tables between the origin and target Cassandra clusters, with available logging and reconciliation support.
Use Cassandra Data Migrator
-
Configure for your environment the
cdm*.properties
file that’s provided in the Cassandra Data Migrator GitHub repo. The file can have any name. It does not need to becdm.properties
orcdm-detailed.properties
. In both versions, thespark-submit
job processes only the parameters that aren’t commented out. Other parameter values use defaults or are ignored.See the descriptions and defaults in each file. For more information about the sample properties configuration, see the cdm-detailed.properties. This is the full set of configurable settings.
-
Place the properties file that you elected to use and customize where it can be accessed while running the job using
spark-submit
. -
Run the job using
spark-submit
command:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
|
Use Cassandra Data Migrator steps in validation mode
To run your migration job with Cassandra Data Migrator in data validation mode, use class option --class com.datastax.cdm.job.DiffData
.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
The Cassandra Data Migrator validation job reports differences as ERROR
entries in the log file.
Example:
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
To get the list of missing or mismatched records, grep for all |
You can also run the Cassandra Data Migrator validation job in an AutoCorrect mode, which can:
-
Add any missing records from the origin to target cluster.
-
Update any mismatched records between the origin and target clusters; this action makes the target cluster the same as the origin cluster.
To enable or disable this feature, use one or both of the following settings in your *.properties
configuration file.
spark.cdm.autocorrect.missing false|true
spark.cdm.autocorrect.mismatch false|true
The Cassandra Data Migrator validation job never deletes records from the source or target clusters. The job only adds or updates data on the target cluster. |
Migrate or validate specific partition ranges
You can also use Cassandra Data Migrator to migrate or validate specific partition ranges by passing the below additional parameters.
--conf spark.cdm.filter.cassandra.partition.min=<token-range-min>
--conf spark.cdm.filter.cassandra.partition.max=<token-range-max>
This mode is specifically useful to process a subset of partition-ranges.
Perform large-field guardrail violation checks
Use Cassandra Data Migrator to identify large fields from a table that may break your cluster guardrails.
For example, Astra DB has a 10MB limit for a single large field.
Specify --class com.datastax.cdm.job.GuardrailCheck
on the command.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt