Cassandra Data Migrator
Use Cassandra Data Migrator to migrate and validate tables between the origin and target Cassandra clusters, with available logging and reconciliation support.
Use Cassandra Data Migrator
-
Configure for your environment the
cdm*.properties
file that’s provided in the Cassandra Data Migrator GitHub repo. The file can have any name. It does not need to becdm.properties
orcdm-detailed.properties
. In both versions, thespark-submit
job processes only the parameters that aren’t commented out. Other parameter values use defaults or are ignored. See the descriptions and defaults in each file. For more information, see the following:-
The simplified sample properties configuration, cdm.properties. This file contains only those parameters that are commonly configured.
-
The complete sample properties configuration, cdm-detailed.properties, for the full set of configurable settings.
-
-
Place the properties file that you elected to use and customize where it can be accessed while running the job using
spark-submit
. -
Run the job using
spark-submit
command:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
|
Use Cassandra Data Migrator steps in validation mode
To run your migration job with Cassandra Data Migrator in data validation mode, use class option --class com.datastax.cdm.job.DiffData
.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
The Cassandra Data Migrator validation job reports differences as ERROR
entries in the log file.
Example:
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
To get the list of missing or mismatched records, grep for all |
You can also run the Cassandra Data Migrator validation job in an AutoCorrect mode, which can:
-
Add any missing records from the origin to target cluster.
-
Update any mismatched records between the origin and target clusters; this action makes the target cluster the same as the origin cluster.
To enable or disable this feature, use one or both of the following settings in your *.properties
configuration file.
spark.cdm.autocorrect.missing false|true
spark.cdm.autocorrect.mismatch false|true
The Cassandra Data Migrator validation job never deletes records from the target cluster. The job only adds or updates data on the target cluster. |
Migrate or validate specific partition ranges
You can also use Cassandra Data Migrator to migrate or validate specific partition ranges. Use a partition-file with the name ./<keyspacename>.<tablename>_partitions.csv
.
Use the following format in the CSV file, in the current folder as input.
Example:
-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540
Each line in the CSV represents a partition-range (min,max
).
Alternatively, you can also pass the partition-file with a command-line parameter. Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
This mode is specifically useful to process a subset of partition-ranges that may have failed during a previous run.
In the format shown above, the migration and validation jobs autogenerate a file named |
Perform large-field guardrail violation checks
Use Cassandra Data Migrator to identify large fields from a table that may break your cluster guardrails.
For example, Astra DB has a 10MB limit for a single large field.
Specify --class com.datastax.cdm.job.GuardrailCheck
on the command.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt