Cassandra Data Migrator
Use Cassandra Data Migrator to migrate and validate tables between origin and target Cassandra clusters, with available logging and reconciliation support.
Cassandra Data Migrator prerequisites
Read the prerequisites below before using the Cassandra Data Migrator.
-
Install or switch to Java 11. The Spark binaries are compiled with this version of Java.
-
Select a single VM to run this job and install Spark 3.5.3 there. No cluster is necessary for most one-time migrations. However, Spark cluster mode is also supported for complex migrations.
-
Optionally, install Maven
3.9.x
if you want to build the JAR for local development.
Run the following commands to install Apache Spark:
wget https://archive.apache.org/dist/spark/spark-3.5.3/spark-3.5.3-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.5.3-bin-hadoop3-scala2.13.tgz
Install Cassandra Data Migrator as a Container
Get the latest image that includes all dependencies from DockerHub.
All migration tools, cassandra-data-migrator
and dsbulk
and cqlsh
, are available in the /assets/
folder of the container.
Install Cassandra Data Migrator as a JAR file
Download the latest JAR file from the Cassandra Data Migrator GitHub repo.
Version 4.x of Cassandra Data Migrator is not backward-compatible with |
Build Cassandra Data Migrator JAR for local development (optional)
Optionally, you can build the Cassandra Data Migrator JAR for local development. You’ll need Maven 3.9.x.
Example:
cd ~/github
git clone git@github.com:datastax/cassandra-data-migrator.git
cd cassandra-data-migrator
mvn clean package
The fat jar file, cassandra-data-migrator-x.y.z.jar
, should be present now in the target
folder.
Use Cassandra Data Migrator
-
Configure for your environment the
cdm*.properties
file that’s provided in the Cassandra Data Migrator GitHub repo. The file can have any name. It does not need to becdm.properties
orcdm-detailed.properties
. In both versions, thespark-submit
job processes only the parameters that aren’t commented out. Other parameter values use defaults or are ignored.See the descriptions and defaults in each file. For more information about the sample properties configuration, see the cdm-detailed.properties. This is the full set of configurable settings.
-
Place the properties file that you elected to use and customize where it can be accessed while running the job using
spark-submit
. -
Run the job using
spark-submit
command:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
|
Use Cassandra Data Migrator steps in validation mode
To run your migration job with Cassandra Data Migrator in data validation mode, use class option --class com.datastax.cdm.job.DiffData
.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
The Cassandra Data Migrator validation job reports differences as ERROR
entries in the log file.
Example:
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
To get the list of missing or mismatched records, grep for all |
You can also run the Cassandra Data Migrator validation job in an AutoCorrect mode, which can:
-
Add any missing records from the origin to target cluster.
-
Update any mismatched records between the origin and target clusters; this action makes the target cluster the same as the origin cluster.
To enable or disable this feature, use one or both of the following settings in your *.properties
configuration file.
spark.cdm.autocorrect.missing false|true
spark.cdm.autocorrect.mismatch false|true
The Cassandra Data Migrator validation job never deletes records from the source or target clusters. The job only adds or updates data on the target cluster. |
Migrate or validate specific partition ranges
You can also use Cassandra Data Migrator to migrate or validate specific partition ranges by passing the below additional parameters.
--conf spark.cdm.filter.cassandra.partition.min=<token-range-min>
--conf spark.cdm.filter.cassandra.partition.max=<token-range-max>
This mode is specifically useful to process a subset of partition-ranges.
Perform large-field guardrail violation checks
Use Cassandra Data Migrator to identify large fields from a table that may break your cluster guardrails.
For example, Astra DB has a 10MB limit for a single large field.
Specify --class com.datastax.cdm.job.GuardrailCheck
on the command.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Next steps
For advanced operations, see documentation at the repository.