Cassandra Data Migrator
Use Cassandra Data Migrator to migrate and validate tables between origin and target Cassandra clusters, with available logging and reconciliation support.
Cassandra Data Migrator prerequisites
Read the prerequisites below before using the Cassandra Data Migrator.
-
Install or switch to Java 11. The Spark binaries are compiled with this version of Java.
-
Select a single VM to run this job and install Spark 3.5.1 there. No cluster is necessary.
-
Optionally, install Maven 3.9.x if you want to build the JAR for local development.
Run the following commands to install Apache Spark:
wget https://archive.apache.org/dist/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3-scala2.13.tgz
tar -xvzf spark-3.5.1-bin-hadoop3-scala2.13.tgz
Install Cassandra Data Migrator as a Container
Get the latest image that includes all dependencies from DockerHub.
All migration tools, cassandra-data-migrator
and dsbulk
and cqlsh
, are available in the /assets/
folder of the container.
Install Cassandra Data Migrator as a JAR file
Download the latest JAR file from the Cassandra Data Migrator GitHub repo.
Version 4.x of Cassandra Data Migrator is not backward-compatible with |
Build Cassandra Data Migrator JAR for local development (optional)
Optionally, you can build the Cassandra Data Migrator JAR for local development. You’ll need Maven 3.9.x.
Example:
cd ~/github
git clone git@github.com:datastax/cassandra-data-migrator.git
cd cassandra-data-migrator
mvn clean package
The fat jar file, cassandra-data-migrator-x.y.z.jar
, should be present now in the target
folder.
Use Cassandra Data Migrator
-
Configure for your environment the
cdm*.properties
file that’s provided in the Cassandra Data Migrator GitHub repo. The file can have any name. It does not need to becdm.properties
orcdm-detailed.properties
. In both versions, thespark-submit
job processes only the parameters that aren’t commented out. Other parameter values use defaults or are ignored. See the descriptions and defaults in each file. For more information, see the following:-
The simplified sample properties configuration, cdm.properties. This file contains only those parameters that are commonly configured.
-
The complete sample properties configuration, cdm-detailed.properties, for the full set of configurable settings.
-
-
Place the properties file that you elected to use and customize where it can be accessed while running the job using
spark-submit
. -
Run the job using
spark-submit
command:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.Migrate cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
|
Use Cassandra Data Migrator steps in validation mode
To run your migration job with Cassandra Data Migrator in data validation mode, use class option --class com.datastax.cdm.job.DiffData
.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.DiffData cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
The Cassandra Data Migrator validation job reports differences as ERROR
entries in the log file.
Example:
23/04/06 08:43:06 ERROR DiffJobSession: Mismatch row found for key: [key3] Mismatch: Target Index: 1 Origin: valueC Target: value999)
23/04/06 08:43:06 ERROR DiffJobSession: Corrected mismatch row in target: [key3]
23/04/06 08:43:06 ERROR DiffJobSession: Missing target row found for key: [key2]
23/04/06 08:43:06 ERROR DiffJobSession: Inserted missing row in target: [key2]
To get the list of missing or mismatched records, grep for all |
You can also run the Cassandra Data Migrator validation job in an AutoCorrect mode, which can:
-
Add any missing records from the origin to target cluster.
-
Update any mismatched records between the origin and target clusters; this action makes the target cluster the same as the origin cluster.
To enable or disable this feature, use one or both of the following settings in your *.properties
configuration file.
spark.cdm.autocorrect.missing false|true
spark.cdm.autocorrect.mismatch false|true
The Cassandra Data Migrator validation job never deletes records from the target cluster. The job only adds or updates data on the target cluster. |
Migrate or validate specific partition ranges
You can also use Cassandra Data Migrator to migrate or validate specific partition ranges. Use a partition-file with the name ./<keyspacename>.<tablename>_partitions.csv
.
Use the following format in the CSV file, in the current folder as input.
Example:
-507900353496146534,-107285462027022883
-506781526266485690,1506166634797362039
2637884402540451982,4638499294009575633
798869613692279889,8699484505161403540
Each line in the CSV represents a partition-range (min,max
).
Alternatively, you can also pass the partition-file with a command-line parameter. Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.tokenrange.partitionFile.input="/<path-to-file>/<csv-input-filename>" \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.<Migrate|DiffData> cassandra-data-migrator-x.y.z.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
This mode is specifically useful to process a subset of partition-ranges that may have failed during a previous run.
In the format shown above, the migration and validation jobs autogenerate a file named |
Perform large-field guardrail violation checks
Use Cassandra Data Migrator to identify large fields from a table that may break your cluster guardrails.
For example, Astra DB has a 10MB limit for a single large field.
Specify --class com.datastax.cdm.job.GuardrailCheck
on the command.
Example:
./spark-submit --properties-file cdm.properties \
--conf spark.cdm.schema.origin.keyspaceTable="<keyspacename>.<tablename>" \
--conf spark.cdm.feature.guardrail.colSizeInKB=10000 \
--master "local[*]" --driver-memory 25G --executor-memory 25G \
--class com.datastax.cdm.job.GuardrailCheck cassandra-data-migrator-4.x.x.jar &> logfile_name_$(date +%Y%m%d_%H_%M).txt
Cassandra Data Migrator references
Common connection parameters for Origin and Target
Property | Default | Notes |
---|---|---|
|
|
Hostname/IP address of the cluster.
May be a comma-separated list, and can follow the |
|
|
Port number to use if not specified on |
|
(Not set) |
Secure Connect Bundle, used to connect to an Astra DB database.
Example: |
|
|
Username (or |
|
|
Password (or |
|
|
Hostname/IP address of the cluster.
May be a comma-separated list, and can follow the |
|
|
Port number to use if not specified on |
|
(Not set) |
Secure Connect Bundle, used to connect to an Astra DB database.
Default is not set.
Example if set: |
|
|
Username (or |
|
|
Password (or |
Origin schema parameters
Property | Default | Notes | ||
---|---|---|---|---|
|
Required - the |
|||
|
|
Default is |
||
|
Default is empty, meaning the names are determined automatically if |
|||
|
|
Default is
|
||
|
Default is empty, meaning the names are determined automatically if |
|||
|
Default is empty.
If column names are changed between the origin and target clusters, then this mapped list provides a mechanism to associate the two.
The format is |
For optimization reasons, Cassandra Data Migrator does not migrate TTL and writetime at the field level. Instead, Cassandra Data Migrator finds the field with the highest TTL and the field with the highest writetime within an origin table row, and uses those values on the entire target table row. |
Target schema parameters
Property | Default | Notes |
---|---|---|
|
Equals the value of |
This parameter is commented out.
It’s the |
Auto-correction parameters
Auto-correction parameters allow Cassandra Data Migrator to correct data differences found between the origin and target clusters when you run the DiffData
program.
Typically, these parameters are run-disabled for "what if" migration testing, and generate a list of data discrepancies.
The reasons for these discrepancies can then be investigated, and if necessary the parameters below can be enabled.
For information about invoking DiffData
in a Cassandra Data Migrator command, see Cassandra Data Migrator steps in validation mode.
Property | Default | Notes | ||
---|---|---|---|---|
|
|
When |
||
|
|
When
|
||
|
|
Commented out. By default, counter tables are not copied when missing, unless explicitly set. |
||
|
|
Commented out. This CSV file is used as input, as well as output, when applicable. If the file exists, only the partition ranges in this file are migrated or validated. Similarly, if exceptions occur while migrating or validating, partition ranges with exceptions are logged to this file. |
Performance and operations parameters
Performance and operations parameters that can affect migration throughput, error handling, and similar concerns.
Property | Default | Notes |
---|---|---|
|
|
In standard operation, the full token range of -2^63 to 2^63-1 is divided into a number of parts, which are parallel processed.
You should aim for each part to comprise a total of ≈1-10GB of data to migrate.
During initial testing, you may want this to be a small number, such as |
|
|
When writing to the target cluster, this comprises the number of records that are put into an |
|
|
Concurrent number of operations across all parallel threads from the origin cluster. This value may be adjusted up or down, depending on the amount of data and the processing capacity of the origin cluster. |
|
|
Concurrent number of operations across all parallel threads from the target cluster. This may be adjusted up or down, depending on the amount of data and the processing capacity of the target cluster. |
|
|
Commented out.
Read consistency from the origin cluster and from the target cluster when records are read for comparison purposes.
The consistency parameters may be one of: |
|
|
Commented out.
Write consistency to the target cluster.
The consistency parameters may be one of: |
|
|
Commented out. Number of rows of processing after which a progress log entry is made. |
|
|
Commented out. This parameter affects the frequency of reads from the origin cluster and the frequency of flushes to the target cluster. |
|
|
Commented out.
Controls how many errors a thread may encounter during |
Transformation parameters
Parameters to perform schema transformations between the origin and target clusters.
By default, these parameters are commented out.
Property | Default | Notes | ||
---|---|---|---|---|
|
|
Timestamp value in milliseconds.
Partition and clustering columns cannot have null values.
If they are added as part of a schema transformation between the origin and target clusters, it is possible that the origin side is null.
In this case, the |
||
|
|
Default is 0 (disabled).
Timestamp value in microseconds to use as the |
||
|
|
Default is |
||
|
Default is empty. A comma-separated list of additional codecs to enable.
|
|||
|
|
Configuration for |
||
|
|
Default is |
Cassandra filter parameters
Cassandra filters are applied on the coordinator node.
Depending on the filter, the coordinator node may need to do a lot more work than is normal, notably because Cassandra Data Migrator specifies ALLOW FILTERING
.
By default, these parameters are commented out.
Property | Default | Notes |
---|---|---|
|
|
Default is |
|
|
Default is |
|
CQL added to the |
Java filter parameters
Java filters are applied on the client node.
Data must be pulled from the origin cluster and then filtered.
However, this option may have a lower impact on the production cluster than Cassandra filters.
Java filters put a load onto the Cassandra Data Migrator processing node.
They send more data from Cassandra.
Cassandra filters put a load on the Cassandra nodes because Cassandra Data Migrator specifies ALLOW FILTERING
, which could cause the coordinator node to perform a lot more work.
By default, these parameters are commented out.
Property | Default | Notes |
---|---|---|
|
|
Between 1 and 100 percent of the token in each split that is migrated. This property is used to do a wide and random sampling of the data. The percentage value is applied to each split. Invalid percentages are treated as 100. |
|
|
The lowest (inclusive) writetime values to be migrated.
Using the |
|
|
The highest (inclusive) writetime values to be migrated.
The |
|
Filter rows based on matching a configured value.
With |
|
|
String value to use as comparison.
The whitespace on the ends of |
Constant column feature parameters
The constant columns feature allows you to add constant columns to the target table.
If used, the spark.cdm.feature.constantColumns.names
, spark.cdm.feature.constantColumns.types
, and spark.cdm.feature.constantColumns.values
lists must all be the same length.
By default, these parameters are commented out.
Property | Default | Notes |
---|---|---|
|
A comma-separated list of column names, such as |
|
|
A comma-separated list of column types. |
|
|
A comma-separated list of hard-coded values.
Each value should be provided as you would use on the |
|
|
|
Defaults to comma, but can be any regex character that works with |
Explode map feature parameters
The explode map feature allows you convert an origin table map into multiple target table records.
By default, these parameters are commented out.
Property | Notes |
---|---|
|
The name of the map column, such as |
|
The name of the column on the target table that holds the map key, such as |
|
The name of the column on the target table that holds the map value, such as |
Guardrail feature parameter
The guardrail feature manages records that exceed guardrail checks. The guardrail job generates a report; other jobs skip records that exceed the guardrail limit.
By default, these parameters are commented out.
Property | Default | Notes |
---|---|---|
|
|
The |
TLS (SSL) connection parameters
These are TLS (SSL) connection parameters, if configured, for the origin and target clusters. Note that a secure connect bundle (SCB) embeds these details.
By default, these parameters are commented out.
Property | Default | Notes |
---|---|---|
|
|
If TLS is used, set to |
|
Path to the Java truststore file. |
|
|
Password needed to open the truststore. |
|
|
|
|
|
Path to the Java keystore file. |
|
|
Password needed to open the keystore. |
|
|
|
|
|
|
If TLS is used, set to |
|
Path to the Java truststore file. |
|
|
Password needed to open the truststore. |
|
|
|
|
|
Path to the Java keystore file. |
|
|
Password needed to open the keystore. |
|
|
|