Phase 2: Migrate and validate data
This topic presents the benefits of using two free, open-source data migration tools during your migration project:
Cassandra Data Migrator
These tools provide helpful features and may be used to match the requirements of your Apache Cassandra®, DataStax Enterprise (DSE), or Astra DB databases. The tools can help you migrate data from any Cassandra Origin (Cassandra/DSE/Astra DB) to any Cassandra Target (Cassandra/DSE/Astra DB).
Illustrated view of this phase:
For illustrations of all the migration phases, see the Introduction.
What’s the difference between these data migration tools?
Cassandra Data Migrator is the best choice to migrate large data quantities, and where detailed verifications and reconciliation options are needed.
DSBulk Migrator leverages DataStax Bulk Loader (DSBulk) to perform the actual data migration, and provides new commands specific to migrations. DSBulk Migrator is ideal for migration of small data quantities, such as databases that have less than 20 GBs of data in the table rows.
How do I install and use these data migration tools?
They’re available in the following GitHub repos:
Cassandra Data Migrator repo.
DSBulk Migrator repo.
Refer to the README in each repo for the latest, detailed instructions to install and use these data migrators. The READMEs include prerequisites, download resources, configuration, and command-line usage information.
Summary of features
Here’s a quick summary of the features per data migration tool. See each repo’s README for details.
Cassandra Data Migrator
For large data migrations, including cases where advanced logging is needed, Cassandra Data Migrator is designed to:
Connect to and compare your Target database with Origin
Report differences in a detailed log file
Reconcile any missing records and fix any data inconsistencies in the target, if you enable
autocorrectin a config file
Cassandra Data Migrator runs in a lightweight, easily set up Apache Spark wrapper. For example, you can configure a
sparkConf.properties file for the environment. There’s a sample sparkConf.properties configuration example in the GitHub repo.
In its settings, you’ll identify values for your Origin and Target databases. A subset example:
spark.origin.isAstra false spark.origin.host localhost spark.origin.username some-username spark.origin.password some-secret-password spark.origin.read.consistency.level LOCAL_QUORUM spark.origin.keyspaceTable test.a1 spark.target.isAstra true spark.target.scb file:///aaa/bbb/secure-connect-enterprise.zip spark.target.username client-id spark.target.password client-secret spark.target.read.consistency.level LOCAL_QUORUM spark.target.keyspaceTable test.a2 spark.target.autocorrect.missing false spark.target.autocorrect.mismatch false
An important prerequisite is that you already have the matching schema on Target. For every table migrated by Cassandra Data Migrator, the tool can use a mapping configuration that links every Origin column to every Target column.
The validation checks are a way to verify that all the data has been migrated successfully. For data written by idempotent writes these checks are optional, as any errors, timeouts or other failures during the migration are made visible by the Cassandra Data Migrator and by ZDM Proxy.
In the case of data written by non-idempotent writes, it is necessary to reconcile and realign any discrepancies before starting to use Target as the primary cluster.
For installation and usage details, see the Cassandra Data Migrator repo’s README.
DSBulk Migrator, which is based on DataStax Bulk Loader (DSBulk), is best for migrating smaller amounts of data, and/or when you can shard data from table rows into more manageable quantities.
DSBulk Migrator provides the following main commands:
migrate-livestarts a live data migration using a pre-existing DSBulk installation, or alternatively, the embedded DSBulk version. A "live" migration means that the data migration will start immediately and will be performed by this migrator tool through the desired DSBulk installation.
generate-scriptgenerates a migration script that, once executed, will perform the desired data migration, using a pre-existing DSBulk installation. Please note: this command does not actually migrate the data; it only generates the migration script.
generate-ddlreads the schema from Origin and generates CQL files to recreate it in an Astra DB cluster used as Target.
For installation and usage details, see the DSBulk Migrator repo’s README.