• Glossary
  • Support
  • Downloads
  • DataStax Home
Get Live Help
Expand All
Collapse All

DataStax Astra DB Serverless Documentation

    • Overview
      • Release notes
      • Astra DB FAQs
      • Astra DB Architecture FAQ
      • Astra DB glossary
      • Get support
    • Getting Started
      • Grant a user access
      • Load and retrieve data
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
      • Connect a driver
      • Build sample apps
      • Use integrations
    • Planning
      • Plan options
      • Database regions
    • Securing
      • Security highlights
      • Security guidelines
      • Default user permissions
      • Change your password
      • Reset your password
      • Authentication and Authorization
      • Astra DB Plugin for HashiCorp Vault
    • Connecting
      • Connecting private endpoints
        • AWS Private Link
        • Azure Private Link
        • GCP Private Endpoints
        • Connecting custom DNS
      • Connecting Change Data Capture (CDC)
      • Connecting CQL console
      • Connect the Spark Cassandra Connector to Astra
      • Drivers for Astra DB
        • Connecting C++ driver
        • Connecting C# driver
        • Connecting Java driver
        • Connecting Node.js driver
        • Connecting Python driver
        • Drivers retry policies
      • Connecting Legacy drivers
      • Get Secure Connect Bundle
    • Migrating
      • Components
      • FAQs
      • Preliminary steps
        • Feasibility checks
        • Deployment and infrastructure considerations
        • Create target environment for migration
        • Understand rollback options
      • Phase 1: Deploy ZDM Proxy and connect client applications
        • Set up the ZDM Proxy Automation with ZDM Utility
        • Deploy the ZDM Proxy and monitoring
        • Configure Transport Layer Security
        • Connect client applications to ZDM Proxy
        • Leverage metrics provided by ZDM Proxy
        • Manage your ZDM Proxy instances
      • Phase 2: Migrate and validate data
      • Phase 3: Enable asynchronous dual reads
      • Phase 4: Change read routing to Target
      • Phase 5: Connect client applications directly to Target
      • Troubleshooting
        • Troubleshooting tips
        • Troubleshooting scenarios
      • Glossary
      • Contribution guidelines
      • Release Notes
    • Managing
      • Managing your organization
        • User permissions
        • Pricing and billing
        • Audit Logs
        • Bring Your Own Key
          • BYOK AWS Astra DB console
          • BYOK GCP Astra DB console
          • BYOK AWS DevOps API
          • BYOK GCP DevOps API
        • Configuring SSO
          • Configure SSO for Microsoft Azure AD
          • Configure SSO for Okta
          • Configure SSO for OneLogin
      • Managing your database
        • Create your database
        • View your databases
        • Database statuses
        • Use DSBulk to load data
        • Use Data Loader in Astra Portal
        • Monitor your databases
        • Export metrics to third party
          • Export metrics via Astra Portal
          • Export metrics via DevOps API
        • Manage access lists
        • Manage multiple keyspaces
        • Using multiple regions
        • Terminate your database
      • Managing with DevOps API
        • Managing database lifecycle
        • Managing roles
        • Managing users
        • Managing tokens
        • Managing BYOK AWS
        • Managing BYOK GCP
        • Managing access list
        • Managing multiple regions
        • Get private endpoints
        • AWS PrivateLink
        • Azure PrivateLink
        • GCP Private Service
    • Astra CLI
    • Astra Block
      • Quickstart
      • FAQ
      • Data model
      • About NFTs
    • Developing with Stargate APIs
      • Develop with REST
      • Develop with Document
      • Develop with GraphQL
        • Develop with GraphQL (CQL-first)
        • Develop with GraphQL (Schema-first)
      • Develop with gRPC
        • gRPC Rust client
        • gRPC Go client
        • gRPC Node.js client
        • gRPC Java client
      • Develop with CQL
      • Tooling Resources
      • Node.js Document API client
      • Node.js REST API client
    • Stargate QuickStarts
      • Document API QuickStart
      • REST API QuickStart
      • GraphQL API CQL-first QuickStart
    • API References
      • DevOps REST API v2
      • Stargate Document API v2
      • Stargate REST API v2
  • DataStax Astra DB Serverless Documentation
  • Migrating
  • Phase 2: Migrate and validate data

Phase 2: Migrate and validate data

This topic presents the benefits of using two free, open-source data migration tools during your migration project:

  • Cassandra Data Migrator

  • DSBulk Migrator

These tools provide helpful features and may be used to match the requirements of your Apache Cassandra®, DataStax Enterprise (DSE), or Astra DB databases. The tools can help you migrate data from any Cassandra Origin (Cassandra/DSE/Astra DB) to any Cassandra Target (Cassandra/DSE/Astra DB).

Illustrated view of this phase:

Phase 2 diagram shows using tools to migrate data from Origin to Target.

For illustrations of all the migration phases, see the Introduction.

What’s the difference between these data migration tools?

In general:

  • Cassandra Data Migrator is the best choice to migrate large data quantities, and where detailed verifications and reconciliation options are needed.

  • DSBulk Migrator leverages DataStax Bulk Loader (DSBulk) to perform the actual data migration, and provides new commands specific to migrations. DSBulk Migrator is ideal for migration of small data quantities, such as databases that have less than 20 GBs of data in the table rows.

How do I install and use these data migration tools?

They’re available in the following GitHub repos:

  • Cassandra Data Migrator repo.

  • DSBulk Migrator repo.

Refer to the README in each repo for the latest, detailed instructions to install and use these data migrators. The READMEs include prerequisites, download resources, configuration, and command-line usage information.

Summary of features

Here’s a quick summary of the features per data migration tool. See each repo’s README for details.

Cassandra Data Migrator

For large data migrations, including cases where advanced logging is needed, Cassandra Data Migrator is designed to:

  • Connect to and compare your Target database with Origin

  • Report differences in a detailed log file

  • Reconcile any missing records and fix any data inconsistencies in the target, if you enable autocorrect in a config file

Cassandra Data Migrator runs in a lightweight, easily set up Apache Spark wrapper. For example, you can configure a sparkConf.properties file for the environment. There’s a sample sparkConf.properties configuration example in the GitHub repo.

In its settings, you’ll identify values for your Origin and Target databases. A subset example:

spark.origin.isAstra                               false
spark.origin.host                                  localhost
spark.origin.username                              some-username
spark.origin.password                              some-secret-password
spark.origin.read.consistency.level                LOCAL_QUORUM
spark.origin.keyspaceTable                         test.a1

spark.target.isAstra                               true
spark.target.scb                                   file:///aaa/bbb/secure-connect-enterprise.zip
spark.target.username                              client-id
spark.target.password                              client-secret
spark.target.read.consistency.level                LOCAL_QUORUM
spark.target.keyspaceTable                         test.a2
spark.target.autocorrect.missing                   false
spark.target.autocorrect.mismatch                  false

An important prerequisite is that you already have the matching schema on Target. For every table migrated by Cassandra Data Migrator, the tool can use a mapping configuration that links every Origin column to every Target column.

The validation checks are a way to verify that all the data has been migrated successfully. For data written by idempotent writes these checks are optional, as any errors, timeouts or other failures during the migration are made visible by the Cassandra Data Migrator and by ZDM Proxy.

In the case of data written by non-idempotent writes, it is necessary to reconcile and realign any discrepancies before starting to use Target as the primary cluster.

For installation and usage details, see the Cassandra Data Migrator repo’s README.

DSBulk Migrator

DSBulk Migrator, which is based on DataStax Bulk Loader (DSBulk), is best for migrating smaller amounts of data, and/or when you can shard data from table rows into more manageable quantities.

DSBulk Migrator provides the following main commands:

  • migrate-live starts a live data migration using a pre-existing DSBulk installation, or alternatively, the embedded DSBulk version. A "live" migration means that the data migration will start immediately and will be performed by this migrator tool through the desired DSBulk installation.

  • generate-script generates a migration script that, once executed, will perform the desired data migration, using a pre-existing DSBulk installation. Please note: this command does not actually migrate the data; it only generates the migration script.

  • generate-ddl reads the schema from Origin and generates CQL files to recreate it in an Astra DB cluster used as Target.

For installation and usage details, see the DSBulk Migrator repo’s README.

Manage your ZDM Proxy instances Phase 3: Enable asynchronous dual reads

General Inquiries: +1 (650) 389-6000 info@datastax.com

© DataStax | Privacy policy | Terms of use

DataStax, Titan, and TitanDB are registered trademarks of DataStax, Inc. and its subsidiaries in the United States and/or other countries.

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries.

Kubernetes is the registered trademark of the Linux Foundation.

landing_page landingpage