Migrate to DataStax Enterprise (DSE)
Migrations to DataStax Enterprise (DSE) can be data-only migrations or full-scale platform migrations. The tools you use and the complexity of the process depends on the migration type.
-
Data-only migrations: Copy, stream, or bulk load data into DSE databases. For example, write data to a DSE database from a streaming source such as Apache Kafka®.
You can perform data-only migrations as needed or as part of a platform migration.
-
Platform migrations: Configure a new DSE cluster, copy or restore data from your existing cluster, and then reconfigure your applications to use the new cluster using DSE-compatible APIs and libraries. For this type of migration, you can use the Zero Downtime Migration (ZDM) tools (recommended) or an in-place upgrade process.
|
You can also use a platform migration to support significant upgrades, such as multiple major DSE versions. |
Cluster compatibility
Technically, you can migrate any data into DSE, but some sources are more compatible with DSE than others.
Because DSE is based on Apache Cassandra®, it expects data to be in a format that is compatible with Cassandra table schemas.
Migrations from open-source Apache Cassandra and other Cassandra-based NoSQL databases are the most compatible sources because they share the same foundational architecture as DSE. However, you must carefully evaluate the compatibility of your source database with DSE before migrating data. For example, when migrating from open-source Cassandra, make sure your version is compatible with DSE. If it isn’t, you must determine the differences between your version and DSE to ensure that your migration succeeds. Before migrating to DSE, you might need to modify your schema, disable incompatible features, or upgrade your platform to a compatible version.
Migrations from RDBMS and other non-Cassandra sources are more complex because of differences in data models, schemas, and query patterns. You might need to redesign your data models, prepare the data before migrating it, or manipulate the data during the migration process.
If your source data is schemaless or semi-structured, you can use techniques like super shredding to flatten, normalize, and map schemaless or semi-structured JSON/CSV data into a Cassandra-compatible fixed schema. Then, you can load the data into DSE with data migration tools. However, super shredding can be complex and cumbersome, depending on the structure (or lack thereof) of the source data.
Pre-migration data modeling
Before moving data to DSE, consider how your client application will query the tables. DataStax recommends pre-migration data modeling, particularly in scenarios where data types or query patterns might change. For example:
-
Changing the DDL, either before migration or with an ETL tool.
-
Moving from a relational database management system (RDBMS) to a NoSQL database like DSE. Direct imports from your RDBMS will fail.
-
Upgrading or changing the APIs or libraries that your applications use to access your data.
Data model changes aren’t always required, but they can be necessary for your migration to be successful and avoid data loss, corruption, or performance degradation due to inefficient or unsupported data types and query patterns.
Data migration tools
There are many solutions you can use to migrate data from other databases and data sources. Most of these tools require that the source data is in a DSE-compatible format, such as data from another NoSQL database, CSV files, or JSON files. If your source data is in an incompatible format, you can use an ETL tool to manipulate the data before writing it to your DSE cluster.
You can use data migration tools for bulk writes, data-only migrations, or as part of a full-scale platform migration. Some examples include the following:
-
Cassandra Data Migrator (CDM): Migrate and validate tables between origin Apache Cassandra® clusters and target DSE clusters, with available logging and reconciliation support.
You can use CDM alone or in conjunction with the ZDM tools.
DataStax recommends CDM for large-scale and sensitive migrations because of its validation and reconciliation features.
-
DataStax Bulk Loader (DSBulk): Extract and load CSV and JSON files containing Cassandra table data. You can use DSBulk to move data between compatible NoSQL databases, including DSE, as long as the source and target schemas are compatible.
-
CQL shell
COPY: Use these commands to read and write data in CSV format.The
COPYcommands mirror what the PostgreSQL RDBMS uses for file import and export. When moving from an RDBMS, typically the RDBMS has unload utilities for writing table data to a file system. -
sstableloader: Bulk load data from SSTable snapshots of other Cassandra-based clusters.
Streaming connectors
These tools are designed for data streaming use cases, where data is continuously ingested from a source into DSE databases:
-
The DataStax Apache Pulsar connector is open-source software installed in the Pulsar IO framework. The connector synchronizes records from a Pulsar topic with table rows in your DSE database.
-
The DataStax Apache Kafka Connector synchronizes records from a Kafka topic with rows in one or more DSE database tables.
ETL tools
If you need to change the source data before writing it to your DSE cluster, you can use an extract, transform, load (ETL) solution that is compatible with DSE, such as tools from Talend, Informatica, and Streamsets.
ETL tools provide transformation routines for manipulating source data before loading it into a DSE cluster, as well as other helpful features, such as visual interfaces and scheduling engines. This is useful when your source schema doesn’t match your target DSE schema, when you need to change data types or formats, or your source cluster is wholly incompatible with DSE, such as an RDBMS.
If you are performing a platform migration where you plan to stop using the source database completely, be aware that ETL tools can require some application downtime to run the ETL pipeline and move the data to the new cluster. Additionally, ETL pipelines aren’t compatible with the ZDM tools if the clusters have different schemas because the ZDM tools must send the same CQL read/write statements to both clusters.
Platform migration options
For platform migrations where the source and target schemas are compatible, you can use the ZDM tools or perform an in-place upgrade.
If your schemas aren’t compatible, you must use an ETL tool to manipulate the data before writing it to your new DSE cluster. This type of migration requires more manual oversight and planning because there will be some downtime while you stop writes, run the ETL pipeline, and switch your applications to the new cluster.
Generally, DataStax recommends migrating to the latest version of DSE, unless you have a specific functional requirement or a compatibility issue that requires migrating to an earlier version.
Zero Downtime Migration (ZDM)
|
DataStax strongly recommends using the ZDM tools for data migrations whenever possible. Typically the ZDM tools are used for full-scale platform migrations, but you can also use them to reduce risk during sensitive upgrades, such as major version upgrades with breaking changes. For supported migration paths, see Compatibility requirements for ZDM Proxy. |
The ZDM tools provide the safest upgrade approach with blue-green deployment capabilities that eliminate time pressure and ensure optimal availability and operational safety.
Here’s how the ZDM process works:
-
Set up your new, empty DSE cluster separate from your existing cluster.
The ZDM tools minimize risk and complexity by isolating the source and target clusters. You can configure your new DSE cluster as needed from the start, including settings that you wouldn’t be able to change during an in-place upgrade. There is no need for progressive reconfiguration and node restarts because the clusters are independent.
Incompatible settings don’t disrupt the migration because your existing cluster remains active and unchanged while you configure the new cluster and copy your data.
-
ZDM Proxy orchestrates live reads and writes while you use a data migration tool to replicate your existing data on the new cluster.
ZDM Proxy uses one cluster as the source of truth for reads while sending writes to both clusters. The dual writes feature ensures that your new cluster doesn’t miss ongoing writes during the migration process, and your existing cluster remains current with all mutations.
As long as ZDM Proxy is active and connected, both clusters remain synchronized.
-
Validate the data on the new cluster and simulate production workloads before permanently switching your application connections and traffic to the new cluster.
It is crucial that you fully validate and test your new cluster before switching your traffic over to it. Data validation tools can identify inconsistencies as missing or mismatched data, but you still need to have a plan to resolve them. For example, you might need to modify your applications to use a different data type or perform additional post-migration writes to populate lost data.
Because your original cluster remains running and synchronized throughout the migration process, you can seamlessly stop the migration at any point up until the last phase when you route writes exclusively to the new cluster.
To learn more and get started on your zero downtime migration to DSE, see Phases of the Zero Downtime Migration (ZDM) process.
In-place upgrades
|
DataStax recommends that you use this option only if you cannot use the ZDM tools. |
In-place upgrades replace the database platform on your current cluster with DSE without moving your data.
In-place upgrades have the following limitations:
-
Available for specific migration paths only.
-
Require downtime.
-
Involve systematic manual reconfiguration of the cluster before, during, and after the migration.
-
Limited rollback options and the complexity of the upgrade process could lead to data loss or corruption.
Because an in-place upgrade manipulates a single cluster, a rollback requires that you revert to the previous platform version, and then restore a backup of your data. Any data written to the cluster after the backup was taken is lost.
In contrast, the ZDM tools isolate your source and target clusters, allowing you to cleanly discard the target cluster if something goes wrong during the migration process.
For in-place upgrade instructions, see the following:
Migrate your applications
Platform migrations require code changes.
At minimum, you must update your application’s connection strings to point to your new DSE cluster after fully migrating your data to DSE.
Aside from the database connection, your code might not require any other changes if you already use a compatible Cassandra driver and CQL statements.
Additional changes depend on the differences between your source database and DSE, such as changes to query statements, data types, APIs or libraries, and enabling DSE-specific features.
If you are using ZDM Proxy for your migration, you will configure your new cluster, copy data to the new cluster, and then validate the data on the new cluster before changing your application code to connect exclusively to the new cluster. These steps are all part of the ZDM process.
Migrate to DSE Advanced Workloads
After migrating to DSE, you might need to take additional steps to prepare some of your data for use with DSE Advanced Workloads, like DSE Analytics and DSE Graph.
-
Migrate graph data to DSE Graph: See Migrating to DSE Graph from a relational database and Migrating to DSE Graph from Apache Cassandra.
-
Migrate data with DSE Analytics: DSE Analytics can use Apache Spark™ to connect to a wide variety of data sources and save the data to DSE by using either the older RDD or newer DataFrame method.
-
Load DSE Search nodes: See DSE Search initial data migration.
Migrate clusters to Mission Control
You can migrate your clusters to Mission Control by adding a new cluster or datacenter to your Mission Control deployment.
Get support for your migration
If you need help planning or executing your migration to DSE, contact your DataStax account representative or IBM Support.
If you have a subscription to IBM Elite Support for Apache Cassandra, contact IBM Elite Support or your account representative to see if your plan includes migration assistance.
For any observed problems with ZDM Proxy or the other open-source ZDM and data migration tools, you can report an issue in their respective GitHub repositories:
-
ZDM Proxy Automation repository (includes ZDM Proxy Automation and ZDM Utility)