Rebuild a failed datacenter

Use this procedure to rebuild a failed datacenter after outages longer than the hint window size. This procedure removes all replicas in a failed datacenter from a cluster, including all resources and data. Then, it redeploys the failed datacenter, streaming data from at least one surviving datacenter while the cluster continues normal operations.

For less catastrophic scenarios where the datacenter is still operational, see Rebuild a datacenter’s replicas.

This procedure permanently deletes all resources and data in the failed datacenter. Run this procedure only when you cannot recover the failed datacenter, such as when data is corrupted, lost, or out-of-date. Do not run this procedure in production without fully understanding these impacts.

Prerequisites

This procedure requires at least one surviving datacenter in a healthy, operational state. If all datacenters have failed and no datacenter is operational, you cannot use this rebuild procedure. Instead, you must restore from backup.

Rebuild procedure

Complete these steps to rebuild the failed datacenter:

The following steps demonstrate the rebuild process using an example cluster with two datacenters. The east datacenter is the surviving datacenter. The west datacenter is the failed datacenter that needs to be rebuilt. Each datacenter has three racks and nine nodes (three nodes per rack).

The example uses two user keyspaces (inventory and analytics) with the following original replication strategy:

ALTER KEYSPACE inventory WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};
ALTER KEYSPACE analytics WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};

This configuration provides triple replication in each datacenter.

Remove replication to the failed datacenter

Reconfigure all user keyspaces to replicate only to the surviving datacenter:

  1. Connect to a node in the surviving datacenter using cqlsh.

  2. Alter all user-defined keyspaces to remove the failed datacenter from replication.

    This example shows two user keyspaces: inventory and analytics.

    ALTER KEYSPACE inventory WITH replication = {'class': 'NetworkTopologyStrategy', 'east': '3'};
    ALTER KEYSPACE analytics WITH replication = {'class': 'NetworkTopologyStrategy', 'east': '3'};

    The database displays a warning about running nodetool repair -pr when you increase the replication factor. You can disregard this warning because the replication factor is not changing in the surviving datacenter.

Remove the failed datacenter from MissionControlCluster

Remove the failed datacenter definition from the MissionControlCluster object:

  1. Edit the MissionControlCluster manifest to remove the datacenter definition. Remove the section describing the failed datacenter from .spec.k8ssandra.cassandra.datacenters. This example shows the west datacenter configuration being removed:

    - datacenterName: west
      metadata:
        name: west
      racks:
      - name: us-west1-a
        nodeAffinityLabels:
          mission-control.datastax.com/role: database
          topology.kubernetes.io/zone: us-west1-a
      - name: us-west1-b
        nodeAffinityLabels:
          mission-control.datastax.com/role: database
          topology.kubernetes.io/zone: us-west1-b
      - name: us-west1-c
        nodeAffinityLabels:
          mission-control.datastax.com/role: database
          topology.kubernetes.io/zone: us-west1-c
      size: 9
      stopped: false
  2. Apply the modified manifest:

    kubectl apply -f missioncontrolcluster.yaml

    This triggers the removal of:

    • The CassandraDatacenter in the failed Kubernetes cluster.

    • All StatefulSets representing racks in the failed datacenter.

    • All data PVCs in the failed datacenter.

  3. Verify removal in the failed datacenter’s Kubernetes cluster:

    kubectl get CassandraDatacenter,pvc,sts,pod -n database-namespace
    Result
    No resources found in database-namespace namespace.

    If resources remain, forcefully remove them:

    1. Get the IP addresses of failed nodes and assassinate each one from the surviving datacenter:

      nodetool status  # Get failed node IPs
      nodetool assassinate 10.0.2.10
      nodetool assassinate 10.0.2.11
      # Repeat for all failed nodes
    2. Verify only the surviving datacenter remains in nodetool status.

    3. Manually delete the CassandraDatacenter resource:

      kubectl delete cassandradatacenter west -n database-namespace
    4. Verify all resources are deleted:

      kubectl get CassandraDatacenter,pvc,sts,pod -n database-namespace
  4. Verify that the surviving datacenter shows all nodes in Up/Normal state:

    nodetool status
    Result
    Datacenter: east
    =================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving/Stopped
    -- Address        Load     Tokens  Owns (effective)  Host ID                               Rack
    UN 192.0.2.10     10.85 TiB  16     34.3%            aaaaaaaa-1111-2222-3333-aaaaaaaaaaaa  rack1
    UN 192.0.2.11     10.61 TiB  16     33.3%            bbbbbbbb-1111-2222-3333-bbbbbbbbbbbb  rack1
    UN 192.0.2.12     10.2 TiB   16     32.3%            cccccccc-1111-2222-3333-cccccccccccc  rack1
    UN 192.0.2.13     10.5 TiB   16     32.3%            dddddddd-1111-2222-3333-dddddddddddd  rack3
    UN 192.0.2.14     10.51 TiB  16     32.3%            eeeeeeee-1111-2222-3333-eeeeeeeeeeee  rack2
    UN 192.0.2.15     11.15 TiB  16     34.3%            ffffffff-1111-2222-3333-ffffffffffff  rack2
    UN 192.0.2.16     11.12 TiB  16     34.3%            11111111-2222-3333-4444-111111111111  rack3
    UN 192.0.2.17     10.86 TiB  16     33.3%            22222222-2222-3333-4444-222222222222  rack2
    UN 192.0.2.18     10.86 TiB  16     33.3%            33333333-2222-3333-4444-333333333333  rack3

At this stage, the surviving datacenter stores all data, and the system has removed all data from the failed datacenter.

Re-add the datacenter

To add the datacenter definition back to the MissionControlCluster manifest, do the following:

  1. Edit the MissionControlCluster manifest to add the datacenter definition.

    Restore the exact datacenter configuration that was removed in the previous step to .spec.k8ssandra.cassandra.datacenters. This example shows the west datacenter configuration:

    - datacenterName: west
      metadata:
        name: west
      racks:
      - name: us-west1-a
        nodeAffinityLabels:
          mission-control.datastax.com/role: database
          topology.kubernetes.io/zone: us-west1-a
      - name: us-west1-b
        nodeAffinityLabels:
          mission-control.datastax.com/role: database
          topology.kubernetes.io/zone: us-west1-b
      - name: us-west1-c
        nodeAffinityLabels:
          mission-control.datastax.com/role: database
          topology.kubernetes.io/zone: us-west1-c
      size: 9
      stopped: false
  2. Apply the modified manifest:

    kubectl apply -f missioncontrolcluster.yaml

    This re-creates the CassandraDatacenter, StatefulSets, PVCs, and pods.

  3. Monitor datacenter creation progress:

    kubectl get CassandraDatacenter,pvc,sts,pod -n database-namespace

    Wait until all pods reach Running state. This can take several minutes.

  4. Verify that all nodes appear in the cluster:

    nodetool status inventory
    Result
    Datacenter: east
    =================
    Status=Up/Down
    |/ State=Normal/Leaving/Joining/Moving/Stopped
    -- Address      Load     Tokens  Owns (effective)  Host ID                               Rack
    UN 10.0.1.10    10.85 TiB  16     34.3%            aaaaaaaa-1111-2222-3333-aaaaaaaaaaaa  rack1
    UN 10.0.1.11    10.61 TiB  16     33.3%            bbbbbbbb-1111-2222-3333-bbbbbbbbbbbb  rack1
    UN 10.0.1.12    10.2 TiB   16     32.3%            cccccccc-1111-2222-3333-cccccccccccc  rack1
    UN 10.0.1.13    10.5 TiB   16     32.3%            dddddddd-1111-2222-3333-dddddddddddd  rack3
    UN 10.0.1.14    10.51 TiB  16     32.3%            eeeeeeee-1111-2222-3333-eeeeeeeeeeee  rack2
    UN 10.0.1.15    11.15 TiB  16     34.3%            ffffffff-1111-2222-3333-ffffffffffff  rack2
    UN 10.0.1.16    11.12 TiB  16     34.3%            11111111-2222-3333-4444-111111111111  rack3
    UN 10.0.1.17    10.86 TiB  16     33.3%            22222222-2222-3333-4444-222222222222  rack2
    UN 10.0.1.18    10.86 TiB  16     33.3%            33333333-2222-3333-4444-333333333333  rack3
    
    Datacenter: west
    ================
    -- Address          Load     Tokens  Owns  Host ID                               Rack
    UN 198.51.100.10    201.8 KiB  16     ?    44444444-3333-4444-5555-444444444444  rack1
    UN 198.51.100.11    179.9 KiB  16     ?    55555555-3333-4444-5555-555555555555  rack1
    UN 198.51.100.12    134.8 KiB  16     ?    66666666-3333-4444-5555-666666666666  rack1
    UN 198.51.100.13    118.1 KiB  16     ?    77777777-3333-4444-5555-777777777777  rack3
    UN 198.51.100.14    121.1 KiB  16     ?    88888888-3333-4444-5555-888888888888  rack2
    UN 198.51.100.15    193.8 KiB  16     ?    99999999-3333-4444-5555-999999999999  rack3
    UN 198.51.100.16    187.3 KiB  16     ?    aaaaaaaa-4444-5555-6666-aaaaaaaaaaaa  rack2
    UN 198.51.100.17    195.4 KiB  16     ?    bbbbbbbb-4444-5555-6666-bbbbbbbbbbbb  rack2
    UN 198.51.100.18    211.7 KiB  16     ?    cccccccc-4444-5555-6666-cccccccccccc  rack3
    
    Note: Non-system keyspaces don't have the same replication settings. Please specify a keyspace to see effective data ownership information.

    New nodes show Up/Normal (UN) status but hold only a few hundred KiB of data. The keyspace replication does not yet recognize the new datacenter.

  5. Repair system keyspaces on the surviving datacenter to ensure consistency before you rebuild the new datacenter and stream data.

    Run a primary-range repair on every node in the surviving datacenter for critical system keyspaces. This example script loops through all pods in the east datacenter:

    for ks in system_auth dse_leases system_distributed dse_security dse_system
    do
      for pod in $(kubectl get pods -n database-namespace -l cassandra.datastax.com/datacenter=east -o jsonpath={.items[*].metadata.name})
      do
        kubectl exec -it -n database-namespace $pod -c cassandra -- nodetool repair -pr $ks
      done
    done

    Alternatively, you can run nodetool repair -pr manually on each node for each keyspace.

    The repair duration scales with the number of nodes in the surviving datacenter. In a nine node cluster, this takes several minutes. The repair command completes when it finishes processing all nodes and keyspaces. If using the script, wait for it to complete before proceeding to the next step.

Replicate data to the rebuilt datacenter

To complete the rebuild process, you must add the rebuilt datacenter to the replication factor for your user keyspaces, and then enable data streaming from the surviving datacenter.

  1. Connect to a node using cqlsh.

  2. Use ALTER KEYSPACE to add the rebuilt datacenter to the replication for the user keyspaces that you altered in Remove replication to the failed datacenter.

    ALTER KEYSPACE inventory WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};
    ALTER KEYSPACE analytics WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};

    When you increase the replication factor, you will receive a warning about running nodetool repair -pr. The following steps address this warning when you stream data from a surviving datacenter.

    New write transactions begin flowing to the rebuilt datacenter immediately after altering the replication factor. You can monitor traffic to the rebuilt datacenter in the Mission Control Cluster dashboard under the Write Requests / Table panel.

  3. Create a K8ssandraTask to stream existing data from the surviving datacenter.

    Create the task in the control plane datacenter where the K8ssandraOperator is running:

    apiVersion: control.k8ssandra.io/v1alpha1
    kind: K8ssandraTask
    metadata:
      name: west-rebuild-task
      namespace: database-namespace
    spec:
      cluster:
        name: db
        namespace: database-namespace
      dcConcurrencyPolicy: Forbid
      datacenters:
      - west
      template:
        concurrencyPolicy: Allow
        maxConcurrentPods: 1
        jobs:
        - args:
            source_datacenter: east
            keyspace_name: inventory
          command: rebuild
          name: rebuild-keyspace

    Key parameters:

    • source_datacenter: The surviving datacenter from which to stream data.

    • datacenters: The rebuilt datacenter that receives the streamed data.

    • maxConcurrentPods: The number of nodes to stream data to simultaneously per rack (available in Mission Control 1.18 and later). Increasing this value increases the streaming speed but also increases resource usage and cost. Depending on the size of your cluster, you might want to stream to all nodes in parallel.

    • keyspace_name: The keyspace to stream.

      Set maxConcurrentPods to ceil(rackSize / 2) to stream data to each rack in two steps. This optimizes streaming throughput by allowing multiple nodes to receive data simultaneously. On Mission Control versions earlier than 1.18, data streams to one node at a time.

  4. Apply the task:

    kubectl apply -f rebuild-task.yaml
  5. Monitor streaming progress at the datacenter level:

    kubectl describe K8ssandraTask -n database-namespace west-rebuild-task
    Result
    Status:
      Active: 1
      Conditions:
        Last Transition Time: 2026-04-22T06:37:49Z
        Message:
        Reason: Running
        Status: True
        Type: Running
      Datacenters:
        west:
          Active: 1
          Conditions:
            Last Transition Time: 2026-04-22T06:37:49Z
            Message:
            Reason: Running
            Status: True
            Type: Running
          Pod Statuses:
            db-west-rack1-sts-0:
              Completion Time: 2026-04-23T12:20:07Z
              Job Id: aaaaaaaa-1111-2222-3333-aaaaaaaaaaaa
              Start Time: 2026-04-22T06:37:49Z
              Status: COMPLETED
            db-west-rack1-sts-1:
              Completion Time: 2026-04-23T12:20:07Z
              Job Id: bbbbbbbb-1111-2222-3333-bbbbbbbbbbbb
              Start Time: 2026-04-22T06:37:49Z
              Status: COMPLETED
            db-west-rack2-sts-0:
              Job Id: cccccccc-1111-2222-3333-cccccccccccc
              Start Time: 2026-04-23T12:20:07Z
              Status: RUNNING
          Start Time: 2026-04-22T06:37:49Z
          Succeeded: 2
      Start Time: 2026-04-22T06:37:49Z
      Succeeded: 2
  6. Monitor streaming progress on individual nodes:

    nodetool netstats | awk '/\s+\/([0-9]{1,3}\.){3}[0-9]|Receiving/ { if (NF == 1) host=$1; else print host " : " $11/$4*100 "%\t" $11/1024/1024/1024 "/" $4/1024/1024/1024 "GB";}'|sort -n
    Result
    /10.0.1.10 : 0.0265546% 0.301309/1134.68GB
    /10.0.1.11 : 0.00364573% 0.046889/1286.14GB
    /10.0.1.12 : 0.0159373% 0.179635/1127.14GB
    /10.0.1.13 : 0.0343425% 0.360817/1050.64GB
    /10.0.1.14 : 0.0165495% 0.190338/1150.11GB
    /10.0.1.15 : 0.02645% 0.204172/771.92GB
    /10.0.1.16 : 0.0021786% 0.0343957/1578.79GB
    /10.0.1.17 : 0.00238363% 0.032081/1345.89GB
    /10.0.1.18 : 0.0161406% 0.263193/1630.63GB

    This shows the percentage and volume of data streamed from each source endpoint.

During data streaming, you can see streaming traffic on the network throughput of the receiving node. Data streams in parallel from multiple nodes in the surviving datacenter.

Next steps

  • Monitor cluster health in the Mission Control UI.

  • Verify data consistency across both datacenters.

  • Update application connection strings to include the rebuilt datacenter.

Was this helpful?

Give Feedback

How can we improve the documentation?

© Copyright IBM Corporation 2026 | Privacy policy | Terms of use Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: Contact IBM