Rebuild a failed datacenter

Use this procedure to rebuild a failed datacenter after outages longer than the hint window size. This procedure removes all replicas in a failed datacenter from a cluster, including all resources and data. Then, it redeploys the failed datacenter, streaming data from at least one surviving datacenter while the cluster continues normal operations.

For less catastrophic scenarios where the datacenter is still operational, see Rebuild a datacenter’s replicas.

This procedure permanently deletes all resources and data in the failed datacenter. Run this procedure only when you cannot recover the failed datacenter, such as when data is corrupted, lost, or out-of-date. Do not run this procedure in production without fully understanding these impacts.

Prerequisites

A prepared Mission Control environment on an existing Kubernetes cluster
kubectl access with permissions to modify, remove, and create resources in the database namespace
At least one surviving datacenter in a consistent state
cqlsh access with ALTER keyspace permission

This procedure requires at least one surviving datacenter in a healthy, operational state. If all datacenters have failed and no datacenter is operational, you cannot use this rebuild procedure. Instead, you must restore from backup.

Rebuild procedure

Complete these steps to rebuild the failed datacenter:

Remove replication to the failed datacenter.
Remove the failed datacenter from MissionControlCluster.
Re-add the datacenter.
Replicate data to the rebuilt datacenter.

The following steps demonstrate the rebuild process using an example cluster with two datacenters. The east datacenter is the surviving datacenter. The west datacenter is the failed datacenter that needs to be rebuilt. Each datacenter has three racks and nine nodes (three nodes per rack).

The example uses two user keyspaces (inventory and analytics) with the following original replication strategy:

ALTER KEYSPACE inventory WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};
ALTER KEYSPACE analytics WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};

This configuration provides triple replication in each datacenter.

Remove replication to the failed datacenter

Reconfigure all user keyspaces to replicate only to the surviving datacenter:

Connect to a node in the surviving datacenter using cqlsh.

Alter all user-defined keyspaces to remove the failed datacenter from replication.

This example shows two user keyspaces: inventory and analytics.

ALTER KEYSPACE inventory WITH replication = {'class': 'NetworkTopologyStrategy', 'east': '3'};
ALTER KEYSPACE analytics WITH replication = {'class': 'NetworkTopologyStrategy', 'east': '3'};

The database displays a warning about running nodetool repair -pr when you increase the replication factor. You can disregard this warning because the replication factor is not changing in the surviving datacenter.

Remove the failed datacenter from MissionControlCluster

Remove the failed datacenter definition from the MissionControlCluster object:

Edit the MissionControlCluster manifest to remove the datacenter definition. Remove the section describing the failed datacenter from .spec.k8ssandra.cassandra.datacenters. This example shows the west datacenter configuration being removed:

- datacenterName: west
  metadata:
    name: west
  racks:
  - name: us-west1-a
    nodeAffinityLabels:
      mission-control.datastax.com/role: database
      topology.kubernetes.io/zone: us-west1-a
  - name: us-west1-b
    nodeAffinityLabels:
      mission-control.datastax.com/role: database
      topology.kubernetes.io/zone: us-west1-b
  - name: us-west1-c
    nodeAffinityLabels:
      mission-control.datastax.com/role: database
      topology.kubernetes.io/zone: us-west1-c
  size: 9
  stopped: false

Apply the modified manifest:
```
kubectl apply -f missioncontrolcluster.yaml
```
This triggers the removal of:
- The CassandraDatacenter in the failed Kubernetes cluster.
- All StatefulSets representing racks in the failed datacenter.
- All data PVCs in the failed datacenter.

Verify removal in the failed datacenter’s Kubernetes cluster:

kubectl get CassandraDatacenter,pvc,sts,pod -n database-namespace

Result

No resources found in database-namespace namespace.

If resources remain, forcefully remove them:

Get the IP addresses of failed nodes and assassinate each one from the surviving datacenter:

nodetool status  # Get failed node IPs
nodetool assassinate 10.0.2.10
nodetool assassinate 10.0.2.11
# Repeat for all failed nodes

Verify only the surviving datacenter remains in nodetool status.

Manually delete the CassandraDatacenter resource:

kubectl delete cassandradatacenter west -n database-namespace

Verify all resources are deleted:

kubectl get CassandraDatacenter,pvc,sts,pod -n database-namespace

Verify that the surviving datacenter shows all nodes in Up/Normal state:

nodetool status

Result

Datacenter: east
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address        Load     Tokens  Owns (effective)  Host ID                               Rack
UN 192.0.2.10     10.85 TiB  16     34.3%            aaaaaaaa-1111-2222-3333-aaaaaaaaaaaa  rack1
UN 192.0.2.11     10.61 TiB  16     33.3%            bbbbbbbb-1111-2222-3333-bbbbbbbbbbbb  rack1
UN 192.0.2.12     10.2 TiB   16     32.3%            cccccccc-1111-2222-3333-cccccccccccc  rack1
UN 192.0.2.13     10.5 TiB   16     32.3%            dddddddd-1111-2222-3333-dddddddddddd  rack3
UN 192.0.2.14     10.51 TiB  16     32.3%            eeeeeeee-1111-2222-3333-eeeeeeeeeeee  rack2
UN 192.0.2.15     11.15 TiB  16     34.3%            ffffffff-1111-2222-3333-ffffffffffff  rack2
UN 192.0.2.16     11.12 TiB  16     34.3%            11111111-2222-3333-4444-111111111111  rack3
UN 192.0.2.17     10.86 TiB  16     33.3%            22222222-2222-3333-4444-222222222222  rack2
UN 192.0.2.18     10.86 TiB  16     33.3%            33333333-2222-3333-4444-333333333333  rack3

At this stage, the surviving datacenter stores all data, and the system has removed all data from the failed datacenter.

Re-add the datacenter

To add the datacenter definition back to the MissionControlCluster manifest, do the following:

Edit the MissionControlCluster manifest to add the datacenter definition.

Restore the exact datacenter configuration that was removed in the previous step to .spec.k8ssandra.cassandra.datacenters. This example shows the west datacenter configuration:

- datacenterName: west
  metadata:
    name: west
  racks:
  - name: us-west1-a
    nodeAffinityLabels:
      mission-control.datastax.com/role: database
      topology.kubernetes.io/zone: us-west1-a
  - name: us-west1-b
    nodeAffinityLabels:
      mission-control.datastax.com/role: database
      topology.kubernetes.io/zone: us-west1-b
  - name: us-west1-c
    nodeAffinityLabels:
      mission-control.datastax.com/role: database
      topology.kubernetes.io/zone: us-west1-c
  size: 9
  stopped: false

Apply the modified manifest:
```
kubectl apply -f missioncontrolcluster.yaml
```
This re-creates the CassandraDatacenter, StatefulSets, PVCs, and pods.
Monitor datacenter creation progress:
```
kubectl get CassandraDatacenter,pvc,sts,pod -n database-namespace
```
Wait until all pods reach Running state. This can take several minutes.

Verify that all nodes appear in the cluster:

nodetool status inventory

Result

Datacenter: east
=================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving/Stopped
-- Address      Load     Tokens  Owns (effective)  Host ID                               Rack
UN 10.0.1.10    10.85 TiB  16     34.3%            aaaaaaaa-1111-2222-3333-aaaaaaaaaaaa  rack1
UN 10.0.1.11    10.61 TiB  16     33.3%            bbbbbbbb-1111-2222-3333-bbbbbbbbbbbb  rack1
UN 10.0.1.12    10.2 TiB   16     32.3%            cccccccc-1111-2222-3333-cccccccccccc  rack1
UN 10.0.1.13    10.5 TiB   16     32.3%            dddddddd-1111-2222-3333-dddddddddddd  rack3
UN 10.0.1.14    10.51 TiB  16     32.3%            eeeeeeee-1111-2222-3333-eeeeeeeeeeee  rack2
UN 10.0.1.15    11.15 TiB  16     34.3%            ffffffff-1111-2222-3333-ffffffffffff  rack2
UN 10.0.1.16    11.12 TiB  16     34.3%            11111111-2222-3333-4444-111111111111  rack3
UN 10.0.1.17    10.86 TiB  16     33.3%            22222222-2222-3333-4444-222222222222  rack2
UN 10.0.1.18    10.86 TiB  16     33.3%            33333333-2222-3333-4444-333333333333  rack3

Datacenter: west
================
-- Address          Load     Tokens  Owns  Host ID                               Rack
UN 198.51.100.10    201.8 KiB  16     ?    44444444-3333-4444-5555-444444444444  rack1
UN 198.51.100.11    179.9 KiB  16     ?    55555555-3333-4444-5555-555555555555  rack1
UN 198.51.100.12    134.8 KiB  16     ?    66666666-3333-4444-5555-666666666666  rack1
UN 198.51.100.13    118.1 KiB  16     ?    77777777-3333-4444-5555-777777777777  rack3
UN 198.51.100.14    121.1 KiB  16     ?    88888888-3333-4444-5555-888888888888  rack2
UN 198.51.100.15    193.8 KiB  16     ?    99999999-3333-4444-5555-999999999999  rack3
UN 198.51.100.16    187.3 KiB  16     ?    aaaaaaaa-4444-5555-6666-aaaaaaaaaaaa  rack2
UN 198.51.100.17    195.4 KiB  16     ?    bbbbbbbb-4444-5555-6666-bbbbbbbbbbbb  rack2
UN 198.51.100.18    211.7 KiB  16     ?    cccccccc-4444-5555-6666-cccccccccccc  rack3

Note: Non-system keyspaces don't have the same replication settings. Please specify a keyspace to see effective data ownership information.

New nodes show Up/Normal (UN) status but hold only a few hundred KiB of data. The keyspace replication does not yet recognize the new datacenter.

Repair system keyspaces on the surviving datacenter to ensure consistency before you rebuild the new datacenter and stream data.

Run a primary-range repair on every node in the surviving datacenter for critical system keyspaces. This example script loops through all pods in the east datacenter:
```
for ks in system_auth dse_leases system_distributed dse_security dse_system
do
  for pod in $(kubectl get pods -n database-namespace -l cassandra.datastax.com/datacenter=east -o jsonpath={.items[*].metadata.name})
  do
    kubectl exec -it -n database-namespace $pod -c cassandra -- nodetool repair -pr $ks
  done
done
```
Alternatively, you can run nodetool repair -pr manually on each node for each keyspace.

The repair duration scales with the number of nodes in the surviving datacenter. In a nine node cluster, this takes several minutes. The repair command completes when it finishes processing all nodes and keyspaces. If using the script, wait for it to complete before proceeding to the next step.

Replicate data to the rebuilt datacenter

To complete the rebuild process, you must add the rebuilt datacenter to the replication factor for your user keyspaces, and then enable data streaming from the surviving datacenter.

Connect to a node using cqlsh.
Use ALTER KEYSPACE to add the rebuilt datacenter to the replication for the user keyspaces that you altered in Remove replication to the failed datacenter.
```
ALTER KEYSPACE inventory WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};
ALTER KEYSPACE analytics WITH replication = {'class': 'NetworkTopologyStrategy', 'west': '3', 'east': '3'};
```
When you increase the replication factor, you will receive a warning about running nodetool repair -pr. The following steps address this warning when you stream data from a surviving datacenter.

New write transactions begin flowing to the rebuilt datacenter immediately after altering the replication factor. You can monitor traffic to the rebuilt datacenter in the Mission Control Cluster dashboard under the Write Requests / Table panel.

Create a K8ssandraTask to stream existing data from the surviving datacenter.

Create the task in the control plane datacenter where the K8ssandraOperator is running:

apiVersion: control.k8ssandra.io/v1alpha1
kind: K8ssandraTask
metadata:
  name: west-rebuild-task
  namespace: database-namespace
spec:
  cluster:
    name: db
    namespace: database-namespace
  dcConcurrencyPolicy: Forbid
  datacenters:
  - west
  template:
    concurrencyPolicy: Allow
    maxConcurrentPods: 1
    jobs:
    - args:
        source_datacenter: east
        keyspace_name: inventory
      command: rebuild
      name: rebuild-keyspace

Key parameters:

source_datacenter: The surviving datacenter from which to stream data.
datacenters: The rebuilt datacenter that receives the streamed data.
maxConcurrentPods: The number of nodes to stream data to simultaneously per rack (available in Mission Control 1.18 and later). Increasing this value increases the streaming speed but also increases resource usage and cost. Depending on the size of your cluster, you might want to stream to all nodes in parallel.

keyspace_name: The keyspace to stream.

Set maxConcurrentPods to ceil(rackSize / 2) to stream data to each rack in two steps. This optimizes streaming throughput by allowing multiple nodes to receive data simultaneously. On Mission Control versions earlier than 1.18, data streams to one node at a time.

Apply the task:
```
kubectl apply -f rebuild-task.yaml
```

Monitor streaming progress at the datacenter level:

kubectl describe K8ssandraTask -n database-namespace west-rebuild-task

Result

Status:
  Active: 1
  Conditions:
    Last Transition Time: 2026-04-22T06:37:49Z
    Message:
    Reason: Running
    Status: True
    Type: Running
  Datacenters:
    west:
      Active: 1
      Conditions:
        Last Transition Time: 2026-04-22T06:37:49Z
        Message:
        Reason: Running
        Status: True
        Type: Running
      Pod Statuses:
        db-west-rack1-sts-0:
          Completion Time: 2026-04-23T12:20:07Z
          Job Id: aaaaaaaa-1111-2222-3333-aaaaaaaaaaaa
          Start Time: 2026-04-22T06:37:49Z
          Status: COMPLETED
        db-west-rack1-sts-1:
          Completion Time: 2026-04-23T12:20:07Z
          Job Id: bbbbbbbb-1111-2222-3333-bbbbbbbbbbbb
          Start Time: 2026-04-22T06:37:49Z
          Status: COMPLETED
        db-west-rack2-sts-0:
          Job Id: cccccccc-1111-2222-3333-cccccccccccc
          Start Time: 2026-04-23T12:20:07Z
          Status: RUNNING
      Start Time: 2026-04-22T06:37:49Z
      Succeeded: 2
  Start Time: 2026-04-22T06:37:49Z
  Succeeded: 2

Monitor streaming progress on individual nodes:

nodetool netstats | awk '/\s+\/([0-9]{1,3}\.){3}[0-9]|Receiving/ { if (NF == 1) host=$1; else print host " : " $11/$4*100 "%\t" $11/1024/1024/1024 "/" $4/1024/1024/1024 "GB";}'|sort -n

Result

/10.0.1.10 : 0.0265546% 0.301309/1134.68GB
/10.0.1.11 : 0.00364573% 0.046889/1286.14GB
/10.0.1.12 : 0.0159373% 0.179635/1127.14GB
/10.0.1.13 : 0.0343425% 0.360817/1050.64GB
/10.0.1.14 : 0.0165495% 0.190338/1150.11GB
/10.0.1.15 : 0.02645% 0.204172/771.92GB
/10.0.1.16 : 0.0021786% 0.0343957/1578.79GB
/10.0.1.17 : 0.00238363% 0.032081/1345.89GB
/10.0.1.18 : 0.0161406% 0.263193/1630.63GB

This shows the percentage and volume of data streamed from each source endpoint.

During data streaming, you can see streaming traffic on the network throughput of the receiving node. Data streams in parallel from multiple nodes in the surviving datacenter.

Next steps

Monitor cluster health in the Mission Control UI.
Verify data consistency across both datacenters.
Update application connection strings to include the rebuilt datacenter.

Rebuild a failed datacenter

Prerequisites

Rebuild procedure

Remove replication to the failed datacenter

Remove the failed datacenter from MissionControlCluster

Re-add the datacenter

Replicate data to the rebuilt datacenter

Next steps

See also

Was this helpful?

Give Feedback