Rebuild a datacenter’s replicas
A rebuild operation runs the nodetool rebuild
command, rebuilding data on a single node by streaming from another (source) datacenter (DC).
Run this operation on each node after defining a source datacenter from which to stream data.
This command streams data from another datacenter to rebuild the local DC’s replicas.
Redistribute data on remaining nodes after one or more nodes or datacenters are terminated or added.
Performance impact
Rebuilding nodes ensures high availability of data and avoids single points of failure.
Prerequisites
-
A prepared environment on either bare-metal/VM or an existing Kubernetes cluster.
Example
A single datacenter (dc1
) is deployed on the Data Plane
Kubernetes cluster.
A Control Plane
Kubernetes cluster exists.
Workflow of user and operators
-
Perform a backup of the node.
-
User defines the
rebuild
CassandraTask
. Specify the affected node as offline. -
User submits a
rebuild
CassandraTask
to theData Plane
Kubernetes cluster where the datacenter is deployed. -
DC-operator detects new task custom resource definition (CRD).
-
DC-operator iterates one rack at a time.
-
DC-operator triggers and monitors rebuild operations one pod at a time.
-
DC-operator reports task progress and status.
-
User requests a status report of the
rebuild
CassandraTask
with thekubectl
command, and views the status response.
Rebuild a datacenter’s replicas
-
Modify the
rebuild-dc1.cassandratask.yaml
file.Here is a sample:
apiVersion: control.k8ssandra.io/v1alpha1 kind: CassandraTask metadata: name: rebuild-dc1 spec: datacenter: name: dc1 namespace: demo jobs: - name: rebuild-dc1 command: rebuild args: keyspace_name: KEYSPACE_NAME
Replace
KEYSPACE_NAME
with the name of the keyspace to rebuild.Key options:
-
metadata.name
: a unique identifier within the Kubernetes namespace where the task is submitted. While the name can be any value, consider including the cluster name to prevent collision with other options. -
spec.datacenter
: a uniquenamespace
andname
combination used to determine which datacenter is the source for this operation. -
spec.jobs[0].command
: MUST berebuild
for this operation. -
Optional:
spec.jobs[0].args.keyspace_name
: restricts this operation to a particular keyspace. Omitting this value results in ALL keyspaces being rebuilt. By default all keyspaces are rebuilt.
-
-
Submit the
rebuild
CassandraTask
custom resource definition toData Plane
Kubernetes cluster where the datacenter is deployed with this command:kubectl apply -f rebuild-dc1.cassandratask.yaml
Submit the
rebuild
CassandraTask
object to the Kubernetes cluster where the specified datacenter is deployed.If
nodetool rebuild
is interrupted before completion, restart it by re-entering the command. The process resumes from the point at which it was interrupted.The DC-level operators perform a rolling rebuild operation, one node at a time. The order is determined lexicographically, also known as Dictionary order, starting with rack names and then continuing with node (pod) names.
-
Monitor the rebuild operation progress with this
kubectl
command:kubectl get cassandratask rebuild-dc1 | yq .status
Sample results
... status: completionTime: "2022-10-23T23:34:38Z" conditions: - lastTransitionTime: "2022-10-23T23:34:08Z" status: "True" type: Running - lastTransitionTime: "2022-10-23T23:34:39Z" status: "True" type: Complete startTime: "2022-10-23T23:34:08Z" succeeded: 3
The DC-level operators set the
startTime
field prior to starting therebuild
operation. They update thecompletionTime
field when therebuild
operation is completed.The sample output indicates that the task is completed with the
type: Complete
status condition set toTrue
. Thesucceeded: 3
field indicates that three (3) nodes (or pods) completed the requested task successfully. Afailed
field tracks a running count of pods that failed therebuild
operation. -
Monitor the DSE node logs on the
Data Plane
cluster to verify that the datacenter is rebuilt with this command:kubectl logs -f demo-dc1-rack3-sts-1 -c server-system-logger | grep "finished rebuild"
Sample results
INFO [pool-18-thread-1] 2022-10-23 23:32:18,088 StorageService.java:1895 - finished rebuild for (All keyspaces), (All tokens), 1 streaming connections, NORMAL, included DCs: dc1 after 5 seconds receiving 3.52 MiB.