Interact with local operators during a control plane outage
When the Mission Control control plane is unavailable, you can still interact with the data plane’s local operators to manage your database clusters.
The cass-operator manages individual database datacenters in the data plane.
When the control plane is down, you can interact with this operator directly to troubleshoot and recover at the datacenter level.
Prerequisites
-
A backup of your current cluster configuration
-
Access to the data plane of the Mission Control cluster
-
Permissions to manage Kubernetes resources
-
kubectlversion 1.20 or later installed on your local machine
Verify control plane availability
Before proceeding, verify if the control plane is unavailable:
kubectl cluster-info
Sample result
Unable to connect to the server: dial tcp SERVER_IP:8443: i/o timeout
Manage a custom resource (CR) in the data plane
Each data plane runs its own Kubernetes API server, independent of the Mission Control control plane. You can interact with this API directly using standard Kubernetes tools:
-
kubectlconfigured with the data plane’skubeconfigfile -
Kubernetes API via REST calls
The Mission Control UI usually creates CRs, and these resources reside in the data plane. When the control plane is unavailable, you must manage these resources directly. For more information on CRs, see the Mission Control Custom Resource Definition (CRD) reference.
Change the context to the data plane
If the control plane is unavailable, change the context to the data plane using the kubeconfig file.
List all available contexts:
kubectl config get-contexts
Change the context to the data plane:
kubectl config use-context DATA_PLANE_CLUSTER_NAME
Replace DATA_PLANE_CLUSTER_NAME with the name of the data plane cluster.
View details of a CR
Describe a specific CR to view its details:
kubectl get CUSTOM_RESOURCE_KIND RESOURCE_NAME -n NAMESPACE -o yaml
Replace the following:
-
CUSTOM_RESOURCE_KIND: The kind of CR
-
RESOURCE_NAME: The name of the resource
-
NAMESPACE: The namespace where the resource is deployed
Modify a CR
You can edit a CR to make changes:
kubectl edit CUSTOM_RESOURCE_KIND RESOURCE_NAME -n NAMESPACE
Replace the following:
-
CUSTOM_RESOURCE_KIND: The kind of CR
-
RESOURCE_NAME: The name of the resource
-
NAMESPACE: The namespace where the resource is deployed
Apply changes to a CR
To apply changes to a CR, use the kubectl apply command:
kubectl apply -f CUSTOM_RESOURCE.yaml
Replace CUSTOM_RESOURCE.yaml with the YAML file containing the changes.
Delete a CR
To delete a CR, use the kubectl delete command:
kubectl delete CUSTOM_RESOURCE_KIND RESOURCE_NAME -n NAMESPACE
Replace the following:
-
CUSTOM_RESOURCE_KIND: The kind of CR
-
RESOURCE_NAME: The name of the resource
-
NAMESPACE: The namespace where the resource is deployed
Manage database datacenter resources
When the control plane is down, you can only work with the cass-operator in the data plane, which manages individual datacenters.
You cannot access K8ssandraCluster level functionality during control plane outages.
|
During a control plane outage:
|
The system ignores any changes to K8ssandraCluster objects when the control plane is down.
Backup and repair behavior during outages
Understanding how Mission Control handles backups and repairs during a control plane outage helps you plan for high availability and disaster recovery scenarios.
Mission Control runs scheduled backups in the data planes even when the control plane is unavailable. Your data protection strategy remains intact during control plane outages.
However, Mission Control doesn’t run repair operations when the control plane is down. Repair availability depends on your Reaper deployment configuration.
Reaper deployment configuration
By default, Mission Control doesn’t deploy a Reaper instance in each datacenter. Mission Control uses the following default behavior:
-
Deploys Reaper in the first datacenter with data nodes
-
Deploys Reaper in the first data plane datacenter if the control plane has no data nodes and you have multiple data planes with data nodes
-
Doesn’t deploy Reaper in a control plane that doesn’t also function as a data plane
Your Reaper configuration and deployment determine when you can run repairs:
-
Single Reaper in control plane: You can’t access Reaper functionality during a control plane outage if your deployment uses a single Reaper installation that the control plane manages.
-
Reaper in data plane: You can still run repair operations when the control plane is down if you have a control plane with no data nodes and two data planes with data nodes. Mission Control deploys Reaper in the first datacenter, not in the control plane.
-
Datacenter failure: You can’t run repairs if one datacenter is down unless you restrict repairs to datacenters that are currently up.
You can configure Mission Control to deploy Reaper in each datacenter for improved availability during partial outages. This configuration lets repairs continue in available datacenters even when other datacenters or the control plane are down.
Available cass-operator resources
The cass-operator manages the following resources at the datacenter level:
-
CassandraDatacenter: Defines individual datacenters and their configurations, including size, rack definitions, and storage -
CassandraTask: Defines maintenance tasks for database datacenters
Update CassandraDatacenter resources
List all CassandraDatacenter resources:
kubectl get cassandradatacenter -A
View details of a CassandraDatacenter resource:
kubectl describe cassandradatacenter DATACENTER_NAME -n NAMESPACE
Replace the following:
-
DATACENTER_NAME: The name of the
CassandraDatacenter -
NAMESPACE: The namespace where the
CassandraDatacenteris deployed
|
When the control plane becomes available again, your direct changes to the
It’s important to document any manual changes made during the outage to ensure they are properly incorporated when the control plane is restored. |
Modify a CassandraDatacenter resource:
kubectl edit cassandradatacenter DATACENTER_NAME -n NAMESPACE
Replace the following:
-
DATACENTER_NAME: The name of the
CassandraDatacenter -
NAMESPACE: The namespace where the
CassandraDatacenteris deployed
Trigger a rolling restart
To trigger a rolling restart, you must create a CassandraTask resource with the restart command.
You can restart the entire datacenter or add an argument to restart a specific rack.
apiVersion: control.k8ssandra.io/v1alpha1
kind: CassandraTask
metadata:
name: restart-task
spec:
datacenter:
name: DATACENTER_NAME
namespace: cass-operator
jobs:
- name: JOB_NAME
command: restart
Replace the following:
-
DATACENTER_NAME: The name of the
CassandraDatacenterto restart -
JOB_NAME: The name of the job
apiVersion: control.k8ssandra.io/v1alpha1
kind: CassandraTask
metadata:
name: restart-task
spec:
datacenter:
name: DATACENTER_NAME
namespace: cass-operator
jobs:
- name: JOB_NAME
command: restart
args:
- rack: RACK_NAME
Replace the following:
-
DATACENTER_NAME: The name of the
CassandraDatacenterto restart -
JOB_NAME: The name of the job
-
RACK_NAME: The name of the rack to restart
Apply the restart task:
kubectl apply -f RESTART_TASK_FILENAME.yaml
Replace RESTART_TASK_FILENAME.yaml with the name of the restart task file.
For more information on CassandraDatacenter resources, see the CassandraDatacenter CRD reference in the K8ssandra documentation.
Create a CassandraTask
CassandraTask resources define maintenance tasks for database clusters, such as rebuilds and restarts.
You can create these tasks directly.
Supported tasks include:
-
rebuild: Rebuild a node -
cleanup: Cleanup a node -
restart: Restart a node -
replacenode: Replace a node -
upgradesstables: Upgrade SSTables -
scrub: Scrub a node -
compaction: Compact a node -
move: Move a node -
flush: Flush a node -
garbagecollect: Garbage collect a node -
refresh: Refresh a node
For example, to create a task to replace a node, you can use the following YAML:
apiVersion: control.k8ssandra.io/v1alpha1
kind: CassandraTask
metadata:
name: REPLACE_TASK_FILE_NAME
spec:
datacenter:
name: DATACENTER_NAME
namespace: cass-operator
jobs:
- name: JOB_NAME
command: replacenode
args:
pod_name: POD_NAME
Replace the following:
-
REPLACE_TASK_FILE_NAME: The name of the replace task file
-
DATACENTER_NAME: The name of the
CassandraDatacenter -
JOB_NAME: The name of the job
-
POD_NAME: The name of the pod to replace
Apply the replace task:
kubectl apply -f REPLACE_TASK_FILENAME.yaml
Replace REPLACE_TASK_FILENAME.yaml with the name of the replace task file.
For more information on CassandraTask resources, see the CassandraTask CRD reference in the K8ssandra documentation.
Best practices
Follow these best practices to ensure a smooth recovery process:
Before making changes
-
Document all manual changes
-
Create backups of critical resources
-
Test changes in a non-production environment if possible
During the outage
-
Make only necessary changes
-
Keep detailed logs of all modifications
-
Coordinate changes with team members
After control plane recovery
-
Verify all changes are properly synchronized
-
Update documentation
-
Review logs for any inconsistencies
Recovery procedures
After the control plane is restored, verify the following:
-
Control plane connectivity
-
Changes made during the outage
-
Synchronize configurations if needed
-
Test cluster functionality
-
Update documentation with any permanent changes