Performing Cluster Operations
Cluster operations include initiating administrative actions on nodes, such as garbage collection, in a Cassandra or DSE cluster, rebalancing a cluster, and managing API requests sent to cluster.
Node Administration Methods | |
---|---|
Perform bulk Operations |
|
Initiate JVM garbage collection on a node. |
|
Assign a new token to the node. |
|
Drain a node. |
|
Decommission a node. |
|
Clean up a keyspace. |
|
Flush memtables from a keyspace. |
|
Repair a keyspace. |
|
Compact a keyspace. |
|
Start Cassandra/DSE on a node |
|
Stop Cassandra/DSE on a node |
|
Restart Cassandra/DSE on a node |
|
Perform a rolling restart of the cluster |
|
List moves to balance a cluster. |
|
Run a list of moves to balance a cluster. |
|
Get the status of cluster services. |
|
Turn on the cluster repair service. |
|
Turn off the cluster repair service. |
|
Get the status of the repair service. |
|
Get a summary of the repair service progress. |
|
Get details of the repair service progress. |
|
Turn on/off NodeSync for tables. |
|
Get the NodeSync status. |
|
Get the summary of NodeSync progress. |
|
Get per-keyspace summary of NodeSync progress. |
|
Get per-table summary of NodeSync progress. |
|
Get per-table summary of NodeSync progress in keyspace. |
|
Get the status of a long-running request. |
|
Cancel a request. |
|
List requests of a specific type |
Node Administration Methods
POST /{cluster_id}/ops
Initiate a bulk set of operations on one or more nodes
Body: A JSON dictionary with the following keys:
-
ips: List of IPs that represent the nodes the operations will run on:
-
action: The operation that should be performed on the node. Values are (cleanup, compact, flush, perform_gc, repair, restart, start, stop)
-
is_rolling: Whether the jobs are running in a rolling or parallel fashion
-
sleep: Seconds between each grouping of jobs. Default is 60
-
args: Arguments in a list, to pass to each operation.
Returns a Request ID.
Example:
curl -X POST http://127.0.0.1:8888/Test_Cluster/ops
-d '{"ips":["127.0.0.1"],"action":"cleanup", "is_rolling": true, "sleep": 1, "args":["OpsCenter", "events"]}'
GET /{cluster_id}/ops/gc/{node_ip}
Initiate JVM garbage collection on a node
.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip: IP address of the target node.
Returns null.
Example:
curl -X GET
http://127.0.0.1:8888/Test_Cluster/ops/gc/1.2.3.4
PUT /{cluster_id}/ops/move/{node_ip}
Assign a new token to the node.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be assigned a new token.
Body: New token to assign to node.
Returns a Request ID.
Example:
curl -X PUT
http://127.0.0.1:8888/Test_Cluster/ops/move/10.11.12.72
-d '"85070591730234615865843651857942052864"'
Output:
"72ff69b2-9cf5-4777-a600-9173b3fe7e6a"
GET /{cluster_id}/ops/drain/{node_ip}
Initiate a drain operation to flush all memtables from the node.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be flushed of memtables.
Returns null.
Example:
curl -X GET
http://127.0.0.1:8888/Test_Cluster/ops/drain/1.2.3.4
POST /{cluster_id}/ops/decommission/{node_ip}
Initiate decommissioning of a node.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be decommissioned.
Returns null.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/decommission/1.2.3.4
POST /{cluster_id}/ops/cleanup/{node_ip}/{ks_name}
Initiate a cleanup operation for the specified keyspace.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
that initiates cleaning of the keyspace. -
ks_name: Name of the keyspace to be cleaned. If empty, all keyspaces are cleaned up.
Body: List of tables to cleanup. If empty, all tables are cleaned up.
Returns null.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/cleanup/1.2.3.4/Keyspace1
-d '["ColFam1", "ColFam2"]'
POST /{cluster_id}/ops/flush/{node_ip}/{ks_name}
Flush memtables for a keyspace.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be flushed of memtables for a keyspace. -
ks_name: Keyspace of the memtables to be flushed. If empty, all keyspaces are cleaned up.
Body: List of tables to flush. If empty, all tables are flushed.
Returns null.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/flush/1.2.3.4/Keyspace1
-d '["ColFam1", "ColFam2"]'
POST /{cluster_id}/ops/repair/{node_ip}/{ks_name}
Initiates repair of a keyspace.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip: :http:response:`node` that initiates repair.
-
ks_name: Keyspace to be repaired.
Body: A JSON dictionary with the following keys:
-
is_sequential: A boolean indicating whether to run the repair sequentially or not, default is true.
-
is_local: A boolean indicating whether to use only nodes in the same datacenter during the repair or not. Default is false.
-
primary_range: Repair just the primary range for that node or else will repair all ranges. A boolean, default is false.
-
cfs: List of tables (column families) to repair. If this is empty, all tables will be repaired.
Returns null.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/repair/1.2.3.4/Keyspace1
-d '{"is_sequential": false, cfs":["ColFam1", "ColFam2"]}'
POST /{cluster_id}/ops/compact/{node_ip}/{ks_name}
Initiates a major compaction on a keyspace.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
that initiates the compaction. -
ks_name: Keyspace to be compacted. If empty, all keyspaces are cleaned up
Body: List of tables to compact. If this is empty, all tables are compacted.
Returns null.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/compact/1.2.3.4/Keyspace1
-d '["ColFam1", "ColFam2"]'
Process Management Methods
POST /{cluster_id}/ops/start/{node_ip}
Start the Cassandra/DSE process on a single node.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be started.
Returns a Request ID.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/start/10.11.12.72
Output:
"a34814a6-4896-11e2-a563-e0b9a54a6d93"
POST /{cluster_id}/ops/stop/{node_ip}
Stop the Cassandra/DSE process on a single node.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be stopped.
Body: A JSON dictionary with an optional key:
-
drain_first: A boolean to first perform a drain operation before stopping a node.
Returns a Request ID.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/stop/10.11.12.72
-d '{"drain_first": true}'
Output:
"c0d81d54-4896-11e2-a563-e0b9a54a6d93"
POST /{cluster_id}/ops/restart/{node_ip}
Restart the Cassandra/DSE process on a single node.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be restarted.
Body: A JSON dictionary with two optional keys:
-
wait_for_cassandra: A boolean that waits until DSE is fully started before completing the request asynchronously.
-
drain_first: A boolean to first perform a drain operation before stopping a node.
Returns a Request ID.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/restart/10.11.12.72
-d '{"wait_for_cassandra": true, "drain_first": true}'
Output:
"e2212500-4896-11e2-a563-e0b9a54a6d93"
POST /{cluster_id}/ops/restart
Perform a rolling restart of the entire cluster or a select list of nodes.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
node_ip:
node
to be restarted.
Body: A JSON dictionary with three optional keys:
-
sleep: Amount of time in seconds to sleep between restarting each node. Default is 60.
-
ips: A list of ips to restart. If left empty, all nodes will be restarted (this is the default behavior).
-
drain_first: A boolean to first perform a drain operation before stopping a node.
Returns a Request ID.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/restart
Output:
"e2212500-4896-11e2-a563-e0b9a54a6d93"
Cluster Rebalancing Methods
GET /{cluster_id}/ops/rebalance
Returns a list of proposed moves to run to balance a cluster. Throws an error if called on a cluster using vnodes.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
Returns a list of moves, where each move is a token and the IP address of its assigned node. The result of this call is passed to POST /{cluster_id}/ops/rebalance.
Example:
curl http://127.0.0.1:8888/Test_Cluster/ops/rebalance
Output:
[
[
"85070591730234615865843651857942052864",
"10.11.12.152"
]
]
POST /{cluster_id}/ops/rebalance
Run the specified list of moves to balance a cluster. Throws an error if called on a cluster using vnodes.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs. Optional parameters: sleep: An optional number of seconds to wait between each move.
Body: A list of moves to run to balance this cluster. This is typically the result of [GET /{cluster_id}/ops/rebalance.].
Returns a Request ID for determining the status of, or cancelling, a running rebalance.
Example:
curl -X POST
http://127.0.0.1:8888/Test_Cluster/ops/rebalance
-d
'[
[
"85070591730234615865843651857942052864",
"10.11.12.152"
]
]'
Output:
"e330b179-1b9f-40c2-a2f5-d2f3d24aa85c"
Cluster Services
GET /{cluster_id}/services
Get the status of cluster services.
Returns a dictionary with service names as keys and the status, parameters, and associated activity or progress of the service as the values.
Example:
curl "http://localhost:8888/Test_Cluster/services"
Output:
{
"repair": {
"progress": {
"completed": 26,
"total": 256
},
"status": {
"parameters": {
"time_to_completion": 100000
},
"status": true
}
}
}
Cluster Repair Service
POST /{cluster_id}/services/repair
Start the cluster repair service with the given parameters.
Body: A dictionary of repair service parameters.
-
time_to_completion: The time in seconds to complete a repair cycle of the entire cluster. For example,
864000 (10 days)
.
DELETE /{cluster_id}/services/repair
Stop the cluster repair service.
GET /{cluster_id}/services/repair
Get the status of the repair service.
Returns a dictionary describing the status and parameters of the service.
Example:
curl "http://127.0.0.1:8888/Test_Cluster/services/repair"
{
"status": true,
"parameters": {"time_to_completion": 100000}
}
GET /{cluster_id}/repair-status
Get a status summary of the repair service progress.
Returns a progress summary for the current repair cycle. Includes statistics on pending, in progress, any errors, and completed repairs in total.
Example:
curl "http://127.0.0.1:8888/Test_Cluster/repair-status"
{
"config": {
"cluster_stabilization_period": "30",
"error_logging_window": "86400",
"ignore_keyspaces": "",
},
"status": "active",
"time_to_completion": 777600
"overview": {
"completed": 36,
"failed": 0,
"in_progress": 1,
"remaining": 19,
"repair_times": {
"50": 1,
"75": 1,
"90": 1,
"99": 5,
"average": 1.3611111111111112,
"max": 7,
"min": 1
},
"total": 56
},
"incremental": {
"completed": 8,
"completed_bytes": 40000,
"estimated_time": 0,
"job_state": "success",
"last_repair_ts": 0,
"remaining": 0,
"remaining_bytes": 0,
"throughput": 1.0,
"throughput_bytes": 5000,
"total": 8,
"total_bytes": 40000,
"ttc_remaining": 777329
},
"subrange": {
"completed": 28,
"completed_bytes": 445648829,
"estimated_time": 190,
"job_state": "running",
"last_repair_ts": 0,
"remaining": 19,
"remaining_bytes": 164736194,
"throughput": 0.6829268292682927,
"throughput_bytes": 11141095,
"total": 48,
"total_bytes": 610390023,
"ttc_remaining": 777329
},
"details": {
"OpsCenter.backup_reports": {
"attempts": 0,
"average_time": 0,
"state": {
"aborted": 0,
"failure": 0,
"pending": 4,
"running": 0,
"success": 0
},
"time": 0,
"type": "incremental"
},
},
}
GET /{cluster_id}/repair-details
Gets a detailed list of current cycle’s repairs.
Optional parameters:
-
keyspace: Limits results to only the specified keyspace. Optional.
-
table: Limits results to only the specified table. Optional.
Returns a detailed list of every repair and its present status in the current repair cycle.
Example:
curl "http://127.0.0.1:8888/Test_Cluster/repair-details?keyspace=myks&table=mytable"
[
{
"attempts": 0,
"executing": false,
"ksname": "blackhat",
"last_error": "",
"node": "127.0.0.4",
"repair_range": [
"0",
"4611686018427387904"
],
"size": 281480,
"start_ts": 1492099955.84,
"tables": [
"cc"
],
"time": 1,
"type": "subrange"
},
{
"attempts": 0,
"executing": false,
"ksname": "OpsCenter",
"last_error": "",
"node": "127.0.0.2",
"size": 5000,
"start_ts": 1492099102.985,
"table": "settings",
"time": 1,
"type": "incremental"
},
]
NodeSync Service
Terminology:
deadline:
Target for the maximum time between 2 validations of the same data. As long as the deadline is met, all parts of the ring (for the table) are validated at least that often. Deadline could be set via deadline_target_sec
property of the table, or inferred from the gc_grace_seconds
property. The deadline should always be less than or equal to the grace period. As long as the deadline is met, no data is resurrected due to tombstone purging. NodeSync prioritize segments in order to try to meet the deadline. The next segment to validate at any given time is the one the closest to missing its deadline.
segment:
A segment is a small local token range of a table. NodeSync recursively splits local ranges in half a certain number of times (depth) to create segments. The depth is calculated using the total table size, assuming equal distribution of data. Typically segments cover no more than 200 MB
. The token ranges can be no smaller than a single partition, so very large partitions can result in segments larger than the configured size.
POST /{cluster_id}/nodesync
Enable or disable NodeSync and Incremental NodeSync for specified tables. Please note that there is a limit on maximum number of the tables for which NodeSync could be enabled or disabled in one request. See documentation about nodesync.max_request_tables
configuration parameter. For tables matching multiple sections, the order of precedence is incremental
> enable
> disable
.
*Body*: A dictionary of parameters.
-
incremental: array of table names for which Incremental NodeSync should be enabled. Table name could be specified as full name in form
keyspace.table
, or as wildcards,keyspace.
(all tables in given keyspace), or(all tables).
-
enable: array of table names for which Standard NodeSync should be enabled. Calling
enable
on a table with Incremental NodeSync enabled disables Incremental NodeSync and set it to Standard NodeSync. -
disable: array of table names for which NodeSync should be disabled.
{
"incremental": ["test.t0"],
"enable": ["test.t1", "test.t2"],
"disable":["test.t3"]
}
Returns true
if request was accepted, or JSON
object describing error.
{
"brief": "error",
"message": "Please select fewer tables and try again. To maximize performance, OpsCenter has been configured to limit the number of concurrent updates to 50 tables. The total number of tables to process in this request, after expanding wild card selectors, was 102.",
"type": "InvalidArguments"
}
GET /{cluster_id}/nodesync
Retrieve the NodeSync status.
Returns JSON
object with following fields:
-
status: boolean value describing status of NodeSync support in cluster - it is
false
for DSE 5.1, for example, as it does not have NodeSync. The actual status of NodeSync could be obtained via BestPractice rulecheck-nodesync-running
-
supports_incremental: boolean value describing status of Incremental NodeSync support in a cluster
-
enabled: array of table names for which NodeSync is enabled
-
incremental: array of table names for which Incremental NodeSync is enabled - these Incremental tables will also show up in the
enabled
list -
disabled: array of table names for which NodeSync is disabled;
-
pending: array of table names for which status of NodeSync is changing (being enabled or disabled)
-
ineligible:
JSON
object with following fields: -
system
: array of system table names -
rf1
: array of table names in keyspaces with RF=1
{
"pending": [],
"incremental": [
"dse_insights_local.insights_config"
],
"enabled": [
"dse_system_local.solr_resources",
"dse_insights_local.insights_config"
],
"ineligible": {
"system": [
"system.local",
...
],
"rf1": [
"dse_system_local.solr_resources",
...
]
},
"disabled": [
"test.t40",
...
],
"supports_incremental": true,
"status": true
}
GET /{cluster_id}/nodesync/summary
Get a summary of the NodeSync progress.
Returns JSON
object with following fields:
-
last_updated: timestamp of last update of NodeSync status (in seconds)
-
segment_sync_counts: array of 4 numbers:
-
the count of segments whose timestamp occurred at most half of the deadline ago
-
the count of segments whose timestamp occurred at most 90% of the deadline ago
-
the count of segments whose timestamp occurred no older than the deadline
-
the count of segments whose timestamp occurred longer ago than the deadline
-
segment_sync_percentages: array of 4 numbers that express data in
segment_sync_counts
as percentage of total counts.
GET /{cluster_id}/nodesync/summary/keyspace
Get a per-keyspace summary of the NodeSync progress.
Path arguments:
-
page: (optional) page number to retrieve, starting with
1
-
per_page: (optional) how many results return per page
Returns JSON
object with following fields:
-
next: number of the next page, or
null
if it’s the last page; -
per_page: how many results are per page
-
previous: number of the previous page, or
null
if it’s the first page -
last: number of the last page
-
count: total number of results
-
current: number of current page
-
proximate: JSON object describing navigation to previous and next page
-
results: array of JSON objects, consisting of following fields:
-
keyspace: name of the keyspace
-
last_updated: timestamp of last update of NodeSync status (in seconds)
-
segment_sync_counts: array of 4 numbers:
-
the count of segments whose timestamp occurred at most half of the deadline ago
-
the count of segments whose timestamp occurred at most 90% of the deadline ago
-
the count of segments whose timestamp occurred no older than the deadline
-
the count of segments whose timestamp occurred longer ago than the deadline
-
segment_sync_percentages: array of 4 numbers that express data in
segment_sync_counts
as percentage of total counts.
GET /{cluster_id}/nodesync/summary/table
Get a per-table summary of the NodeSync progress.
Path arguments:
-
page: (optional) page number to retrieve, starting with
1
-
per_page: (optional) how many results return per page
Returns JSON
object with following fields:
-
next: number of the next page, or
null
if it’s the last page -
per_page: how many results are per page
-
previous: number of the previous page, or
null
if it’s the first page -
last: number of the last page
-
count: total number of results
-
current: number of current page
-
proximate: JSON object describing navigation to previous and next page
-
results: array of JSON objects, consisting of following fields:
-
keyspace: name of the keyspace
-
table: name of the table
-
last_updated: timestamp of last update of NodeSync status (in seconds)
-
segment_sync_counts: array of 4 numbers:
-
the count of segments whose timestamp occurred at most half of the deadline ago
-
the count of segments whose timestamp occurred at most 90% of the deadline ago
-
the count of segments whose timestamp occurred no older than the deadline
-
the count of segments whose timestamp occurred longer ago than the deadline
-
segment_sync_percentages: array of 4 numbers that express data in
segment_sync_counts
as percentage of total counts.
GET /{cluster_id}/nodesync/summary/table/{keyspace}
Get a per-table summary of the NodeSync progress in specific keyspace.
Path arguments:
-
keyspace: name of the keyspace
-
page: (optional) page number to retrieve, starting with 1
-
per_page: (optional) how many results return per page
Returns JSON
object with following fields:
-
next: number of the next page, or
null
if it’s the last page -
per_page: how many results are per page
-
previous: number of the previous page, or
null
if it’s the first page -
last: number of the last page
-
count: total number of results
-
current: number of current page
-
proximate: JSON object describing navigation to previous and next page
-
results: array of JSON objects, consisting of following fields:
-
table: name of the table
-
last_updated: timestamp of last update of NodeSync status (in seconds)
-
segment_sync_counts: array of 4 numbers:
-
the count of segments whose timestamp occurred at most half of the deadline ago
-
the count of segments whose timestamp occurred at most 90% of the deadline ago
-
the count of segments whose timestamp occurred no older than the deadline
-
the count of segments whose timestamp occurred longer ago than the deadline
-
segment_sync_percentages: array of 4 numbers that express data in
segment_sync_counts
as percentage of total counts.
Request Management Methods
Request
Requests are the method that OpsCenter uses to track potentially long-running requests that must be completed asynchronously. When these potentially long-running API calls are made, opscenterd will immediately return a Request ID that can be used to look up the status of the request.
Once a Request is started, you can fetch the status information for it until opscenterd is restarted or a large number of Requests have been started.
A Request
status takes the following form:
{
"id": ID,
"state": STATE,
"started": STARTED,
"finished": FINISHED,
"cluster_id": CLUSTER_ID,
"details": DETAILS
}
Data:
-
ID (string): The unique UUID for this Request. When an operation is potentially long-running, opscented will return this ID immediately.
-
STATE (string): Either "running", "success", or "error"
-
STARTED (int): A unix timestamp representing when the Request started
-
FINISHED (int): A unix timestamp representing when the Request finished, or null if it has not finished yet
-
CLUSTER_ID (string): The name of the cluster on which the Request is operating
-
DETAILS: Typically a string containing a status or error message, but may be a dictionary in the form
\{<subrequest_id>: <Request>}
when the Request holds a collection of subrequests.
Content Types: JSON
GET /request/{request_id}/status
Check the status of an asynchronous request sent to OpsCenter.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
request_id: The ID returned by the API call that triggered the request.
Returns a dictionary describing the status of the request.
Example:
curl http://127.0.0.1:8888/request/6b6b15aa-df8a-43f1-aab3-efce6b8589e4/status
{
"status": "running",
"started": 1334856122,
"error_message": null,
"finished": null,
"moves": [
{
"status": null,
"ip": "10.100.100.100",
"old": "2",
"new": "85070591730234615865843651857942052864"
}
],
"id": "6b6b15aa-df8a-43f1-aab3-efce6b8589e4"
}
POST /request/{request_id}/cancel
Cancel an asynchronous request sent to OpsCenter. Not all requests can be cancelled.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
request_id: The ID returned by the API call that triggered the request.
Returns null.
Example:
curl -X POST
http://127.0.0.1:8888/request/6b6b15aa-df8a-43f1-aab3-efce6b8589e4/cancel
The request is canceled.
GET /{cluster_id}/request/{request_type}
List requests of a particular type. Default is the latest request of that type.
Path arguments:
-
cluster_id: The ID of a cluster returned from GET /cluster-configs.
-
request_type: Either
rolling-restart
,restore
orbulk-operations
Query params: list_all: A boolean (0
or 1
) indicating whether all of the requests should be returned or just the latest. Default is 0
(false
).
Returns a Request ID. If list is true
, then an array of request
IDs.
Example:
curl -X GET
http://127.0.0.1:8888/Test_Cluster/request/rolling-restart
"8f4f71e7-65d3-41a7-bb1a-789af07dbd73"
curl -X GET
http://127.0.0.1:8888/Test_Cluster/request/rolling-restart?list_all=1
[
"35dd37a5-4170-4694-9253-faa9532d47b6",
"8f4f71e7-65d3-41a7-bb1a-789af07dbd73"
]