Configure very large clusters

OpsCenter can manage very large clusters that contain hundreds of nodes. When working with very large clusters, the performance of OpsCenter decreases with the default settings. To improve performance, adjust the cluster settings to increase the time period between polls of a cluster’s nodes and token lists.

OpsCenter performs best when monitoring 200 nodes or less. DataStax recommends a dedicated OpsCenter instance for each cluster with more than 200 nodes.

Mission Control is the successor to OpsCenter and was designed with enterprise scale in mind. Consider evaluating Mission Control if you are hitting the limits of OpsCenter. For information about Mission Control, see Intro to Mission Control.

Lifecycle Manager can provision and manage nodes per cluster within its UI. See Supported capabilities for more details.

Change default settings to improve the performance of very large clusters

Locate the cluster_name.conf file. The location of this file depends on the type of installation:
- Package installations: /etc/opscenter/clusters/cluster_name.conf
- Tarball installations: install_location/conf/clusters/cluster_name.conf
Open cluster_name.conf for editing.
1. Increase the node list poll period to 30 minutes by setting the nodelist_poll_period option to 1800 under [collection]:
  - [collection] nodelist_poll_period
    
    The interval in seconds OpsCenter waits to poll the nodes in a cluster. The default value is 30.
    
    [collection] nodelist_poll_period = 1800
2. If an agent is overloaded, increase the default http_timeout if necessary:
  - [agents] http_timeout
    
    The timeout, in seconds, for an HTTP call to the agent. The default value is 10.
    
    [agents] http_timeout = 20
Set the following properties in the [agent_config] section of cluster_name.conf on the opscenterd machine and the properties propagate automatically to all DataStax Agents.

Conversely, setting these options in address.yaml must be done for every node.

Increasing the [agent_config] values reduces the speed at which the UI displays updates of node status, such as the Ring view.
- [agent_config] realtime_interval
  
  The length of time, in seconds, between polling attempts to capture rapidly changing realtime information. Default value: 5
  [agent_config] realtime_interval = 40
- [agent_config] shorttime_interval
  
  The length of time, in seconds, between polling attempts to capture information that changes frequently. Default value: 10
  [agent_config] shorttime_interval = 80
- [agent_config] longtime_interval
  
  The length of time, in seconds, between polling attempts to capture information that changes infrequently. Default value: 300
  [agent_config] longtime_interval = 2400
- [agent_config] status_reporting_interval
  
  The length of time, in seconds, between sending agent health information. Default value: 20
  [agent_config] status_reporting_interval = 160
- [agent_config] disk_usage_update_period
  
  The length of time, in seconds, to wait between attempts to poll the disk for usage. Agents send storage capacity data to opscenterd based on the disk_usage_update_period value in address.yaml or in cluster_name.conf. Default value: 30
  [agent_config] disk_usage_update_period = 240
Locate the opscenterd.conf file. The location of this file depends on the type of installation:
- Package installations: /etc/opscenter/opscenterd.conf
- Tarball installations: install_location/conf/opscenterd.conf
Open opscenterd.conf for editing and adjust the following settings:
- [agents] not_seen_threshold
  
  The maximum time in seconds since the last agent status about a specific connection, such as stomp, was sent before that agent connection is considered down. This threshold also affects how long OpsCenter waits before marking node health as unknown. Default value: 180 seconds.
- [agents] http_poll_period
  
  The frequency in seconds between attempts to poll agent http health. Default value: 60 seconds.
- [ui] default_api_timeout
  
  The default timeout value in seconds for an API call from the OpsCenter UI to the OpsCenter API. Default value: 30 seconds. Some API calls require a timeout longer than 30 seconds. In those cases, the API call timeouts are scaled relative to the default_api_timeout (for example, 6 * default_api_timeout). Changing the default_api_timeout affects those timeouts accordingly. For example:
  [agents] not_seen_threshold = 620 http_poll_period = 500 [ui] default_api_timeout = 60
Locate the address.yaml file. The location of this file depends on the type of installation:
- Package installations: /var/lib/datastax-agent/conf/address.yaml
- Tarball installations: install_location/conf/address.yaml
Open address.yaml for editing, and adjust the following setting to override the default of 30 seconds:
- stomp-setup-timeout
  
  The maximum time in seconds before the agents try to reconnect via stomp. Increasing this timeout to 120 seconds better allows agents to reconnect after restarting OpsCenter in a large cluster.
  stomp-setup-timeout: 120
Use the environmental variable OPSC_JVM_OPTS to override the default parameters for the OpsCenter JVM.

The following command doubles the heap size to 4096m (4GB).
```
export OPSC_JVM_OPTS=-Xmx4096m
```
See Configuring the OpsCenter JVM for additional information.
If you continually receive OutOfMemory errors, consider Configure the DataStax Agent JVM.
Restart OpsCenter.

Configure very large clusters

Change default settings to improve the performance of very large clusters

Was this helpful?

Give Feedback