Capacity planning and hardware selection for HCD deployments

To install Hyper-Converged Database (HCD), you first need to provision and deploy a cluster. Before deployment, plan for capacity and performance by determining the cluster size. This includes selecting the number of data centers, the number of nodes per data center, and the appropriate node size, cloud instance types, and disk capacity.

Use the capacity planning guidelines in this document to choose the right hardware for your HCD deployment. Once you’ve planned for capacity and performance, you can deploy a cluster to install HCD.

General capacity planning guidelines

Your hardware choices depend on your particular use case. For example, the optimal balance of CPUs, memory, disks, network resources, and the number of nodes will differ significantly between an environment with static data accessed infrequently and one with volatile data accessed frequently.

These guidelines specify the minimum requirements, but you may need to increase CPU capacity, memory, and disk space beyond the recommended minimums. Test your configuration thoroughly before deployment.

DataStax recommends using the latest version of the Linux operating system for optimal performance. Newer versions of Linux handle highly concurrent workloads more efficiently.

CPUs and cores

In HCD, insert-heavy workloads first become CPU-bound before becoming memory-bound. All writes go to the commit log, but the database writes so efficiently that the CPU becomes the limiting factor. HCD is highly concurrent and uses all available vCPUs.

Minimum for core HCD workloads

The following table lists the minimum number of cores for production and development environments without storage attached indexing (SAI), high node density, or vector search:

Environment	Minimum cores (vCPUs)	Notes
Production	16 cores	Additional cores may be required for higher request throughput, indexing, or higher node density, depending on your use case.
Development	4 cores	Sufficient for non-load testing environments.

Environment

Minimum cores (vCPUs)

Notes

Production

16 cores

Additional cores may be required for higher request throughput, indexing, or higher node density, depending on your use case.

Development

4 cores

Sufficient for non-load testing environments.

Minimum for SAI, high node density, or vector search

The following table lists the minimum number of cores for production and development environments with storage attached indexing (SAI), high node density, or vector search:

Environment	Minimum cores (vCPUs)	Notes
Production	32 cores	Additional cores may be required for higher request throughput, indexing, or higher node density, depending on your use case.
Development	8 cores	Sufficient for non-load testing environments.

Environment

Minimum cores (vCPUs)

Notes

Production

32 cores

Additional cores may be required for higher request throughput, indexing, or higher node density, depending on your use case.

Development

8 cores

Sufficient for non-load testing environments.

HCD is horizontally scalable. In some scenarios, adding more database nodes might be better for scaling production clusters.

Memory and heap

The more memory an HCD node has, the better its read performance. More RAM also lets memory tables (memtables) hold more recently written data. Larger memtables reduce the number of sorted string tables (SSTables) that the system flushes to disk, hold more data in the operating system page cache, and reduce the number of files the system scans from disk during a read.

DataStax recommends that you base the amount of memory on the size of your data set and the number of requests you expect to handle.

The recommended minimum is:

Node type	System memory	Heap
Production transactional Apache Cassandra®	32 GB	8 GB
Production vector search	32 GB to 512 GB	24 GB for systems with less than 64 GB RAM or 31 GB for systems with 64 GB RAM or more
Development non-load testing environments	8 GB to 16 GB	4 GB to 8 GB

Node type

System memory

Heap

Production transactional Apache Cassandra®

32 GB

8 GB

Production vector search

32 GB to 512 GB

24 GB for systems with less than 64 GB RAM or 31 GB for systems with 64 GB RAM or more

Development non-load testing environments

8 GB to 16 GB

4 GB to 8 GB

Storage subsystem

The storage subsystem is critical for your database’s performance. The database appends data to the commit log for durability and flushes memtables to SSTable data files for persistent storage.

Maximum capacity per node - node density

DataStax recommends using up to 2 TB of data per node with enough free disk space to accommodate the compaction strategy.

Exceeding the recommended node density can cause the following effects:

Compactions can fall behind depending on write throughput, hardware, and compaction strategy.
Streaming operations like bootstrap, repair, and replacing nodes take longer to complete.

High-capacity nodes perform best with low to moderate write throughput and without indexing.

One special case applies to time-series data. You can use the TimeWindowCompactionStrategy to scale beyond these limits if:

The time-series data is written once and never updated.
The data includes a clustering column based on time.
Read queries cover specific time-bound ranges of data rather than its entire history.
You configure the TWCS windows appropriately.

Vector search capacity

Vector search enables semantic associations among data as an extension of SAI.

From an operational standpoint, vector search works like any other database index. Writing data uses additional CPU resources to index it. When reading data, vector search and SAI require extra work to consult the indexes, gather results, and send them to the application client. Therefore, you need to account for this overhead in your capacity planning, particularly for CPU usage, available memory, storage speed, and per-node data density. Specifically, DataStax recommends nodes that use vector search have at least 32 vCPUs, at least 64 GB memory, and fast storage—such as SSD or NVMe-based.

Thoroughly test your setup before deploying to production. DataStax highly recommends testing with tools like NoSQLBench using your desired configuration. Make sure to test common administrative operations, such as bootstrap, repair, and failure, to ensure your hardware selections meet your business needs. For more information, see Testing Your Cluster Before Production.

In addition to the database cluster, DataStax optionally provides a Data API that abstracts vector search data and indexes behind a JSON collection-oriented interface. The Data API runs in a separate stateless service deployed as containers. You can scale the Data API independently of the cluster based on request throughput.

Per-node data density factors

Node density depends heavily on the environment. You must determine node density based on several factors, including:

How frequently data changes and how often it is accessed.
Service-Level Agreements (SLAs) and the ability to handle outages.
Data compression.
Compaction strategy: These strategies control which SSTables the system selects for compaction and how it sorts the compacted rows into new SSTables. Each strategy has its strengths. To choose your compaction strategy, see compaction strategy.
Network performance: Remote links typically limit storage bandwidth and increase latency.
Replication factor: For more information, see Data distribution to nodes.

Free disk space

Disk space depends on usage, so you must understand the underlying mechanism.

The database writes data to disk when it appends data to the commit log for durability and when it flushes memtables to SSTable data files for persistent storage. The commit log follows a different access pattern and read/write ratio than SSTable data. This is more important for spinning disks than for SSDs.

The database periodically compacts SSTables. Compaction improves performance as it merges and rewrites data while discarding old data. However, certain compaction types and sizes can temporarily increase disk utilization and data directory volume. For this reason, ensure you leave an adequate amount of free disk space on a node.

The following table lists the minimum disk space requirements based on the compaction strategy:

Disk space requirements per compaction strategy
Compaction strategy	Minimum requirements
`UnifiedCompactionStrategy` (UCS)	Worst case: 20% of free disk space at the default configuration.
`SizeTieredCompactionStrategy` (STCS)	Sum of all the SSTables compacting must be smaller than the remaining disk space. Worst case: 50% of free disk space. This scenario can occur in a manual compaction where all SSTables are merged into one giant SSTable.
`LeveledCompactionStrategy` (LCS)	Generally 10%. Worse case: 50% when the Level 0 backlog exceeds 32 SSTables (LCS uses STCS for Level 0).
`TimeWindowCompactionStrategy` (TWCS)	TWCS requirements are similar to STCS. TWCS requires a maximum disk space overhead of approximately 50% of the total size of SSTables in the last created bucket. To ensure adequate disk space, determine the size of the largest bucket or window ever generated and calculate 50% of that size.

Cassandra periodically merges SSTables and discards old data to keep the database healthy through a process called compaction. For more information, see Data maintenance.

Estimate usable disk capacity

To estimate how much data your nodes can hold, calculate the usable disk capacity per node and then multiply that by the number of nodes in your cluster.

Start with the raw capacity of the physical disks:
```
raw_capacity = disk_size * number_of_data_disks
```
Calculate the usable disk space accounting for file system formatting overhead, roughly 10 percent:
```
formatted_disk_space = (raw_capacity * 0.9)
```

Calculate the recommended working disk capacity:

usable_disk_space = formatted_disk_space * (0.5 to 0.8)

During normal operations, the database routinely requires you to allocate disk capacity for compaction and repair operations. For optimal performance and cluster health, DataStax recommends not filling your disks, but running at 50% to 80% capacity.

Capacity and I/O

When choosing disks for your nodes, consider both capacity, which refers to how much data you plan to store, and I/O, which refers to the write/read throughput rate. Some workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes with more RAM.

Number of disks - HDD

DataStax recommends using at least two disks per node: one for the commit log and the other for the data directories.

At a minimum, you should place the commit log on its own partition.

Commit log disk - HDD

You do not need a large disk, but it should be fast enough to receive all of your writes as appends for sequential I/O.

Commit log disk - SSD

Unlike spinning disks, there is less of a penalty for sharing commit logs and data directories on SSD than there is on HDD.

DataStax recommends separating commit logs and data for highest performance and resiliency.

Data disks

Use one or more disks per node and make sure they are large enough for the data volume and fast enough to satisfy both (1) reads that are not cached in memory, and (2) to keep up with compaction.

RAID on data disks

You generally do not need to use a Redundant Array of Independent Disks (RAID) for the following reasons:

The system replicates data across the cluster based on the replication factor you choose.
HCD includes a just-a-bunch-of-disks (JBOD) feature for disk management. Because the database responds to a disk failure according to your availability/consistency requirements, either by stopping the affected node or by denylisting the failed drive, you can deploy nodes with large disk arrays without the overhead of RAID 10. You can configure the database to stop the affected node or denylist the drive according to your availability/consistency requirements. Also see Recovering from a single disk failure using JBOD.

RAID on the commit log disk

Generally you do not need RAID for the commit log disk. Replication adequately prevents you from losing data. If you need extra redundancy, use RAID 1.

Extended file systems

DataStax recommends that you deploy on XFS or ext4. On ext2 or ext3, you can only use a maximum file size of 2TB, even with a 64-bit kernel.

Because the database may use almost half your disk space for a single file with SizeTieredCompactionStrategy (STCS), you should use XFS with large disks, especially if you’re using a 32-bit kernel. With a 32-bit kernel, XFS file size limits are capped at 16TB, and on a 64-bit kernel, the limits are essentially unlimited

Partition size and count

To ensure efficient operation, you must size partitions within certain limits.

You can measure partition size by the number of values in the partition and the size of the partition on disk. The practical limit of cells per partition is 2 billion.

Sizing the disk space is more complex; it depends on the number of rows, columns, primary key columns, and static columns in each table. Each application has different efficiency parameters, but a good rule of thumb is to keep the number of rows per partition under 100,000 and the partition size under 100 MB.

Network bandwidth

Minimum recommended bandwidth: 1000 Mb/s (gigabit/s).

A distributed data store increases the load on the network to handle read/write requests and replicate data across nodes. Make sure your network can handle inter-node traffic without bottlenecks.

DataStax recommends binding your interfaces to separate Network Interface Cards (NIC). You can use public or private NICs based on your requirements.

The database efficiently routes requests to replicas geographically closest to the coordinator node and chooses a replica in the same rack when possible. The database always chooses replicas located in the same datacenter over those in a remote datacenter.

Firewall configuration for clusters

If you use a firewall, make sure nodes within a cluster can communicate with each other.

You must open the following ports to allow bi-directional communication between nodes. Configure the firewall running on nodes in your cluster accordingly. If you do not open the following ports, nodes will act as standalone database servers and will not join the cluster.

Port number Communication path Description

Port number	Communication path	Description
22	Public	SSH
7000	Node-to-node	Inter-node cluster communication
7001	Node-to-node	SSL
7199	Node-to-node	JMX monitoring
9042	Client-to-node	Client port
9142	Client-to-node	Default for `native_transport_port_ssl`, useful when both encrypted and unencrypted connections are required.

Public

SSH

7000

Node-to-node

Inter-node cluster communication

7001

Node-to-node

SSL

7199

Node-to-node

JMX monitoring

9042

Client-to-node

Client port

9142

Client-to-node

Default for native_transport_port_ssl, useful when both encrypted and unencrypted connections are required.

Capacity planning and hardware selection for HCD deployments

General capacity planning guidelines

CPUs and cores

Minimum for core HCD workloads

Minimum for SAI, high node density, or vector search

Memory and heap

Storage subsystem

Maximum capacity per node - node density

Vector search capacity

Per-node data density factors

Free disk space

Estimate usable disk capacity

Capacity and I/O

Number of disks - HDD

Commit log disk - HDD

Commit log disk - SSD

Data disks

RAID on data disks

RAID on the commit log disk

Extended file systems

Partition size and count

Network bandwidth

Firewall configuration for clusters

Was this helpful?

Give Feedback