Capacity planning and hardware selection for DSE deployments

Before you install DataStax Enterprise (DSE), you must provision and deploy a cluster. The hardware you choose for your DSE cluster directly influences cluster performance.

This guide provides general capacity planning recommendations for a broad range of DSE use cases. Unless otherwise noted, any specific values, such as system memory, are minimums that you can exceed as needed.

Adjust these recommendations as needed for your environment and workload characteristics. For example, environments with static, infrequently accessed data have different requirements than environments with volatile, frequently accessed data.

These recommendations aren’t fixed limits, and they don’t guarantee optimal performance or stability.

Use these recommendations as a starting point. Then, test the configuration and make adjustments as needed for your workloads and performance targets.

This guide assumes you have basic knowledge of Apache Cassandra-based databases and DevOps concepts. Use this guide in conjunction with other DSE documentation.

Deployment sizing

Before deployment, plan for capacity and performance by determining the cluster size. This includes selecting the number of datacenters, the number of nodes per datacenter, and the appropriate node size, cloud instance types, and disk capacity.

To learn more about clusters, datacenters, and nodes, see the following:

Architecture in brief
Initialize a DataStax Enterprise (DSE) cluster If you plan to use DSE advanced workloads, DataStax recommends separate nodes for each workload type for better performance and stability. Multiple workload types can share the same cluster, but single-workload nodes avoid resource competition problems while allowing the cluster to serve multi-workload requests. For more information, see Initializing datacenters.

Operating system

For optimal performance in Linux environments, DataStax recommends using the latest version of a supported Linux distribution. Newer versions of Linux handle highly concurrent workloads more efficiently.

Some examples of supported Linux distributions include Amazon Linux, Oracle Linux, Red Hat Enterprise Linux (RHEL), and Ubuntu. For more information, see Supported platforms and compatibility for DSE.

Java runtime

A supported Java runtime is required.

If you install multiple Java versions, you must set $JAVA_HOME to the latest DSE-supported Java version that you have installed.

Linux command-line tools

Familiarity with Linux command-line tools is important for managing and operating your DSE clusters. For example:

Parallel Secure Shell (SSH) and Cluster SSH: The pssh and cssh tools allow SSH access to multiple nodes. This is useful for inspections and cluster-wide changes.
Passwordless SSH: SSH authentication is carried out by using public and private keys. This allows SSH connections to easily move from node to node without password access. In cases where more security is required, you can implement either a bastion host or a VPN, or both.
dstat: Command-line tool that shows all system resources instantly. For example, you can compare disk usage in combination with interrupts from your IDE controller, or compare the network bandwidth numbers directly with the disk throughput (in the same interval).
top: Command-line tool that provides an ongoing look at CPU processor activity in real time.
vmstat: Command-line tool that reports information about processes, memory, paging, block I/O, traps, and CPU activity.
iftop: Command-line tool that shows a list of network connections. Connections are ordered by bandwidth usage, with the pair of hosts responsible for the most traffic at the top of list. This tool makes it easier to identify the hosts causing network congestion.
Other system performance tools: In addition to the previously mentioned tools, other Linux command-line tools, such as iostat, mpstat, sar, lsof, netstat, and htop, can collect and report a variety of metrics about the operation of the system.

CPUs

In DSE, insert-heavy workloads become CPU-bound before becoming memory-bound. All writes go to the commit log, but the database is so efficient in writing that the CPU is the limiting factor. DSE is highly concurrent and uses as many CPU cores as available.

The following table lists the minimum number of cores for each DSE node in production and development environments. Multiply these minimum requirements based on the cluster size.

More cores are required for environments that support resource intensive workloads, including auxiliary indexes and high node density. For example, storage attached indexing (SAI) and vector search at scale always require additional cores and memory.

Minimum required cores
Node type	Environment	Minimum CPU cores (logical) per node	Notes
Transactional (Cassandra)	Production	Standard workloads: 16 cores Resource intensive workloads: 32 cores	Adjust according to your actual workloads and performance targets. Additional cores might be required for higher request throughput or auxiliary indexing. Fewer than 16 cores per node isn’t recommended for most production workloads.
Transactional (Cassandra)	Development	Standard workloads: 2 cores Resource intensive workloads: 4 cores	Sufficient for non-load testing environments. When testing production workloads, use production recommendations.
DSE Analytics	Any	16 to 32 cores	Advanced workloads are considered resource intensive. DSE Analytics workloads rely heavily on memory for optimal performance. For more information, see Memory and heap.
DSE Search	Any	2 cores per search index	Advanced workloads are considered resource intensive. Total number of search indexes must not exceed half of the number of physical cores. For more information, see DSE Search workloads.
DSE Graph	Any	16 to 32 cores	Advanced workloads are considered resource intensive. DSE Graph queries can cause a CPU bottleneck due to query optimization and result set preparation. Therefore, DataStax recommends more cores, or more powerful cores, for these nodes.

The most effective way to scale DSE production clusters is to add more servers. Additionally, DSE 6.9 has a thread-per-core architecture that can also benefit from more cores or a faster storage layer.

Memory and heap

The more memory an DSE node has, the better its read performance. More RAM also lets memory tables (memtables) hold more recently written data. Larger memtables are more efficient:

Fewer sorted string tables (SSTables) flushed to disk.
More data held in the chunk cache and, if needed, the OS page cache.
Fewer files scanned from disk during reads.

DataStax recommends that you base the amount of memory on the size of your hot dataset (indexes, frequently accessed data) and the number of requests you expect to handle.

As a general starting point, DataStax recommends the following:

Recommended memory for dedicated hardware and virtual environments
Environment	Node type	System memory	Heap
Production	Transactional (Cassandra)	32 GB	8 GB
Production	Vector Search	Minimum: 32 GB Balanced: 64 GB Maximum: 512 GB	System memory less than 64 GB: 24 GB System memory greater than 64 GB: 31 GB See Vector search workloads.
Production	DSE Analytics	32 GB to 512 GB	System memory less than 64 GB: 24 GB System memory greater than 64 GB: 31 GB
Production	DSE Search	32 GB to 512 GB	System memory less than 64 GB: 24 GB System memory greater than 64 GB: 31 GB See DSE Search workloads.
Production	DSE Graph	Add 2 to 4 GB to the recommended memory for DSE Search or DSE Analytics, based on your cluster’s combination of advanced workloads. For a large dedicated graph cache, add more RAM.	System memory less than 64 GB: 24 GB System memory greater than 64 GB: 31 GB
Development (non-load testing) When testing production workloads, use production recommendations.	Any	Transactional: 8 GB Advanced workloads or auxiliary indexing: 16 GB	Transactional: 4 GB Advanced workloads or auxiliary indexing: 8 GB

For more information, see Set the heap size for Java garbage collection.

Storage subsystem

The storage subsystem is critical for your database’s performance, and disk space requirements depend on usage.

The database writes to disk when it appends data to the commit log for durability, and when it flushes memtables to SSTable data files for persistent storage. For more information about these processes, see Architecture in brief and Writes.

The commit log has a different access pattern (read/writes ratio) than reads from SSTables. This is more important for spinning disks than for SSDs.

Disk space required for compaction

For any compaction strategy, it is important that each node has adequate resources to support the commit log, memtables, SSTables, and the compaction process.

To keep the database healthy, the database periodically merges and rewrites SSTables while discarding old data through a process called compaction.

It is important that you choose an appropriate compaction strategy for your use case and data model. Misconfigured or unsuitable compaction strategies can degrade performance and overconsume system resources. The following table summarizes the supported compaction strategies and general configuration guidance for each strategy.

Supported compaction strategies
Strategy	Use case	Configuration notes	General disk requirements
UnifiedCompactionStrategy (UCS)	Unifies and builds on tiered (STCS) and leveled (LCS) compaction. Recommended for all workloads with the exception of time series data with expiring time-to-live (TTL) workloads that is better suited to TimeWindowCompactionStrategy (TWCS).	Understand how the scaling property works and make sure it is set to your preferred mode and performance targets. Many compaction properties for this strategy are sufficient at the default values, but you might need to tune them for certain workloads.	Disk space requirements depend on the configured mode (STCS, LCS, balanced, or multiple tier-specific modes). The maximum disk space for this strategy is set by the `max_space_overhead` parameter. If the default configuration is 20 percent of your node’s free disk space, DataStax recommends tuning this parameter.
SizeTieredCompactionStrategy (STCS)	Good for write-heavy workloads that prioritize fast writes over read latency. DataStax recommends UCS in tiered mode over traditional STCS.	In most cases, the default properties are sufficient. If this strategy produces too many outliers, or compaction runs more often that you would like, then the size range and thresholds might need tuning.	Make sure there is sufficient memory and storage available for the number and size of SSTables as well as overhead for the compaction process. The sum of all SSTables being compacted must be smaller than the remaining disk space, ideally less than 50 percent. Avoid exceeding 50 percent of free disk space, which is likely to occur with manual compaction where all SSTables are merged into one giant SSTable.
LeveledCompactionStrategy (LCS)	Good for read-heavy workloads that perform best with fewer SSTables. DataStax recommends UCS in leveled mode or multiple tier-specific modes over traditional LCS, which can alleviate some performance concerns at high levels (L3 and above).	Understand the mechanisms and resource requirements at L0 compared to L1 and higher. Requires tuning `memtable` parameters for optimal performance, such as less frequent flushing of memtables to avoid overloading L0.	Due to guaranteed non-overlapping row key ranges, LCS requires much less disk space for compaction compared to STCS. However, you must account for the use of STCS as a failsafe at L0, which requires more disk space. Furthermore, the maximum overhead for LCS increases dramatically beyond L3 because each level is approximately 10 times larger than the preceding level. I/O saturation is possible when compacting at the highest levels due to progressively larger SSTables at each additional level. For more information, see LCS compaction write amplification and disk requirements.
TimeWindowCompactionStrategy (TWCS)	Designed for time series data and expiring time-to-live (TTL) workloads, especially data that is written once, in chronological order, and never updated.	TWCS must be enabled when you create a table. You cannot apply TWCS retroactively to existing tables that weren’t created with the proper time windowing layout.	Similar to STCS, TWCS requires a maximum disk space overhead of 50 percent of the total size of SSTables in the last created bucket. To ensure adequate disk space, determine the size of the largest bucket or window ever generated, and divide that value by 2: `TWCS disk space = Largest bucket size / 2`. For new deployments, you must monitor and tune this during cluster performance tests.

For more information about compaction strategy properties and tuning, see Configure compaction.

Estimate usable disk capacity

To estimate how much data your nodes can hold, calculate the usable disk capacity per node and then multiply that by the number of nodes in your cluster:

Start with the raw capacity of the physical disks:
```
raw_capacity = disk_size * number_of_data_disks
```
Calculate the usable disk space accounting for file system formatting overhead, which is approximately 10 percent:
```
formatted_disk_space = (raw_capacity * 0.9)
```
Calculate the recommended working disk capacity:
```
usable_disk_space = formatted_disk_space * (0.5 to 0.8)
```
During normal operations, the database routinely requires disk capacity for compaction and repair operations. For optimal performance and cluster health, DataStax recommends not filling your disks to capacity. Instead, run at 50 to 80 percent of maximum capacity. For example, if you want to support 10 TB of node density at 80 percent of maximum capacity, use 12 TB disks.

Estimate partition size

See Evaluate partitions.

Maximum capacity per node (node density)

Determining node density depends heavily on the environment and factors such as the following:

Frequency of reads.
Frequency of new writes and mutations.
Using HDDs or SSDs.
Storage speed and whether the storage is local.
Your Service-Level Agreements (SLAs) and tolerance for outages.
Data compression.
Compaction strategy, depending on whether the workload is write-intensive, read-intensive, or time dependent.
Network performance: Remote links can limit storage bandwidth and increase latency.
Replication factor.

DataStax recommends no more than 2 TB of data per node.

Exceeding the recommended data density has the following effects:

Compactions can fall behind depending on write throughput, hardware, and compaction strategy.
Substantially more compactions per node.
Excessively long run times for streaming operations, such as bootstrapping, repairing, and replacing nodes. In extreme cases, these operations can take days to complete.
Interferes with routine maintenance, such as recovering, adding, and replacing nodes. Operations take longer to complete and are less efficient.

High-capacity nodes work best with low to moderate write throughput and no indexing. An ideal use case is static data that is rarely accessed.

If you have time-series data, you can use the TimeWindowCompactionStrategy (TWCS) to scale larger than these limits if the following conditions are true for your cluster:

The time-series data is written once and never updated.
The data has a clustering column that is time based.
Reads cover specific time-bounded ranges of data rather than its full history.
You are prepared to configure the TWCS windows appropriately.

If you require additional data density, contact IBM Support to determine if the workload and hardware being used is appropriate for higher densities.

Spinning disks versus solid state drives (SSD) (local only)

For cloud deployments, contact IBM Support.

For assistance with determining the most cost-effective hardware options for any DSE deployment, contact IBM Support.

Solid state drives (SSDs) are recommended for all DSE nodes.

The NAND Flash chips that power SSDs provide extremely low-latency response times for random reads while supplying ample sequential write performance for compaction operations. In recent years, drive manufacturers have improved overall endurance, usually in conjunction with spare (unexposed) capacity. Additionally, because PBW/DWPD ratings are probabilistic estimates based on worst case scenarios, such as random write workloads, and because the database does only large sequential writes, drives significantly exceed their endurance ratings.

However, it is important to plan for drive failures and have spares available. A large variety of SSDs are available from server vendors and third-party drive manufacturers. Longevity is a key factor when purchasing SSDs. The best recommendation is to make the decision based on how difficult it is to change drives when they fail, not on workload of the drive. Remember, your data is protected because the database replicates data across the cluster. Buying strategies include:

If drives are quickly available, buy the cheapest drives that provide the performance you want.
If it is more challenging to swap the drives, consider higher endurance models, possibly starting in the mid range, and then choose replacements of higher or lower endurance based on the failure rates of the initial model chosen.

When choosing disks for your nodes, consider capacity (how much data you plan to store) and I/O (the write/read throughput rate). Some workloads are best served by using less expensive SATA disks and scaling disk capacity and I/O by adding more nodes (with more RAM).

For more information about SSDs with DSE, see Storage engine and Optimize disk settings.

Use separate disks for commit logs and data directories

DataStax recommends placing commit logs and data directories on separate disks (at least two disks) for better performance and resiliency. If you cannot use separate disks, the commit log should be on its own partition.

For SSD, unlike spinning disks, performance doesn’t suffer as much when sharing commit logs and data directories as compared to HDD. However, separation is still recommended.

Commit log disk: The commit log disk doesn’t need to be large, but it must be fast enough to receive all writes as appends for sequential I/O.
Data disks: For data disks, use one or more disks per node. Disks must be large enough for the required data volume and fast enough to satisfy reads that are not cached in memory while keeping up with compaction.

DSE requires at least 10,000 IOPS per node. For DSE to linearly scale, every data disk in the cluster must be capable of sustaining this IOPS rate or better. Make sure there are no bottlenecks (controller/LUN) in any node’s I/O.

Avoid SAN storage for on-premise deployments

In a physical, on-premise deployment, Storage Area Network (SAN) storage is aggregated storage that is external to a server. In cloud deployments, virtual SAN storage is local to compute nodes.

DataStax strongly discourages traditional SAN storage for on-premise DSE deployments.

This restriction doesn’t apply to cloud deployments. Virtual SAN storage is less susceptible to the SAN storage issues that can occur with distributed databases.

Although used frequently in enterprise IT environments, traditional SAN storage is typically less performant and more expensive than other options when used with distributed databases:

Traditional SAN return on investment (ROI) does not scale along with that of DSE clusters, with regards to capital expenses and engineering resources.
In distributed architectures, traditional SAN generally introduces a bottleneck and single point of failure because the database’s I/O frequently surpasses the ability of the array controller to keep pace.
External storage increases latency for all operations, even with a high-speed network and SSD.
Heap pressure is increased because pending I/O operations take longer.
When the SAN transport shares operations with internal and external database traffic, it can saturate the network and lead to network availability problems.

Taken together, these factors can create problems that are difficult to resolve in production. In particular, deploying DSE clusters with an external SAN requires additional preparation and testing that isn’t common to other deployments. This can include development of specialized testing methods and additional personnel, time, and resource requirements for testing and reconfiguration cycles. For example, methods are needed for all key scaling factors, such as operational rates and SAN fiber saturation.

For more information about the disadvantages of traditional SAN storage, contact IBM Support.

Avoid NAS devices

DataStax does not recommend storing SSTables on a Network-Attached Storage (NAS) device. If your organization or environment requires NAS, contact IBM Support.

Using a NAS device often results in network-related bottlenecks caused by high levels of I/O wait time on reads and writes. Examples of these bottlenecks include router latency and the Network Interface Cards (NICs) in the node and the NAS device.

RAID isn’t required

Typically, you don’t need to use a Redundant Array of Independent Disks (RAID) on data disks for the following reasons:

Data is replicated across the cluster based on the configured replication factor.
DSE includes JBOD (Just a Bunch of Disks) features for disk management.

Based on your data availability and consistency configuration, the database responds to disk failure by stopping the affected node or denylisting the failed drive. This means that you can deploy nodes with large disk arrays without the overhead of RAID-10. For more information, see Recover from a single disk failure using JBOD and the disk_failure_policy parameter in cassandra.yaml.

Generally, DataStax recommends the built-in JBOD configurations. If you need extra redundancy, use RAID-0, RAID-1, or RAID-10. For certain workloads, RAID-0 can provide better throughput because it splits every block onto another device, allowing parallel (instead of serial) writes on disk. Don’t use RAID-5, RAID-6, or variants like RAID-50 or RAID-60, which exhibit poor performance.

Additionally, you don’t typically need RAID for the commit log disk because built-in replication functionality adequately prevents data loss. If you need extra redundancy for the commit log disk, use RAID-1.

Extended file systems

DataStax recommends that you deploy on XFS. If XFS is not available, use ext4.

Don’t use ext2 or ext3 because you can only use a maximum file size of 2 TB, even with a 64-bit kernel. In contrast, the XFS file system limitation is 16 TB with a 32-bit kernel and essentially unlimited with a 64-bit kernel.

Because the database can use almost half your disk space for a single file with SizeTieredCompactionStrategy (STCS), use XFS with large disks, especially with a 32-bit kernel.

Use a 4 KB block size for optimal performance.

Extended File Systems limits are different from node density limits.

Network

The minimum recommended bandwidth is 1000 Mb/s (gigabit).

A distributed data store puts load on the network to handle read/write requests and replication of data across nodes. Make sure your network can handle inter-node traffic without bottlenecks.

DataStax recommends binding your interfaces to separate Network Interface Cards (NICs). You can use public or private NICs depending on your requirements.

The database efficiently routes requests to replicas that are geographically closest to the coordinator node, and it chooses a replica in the same rack when possible. The database always chooses replicas located in the same datacenter over replicas in a remote datacenter.

Firewall and ports

If you use a firewall, make sure that nodes within a cluster can communicate with each other. For required ports, see Secure DataStax Enterprise ports.

If you are using other components with DSE, make sure the required ports for those components are open and not blocked by a firewall. For example, for DSE OpsCenter, see OpsCenter ports reference to set firewall rules.

Encryption

DataStax strongly recommends that you configure peer-to-peer encryption and client-to-server encryption during the initial setup of your production cluster. Even if you don’t intend to use network encryption immediately, it is better to configure it during initial setup.

It is more difficult to enable this encryption after the cluster starts serving production traffic. Misconfiguration on a live cluster can result in cluster-wide downtime and data loss, such as missed writes.

For more information, see the following:

Load balancers

DSE was designed to avoid the need for load balancers. Putting load balancers between the database and clients can be harmful to performance, cost, availability, debugging, testing, and scaling.

All high-level clients implement load balancing directly. For example, see Load balancing in Cassandra drivers.

With Mission Control, you can use load balancers to support your Mission Control deployment, but they must not intersect database-client communication.

Racks

If you intend to use racks, include them in your deployment and cluster configuration planning from the beginning.

DataStax doesn’t recommended attempting to incorporate racks after deploying clusters.

Don’t change fundamental rack architecture after cluster deployment

You cannot reconfigure or change racks after provisioning a cluster. This includes migrations from a single rack to multiple racks. These changes can result in data loss.

If racks are fundamentally misconfigured, you must redeploy your cluster with the correct configuration.

Avoid multiple racks in single-token architecture deployments

The following guidance applies to single-token architectures only; it doesn’t apply to virtual nodes. For more information, see Data distribution and replication.

In single-token architectures, defining one rack for the entire cluster is the simplest and most common implementation. Avoid multiple racks for the following reasons:

Operators often forget or ignore the requirement to organize racks in an alternating order. However, alternating rack order is intentional because it allows the data to be distributed safely and appropriately.
Operators often use rack information inefficiently. For example, having the same number of racks and nodes provides no organizational value to the deployment architecture.
Expanding a cluster when using racks can be tedious. The procedure typically involves several node moves, and you must ensure that racks are distributing data correctly and evenly. When you need to scale clusters urgently, consider all other infrastructure before racks.

To set up racks correctly, plan to allocate the same number of nodes to each rack. Add nodes to the first rack, and then configure subsequent racks in an alternating pattern.

The rack feature benefits from quick and fully functional cluster expansions. Once the cluster is stable, you can swap nodes and make the appropriate node moves to ensure that nodes are placed in the ring in an alternating fashion with respect to the racks.

Snitches, partitioners, and replicas

For better resilience and scalability, determine your data distribution and replication strategy before deploying your cluster. Some of these configurations cannot be changed after deployment.

Snitches

You must configure a snitch when you create a cluster.

A snitch maps the IP addresses of nodes to physical and virtual locations, such as racks and datacenters. Snitches use gossip to inform the database about the network topology for efficient request routing and replica distribution in datacenters and racks.

There are multiple types of snitches you can use, depending on your deployment characteristics:

The default SimpleSnitch doesn’t recognize datacenter or rack information. Use it for single-datacenter or single-zone deployments in public clouds.
The GossipingPropertyFileSnitch is recommended for production. It is the most flexible solution for on-premise or mixed cloud environments. It defines a node’s datacenter and rack and uses gossip for propagating this information to other nodes.
Other snitches, such as the Ec2Snitch are available for specific deployment environments and architectures.

Partitioners

You must choose a partitioner when you deploy a cluster, and all nodes in a cluster must use the same partitioner. To change the partitioner after deployment, you must reload all data for the entire cluster.

The partitioner determines how data is distributed across the nodes in the cluster.

The default Murmur3Partitioner partitioner is recommended for all new deployments. Other partitioners are included for backwards compatibility only.

Single-token architecture or vnodes

Determine your token architecture when deploying a cluster or datacenter. All nodes in a datacenter must use same token architecture type: single-token architecture or virtual nodes (vnodes).

DataStax recommends vnodes because they simplify token management across partitions when deploying and scaling clusters. For most use cases, DataStax recommends 8 or 16 vnodes. To enable vnodes in a new cluster, set num_tokens and allocate_tokens_for_local_replication_factor in cassandra.yaml. For more information, see the following:

Replicas

The replication strategy and factor are set at the keyspace level, but your replication requirements determine your cluster infrastructure. Generally, the replication factor (number of replicas) must not exceed the number of nodes in the cluster.

The NetworkTopologyStrategy is recommended even if you have only one datacenter because it makes it easier to scale to multiple datacenters in the future, if needed. For more information, see Evaluate keyspace replication.

If you are using vnodes, set allocate_tokens_for_local_replication_factor in cassandra.yaml to match the replication factor of the node’s keyspaces.

Vector search workloads

Vector search enables semantic associations among data as an extension of storage attached indexes (SAI).

From an operational standpoint, vector search works like any other database index:

Writing data uses additional CPU resources for indexing.
When reading data, vector search and SAI require extra work to consult the indexes, gather results, and send them to the application client.

You must account for this overhead in your capacity planning, particularly in CPU usage, memory, storage speed, and per-node data density. Specifically, DataStax recommends the following for nodes that support vector search:

Minimum 32 vCPUs
64 GB or more of memory
Fast storage, such as SSD or NVMe.

In addition to the database cluster, DataStax optionally provides a Data API that abstracts vector search data and indexes behind a JSON collection-oriented interface. The Data API runs in a separate stateless service deployed as containers. You can scale the Data API independently of the cluster based on request throughput.

DSE Search workloads

Proper capacity planning for DSE Search helps ensure that your nodes have sufficient memory resources to meet operational requirements. This includes the following:

Setting the optimal heap size per node.
Estimating of the number of nodes required for your application.
Increasing the replication factor to support more queries per second.
Using distributed queries with DSE Search. These are more efficient when the number of nodes in the queried datacenter is a multiple of the datacenter’s replication factor.

Capacity recommendations for DSE Search

Use SSDs with DSE Search

For best performance, DataStax recommends using solid-state drives (SSDs).

DSE Search is memory intensive. It rereads the entire row when updating indexes, which can cause a significant performance hit on spinning disks. DataStax strongly recommends SSDs for applications that have aggressive insert and update requirements.

Separate disks for transactional and DSE Search data

To avoid search index performance degradation, it is critical that you locate DSE transactional data and DSE Search (Solr) data on separate SSDs.

Because DSE Search is I/O intensive, transactional (Cassandra) data and search data must be on different SSDs. Otherwise, the SSD can be overrun by both workloads.

Make sure that you set the location of search indexes.

For more information, see Set the location of search indexes and Tune DSE Search for maximum indexing throughput.

Use a single-token architecture or minimal vnodes

Because DSE Search performs a scatter-gather query against all token ranges, the number of queries sent is directly proportional to the number of token ranges. For DSE Search, use either:

A single-token architecture
8 or fewer vnodes, and configure the allocate_tokens_for_local_replication_factor option in cassandra.yaml as needed for your environment.

Monitor index size

The recommended maximum size for a single index is 250 GB. In most cases, performance degrades at or before 250 GB. You must add nodes to further distribute the search index.

For multiple indexes, the recommended maximum size is 500 GB total (sum of all indexes). Supporting multiple indexes depends on the available hardware, particularly the number of physical CPU cores. DataStax recommends at least two physical cores per search index. The maximum number of search indexes should be equal to or less than half the number of physical cores. For example, if a machine has 16 virtual CPUs on 8 physical cores, the recommended maximum number of search indexes is 4.

A DSE Search index can be significantly larger than the size of the actual table data, depending on the data types of the indexed columns and the index type, such as text columns indexed for full-text search or substring search. Only index columns that are required to support your data model, and create the indexes with the appropriate configuration.

Test DSE Search capacity

Capacity planning for DSE Search requires that you estimate how large the search index could become. To do this, you need to index documents on a single node, run typical user queries, and then examine the memory usage for heap allocation.

Repeat this process with more documents until you get a clear estimate of the size of the index for the maximum number of documents that a single node can handle. Then, you can determine how many servers to deploy for a cluster, as well as the optimal heap size.

You can perform this test on a dedicated test instance, or you can include it when you test your cluster with simulated production workloads. Whichever schedule you choose, make sure you thoroughly test before deploying to production.

Create the schema.xml and solrconfig.xml files.
Start a node with the following configuration:
- The amount of RAM that determined during capacity planning. If you are unsure, you might need to repeat the test multiple times to find the ideal amount.
- Data and commit logs disks configured as explained in Disk space required for compaction.
- A dedicated drive for search indexes.
  
  Make sure you store the index on SSDs or in the system I/O cache.
Write documents to the node.

Ideally, this should reflect the expected number of documents per node in production. However, you might want to have additional documents available for continued testing and validating maximum limits.
Run queries that simulate a production environment.

Include all common queries as well as edge cases and less frequent queries, including known slow queries. This ensures your tests are reflective of many scenarios, not only ideal queries.
View the size of the index (on disk) as reported in the status information about the Solr core.
Based on the server’s system I/O cache available, set a maximum index size per server.

Based on the available system memory, set a maximum heap size required per server.

For faster live indexing, configure live indexing (RT) postings to be allocated offheap. See Tune DSE Search for maximum indexing throughput.

Enable live indexing on only one search core per cluster.

Calculate the maximum number of documents per node based on the available system I/O cache and the available system memory.

This determines the maximum index size and maximum heap size per server.
When the system approaches the maximum documents per node, deploy more nodes.

Next steps

Always start with a test environment. Don’t deploy untested configurations directly to production.

Continue planning your deployment by reviewing recommended settings and data modeling guidance.

Some configurations cannot be implemented until you install DSE. However, it is helpful to be familiar with this information so you can incorporate it into your deployment plan.
- Recommended production settings
- Create and evaluate data models and schemas
In a test environment, initialize a cluster, and then install DSE on the nodes in the cluster.
Test your cluster with simulated production workloads to ensure that your hardware and settings are sufficient.

After testing, you should be familiar with deploying and configuring DSE clusters and nodes. You should feel confident about your cluster’s ability to handle production workloads, and you can begin deploying clusters in production according to your organization’s operational procedures and schedules.

Capacity planning and hardware selection for DSE deployments

Deployment sizing

Operating system

Java runtime

Linux command-line tools

CPUs

Memory and heap

Storage subsystem

Disk space required for compaction

Estimate usable disk capacity

Estimate partition size

Maximum capacity per node (node density)

Spinning disks versus solid state drives (SSD) (local only)

Use separate disks for commit logs and data directories

Avoid SAN storage for on-premise deployments

Avoid NAS devices

RAID isn’t required

Extended file systems

Network

Firewall and ports

Encryption

Load balancers

Racks

Don’t change fundamental rack architecture after cluster deployment

Avoid multiple racks in single-token architecture deployments

Snitches, partitioners, and replicas

Vector search workloads

DSE Search workloads

Capacity recommendations for DSE Search

Test DSE Search capacity

Next steps

Was this helpful?

Give Feedback