Capacity Planning for DSE Search

About this task

Using DSE Search is memory-intensive and rereads the entire row when updating indexes, which can cause a significant performance hit on spinning disks. Use solid-state drives (SSD) for applications that have very aggressive insert and update requirements.

This capacity planning discovery process helps you develop a plan for having sufficient memory resources to meet the operational requirements.

The location of the cassandra.yaml file depends on the type of installation:

Package installations

/etc/dse/cassandra/cassandra.yaml

Tarball installations

<installation_location>/resources/cassandra/conf/cassandra.yaml

Overview

First, estimate how large your search index will grow. Index a number of documents on a single node, execute typical user queries, and then examine the memory usage for heap allocation. Repeat this process using a greater number of documents until you get a solid estimate of the size of the index for the maximum number of documents that a single node can handle. You can then determine how many servers to deploy for a cluster and the optimal heap size. Store the index on SSDs or in the system IO cache.

Capacity planning requires a significant effort by operations personnel to:

Set the optimal heap size per node.
Estimate of the number of nodes that are required for your application.
Increase the replication factor to support more queries per second.
Use distributed queries in the DSE Search because they are most efficient when the number of nodes in the queried data center (DC) is a multiple of the replication factor (RF) in that DC.

The Preflight check tool can detect and fix many invalid or suboptimal configuration settings.

Locate transactional and search data on separate SSDs

It is critical that you locate DSE for Cassandra transactional data and Solr-based DSE Search data on separate Solid State Drives (SSDs). Failure to do so will very likely result in sub-optimal search indexing performance. For the steps to accomplish this task, refer to Set the location of search indexes.

Also see Tuning search for maximum indexing throughput.

Recommendations

For best performance, DataStax recommends using SSDs for all versions.

For DSE 5.1, you can use spinning disks. DSE 5.1 provides capabilities for tuning spinning disks. If SSDs are not available, consult DataStax Support or the DataStax Services team for tuning guidance.

If you use DSE 5.1 for search, DataStax recommends staying on 5.1. See Planning your DataStax Enterprise upgrade.

For DSE 6.0 and later, DSE Search requires SSDs. The turning capabilities of DSE 5.1 are not available in later versions.
Using vnodes with DSE Search is not recommended. However, if you decide to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that the allocate_tokens_for_local_replication_factor 6.8 | 5.1 option in cassandra.yaml is correctly configured for your environment.
Recommended maximum index sizes:
- Single index: 250 GB maximum
  
  Once a single index exceeds 250 GB or if performance degrades, consider adding nodes to further distribute the search index.
- Multiple indexes: collective index size: 500 GB maximum
  
  Supporting more than one index is hardware dependent with respect to the number of physical CPU cores available. DataStax recommends a minimum of two physical cores per search index where the maximum number of search indexes is the number of physical cores divided by two.
  
  For example, if a machine has 16 virtual CPUs on 8 physical cores, the recommended maximum number of search indexes is 4.
Set the location of search indexes.
Perform extensive testing or consult the DataStax Services team.

Prerequisites

A node with:

The amount of RAM that is determined during capacity planning.
DataStax recommends the following:
- One data/logs disk. (If using spinning disks, a separate disk for the commit log. See Disk space.)
- Use a dedicated drive for search indexes.

Input data:

N documents indexed on a single test node
A complete set of sample queries to be executed
The maximum number of documents the system supports

Procedure

The capacity planning discovery steps:

Create the schema.xml and solrconfig.xml files.
Start a node.
Add N documents.
Run a range of queries that simulate a production environment.
View the size of the index (on disk) included in the status information about the Solr core.
Based on the server’s system I/O cache available, set a maximum index size per server.
Based on the available system memory, set a maximum heap size required per server. For faster live indexing you should configure live indexing (RT) postings to be allocated offheap.

DataStax recommends the following heap sizes:
- System memory less than 64 GB: 24 GB
- System memory greater than 64 GB: 30 GB
Enable live indexing on only one search core per cluster.
Calculate the maximum number of documents per node based on steps 6 and 7.

When the system is approaching the maximum docs per node, add more nodes.