Capacity planning for DSE Search

Using a discovery process to develop a DSE Search capacity plan to ensure sufficient memory resources.

Using DSE Search is memory-intensive and rereads the entire row when updating indexes which can cause a significant performance hit on spinning disks. Use solid-state drives (SSD) for applications that have very aggressive insert and update requirements.

This capacity planning discovery process helps you develop a plan for having sufficient memory resources to meet the operational requirements.

cassandra.yaml

The location of the cassandra.yaml file depends on the type of installation:
Package installations /etc/dse/cassandra/cassandra.yaml
Tarball installations installation_location/resources/cassandra/conf/cassandra.yaml

Overview

First, estimate how large your search index will grow by indexing a number of documents on a single node, executing typical user queries, and then examining the memory usage for heap allocation. Repeat this process using a greater number of documents until you get a solid estimate of the size of the index for the maximum number of documents that a single node can handle. You can then determine how many servers to deploy for a cluster and the optimal heap size. Store the index on SSDs or in the system IO cache.

Capacity planning requires a significant effort by operations personnel to:
  • Set the optimal heap size per node.
  • Estimate of the number of nodes that are required for your application.
  • Increase the replication factor to support more queries per second.
  • Distributed queries in DSE Search are most efficient when the number of nodes in the queried data center (DC) is a multiple of the replication factor (RF) in that DC.
Note: The Preflight check tool can detect and fix many invalid or suboptimal configuration settings.

Recommendations

  • For best performance, DataStax recommends using SSDs for all versions.
    Note: For DSE 5.1, you can use spinning disks. DSE 5.1 provides capabilities for tuning spinning disks. If SSDs are not available, consult DataStax Support or the DataStax Services team for tuning guidance.

    If you use DSE 5.1 for search, DataStax recommends staying on 5.1. See Planning your DataStax Enterprise upgrade.

  • For DSE 6.0 and later, DSE Search requires SSDs. The turning capabilities of DSE 5.1 are not available in later versions.
  • Using vnodes with DSE Search is not recommended. However, if you decide to use vnodes with DSE Search, do not use more than 8 vnodes and ensure that allocate_tokens_for_local_replication_factor 6.8 | 6.7 | 6.0 | 5.1 option in cassandra.yaml is correctly configured for your environment.
  • Recommended maximum index sizes:
    • Single index: 250 GB maximum

      Once a single index exceeds 250 GB or if performance degrades, consider adding nodes to further distribute the search index.

    • Multiple indexes: collective index size: 500 GB maximum

      Supporting more than one index is hardware dependent with respect to the number of physical CPU cores available. DataStax recommends a minimum of two physical cores per search index where the maximum number of search indexes is the number of physical cores divided by two.

      For example, if a machine has 16 virtual CPUs on 8 physical cores, the recommended maximum number of search indexes is 4.

  • Set the location of search indexes.
  • Perform extensive testing or consult the DataStax Services team.

Prerequisites

A node with:
  • The amount of RAM that is determined during capacity planning.
  • DataStax recommends the following:
    • One data/logs disk. (If using spinning disks, a separate disk for the commit log. See Disk space.)
    • Use a dedicated drive for search indexes.
Input data:
  • N documents indexed on a single test node
  • A complete set of sample queries to be executed
  • The maximum number of documents the system will support

Procedure

The capacity planning discovery steps:
  1. Create the schema.xml and solrconfig.xml files.
  2. Start a node.
  3. Add N docs.
  4. Run a range of queries that simulate a production environment.
  5. View the size of the index (on disk) included in the status information about the Solr core.
  6. Based on the server's system IO cache available, set a maximum index size per server.
  7. Based on the available system memory, set a maximum heap size required per server.
    DataStax recommends the following heap sizes:
    • System memory less than 64 GB: 24 GB
    • System memory greater than 64 GB: 30 GB
    For faster live indexing you should configure live indexing ( (RT) postings to be allocated offheap.
    Note: Enable live indexing on only one search core per cluster.
  8. Calculate the maximum number of documents per node based on 6 and 7.

    When the system is approaching the maximum docs per node, add more nodes.