Capacity planning for DSE Search

Using a discovery process to develop a DSE Search capacity plan to ensure sufficient memory resources.

Using DSE Search is memory-intensive and rereads the entire row when updating indexes which can cause a significant performance hit on spinning disks. Use solid-state drives (SSD) for applications that have very aggressive insert and update requirements.

This capacity planning discovery process helps you develop a plan for having sufficient memory resources to meet the operational requirements.

Overview

First, estimate how large your search index will grow by indexing a number of documents on a single node, executing typical user queries, and then examining the memory usage for heap allocation. Repeat this process using a greater number of documents until you get a solid estimate of the size of the index for the maximum number of documents that a single node can handle. You can then determine how many servers to deploy for a cluster and the optimal heap size. Store the index on SSDs or in the system IO cache.

Capacity planning requires a significant effort by operations personnel to:
  • Set the optimal heap size per node.
  • Estimate of the number of nodes that are required for your application.
  • Increase the replication factor to support more queries per second.
  • Distributed queries in DSE Search are most efficient when the number of nodes in the queried data center (DC) is a multiple of the replication factor (RF) in that DC.
Note: The Preflight check tool can detect and fix many invalid or suboptimal configuration settings.

Prerequisites

A node with:
  • The amount of RAM that is determined during capacity planning.
  • DataStax recommends the following:
    • One data/logs disk. (If using spinning disks, a separate disk for the commit log. See Disk space.)
    • One disk for DSE Search.
Input data:
  • N documents indexed on a single test node
  • A complete set of sample queries to be executed
  • The maximum number of documents the system will support

Procedure

  1. Create the schema.xml and solrconfig.xml files.
  2. Start a node.
  3. Add N docs.
  4. Run a range of queries that simulate a production environment.
  5. View the size of the index (on disk) included in the status information about the Solr core.
  6. Based on the server's system IO cache available, set a maximum index size per server.
  7. Based on the available system memory, set a maximum heap size required per server.
    DataStax recommends the following heap sizes:
    • System memory less than 64 GB: 24 GB
    • System memory greater than 64 GB: 32 GB
    For faster live indexing you should configure live indexing (RT) postings to be allocated offheap.
    Note: Enable live indexing on only one search index per cluster.
  8. Calculate the maximum number of documents per node based on steps 6 and 7.

    When the system is approaching the maximum docs per node, add more nodes.