Tuning search for maximum indexing throughput
Tuning DSE Search for maximum indexing throughput
dse.yaml
The location of the dse.yaml file depends on the type of installation:Package installations | /etc/dse/dse.yaml |
Tarball installations | installation_location/resources/dse/conf/dse.yaml |
cassandra.yaml
The location of the cassandra.yaml file depends on the type of installation:Package installations | /etc/dse/cassandra/cassandra.yaml |
Tarball installations | installation_location/resources/cassandra/conf/cassandra.yaml |
To tune DataStax Enterprise (DSE) Search for maximum indexing throughput, follow the recommendations in this topic. Also see the related topics in DSE Search performance tuning and monitoring. If search throughput improves in your development environment, consider using the recommendations in production.
Locate transactional and search data on separate SSDs
For the steps to accomplish this task, refer to Set the location of search indexes.
In addition, plan for sufficient memory resources and disk space to meet operational requirements. Refer to Capacity planning for DSE Search.
Determine physical CPU resources
Before you tune anything, determine how many physical CPUs you have. The JVM does not know whether CPUs are using hyper-threading.
Assess the IO throughput
iowait
system metric by using the iostat command
during peak load. For example, on Linux: iostat -x -c -d -t 1 600
IOwait
is a measure of time over a given period that a CPU (or all CPUs)
spent idle because all runnable tasks were waiting for an IO operation to complete. While
each environment is unique, a general guidelines is to check whether iowait
is above 5% more than 5% of the time. If that scenario occurs, try upgrading to faster SSD
devices or tune the machine to use less IO, and test again. Again, it's important to locate
the search data on dedicated SSDs, separate from the transactional data.
Enabling Asynchronous I/O (AIO)
In prior DSE 6.x releases, DataStax recommended disabling AIO and setting
file_cache_size_in_mb
to 512 for search workloads, to improve indexing
and query performance. Starting with DSE 6.8.0, DataStax recommends enabling AIO and
using the default file cache size, which is calculated as 50% of
-XX:MaxDirectMemorySize
. Test the performance in your development
environment. If the settings result in improved performance, consider making the changes in
production.
The changed recommendation is based on DataStax performance testing results and is specific to DSE 6.8.0 and later releases. DSE enhancements were made so that the buffer pool no longer over-allocates memory.
By default, AIO is enabled. However if you previously disabled AIO in your DSE 6.7.x or
6.0.x configuration, pass -Ddse.io.aio.enabled=true
to DSE at startup.
file_cache_size_in_mb: 2048 inflight_data_overhead_in_mb: 512With those properties, the size of the buffer pool will be 2048 MB, while the size of the cache will be 2048 - 512, or 1536 MB.
inflight_data_overhead_in_mb
property. Or, as recommended
above, enable AIO and use the calculated default file cache size. Differences between indexing modes
- Near-real-time (NRT) indexing is the default indexing mode for Apache Solr™ and Apache Lucene®.
- Live indexing, also called real-time (RT) indexing, supports searching directly against the Lucene RAM buffer and more frequent, cheaper soft-commits, which provide earlier visibility to newly indexed data. However, RT indexing requires a larger RAM buffer and more memory usage than an otherwise equivalent NRT setup.
Tune NRT reindexing
DSE Search provides multi-threaded asynchronous indexing with a back pressure mechanism to avoid saturating available memory and to maintain stable performance. Multi-threaded indexing improves performance on machines that have multiple CPU cores.
For reindexing only, the IndexPool MBean provides operational visibility and tuning through JMX.
- Increase the soft commit time, which is set to 10 seconds (10000 ms) by default. For
example, increase the time to 60 seconds and then reload the search
index:
ALTER SEARCH INDEX CONFIG ON demo.health_data SET autoCommitTime = 60000;
To make the pending changes active:RELOAD SEARCH INDEX ON demo.health_data;
autoSoftCommit
attribute is that newly
updated rows take longer than usual (10000ms) to appear in search results.Tune RT indexing
- To enable live indexing (also known as
RT):
ALTER SEARCH INDEX CONFIG ON demo.health_data SET realtime = true;
- To configure live indexing, set the autoCommitTime to a value between 100-1000
ms:
ALTER SEARCH INDEX CONFIG ON demo.health_data SET autoCommitTime = 1000;
Test with tuning values of 100-1000 ms. An optimal setting in this range depends on your hardware and environment. For live indexing (RT), this refresh interval saturates at 1000 ms. A value higher than 1000 ms is not recognized.
- Ensure that search nodes have at least 14 GB heap.
- If you change the heap, restart DSE to use live indexing with the changed heap size.
Tune TPC cores
DSE Search workloads do not benefit from hyper-threading for writes (indexing). To optimize DSE Search for indexing throughput for both modes (NRT and RT), change tpc_cores in cassandra.yaml from the default to the number of physical CPUs. Change this setting only on search nodes, because this change might degrade throughput for workloads other than search.
Size RAM buffer
- ram_buffer_heap_space_in_mb: 1024
- ram_buffer_offheap_space_in_mb: 1024
Because NRT does not use offheap, these settings apply only to RT.
JMX MBean path: com.datastax.bdp.metrics.search.RamBufferSize
Check back pressure setting
The back_pressure_threshold_per_core in dse.yaml
affects only index rebuilding/reindexing. If you upgraded to DSE 6.0 from earlier versions,
ensure that you use the new default value of 1024
.
Use default mergeScheduler
The default mergeScheduler
settings are set automatically. Do not adjust
these settings in DSE 6.0 and later. In earlier versions, the default settings were
different and might have required tuning.