DSE Search initial data migration

Best practices and guidelines for loading data into DSE Search.

When you initially load data into DataStax Enterprise (DSE) resource contention requires planning to ensure performance.

  • DSE is performant when writing data.

  • Apache Solr™ is resource intensive when creating a search index.

These two activities compete for resources, so proper resource allocation is critical to maximize efficiency for initial data load.

Recommendations

  • For maximum throughput, store the search index data and DataStax Enterprise (Cassandra) data on separate physical disks.

    If you are unable to use separate disks, DataStax recommends that SSDs have a minimum of 500 MB/s read/write speeds (bandwidth).

  • Enable OpsCenter 6.1 repair service.

Also see memory recommendations in the planning guide.

Initial bulk loading

DataStax recommends following this high-level procedure:

  1. Install DSE and configure nodes for search workloads.

  2. Use the CQL CREATE SEARCH INDEX command to create search indexes.

  3. Tune the index for maximum indexing throughput.

  4. Load data into the database using best practices for data loading. For example, load data with the driver with the consistency level at LOCAL_ONE (CL.LOCAL_ONE) and a sufficiently high write timeout.

    After data loading is completed, there might be lag time because indexing is asynchronous.

  5. Verify the indexing QueueSize with the IndexPool MBean. After the index queue size has receded, run this CQL query to verify that the number of records is as expected:

    SELECT count(*) FROM ks.table WHERE solr_query = '*:*';

New data is automatically indexed.

Troubleshooting

If the record count does not stabilize:

  • If dropped mutations exist in the nodetool tpstats output for some nodes, and OpsCenter repair service is not enabled, run manual repair on those nodes.

  • If dropped mutations do not exist, check the system.log and the Solr validation logfor indexing errors.

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com