Search index filtering best practices

Best practices for DSE Search queries.

DataStax recommends following these best practices for running queries in DSE Search:
  • Use CQL to run search queries.

    Perform all data manipulation with CQL, except for deleting by query.

  • Use the simplest and best fit Solr types to fulfill the required type for your query. See Understanding schema field types.
  • For improved performance, use Solr filter query (fq) parameters instead of q parameters whenever possible. The results from filter queries are stored in a cache. You can reduce the average response time from seconds to milliseconds. The following example queries the cyclist first name and last name:
    '{"q":"*:*", "fq":"firstname:Alex AND lastname:FRAME"}'
    Each fq name and value string pair can be a member of an fq array. The fq name and value pairs are treated as if they are separated by AND. For example:
    '{"q":"*:*", "fq":["lastname:BELKOV", "nationality:Russia"]}'
    Adjust your queries so that the results fit into the memory cache.
  • Use profiles when creating a search index.
  • Avoid querying nodes that are indexing.

    For responding to queries, DSE Search ranks the nodes that are not performing search indexing higher than nodes that are indexing. If nodes that are indexing are the only nodes that can satisfy the query, the query does not fail but can return only partial results.

  • Avoid wildcard queries that contain multiple search tokens, unless you are making those queries in conjunction with the Lucene KeywordTokenizer class. If the analyzer creates multiple search tokens from your original input, then you can perform wildcard queries on those tokens. You can also use the KeywordTokenizer class and define a multiple term wildcard query.
  • If you specify time-to-live (TTL) settings for data, to ensure that a Solr index synchronizes with the data, add a query condition similar to the following example. The query condition filters out the expired data that is older than TTL boundary in the query. The query condition does not remove expired data from the index; instead, the condition improves the query consistency. In the example, the epoch time in seconds is the time in the past rounded to the nearest minute, hour, and so on.
    "q":"-_ttl_expire:[* TO <epoch time in seconds>]" 
  • In addition to the TTL guidance shown in the previous point, do not use the current time with every query. You can create a window of staleness; for example, set your TTL filter value every 60 seconds or other time period that the data can be out of synchronization, but the filter cache is cleared after the time period. This avoids caching filters that cannot be reused. You can also use {!cache=false} with your TTL filter, which does not cache the data. For example:
    "fq":"{!cache=false}-_ttl_expire:[0 TO <epoch time in seconds>]"
  • For optimal CQL single-pass queries, including queries where solr_query is used with a partition restriction, and queries with partition restrictions and a search predicate, ensure that the columns to SELECT are not indexed in the search index schema.
    Auto-generation indexes all columns by default. You can ensure that the field is not indexed but still returned in a single-pass query. For example, this statement indexes everything except for column c3, and informs the search index schema about column c3 for efficient and correct single-pass queries.
    CREATE SEARCH INDEX
    ON test_search.abc
    WITH COLUMNS * { indexed : true }, c3 { indexed : false }; 
  • When vnodes are not configured in a cluster, distributed queries in DSE Search are most efficient when the number of nodes in the queried data center (DC) is a multiple of the replication factor (RF) in that DC.
  • Avoid using too many terms in the query, like:
    SELECT request_id, store_id
    FROM store_search.transaction_search
    WHERE solr_query =
      '{"q":"*:*","shards.failover":true,
      "shards.tolerant":false,"fq":"store_id:store1a
      store_id:store2b store_id:store2c ... store_id:store19987d"}';
    Instead, use a terms filter query.
  • When writing collections with few collection updates, DataStax recommends frozen collections over non-frozen collections to address query latency.
    For example, a simple frozen set of text elements:
    CREATE TABLE foo (
      id text, values frozen<set<text>>, PRIMARY KEY (id)
    );
    
    CREATE TYPE name (
      first text, last text
    );
    A frozen list of UDTs:
    CREATE TABLE tableWithList (
      id text, names frozen<list<frozen<name>>>, PRIMARY KEY (id)
    );
  • JSON query limitations
    Failover and tolerance of partial results cannot coexist in the same JSON query. Queries support enabling tolerance for only one parameter.
    Note: The shards.tolerant parameter is not supported when deep paging is on.