DSE Search architecture

An overview of DSE Search architecture.

In a distributed environment, the data is spread over multiple nodes. Deploy DSE Search nodes in their own datacenter to run DSE Search on all nodes.

Data is written to the database first, and then indexes are updated next.

When you update a table using CQL, the search index is updated. Indexing occurs automatically after an update. Writes are durable. All writes to a replica node are recorded in memory and in a commit log before they are acknowledged as a success. If a crash or server failure occurs before the memory tables are flushed to disk, the commit log is replayed on restart to recover any lost writes.

Figure 1.

DSE Search terms

In DSE Search, there are several names for an index of documents on a single node:
  • A search index (formerly referred to as a search core)
  • A collection
  • One shard of a collection

See the following table for a mapping between database and DSE Search concepts.

Table 1. Relationship between the database and DSE Search concepts
Database Search single node environment
Table Search index (core) or collection
Row Document
Primary key Unique key
Column Field
Node n/a
Partition n/a
Keyspace n/a

How DSE Search works

  • Each document in a search index is unique and contains a set of fields that adhere to a user-defined schema.
  • The schema lists the field types and defines how they should be indexed.
  • DSE Search maps search indexes to tables.
  • Each table has a separate search index on a particular node.
  • Solr documents are mapped to rows, and document fields to columns.
  • A shard is indexed data for a subset of the data on the local node.
  • The keyspace is a prefix for the name of the search index and has no counterpart in Solr.
  • Search queries are routed to enough nodes to cover all token ranges.
    • The query is sent to all token ranges to get all possible results.
    • The search engine considers the token ranges that each node is responsible for, taking into account the replication factor (RF), and computes the minimum number of nodes that is required to query all ranges.
  • On DSE Search nodes, the shard selection algorithm for distributed queries uses a series of criteria to route sub-queries to the nodes most capable of handling them. The shard routing is token aware, but is not limited unless the search query specifies a specific token range.
  • With replication, a node or search index contains more than one partition (shard) of table (collection) data. Unless the replication factor equals the number of cluster nodes, the node or search index contains only a portion of the data of the table or collection.

DSE Search path

This section provides an overview of the DSE Search path, and how Solr integrates with Cassandra:

  1. A row mutation is performed in Cassandra.
  2. A thread in the Thread Per Core (TPC) architecture processes the mutation.
  3. The mutation is forwarded to the secondary index API.
  4. A Lucene document is built from the latest full row in the backing table.
  5. The document is placed in the Lucene RAM buffer.
  6. Control is returned to Cassandra.
  7. The Cassandra write operation is completed.

Note:

  • The RAM buffer is flushed when a commit is performed.

    • Commits occur when the RAM buffer is full, or a soft commit or hard commit is performed.

    • Soft commits are triggered by the auto soft commit timer.

    • Hard commits are triggered by a memtable flush on the base Cassandra table.

  • The Lucene documents are flushed to disk into a Lucene segment.

  • Part of the flush process ensures that for a given document identifier, only one live document exists. Therefore, any duplicate older documents are deleted.

  • Lucene merges segments periodically in a similar way that Cassandra performs compaction.