Deleting by query

Delete data based on search criteria, and delete by query best practices.

Delete by query supports deleting data based on search criteria. After you issue a delete by query, documents start getting deleted immediately and deletions continue until all documents are removed. For example, you can delete the data that you inserted using this command:

curl http://localhost:8983/solr/mykeyspace.mysolr/update --data '<delete><query>color:red</query></delete>' -H 'Content-type:text/xml; charset=utf-8'

Using &allowPartialDeletes parameter set to false (default) prevents deletes if a node is down. Using &allowPartialDeletes set to true causes the delete to fail if a node is down and the delete does not meet a consistency level of quorum. Delete by queries using *:* are an exception to these rules. These queries issue a truncate, which requires all nodes to be up in order to succeed.

Best practices

DataStax recommends that queries for delete-by-query operations touch columns that are not updated. For example, a column that is not updated is one of the elements of a compound primary key.

Delete by query problem example

The following workflow demonstrates that not following this best practice is problematic:

  • When a search coordinator receives a delete-by-query request, the request is forwarded to every node in the search datacenter.
  • At each search node, the query is run locally to identify the candidates for deletion, and then the LOCAL_ONE consistency level deletes the queries for each of those candidates.
  • When those database deletes are perceived at the appropriate nodes across the cluster, the records are deleted from the search index.

For example, in a certificates table, each certificate has a date of issue that is a timestamp. When a certificate is renewed, the new issue date is written to the row, and that write is propagated to all replicas. In this example, let's assume that one replica misses it. If you run a periodic delete-by-query that removes all of the certificates with issue dates older than a specified date, unintended consequences occur when the replica that just missed the write with the "certificate renewal" matches the delete query. The certificate is deleted across the entire cluster, on all datacenters making that delete unrecoverable.