Search index syntax
DataStax Enterprise Search nodes can perform a wide range of functionality, and it’s important to understand the syntax required for such tasks as filtering CQL queries, using the solr_query
JSON formats, and escaping special characters.
Filtering synopsis
Applications connecting to the DSE database can use full-text search to filter on these fields within CQL queries.
Applications no longer need to use Apache Solr™ APIs to filter queries on search index columns. |
In a CQL SELECT statement, you can filter results using the following syntax:
SELECT <selectors> FROM <table> WHERE (<indexed_column_expression> | solr_query = '<search_expression>') [ LIMIT <n> ] [ ORDER BY <column_name> ] ;
Write filters on search index columns as follows:
-
Full text search using column names and CQL operators. See Filtering on search indexed columns.
-
Recommended for most queries: Native CQL query expressions.
-
Still supported: the Solr syntax in the
solr_query
option:-
Basic expressions (Solr q parameter) in plain text, see Filtering queries with the solr_query q format. For improved performance, you should use Solr filter queries (
fq
) whenever possible. See Search index filtering best practices. -
Advanced expressions using Solr JSON syntax, see Using the solr_query JSON format.
-
Filtering restrictions and best practices
The following restrictions apply to filtering a CQL query on a search indexed field:
-
Search index queries are defaulted to an equivalent of
LIMIT 10
. -
Pagination is off by default.
In dse.yaml, the cql_solr_query_paging option specifies when to use pagination (also called cursors).
-
Apache Solr and Apache Lucene limitations for field names apply to pagination.
-
Queries with smaller result sets will see increased performance with paging off.
-
Limitations and known Apache Solr issues apply to DSE Search queries. For example: incorrect SORT results for tokenized text fields.
-
Column aliases are not supported in search index queries.
-
All of the fields that are queried on DSE SearchAnalytics clusters must be defined in the search index schema definition. Fields that are not defined in the search index schema columns are excluded from the results returned from Spark queries.
-
Solr allows the
LOCAL_ONE
andONE
consistency levels for read operations.
Filtering on search indexed columns
On a DataStax Enterprise search node, when a query is filtered on an indexed column the corresponding Solr document is used.
Filtering on indexed columns in queries that do not have the partition key fully defined do not require ALLOW FILTERING
.
The search index now supports CQL operators, such as LIKE
, IS NOT NULL
, range, and =/!=
to filter.
See Native CQL search queries.
Combine filters to form more complex queries:
SELECT * FROM test.test WHERE
title IS NOT NULL AND
collection CONTAINS 'Anaisa Pye'
AND age > 35;
Sort using ORDER BY.
Filtering queries with the solr_query q format
Filter a CQL query using solr_query
option using the syntax of the Solr q parameter in plain text.
See Filtering on terms for examples.
solr_query = '<q_expression>'
Examples
Filtering on a single column
To find rows in the pets table where the name
column contains cat
, dog
but not fish
:
SELECT * FROM pets WHERE solr_query='name: cat name: dog -name:fish';
To use punctuation such as a single quote in the Solr q expression, see Escaping characters in a solr_query. |
Limiting results
When you name specific columns, DSE Search retrieves only the specified columns and returns the columns as part of the resulting rows.
DSE Search supports projections (SELECT a, b, c…) only, not functions, for the select expression.
The following example retrieves only the name
column:
SELECT name FROM keyspace.table WHERE solr_query='name:cat name:dog -name:fish'
Use the LIMIT clause to specify how many rows to return. The following example retrieves only 1 row:
SELECT * FROM keyspace.table WHERE solr_query='name:cat name:dog -name:fish' LIMIT 1
Using the count function
Use the count() function in CQL Solr queries to return the number of rows that satisfy the Solr query:
SELECT count(*) FROM table WHERE solr_query = '...';
Using count() in combination with LIMIT or facets results in an error.
All response queries of the drivers have a custom payload where the total number of documents found is returned.
This number is keyed as DSESearch.numFound
.
Using the solr_query JSON format
DSE Search supports advanced Solr search features to apply filters to CQL queries using JSON-formatted expressions. Also see Overriding the default TimeZone (UTC) in search queries.
JSON query syntax
The JSON query expression syntax is a JSON string. The JSON-based query expression supports local parameters in addition to the following parameters:
`{
"q": <query_expression (string)>,
"fq": <filter_query_expression(s) (string_or_array_of_strings, ...)>,
"facet": <facet_query_expression (object)>
"sort": <sort_expression (string)>,
"start": <start_index(number)>,
timeAllowed: <search_time_limit_ms>,
"TZ": <zoneID>), // Any valid zone ID in java TimeZone class
"paging": <"driver" (string)>,
"distrib.singlePass": true|false (boolean),
"shards.failover": true|false (boolean), // Default: true
"shards.tolerant": true|false (boolean), // Default: false
"commit": true|false (boolean),
"route.partition": <partition_routing_expression (array_of_strings)>,
"route.range": <range_routing_expression (array_of_strings)>,
"query.name": <query_name (string)>,
}
For more information: "q", "fq", "facet", "sort", "start", timeAllowed, "TZ" (Any valid zone ID in java TimeZone class), and "paging".
For example:
SELECT id FROM nhanes_ks.nhanes WHERE solr_query='{"q":"ethnicity:Asian"}';
SELECT id FROM nhanes_ks.nhanes WHERE solr_query='{"q":"ethnicity:Mexi*", "sort":"id asc"}' LIMIT 3;
SELECT * FROM mykeyspace.mytable WHERE solr_query='{"q":"{!edismax}quotes:yearning or kills"}';
To use Apache Solr™ Extended DisMax Query Parser (eDisMax) with |
Making distributed queries tolerant of shard failures
Because distributed queries contact many shards, making queries more tolerant of shard failures ensures more successful completions. Use shards.failover and shards.tolerant parameters to define query failover and tolerance of shard failures during JSON queries:
Valid configurations | Description |
---|---|
"shards.failover": true, "shards.tolerant": false, |
This default configuration enables query failover and disables fault tolerance. Attempt to retry the failed shard requests when errors indicate that there is a reasonable chance of recovery. |
"shards.failover": false, "shards.tolerant": true, |
Disable query failover. Enable fault tolerance. Make the query succeed, even if the query only partially succeeded, and did not succeed for all nodes. |
"shards.failover": false, "shards.tolerant": false, |
Disable query failover. Disable fault tolerance. |
Failover and tolerance of partial results cannot coexist in the same JSON query. Queries support enabling tolerance for only one parameter.
The |
Other fault tolerance configuration options include netty_client_request_timeout
in dse.yaml
and read_request_timeout_in_ms
in cassandra.yaml
.
JSON queries with literal characters that are Apache Solr™/Apache Lucene® special characters
Solr and Lucene support escaping special characters that are part of the query syntax.
Character | Description |
---|---|
+ |
Plus. Required search term operator. |
- |
Minus. Prohibited search term operator. |
&& |
Double ampersand. AND operator. Both terms either side of the operator are required for a match. |
|| |
Double pipe. OR operator. |
! |
Exclamation mark. NOT operator. |
( |
Left parenthesis |
) |
Right parenthesis |
" |
Double quote |
~ |
Tilde |
* |
Asterisk |
? |
Question mark |
: |
Colon |
Using JSON with solr_query
requires additional syntax for literal characters that are Lucene special characters.
Syntax for a simple search string:
Simple search string |
|
Solr query |
|
CQL Solr query |
|
Syntax for a complex search string:
Complex search string |
|
Solr query |
|
CQL Solr query |
|
Field, query, and range faceting with a JSON query
Specify the facet parameters inside a facet JSON object to perform field, query, and range faceting inside Solr queries. Distributed pivot faceting is supported.The query syntax is less verbose to specify facets by:
-
Specifying each facet parameter without the facet prefix that is required by HTTP APIs.
-
Expressing multiple facet fields and queries inside a JSON array.
- Faceted search example
-
SELECT * FROM solr WHERE solr_query='{"q":"id:*","facet":{"field":"type"}}';
- Query facet example
-
SELECT * FROM solr WHERE solr_query='{"q":"id:*","facet":{"query":"type:0"}}';
- Multiple queries example
-
SELECT * FROM solr WHERE solr_query='{"q":"id:*","facet":{"query":["type:0","type:1"]}}';
- Distributed pivot faceting example
-
SELECT id FROM <table> WHERE solr_query='{"q":"id:*","facet":{"pivot":"type,value","limit":"10"}}'
- Range facet example
-
SELECT * FROM solr WHERE solr_query='{"q":"id:*","facet":{"range":"type", "f.type.range.start":-10, "f.type.range.end":10, "range.gap":1}}}';
The returned result is formatted as a single row with each column corresponding to the output of a facet (either field, query, or range). The value is represented as a JSON blob because facet results can be complex and nested. For example:
facet_fields | facet_queries ------------------------+------------------------- {"type":{"0":2,"1":1}} | {"type:0":2,"type:1":1}
- Range by date facet example
-
SELECT * FROM solr WHERE solr_query='{"q":"business_date:*","facet":{"range":"business_date", "f.business_date.range.start":"2015-01-01T00:00:00Z", "f.business_date.range.end":"2015-08-01T00:00:00Z", "f.business_date.range.gap":"+1MONTH"}}';
- Interval facet example
-
SELECT * FROM solr WHERE solr_query='{"q":"id:*","facet":{"interval":"id", "interval.set":"[*,500]"}';
Tracing distributed queries
During a distributed query, every node is responsible for a set of token ranges. A shard is the node/ranges combination. The shard token range is reported:
-
In the
shards.info
response for HTTP queries. -
In the
system_traces.events
table for HTTP queries that provide cassandra.trace=true and CQL Solr queries that enable tracing at the driver level.
Solr single-pass CQL queries
Single-pass distributed queries are supported in CQL Solr queries.
There are a number of ways to use a single-pass distributed query instead of the standard two-pass query. You can:
-
Specify the
distrib.singlePass
Boolean parameter in the query expression. Example:SELECT * FROM keyspace.table WHERE solr_query = '{"q" : "*:*", "distrib.singlePass" : true}'
-
Use a
token()
or partition key restriction in theWHERE
clause. Example:SELECT * FROM keyspace.table WHERE token(id) >= 3074457345618258604 AND solr_query = '{"q" : "*:*"}'
-
Execute a
COUNT
query. Example:SELECT count(*) FROM keyspace.table WHERE solr_query = '{"q" : "*:*"}'
-
Specify the primary key elements in the
SELECT
clause. Example:CREATE TABLE cycling.cyclist_name ( id UUID, lastname text, firstname text ); PRIMARY KEY ((id), lastname);
CREATE SEARCH INDEX ON cycling.cyclist_name ... ;
dsetool create_core cycling.cyclist_name ...
INSERT into cycling.cyclist_name ... ;
SELECT id, lastname FROM cycling.cyclist_name WHERE solr_query='{"q": "*:*","fq":"(id:a6f94417-e27b-444f-9d0c-dccb588e421f)"}';
Using a single-pass distributed query has an operational cost that includes potentially more disk and network overhead. With single-pass queries, each node reads all rows that satisfy the query and returns them to the coordinator node. An advanced feature, a single-pass distributed query saves one network round trip transfer during the retrieval of queried rows. A regular distributed query performs two network round trips:
-
The first one to retrieve IDs from DSE Search that satisfy the query.
-
The second one to retrieve only the rows that satisfy the query from the database, based on IDs from the first step.
Single-pass distributed queries are most efficient when most of the documents found are returned in the search results. Single-pass distributed queries are inefficient when most of the documents found are not returned to the coordinator node.
For example, a distributed query that only fans out to a single node from the coordinator node will likely be most efficient as a single-pass query.
[IMPORTANT
With single-pass queries, there is a limitation that returns only the table columns that are defined in the Solr schema.xml in the query results. This limitation also applies to map entries that do not conform to the field mapping.
Consider the cycling.cyclist_name
table example above. If the column firstname
was not indexed, but needed to be retrievable by queries, the entry in the schema.xml for the firstname
field could be:
<field indexed="false" multiValued="false" docValues="false" name="firstname" type="StrField"></field>
If the Solr core was created without a customized schema.xml file and generateResources=true
was included, the default schema.xml file includes all fields that exist in the table. For more information, see dsetool create_core.
JSON query name option
Using the following syntax to name your queries to support metrics and monitoring for performance objects. Naming queries can be useful for tagging and JMX operations, for example.
SELECT id FROM nhanes_ks.nhanes WHERE solr_query=' {"query.name":"Asian subjects", "q":"ethnicity:Asia*"}' LIMIT 50;
JSON query commit option
When executing custom queries after bulk document loading, with auto soft commit disabled (or is an extremely infrequent configured value), use the JSON query commit option to ensure that all pending updates are soft-committed before the running a query. The commit makes the latest data visible to a query.
Do not use the JSON commit option for live operations against a production cluster.
The |
By default, the commit option is set to false.
The example below sets the commit
option to true
:
SELECT * FROM wiki.solr
WHERE solr_query='{"q":"title:Asia*", "commit":true}'
LIMIT 50;
Queries that dynamically enable paging using a driver
To dynamically enable paging when cql_solr_query_paging is set to off
in dse.yaml, set the Solr paging parameter to driver ("paging":"driver"
).
For example:
SELECT id FROM wiki.solr
WHERE solr_query= '{"q":"*", "sort":"id asc", "paging":"driver"}' ;
Limiting queries by time
DSE Search supports limiting queries by time by using the Solr timeAllowed parameter.
Restriction: DSE Search differs from native Solr:
-
If
timeAllowed
is exceeded, an exception is thrown. -
If
timeAllowed
is exceeded, and the additionalshards.tolerant
parameter is set totrue
, the application returns the partial results collected so far.
When partial results are returned, the CQL custom payload contains the DSESearch.isPartialResults
key.
Example with a 30 second timeout:
SELECT * FROM users
WHERE solr_query = '{ "q": "*:*", "timeAllowed":30000}';
The Solr timeAllowed
parameter is enabled by default to prevent long running shard queries, such as complex facets and Boolean queries, from using system resources after they have timed out from the DSE Search coordinator.
DSE Search checks the timeout per segment instead of during document or terms iteration.
The system property |
By default, the value for timeAllowed
is the same as the internode_messaging_options.client_request_timeout_seconds
parameter in dse.yaml
.
Queries that breach client_request_timeout_seconds
fail by default.
The 50th percentile latency for queries using timeAllowed
should be within 5% of the same query that does not use timeAllowed
.
The timeAllowed
parameter applies to these queries:
-
Standard queries and filtering queries, where the timeout is applied on a match collection.
-
Facet queries, including pivot facets.
The json.facet
queries are not supported.
Queries that are terminated due to a timeout are logged with a warning.
Using the Solr |
Escaping characters in a solr_query
Solr queries require escaping special characters that are part of the query syntax.
Character | Description |
---|---|
+ |
Plus. Required search term operator. |
- |
Minus. Prohibited search term operator. |
&& |
Double ampersand. AND operator. Both terms either side of the operator are required for a match. |
|| |
Double pipe. OR operator. |
! |
Exclamation mark. NOT operator. |
( |
Left parenthesis |
) |
Right parenthesis |
" |
Double quote |
~ |
Tilde |
* |
Asterisk |
? |
Question mark |
: |
Colon |
To escape these characters, use a slash (\
) before the character to escape.
For example, to search for a literal double quotation mark ("
) character, escape the "
for Solr with \"
.
When using solr_query
, you can escape special characters using either CQL Solr or JSON forms:
- CQL Solr
-
...WHERE solr_query='<field>:<value>'
- JSON
-
WHERE solr_query='{ "q": "<field>:<value>"}'
JSON-encoded queries require that values must also be JSON-escaped for special characters.
For queries that contain double quotation marks, use triple slashes \\\
:
-
For query syntax: One slash
\
to escape the"
-
For the JSON string syntax: Two slashes
\\
to escape the\
Triple slashes
\\\
escape both characters in\"
to produce\\
(an escaped escape) and\"
(an escaped double quote).
Escaping single quotation marks
-
Double the single quotation mark (
**'**
)- CQL
-
...WHERE solr_query='name:Walter**''**s'
- JSON
-
...WHERE solr_query='{ "q": "Walter'**'**s"}'
-
Use dollar-quotes for the string constant
- CQL
-
...WHERE solr_query=****name:Walter's****
- JSON
-
...WHERE solr_query=****{ "q": "Walter's"}****
Escaping double quotation marks
- CQL
-
Double the single quotation mark (
**'**
) and add the backslash (**\\**
) for Solr escaping...WHERE solr_query='name:Walter**\\''**s'
- JSON
-
Escape
**\\"**
to**\\\\\\"**
to escape both special characters for JSON...WHERE solr_query='{ "q": "Walter**\\\\\\"**s"}'
Exact and fuzzy query examples
- Exact phrase query
-
For a row that looks like this, with an email address that includes a double quotation mark
greenr"q@example.com
:INSERT INTO users(id, email) VALUES(1, 'greenr"q@example.com')"
Perform a phrase query to search for the email address that is enclosed in double quotation marks:
SELECT * FROM users where solr_query = ' { "q": "*:*", "fq": "email:\"greenr\\\"q@example.com\""} ';
- Fuzzy query
-
For a row that looks like this, with the same email address that includes a double quotation mark
greenr"q@example.com
:select * from test.users where solr_query='{"q":"email:r\\\"q@example"}' ; id | email | solr_query ------+-------+------------------------------ 1 | greenr"q@example.com | null (1 rows)
For a term query (fuzzy search) for all email addresses that include
r"q@example
, remove the double quotation marks but retain triple quotation marks for the escaped double quotation character that is part of the email address:SELECT * FROM users where solr_query = ' { "q": "*:*", "fq": "email:r\\\"q@example"} ';
Using JSON with solr_query
requires additional syntax for literal characters that are Lucene special characters.