Find data with CQL analyzers
Use analyzers in CQL statements to filter results on tokenized indexes.
This is an advanced option that imparts additional meaning on the keyword filter, and it enables full-text search in databases based on Apache Cassandra®, such as Astra DB Serverless.
CQL analyzers are also referred to as the :
operator.
How analyzers work
Analyzers process the text in a column to enable term matching on strings (keyword search). Combined with vector search, term matching can improve the relevance of search results returned from a table. Analyzers semantically filter the results of a vector search by specific terms.
Analyzers are built on the Lucene Java Analyzer API. Storage Attached Indexes (SAI) use the Lucene Java Analyzer API to transform text columns into tokens for indexing and querying, which can use built-in or custom analyzers.
For example, if you generate an embedding from the phrase "tell me about available shoes", you can then use a vector search to get a list of rows with similar vectors. These rows will likely correlate with shoe-related strings.
SELECT * from products
ORDER BY vector ANN OF [6.0,2.0, ... 3.1,4.0]
LIMIT 10;
Alternatively, you can filter these search results by a specific tokenized keyword, such as hiking
:
SELECT * from products
WHERE val : 'hiking'
ORDER BY vector ANN OF [6.0,2.0, … 3.1,4.0]
LIMIT 10;
The :
operator triggers the WHERE
clause to use the tokenized SAI, rather than a regular keyword filter.
Configure and use analyzers
To enable analyzer operations with CQL on a Serverless (Vector) database, you must create an SAI with the index_analyzer
option, and then use the :
operator to search the indexed column that has been analyzed.
An analyzed index stores values derived from the raw column values. The stored values are dependent on the analyzer configuration options, which include the following:
The analyzer determines how the column values are analyzed before indexing occurs, and the analyzer is applied to query terms as well.
Standard tokenizer example
-
Create a table:
CREATE TABLE default_keyspace.products ( id text PRIMARY KEY, val text );
-
Create an SAI with the
index_analyzer
option and stemming enabled:CREATE CUSTOM INDEX default_keyspace_products_val_idx ON default_keyspace.products(val) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "porterstem"}] }'};
-
Insert some rows:
INSERT INTO default_keyspace.products (id, val) VALUES ('1', 'soccer cleats'); INSERT INTO default_keyspace.products (id, val) VALUES ('2', 'running shoes'); INSERT INTO default_keyspace.products (id, val) VALUES ('3', 'hiking shoes');
-
Use the
:
operator with keywords to query data from the inserted rows. The analyzer splits the text into case-independent terms.Query on "running"SELECT * FROM default_keyspace.products WHERE val : 'running';
Query on "hiking" and "shoes"SELECT * FROM default_keyspace.products WHERE val : 'hiking' AND val : 'shoes';
For more information, see Simple standard analyzer example.
N-Gram tokenizer example
An ngram
tokenizer processes text by splitting the given text into contiguous sequences of n
tokens to capture the linguistic patterns and context.
This is a part of natural language processing (NLP) tasks.
To configure an ngram
tokenizer that also lowercases all tokens, ensure the "tokenizer" key specifies the Lucene tokenizer.
The remaining key-value pairs in the JSON object configure the tokenizer:
WITH OPTIONS = {
'index_analyzer': '{
"tokenizer" : {"name" : "ngram", "args" : {"minGramSize":"2", "maxGramSize":"3"}},
"filters" : [{"name" : "lowercase"}]
}'
}
For more information, see N-GRAM with lowercase.
Supported built-in analyzers
There are several built-in analyzers from the Lucene project (version 9.8.0). This includes the following analyzers:
standard, simple, whitespace, stop, lowercase, keyword |
|
Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Thai, Turkish |
|
standard, classic, keyword, letter, nGram, edgeNGram, pathHierarchy, pattern, simplePattern, simplePatternSplit, thai, uax29UrlEmail, whitespace, wikipedia |
|
cjk, htmlstrip, mapping, persian, patternreplace |
|
apostrophe, wordDelimiterGraph, portugueseLightStem, latvianStem, dropIfFlagged, keepWord, indicNormalization, bengaliStem, turkishLowercase, galicianStem, bengaliNormalization, portugueseMinimalStem, galicianMinimalStem, swedishMinimalStem, stop, limitTokenCount, italianLightStem, wordDelimiter, teluguStem, hungarianLightStem, protectedTerm, lowercase, capitalization, hyphenatedWords, type, keywordMarker, frenchMinimalStem, kStem, swedishLightStem, soraniNormalization, commonGramsQuery, numericPayload, persianStem, limitTokenOffset, hunspellStem, soraniStem, czechStem, norwegianMinimalStem, englishMinimalStem, norwegianLightStem, germanMinimalStem, snowballPorter, removeDuplicates, minHash, keywordRepeat, germanNormalization, dictionaryCompoundWord, synonymGraph, englishPossessive, spanishMinimalStem, fixedShingle, patternTyping, classic, frenchLightStem, trim, indonesianStem, spanishPluralStem, hindiStem, scandinavianFolding, delimitedBoost, commonGrams, reverseString, cjkWidth, fingerprint, finnishLightStem, greekStem, porterStem, limitTokenPosition, persianNormalization, typeAsSynonym, patternReplace, tokenOffsetPayload, codepointCount, bulgarianStem, synonym, germanStem, asciiFolding, decimalDigit, Word2VecSynonym, scandinavianNormalization, russianLightStem, serbianNormalization, elision, portugueseStem, arabicNormalization, length, greekLowercase, concatenateGraph, flattenGraph, fixBrokenOffsets, truncate, cjkBigram, brazilianStem, uppercase, nGram, dateRecognizer, teluguNormalization, shingle, norwegianNormalization, hindiNormalization, delimitedPayload, spanishLightStem, stemmerOverride, patternCaptureGroup, hyphenationCompoundWord, germanLightStem, edgeNGram, typeAsPayload, irishLowercase, delimitedTermFrequency, arabicStem |
For more information and examples, see Built-in analyzers.
Non-tokenizing filters
Non-tokenizing filters include normalize
, case_sensitive
, and ascii
.
You can’t combine non-tokenizing filters with index_analyzer
, but you can chain them in a pipeline with other non-tokenizing filters:
OPTIONS = {'normalize': true, 'case_sensitive': false}