Find data with CQL analyzers

Use analyzers in CQL statements to filter results on tokenized indexes.

This is an advanced option that imparts additional meaning on the keyword filter, and it enables full-text search in databases based on Apache Cassandra®, such as Astra DB Serverless.

CQL analyzers are also referred to as the : operator.

How analyzers work

Analyzers process the text in a column to enable term matching on strings (keyword search). Combined with vector search, term matching can improve the relevance of search results returned from a table. Analyzers semantically filter the results of a vector search by specific terms.

Analyzers are built on the Lucene Java Analyzer API. Storage Attached Indexes (SAI) use the Lucene Java Analyzer API to transform text columns into tokens for indexing and querying, which can use built-in or custom analyzers.

For example, if you generate an embedding from the phrase "tell me about available shoes", you can then use a vector search to get a list of rows with similar vectors. These rows will likely correlate with shoe-related strings.

Example: CQL vector search

SELECT * from products
  ORDER BY vector ANN OF [6.0,2.0, ... 3.1,4.0]
  LIMIT 10;

Alternatively, you can filter these search results by a specific tokenized keyword, such as hiking:

Example: CQL vector search with analyzer

SELECT * from products
  WHERE val : 'hiking'
  ORDER BY vector ANN OF [6.0,2.0, … 3.1,4.0]
  LIMIT 10;

The : operator triggers the WHERE clause to use the tokenized SAI, rather than a regular keyword filter.

Configure and use analyzers

To enable analyzer operations with CQL on a Serverless (Vector) database, you must create an SAI with the index_analyzer option, and then use the : operator to search the indexed column that has been analyzed.

An analyzed index stores values derived from the raw column values. The stored values are dependent on the analyzer configuration options, which include the following:

The analyzer determines how the column values are analyzed before indexing occurs, and the analyzer is applied to query terms as well.

Standard tokenizer example

Create a table:

CREATE TABLE default_keyspace.products
(
  id text PRIMARY KEY,
  val text
);

Create an SAI with the index_analyzer option and stemming enabled:

CREATE CUSTOM INDEX default_keyspace_products_val_idx
  ON default_keyspace.products(val)
  USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = {
  'index_analyzer': '{
    "tokenizer" : {"name" : "standard"},
    "filters" : [{"name" : "porterstem"}]
}'};

Insert some rows:

INSERT INTO default_keyspace.products (id, val)
  VALUES ('1', 'soccer cleats');
INSERT INTO default_keyspace.products (id, val)
  VALUES ('2', 'running shoes');
INSERT INTO default_keyspace.products (id, val)
  VALUES ('3', 'hiking shoes');

Use the : operator with keywords to query data from the inserted rows. The analyzer splits the text into case-independent terms.
Query on "running"
```
SELECT * FROM default_keyspace.products
  WHERE val : 'running';
```
Query on "hiking" and "shoes"
```
SELECT * FROM default_keyspace.products
  WHERE val : 'hiking' AND val : 'shoes';
```

For more information, see Simple standard analyzer example.

N-Gram tokenizer example

An ngram tokenizer processes text by splitting the given text into contiguous sequences of n tokens to capture the linguistic patterns and context. This is a part of natural language processing (NLP) tasks.

To configure an ngram tokenizer that also lowercases all tokens, ensure the "tokenizer" key specifies the Lucene tokenizer. The remaining key-value pairs in the JSON object configure the tokenizer:

  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer" : {"name" : "ngram", "args" : {"minGramSize":"2", "maxGramSize":"3"}},
  "filters" : [{"name" : "lowercase"}]
  }'
}

For more information, see N-GRAM with lowercase.

Supported built-in analyzers

There are several built-in analyzers from the Lucene project (version 9.8.0). This includes the following analyzers:

Built-in analyzer types
Generic analyzers	standard, simple, whitespace, stop, lowercase, keyword
Language-specific analyzers	Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Thai, Turkish
Tokenizers	standard, classic, keyword, letter, nGram, edgeNGram, pathHierarchy, pattern, simplePattern, simplePatternSplit, thai, uax29UrlEmail, whitespace, wikipedia
CharFilters	cjk, htmlstrip, mapping, persian, patternreplace
TokenFilters	apostrophe, wordDelimiterGraph, portugueseLightStem, latvianStem, dropIfFlagged, keepWord, indicNormalization, bengaliStem, turkishLowercase, galicianStem, bengaliNormalization, portugueseMinimalStem, galicianMinimalStem, swedishMinimalStem, stop, limitTokenCount, italianLightStem, wordDelimiter, teluguStem, hungarianLightStem, protectedTerm, lowercase, capitalization, hyphenatedWords, type, keywordMarker, frenchMinimalStem, kStem, swedishLightStem, soraniNormalization, commonGramsQuery, numericPayload, persianStem, limitTokenOffset, hunspellStem, soraniStem, czechStem, norwegianMinimalStem, englishMinimalStem, norwegianLightStem, germanMinimalStem, snowballPorter, removeDuplicates, minHash, keywordRepeat, germanNormalization, dictionaryCompoundWord, synonymGraph, englishPossessive, spanishMinimalStem, fixedShingle, patternTyping, classic, frenchLightStem, trim, indonesianStem, spanishPluralStem, hindiStem, scandinavianFolding, delimitedBoost, commonGrams, reverseString, cjkWidth, fingerprint, finnishLightStem, greekStem, porterStem, limitTokenPosition, persianNormalization, typeAsSynonym, patternReplace, tokenOffsetPayload, codepointCount, bulgarianStem, synonym, germanStem, asciiFolding, decimalDigit, Word2VecSynonym, scandinavianNormalization, russianLightStem, serbianNormalization, elision, portugueseStem, arabicNormalization, length, greekLowercase, concatenateGraph, flattenGraph, fixBrokenOffsets, truncate, cjkBigram, brazilianStem, uppercase, nGram, dateRecognizer, teluguNormalization, shingle, norwegianNormalization, hindiNormalization, delimitedPayload, spanishLightStem, stemmerOverride, patternCaptureGroup, hyphenationCompoundWord, germanLightStem, edgeNGram, typeAsPayload, irishLowercase, delimitedTermFrequency, arabicStem

For more information and examples, see Built-in analyzers.

Non-tokenizing filters

Non-tokenizing filters include normalize, case_sensitive, and ascii.

You cannot combine non-tokenizing filters with index_analyzer, but you can chain them in a pipeline with other non-tokenizing filters:

OPTIONS = {'normalize': true, 'case_sensitive': false}