Configure and use SAI text analyzers with CQL

Your Hyper-Converged Database (HCD) database is ready to query vector data with SAI text analyzers using CQL. You can connect to your database with the CQL shell (CQLSH) or the CQL console in Mission Control.

Analyzers process the text in a column to enable term matching on strings. Combined with vector-based search algorithms, term matching makes it easier to find relevant information in a table. Analyzers semantically filter the results of a vector search by specific terms.

For example, if you generate a vector embedding from the phrase "tell me about available shoes", you can use a vector search to get a list of rows with similar vectors. These rows will likely correlate with shoe-related strings.

SELECT * from products
  ORDER BY vector ANN OF [6.0,2.0, ... 3.1,4.0]
  LIMIT 10;

Alternatively, you can filter these search results by a specific keyword, such as hiking:

SELECT * from products
  WHERE val : 'hiking'
  ORDER BY vector ANN OF [6.0,2.0, … 3.1,4.0]
  LIMIT 10;

An analyzed index stores values derived from the raw column values. The stored values are dependent on the analyzer Configuration options, which include tokenization, filtering, and charFiltering.

Analyzer Operator for CQL

To enable analyzer operations using Cassandra Query Language (CQL) on an HCD database, use the index_analyzer option. This operator can search indexed columns in Storage Attached Indexes (SAI) that are analyzed.

Example

Create a table:

CREATE TABLE default_keyspace.products
(
  id text PRIMARY KEY,
  val text
);

Create an SAI index with the index_analyzer option and stemming enabled:

CREATE CUSTOM INDEX default_keyspace_products_val_idx
  ON default_keyspace.products(val)
  USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = {
  'index_analyzer': '{
    "tokenizer" : {"name" : "standard"},
    "filters" : [{"name" : "porterstem"}]
}'};

Insert sample rows:

INSERT INTO default_keyspace.products (id, val)
  VALUES ('1', 'soccer cleats');

INSERT INTO default_keyspace.products (id, val)
  VALUES ('2', 'running shoes');

INSERT INTO default_keyspace.products (id, val)
  VALUES ('3', 'hiking shoes');

Query to retrieve your data.
1. To get data from the row with id = '2':
  SELECT * FROM default_keyspace.products WHERE val : 'running';
2. To get the row with id = '3' using two values:
  SELECT * FROM default_keyspace.products WHERE val : 'hiking' AND val : 'shoes';
  The analyzer splits the text into case-independent terms.

Restrictions

Only SAI indexes support the : operator.
The analyzed column cannot be part of the Primary Key, including the partition key and clustering columns.
The : operator can be used with only SELECT statements.
The : operator cannot be used with light-weight transactions, such as a condition for an IF clause.

Configuration options

When querying with an analyzer, you must configure the index_analyzer for SAI. This analyzer determines how a column value is analyzed before indexing them. The analyzer is applied to the query term search, too.

The index_analyzer takes a single string or a JSON object as a value. The JSON object configures a single tokenizer, with optional filters/charFilters.

The following built-in non-tokenizing filters are also available:

normalize

Normalize input using Normalization Form C (NFC)

case_sensitive

Transform all inputs to lowercase

ascii

same as the ASCIIFoldingFilter from Lucene; converts ASCII characters to their associated UTF-8 values

Analyzer example

To configure a built-in analyzer, add the analyzer to your query OPTIONS:

OPTIONS = {'index_analyzer':'STANDARD'}

Tokenizer example

An ngram tokenizer processes text by splitting the given text into contiguous sequences of n tokens to capture the linguistic patterns and context. This is a part of natural language processing (NLP) tasks.

To configure an ngram tokenizer that also lowercases all tokens, ensure the "tokenizer" key specifies the Lucene tokenizer. The remaining key-value pairs in that JSON object configure the tokenizer:

OPTIONS = {
  'index_analyzer':
  '{
	"tokenizer" : {
    "name" : "ngram",
    "args" : {
      "minGramSize":"2",
      "maxGramSize":"3"
     }
	},
	"filters" : [
    {
      "name" : "lowercase",
      "args": {}
    }
	],
	"charFilters" : []
  }'
}

Non-tokenizing analyzers example

This example shows non-tokenizing analyzers:

OPTIONS = {'case_sensitive': false} // default is true

OPTIONS = {'normalize': true} // default is false

OPTIONS = {'ascii': true} // default is false

These analyzers can be mixed to build a pipeline of filters:

OPTIONS = {'normalize': true, 'case_sensitive': false}

Built-in analyzers

There are several built-in analyzers from the Lucene project. This includes the following analyzers:

Built-in analyzer types
Generic analyzers	standard, simple, whitespace, stop, lowercase
Language-specific analyzers	Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish
Tokenizers	standard, korean, hmmChinese, openNlp, japanese, wikipedia, letter, keyword, whitespace, classic, pathHierarchy, edgeNGram, nGram, simplePatternSplit, simplePattern, pattern, thai, uax29UrlEmail, icu
CharFilters	htmlstrip, mapping, persian, patternreplace
TokenFilters	apostrophe, arabicnormalization, arabicstem, bulgarianstem, bengalinormalization, bengalistem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, minhash, asciifolding, capitalization, codepointcount, concatenategraph, daterecognizer, delimitedtermfrequency, fingerprint, fixbrokenoffsets, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, protectedterm, trim, truncate, typeassynonym, worddelimiter, worddelimitergraph, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, fixedshingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, synonymgraph, flattengraph, turkishlowercase, elision

HCD doesn’t support these filters:

synonymgraph
synonym
commongrams
stop
snowballporter