Configure and use SAI text analyzers with CQL
Your Hyper-Converged Database (HCD) database is ready to query vector data with SAI text analyzers using CQL. You can connect to your database with the CQL shell (CQLSH) or the CQL console in Mission Control.
Analyzers process the text in a column to enable term matching on strings. Combined with vector-based search algorithms, term matching makes it easier to find relevant information in a table. Analyzers semantically filter the results of a vector search by specific terms.
For example, if you generate a vector embedding from the phrase "tell me about available shoes", you can use a vector search to get a list of rows with similar vectors. These rows will likely correlate with shoe-related strings.
SELECT * from products
ORDER BY vector ANN OF [6.0,2.0, ... 3.1,4.0]
LIMIT 10;
Alternatively, you can filter these search results by a specific keyword, such as hiking
:
SELECT * from products
WHERE val : 'hiking'
ORDER BY vector ANN OF [6.0,2.0, … 3.1,4.0]
LIMIT 10;
An analyzed index stores values derived from the raw column values. The stored values are dependent on the analyzer Configuration options, which include tokenization, filtering, and charFiltering.
Analyzer Operator for CQL
To enable analyzer operations using Cassandra Query Language (CQL) on an HCD database, use the index_analyzer
option.
This operator can search indexed columns in Storage Attached Indexes (SAI) that are analyzed.
Example
-
Create a table:
CREATE TABLE default_keyspace.products ( id text PRIMARY KEY, val text );
-
Create an SAI index with the
index_analyzer
option and stemming enabled:CREATE CUSTOM INDEX default_keyspace_products_val_idx ON default_keyspace.products(val) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "porterstem"}] }'};
-
Insert sample rows:
INSERT INTO default_keyspace.products (id, val) VALUES ('1', 'soccer cleats'); INSERT INTO default_keyspace.products (id, val) VALUES ('2', 'running shoes'); INSERT INTO default_keyspace.products (id, val) VALUES ('3', 'hiking shoes');
-
Query to retrieve your data.
-
To get data from the row with
id = '2'
:SELECT * FROM default_keyspace.products WHERE val : 'running';
-
To get the row with
id = '3'
using two values:SELECT * FROM default_keyspace.products WHERE val : 'hiking' AND val : 'shoes';
The analyzer splits the text into case-independent terms.
-
Restrictions
-
Only SAI indexes support the
:
operator. -
The analyzed column cannot be part of the Primary Key, including the partition key and clustering columns.
-
The
:
operator can be used with onlySELECT
statements. -
The
:
operator cannot be used with light-weight transactions, such as a condition for anIF
clause.
Configuration options
When querying with an analyzer, you must configure the index_analyzer
for SAI.
This analyzer determines how a column value is analyzed before indexing them.
The analyzer is applied to the query term search, too.
The index_analyzer
takes a single string or a JSON object as a value.
The JSON object configures a single tokenizer, with optional filters/charFilters.
The following built-in non-tokenizing filters are also available:
|
Normalize input using Normalization Form C (NFC) |
|
Transform all inputs to lowercase |
|
same as the ASCIIFoldingFilter from Lucene; converts ASCII characters to their associated UTF-8 values |
Analyzer example
To configure a built-in analyzer, add the analyzer to your query OPTIONS
:
OPTIONS = {'index_analyzer':'STANDARD'}
Tokenizer example
An ngram
tokenizer processes text by splitting the given text into contiguous sequences of n
tokens to capture the linguistic patterns and context.
This is a part of natural language processing (NLP) tasks.
To configure an ngram
tokenizer that also lowercases all tokens, ensure the "tokenizer" key specifies the Lucene tokenizer.
The remaining key-value pairs in that JSON object configure the tokenizer:
OPTIONS = {
'index_analyzer':
'{
"tokenizer" : {
"name" : "ngram",
"args" : {
"minGramSize":"2",
"maxGramSize":"3"
}
},
"filters" : [
{
"name" : "lowercase",
"args": {}
}
],
"charFilters" : []
}'
}
Non-tokenizing analyzers example
This example shows non-tokenizing analyzers:
OPTIONS = {'case_sensitive': false} // default is true
OPTIONS = {'normalize': true} // default is false
OPTIONS = {'ascii': true} // default is false
These analyzers can be mixed to build a pipeline of filters:
OPTIONS = {'normalize': true, 'case_sensitive': false}
Built-in analyzers
There are several built-in analyzers from the Lucene project. This includes the following analyzers:
standard, simple, whitespace, stop, lowercase |
|
Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish |
|
standard, korean, hmmChinese, openNlp, japanese, wikipedia, letter, keyword, whitespace, classic, pathHierarchy, edgeNGram, nGram, simplePatternSplit, simplePattern, pattern, thai, uax29UrlEmail, icu |
|
htmlstrip, mapping, persian, patternreplace |
|
apostrophe, arabicnormalization, arabicstem, bulgarianstem, bengalinormalization, bengalistem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, minhash, asciifolding, capitalization, codepointcount, concatenategraph, daterecognizer, delimitedtermfrequency, fingerprint, fixbrokenoffsets, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, protectedterm, trim, truncate, typeassynonym, worddelimiter, worddelimitergraph, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, fixedshingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, synonymgraph, flattengraph, turkishlowercase, elision |
HCD doesn’t support these filters:
|