Using analyzers with CQL
Analyzers process the text in a column to enable term matching for large strings. Combined with vector-based search algorithms, it is easier to find relevant information in large datasets. Instead of returning only a list of results, analyzers allow you to return specific terms while semantically ordering the results by your query vector.
For example, if you ask an LLM “Tell me about available shoes” and this query is done with vector search against your Astra DB table, you would get a list of several shoes with a variety of features.
SELECT * from products
ORDER BY vector ANN OF [6.0,2.0, ... 3.1,4.0]
LIMIT 10;
Alternatively, you can use the analyzer search to specify a keyword, such as hiking
:
SELECT * from products
WHERE val : 'hiking'
ORDER BY vector ANN OF [6.0,2.0, … 3.1,4.0]
LIMIT 10;
An analyzed index is one where the stored values are derived from the raw column values. The stored values are dependent on the analyzer Configuration options, which can include tokenization, filtering, and char filtering.
Analyzer Operator for CQL
To enable analyzer operations using CQL on Astra serverless databases with Vector Search, a :
operator is available in Cassandra Query Language. This operator can search indexed columns in Storage Attached Indexes (SAI) that are analyzed.
Example
-
Create a table:
CREATE TABLE vsearch.products (id text PRIMARY KEY, val text);
-
Create an SAI index with the
index_analyzer
option and stemming enabled:CREATE CUSTOM INDEX vsearch_products_val_idx ON vsearch.products(val) USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = { 'index_analyzer': '{ "tokenizer" : {"name" : "standard"}, "filters" : [{"name" : "porterstem"}] }'};
-
Insert sample rows:
INSERT INTO vsearch.products (id, val) VALUES ('1', 'soccer cleats'); INSERT INTO vsearch.products (id, val) VALUES ('2', 'running shoes'); INSERT INTO vsearch.products (id, val) VALUES ('3', 'hiking shoes');
-
Query to retrieve your data.
To get data from the rows with
id = '2'
andid = '3'
:SELECT * FROM vsearch.products WHERE val : 'running';
The analyzer splits the text into case-independent terms.
To get the row with
id = '3'
using a different case, the analyzer standardizes that to perform the match:SELECT * FROM vsearch.products WHERE val : 'hiking' AND val : 'shoes';
Restrictions
-
Only SAI indexes support the
:
operator. -
The analyzed column cannot be part of the Primary Key, including the partition key and clustering columns.
-
The
:
operator can be used with onlySELECT
statements. -
The
:
operator cannot be used in with light-weight transactions, such as a condition for anIF
clause.
Configuration options
When querying with an analyzer, you must configure the index_analyzer
for your Storage Attached Index (SAI). This analyzer determines how a column value is analyzed before indexing them. The analyze is applied to the query term search, too.
The index_analyzer
takes a single string or a JSON object as a value. The JSON objects are elements used to configure the analyzers. Each object configures a tokenizer, filter, or charFilter. Exactly one tokenizer should be configured.
The following built-in non-tokenizing filters are also available:
-
normalize
- normalizes input using Normalization Form C (NFC) -
case_sensitive
- lower cases all inputs. -
ascii
- same as the ASCIIFoldingFilter from Lucene; converts ascii characters to their associated UTF-8 values
Examples
Analyzer
To configure a built-in analyzer, add the analyzer to your query OPTIONS
:
OPTIONS = {'index_analyzer':'STANDARD'}
Tokenizer
An ngram
tokenizer processes text by splitting the given text into contiguous sequences of n
items to capture the linguistic patterns and context. This is a part of natural language processing (NLP) tasks.
To configure an ngram
tokenizer that also lowercases all tokens, ensure the “tokenizer” key specifies the Lucene tokenizer. The remaining key-value pairs in that JSON object configure the tokenizer:
OPTIONS = {
'index_analyzer':
'{
"tokenizer" : {
"name" : "ngram",
"args" : {
"minGramSize":"2",
"maxGramSize":"3"
}
},
"filters" : [
{
"name" : "lowercase",
"args": {}
}
],
"charFilters" : []
}'
}
Non-tokenizing analyzers
This example shows non-tokenizing analyzers:
OPTIONS = {'case_sensitive': false} // default is true
OPTIONS = {'normalize': true} // default is false
OPTIONS = {'ascii': true} // default is false
These analyzers can be mixed to build a pipeline of filters:
OPTIONS = {'normalize': true, 'case_sensitive': false}
Built-in analyzers
There are several built-in analyzers from the Lucene project. This includes the following analyzers:
standard, simple, whitespace, stop, lowercase |
|
Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish |
|
standard, korean, hmmChinese, openNlp, japanese, wikipedia, letter, keyword, whitespace, classic, pathHierarchy, edgeNGram, nGram, simplePatternSplit, simplePattern, pattern, thai, uax29UrlEmail, icu |
|
htmlstrip, mapping, persian, patternreplace |
|
apostrophe, arabicnormalization, arabicstem, bulgarianstem, bengalinormalization, bengalistem, brazilianstem, cjkbigram, cjkwidth, soraninormalization, soranistem, commongrams, commongramsquery, dictionarycompoundword, hyphenationcompoundword, decimaldigit, lowercase, stop, type, uppercase, czechstem, germanlightstem, germanminimalstem, germannormalization, germanstem, greeklowercase, greekstem, englishminimalstem, englishpossessive, kstem, porterstem, spanishlightstem, persiannormalization, finnishlightstem, frenchlightstem, frenchminimalstem, irishlowercase, galicianminimalstem, galicianstem, hindinormalization, hindistem, hungarianlightstem, hunspellstem, indonesianstem, indicnormalization, italianlightstem, latvianstem, minhash, asciifolding, capitalization, codepointcount, concatenategraph, daterecognizer, delimitedtermfrequency, fingerprint, fixbrokenoffsets, hyphenatedwords, keepword, keywordmarker, keywordrepeat, length, limittokencount, limittokenoffset, limittokenposition, removeduplicates, stemmeroverride, protectedterm, trim, truncate, typeassynonym, worddelimiter, worddelimitergraph, scandinavianfolding, scandinaviannormalization, edgengram, ngram, norwegianlightstem, norwegianminimalstem, patternreplace, patterncapturegroup, delimitedpayload, numericpayload, tokenoffsetpayload, typeaspayload, portugueselightstem, portugueseminimalstem, portuguesestem, reversestring, russianlightstem, shingle, fixedshingle, snowballporter, serbiannormalization, classic, standard, swedishlightstem, synonym, synonymgraph, flattengraph, turkishlowercase, elision |
Astra DB doesn’t support these Elastic filters:
|