Find data with CQL analyzers

Use analyzers in CQL statements to filter results on tokenized indexes.

This is an advanced option that imparts additional meaning on the keyword filter, and it enables full-text search in databases based on Apache Cassandra®, such as Astra DB Serverless.

CQL analyzers are also referred to as the : operator.

How analyzers work

Analyzers process the text in a column to enable term matching on strings (keyword search). Combined with vector search, term matching can improve the relevance of search results returned from a table. Analyzers semantically filter the results of a vector search by specific terms.

Analyzers are built on the Lucene Java Analyzer API. Storage Attached Indexes (SAI) use the Lucene Java Analyzer API to transform text columns into tokens for indexing and querying, which can use built-in or custom analyzers.

For example, if you generate an embedding from the phrase "tell me about available shoes", you can then use a vector search to get a list of rows with similar vectors. These rows will likely correlate with shoe-related strings.

Example: CQL vector search
SELECT * from products
  ORDER BY vector ANN OF [6.0,2.0, ... 3.1,4.0]
  LIMIT 10;

Alternatively, you can filter these search results by a specific tokenized keyword, such as hiking:

Example: CQL vector search with analyzer
SELECT * from products
  WHERE val : 'hiking'
  ORDER BY vector ANN OF [6.0,2.0, … 3.1,4.0]
  LIMIT 10;

The : operator triggers the WHERE clause to use the tokenized SAI, rather than a regular keyword filter.

Configure and use analyzers

To enable analyzer operations with CQL on a Serverless (Vector) database, you must create an SAI with the index_analyzer option, and then use the : operator to search the indexed column that has been analyzed.

An analyzed index stores values derived from the raw column values. The stored values are dependent on the analyzer configuration options, which include the following:

The analyzer determines how the column values are analyzed before indexing occurs, and the analyzer is applied to query terms as well.

Standard tokenizer example

  1. Create a table:

    CREATE TABLE default_keyspace.products
    (
      id text PRIMARY KEY,
      val text
    );
  2. Create an SAI with the index_analyzer option and stemming enabled:

    CREATE CUSTOM INDEX default_keyspace_products_val_idx
      ON default_keyspace.products(val)
      USING 'org.apache.cassandra.index.sai.StorageAttachedIndex' WITH OPTIONS = {
      'index_analyzer': '{
        "tokenizer" : {"name" : "standard"},
        "filters" : [{"name" : "porterstem"}]
    }'};
  3. Insert some rows:

    INSERT INTO default_keyspace.products (id, val)
      VALUES ('1', 'soccer cleats');
    INSERT INTO default_keyspace.products (id, val)
      VALUES ('2', 'running shoes');
    INSERT INTO default_keyspace.products (id, val)
      VALUES ('3', 'hiking shoes');
  4. Use the : operator with keywords to query data from the inserted rows. The analyzer splits the text into case-independent terms.

    Query on "running"
    SELECT * FROM default_keyspace.products
      WHERE val : 'running';
    Query on "hiking" and "shoes"
    SELECT * FROM default_keyspace.products
      WHERE val : 'hiking' AND val : 'shoes';

For more information, see Simple standard analyzer example.

N-Gram tokenizer example

An ngram tokenizer processes text by splitting the given text into contiguous sequences of n tokens to capture the linguistic patterns and context. This is a part of natural language processing (NLP) tasks.

To configure an ngram tokenizer that also lowercases all tokens, ensure the "tokenizer" key specifies the Lucene tokenizer. The remaining key-value pairs in the JSON object configure the tokenizer:

  WITH OPTIONS = {
  'index_analyzer': '{
  "tokenizer" : {"name" : "ngram", "args" : {"minGramSize":"2", "maxGramSize":"3"}},
  "filters" : [{"name" : "lowercase"}]
  }'
}

For more information, see N-GRAM with lowercase.

Supported built-in analyzers

There are several built-in analyzers from the Lucene project (version 9.8.0). This includes the following analyzers:

Built-in analyzer types

Generic analyzers

standard, simple, whitespace, stop, lowercase, keyword

Language-specific analyzers

Arabic, Armenian, Basque, Bengali, Brazilian, Bulgarian, Catalan, CJK, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Irish, Italian, Latvian, Lithuanian, Norwegian, Persian, Portuguese, Romanian, Russian, Sorani, Spanish, Swedish, Thai, Turkish

Tokenizers

standard, classic, keyword, letter, nGram, edgeNGram, pathHierarchy, pattern, simplePattern, simplePatternSplit, thai, uax29UrlEmail, whitespace, wikipedia

CharFilters

cjk, htmlstrip, mapping, persian, patternreplace

TokenFilters

apostrophe, wordDelimiterGraph, portugueseLightStem, latvianStem, dropIfFlagged, keepWord, indicNormalization, bengaliStem, turkishLowercase, galicianStem, bengaliNormalization, portugueseMinimalStem, galicianMinimalStem, swedishMinimalStem, stop, limitTokenCount, italianLightStem, wordDelimiter, teluguStem, hungarianLightStem, protectedTerm, lowercase, capitalization, hyphenatedWords, type, keywordMarker, frenchMinimalStem, kStem, swedishLightStem, soraniNormalization, commonGramsQuery, numericPayload, persianStem, limitTokenOffset, hunspellStem, soraniStem, czechStem, norwegianMinimalStem, englishMinimalStem, norwegianLightStem, germanMinimalStem, snowballPorter, removeDuplicates, minHash, keywordRepeat, germanNormalization, dictionaryCompoundWord, synonymGraph, englishPossessive, spanishMinimalStem, fixedShingle, patternTyping, classic, frenchLightStem, trim, indonesianStem, spanishPluralStem, hindiStem, scandinavianFolding, delimitedBoost, commonGrams, reverseString, cjkWidth, fingerprint, finnishLightStem, greekStem, porterStem, limitTokenPosition, persianNormalization, typeAsSynonym, patternReplace, tokenOffsetPayload, codepointCount, bulgarianStem, synonym, germanStem, asciiFolding, decimalDigit, Word2VecSynonym, scandinavianNormalization, russianLightStem, serbianNormalization, elision, portugueseStem, arabicNormalization, length, greekLowercase, concatenateGraph, flattenGraph, fixBrokenOffsets, truncate, cjkBigram, brazilianStem, uppercase, nGram, dateRecognizer, teluguNormalization, shingle, norwegianNormalization, hindiNormalization, delimitedPayload, spanishLightStem, stemmerOverride, patternCaptureGroup, hyphenationCompoundWord, germanLightStem, edgeNGram, typeAsPayload, irishLowercase, delimitedTermFrequency, arabicStem

For more information and examples, see Built-in analyzers.

Non-tokenizing filters

Non-tokenizing filters include normalize, case_sensitive, and ascii.

You can’t combine non-tokenizing filters with index_analyzer, but you can chain them in a pipeline with other non-tokenizing filters:

OPTIONS = {'normalize': true, 'case_sensitive': false}

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2025 DataStax | Privacy policy | Terms of use | Manage Privacy Choices

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com