SELECT

Returns data from a single table. A SELECT statement without a WHERE clause is not recommended because all rows from all partitions are returned. Queries that span multiple partitions can seriously impact performance.

See also: CREATE TABLE

Syntax

SELECT [ JSON | DISTINCT ] <selectors>
  FROM [<keyspace_name>.]<table_name>
  [ WHERE [ <primary_key_conditions> | <non_primary_key_conditions> IN (<column_name> [, ...]) ALLOW FILTERING ] [ AND ] [ <index_conditions> ]
  [ GROUP BY <column_name> [ , ... ] ]
  [ ORDER BY <column_name> ( ASC | DESC ) [ , ... ] ] |
  [ ORDER BY <vector_column_name> ANN OF [n,n,n,...] [ LIMIT N ] ]
  [ ( LIMIT <N> [ OFFSET <N> ]  | PER PARTITION LIMIT <N> ) ]
  [ ALLOW FILTERING ] ;
Syntax legend
Legend
Syntax conventions Description

UPPERCASE

Literal keyword.

Lowercase

Not literal.

< >

Variable value. Replace with a user-defined value.

[]

Optional. Square brackets ([]) surround optional command arguments. Do not type the square brackets.

( )

Group. Parentheses ( ( ) ) identify a group to choose from. Do not type the parentheses.

|

Or. A vertical bar (|) separates alternative elements. Type any one of the elements. Do not type the vertical bar.

...

Repeatable. An ellipsis ( ... ) indicates that you can repeat the syntax element as often as required.

'<Literal string>'

Single quotation (') marks must surround literal strings in CQL statements. Use single quotation marks to preserve upper case.

{ <key> : <value> }

Map collection. Braces ({ }) enclose map collections or key value pairs. A colon separates the key and the value.

<datatype2

Set, list, map, or tuple. Angle brackets ( < > ) enclose data types in a set, list, map, or tuple. Separate the data types with a comma.

<cql_statement>;

End CQL statement. A semicolon (;) terminates all CQL statements.

[--]

Separate the command line options from the command arguments with two hyphens ( -- ). This syntax is useful when arguments might be mistaken for command line options.

' <<schema\> ... </schema\>> '

Search CQL only: Single quotation marks (') surround an entire XML schema declaration.

@<xml_entity>='<xml_entity_type>'

Search CQL only: Identify the entity and literal value to overwrite the XML element in the schema and solrConfig files.

Parameters

Parameter Description Default

keyspace_name

Optional. Name of the keyspace that contains the table to select.

If no name is specified, the current keyspace is used.

table_name

Name of the table to select.

column_name

Name of the column to select.

vector_column_name

Name of the vector column to select.

Selectors

The SELECT statement is used to query data from a table in CQL. The results are returned as a set of rows. Selectors are used to specify which columns are queried and returned. The SELECT statement can also transform the data on return using functions. Table data also has related metadata that can be returned using metadata functions.

Return all columns

To return all columns from a table, use the wildcard * selector.

SELECT * FROM cycling.cyclist_name;
Results
 id                                   | firstname | lastname
--------------------------------------+-----------+-----------------
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 |      Alex |           FRAME
 fb372533-eb95-4bb4-8685-6ef61e994caa |   Michael |        MATTHEWS
 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |  Marianne |             VOS
 220844bf-4860-49d6-9a4b-6b5d3a79cbfb |     Paolo |       TIRALONGO
 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 |    Steven |      KRUIKSWIJK
 e7cd5752-bc0d-4157-a80f-7523add8dbcd |      Anna | VAN DER BREGGEN

(6 rows)

Since the * selector returns all columns, it is not necessary to specify the column names. The more columns a table has, the more resources are required to return the data.

Return specific columns

Because the number of columns returned can affect performance, it is recommended to specify the columns to return, if possible.

For example, to return only the lastname column from the cycling.cyclist_name table:

SELECT lastname FROM cycling.cyclist_name;
Results
 lastname
-----------------
           FRAME
        MATTHEWS
             VOS
       TIRALONGO
      KRUIKSWIJK
 VAN DER BREGGEN

(6 rows)

To return multiple columns, separate the column names with a comma:

SELECT lastname, firstname FROM cycling.cyclist_name;
Results
 lastname        | firstname
-----------------+-----------
           FRAME |      Alex
        MATTHEWS |   Michael
             VOS |  Marianne
       TIRALONGO |     Paolo
      KRUIKSWIJK |    Steven
 VAN DER BREGGEN |      Anna

(6 rows)

Columns can be renamed on return using an alias:

SELECT id, cyclist_age AS age FROM cycling.cyclist_alt_stats LIMIT 3;
Results
 id                                   | age
--------------------------------------+------
 e0953617-07eb-4c82-8f91-3b2757981625 | null
 1ba0417d-62da-4103-b710-de6fb222db6f |   34
 4ceb495c-55ab-4f71-83b9-81117252bf13 |   34

(3 rows)

A column alias cannot be used in the WHERE clause; use the original column name.

Return transformed data

The SELECT statement can transform the data on return using functions. The functions are varied and can be used to manipulate the data in different ways. For example, JSON can be used to return all columns as JSON, toJson to return selected column as JSON, or CAST can be used to change the data type of a column. Distinct column values can be returned using the DISTINCT function. The number of rows with a particular column can be returned using the COUNT function. Any native or user-defined functions can be used to transform the data, such as the SUM or AVG function. Any user-defined aggregate function can be used to transform the data, such as the average function. The TOKEN function can be used to return the token value of a column.

To output selected data from a table in JSON format, use the JSON function:

SELECT JSON * FROM cycling.birthday_list;
Results
 [json]
----------------------------------------------------------------------------------------------------------
    {"cyclist_name": "Claudio HEINEN", "blist": {"age": "23", "bday": "27/07/1992", "nation": "GERMANY"}}
    {"cyclist_name": "Claudio VANDELLI", "blist": {"age": "54", "bday": "27/07/1961", "nation": "ITALY"}}
 {"cyclist_name": "Luc HAGENAARS", "blist": {"age": "28", "bday": "27/07/1987", "nation": "NETHERLANDS"}}
   {"cyclist_name": "Toine POELS", "blist": {"age": "52", "bday": "27/07/1963", "nation": "NETHERLANDS"}}
     {"cyclist_name": "Allan DAVIS", "blist": {"age": "35", "bday": "27/07/1980", "nation": "AUSTRALIA"}}
   {"cyclist_name": "Laurence BOURQUE", "blist": {"age": "23", "bday": "27/07/1992", "nation": "CANADA"}}

(6 rows)

Note the nested JSON structure in the output, with the column name as the key and the column value as the value.

To specify the JSON format for a particular column, enclose its name in the toJson() function:

SELECT id, commenter, toJson(created_at) FROM cycling.comments_vs;
Results
 id                                   | commenter | system.tojson(created_at)
--------------------------------------+-----------+----------------------------
 e8ae5cf3-d358-4d99-b900-85902fda9bb0 |      John | "2017-04-01 14:33:02.160Z"
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 |      Alex | "2017-04-01 14:33:02.160Z"
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 |      Alex | "2017-03-21 21:11:09.999Z"
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 |      Alex | "2017-02-14 20:43:20.000Z"
 c7fceba0-c141-4207-9494-a29f9809de6f |       Amy | "2024-07-02 22:39:30.758Z"
 c7fceba0-c141-4207-9494-a29f9809de6f |       Amy | "2017-04-01 13:43:08.030Z"
 c7fceba0-c141-4207-9494-a29f9809de6f |       Amy | "2017-03-22 01:16:59.001Z"
 c7fceba0-c141-4207-9494-a29f9809de6f |       Amy | "2017-02-17 08:43:20.234Z"
 c7fceba0-c141-3207-9494-a29f9809de6f |       Amy | "2017-02-17 08:43:20.234Z"
 c7fceba0-c141-7207-9494-a29f9809de6f |       Amy | "2024-07-02 22:39:30.763Z"
 e8ae5df3-d358-4d99-b900-85902fda9bb0 |      Jane | "2017-04-01 14:33:02.160Z"
 c7fceba0-c141-4207-9594-a29f9809de6f |      Jane | "2017-02-17 08:43:20.234Z"

(12 rows)

Note that the JSON-formatted timestamp column can include complete time zone information.

To cast a column to a different data type, use the CAST function:

  SELECT CAST(created_at AS date) FROM cycling.comments;
Results
 cast(created_at as date)
--------------------------
               2023-04-01
               2024-06-07

(2 rows)

If a table has duplicate values in a column, use the DISTINCT function to return only the unique values:

SELECT DISTINCT country FROM cycling.country_flag;
Results
 country
---------
 Belgium
  France

(2 rows)

You may want to count the number of rows in a column. Use the COUNT function to return the count:

SELECT start_month, MAX(start_day), COUNT(*) FROM cycling.events
  WHERE year = 2017 AND discipline = 'Cyclo-cross';
Results
 start_month | system.max(start_day) | count
-------------+-----------------------+-------
           1 |                    28 |    11

(1 rows)

To use the COUNT function with a WHERE clause, include the COUNT function in the SELECT list and the WHERE clause in the query. For example, to get the number of cyclists from Belgium:

SELECT count(cyclist_name) FROM cycling.country_flag 
 WHERE country='Belgium';
Results
 system.count(cyclist_name)
----------------------------
                          2

(1 rows)

A number of native functions can be used to transform the data. The data is transformed only on return, rather than stored in the table. These are general functions that are familiar to most people, such as the average, or AVG function:

SELECT AVG(race_points) FROM cycling.cyclist_points 
  WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
Results
 system.avg(race_points)
-------------------------
                      67

(1 rows)

User-defined functions can also be used to transform the data. For examples, the 'left` function can be used to return a column that is left-justified:

SELECT left(firstname, 1), lastname FROM cycling.cyclist_name;
Results

User-defined aggregates can also be used to transform the data. For examples, the 'average` function can be used to retrieve the average of the column cyclist_time_sec from a table:

SELECT average(cyclist_time_sec) AS Average FROM cycling.team_average
  WHERE team_name = 'UnitedHealthCare Pro Cycling Womens Team';
Results
 average
-------------
 11474.66667

(1 rows)

Warnings :
User defined function cycling.average_state : (tuple<int, bigint>, int) -> tuple<int, bigint> consumed more than 500µs CPU time (2120µs)

The TOKEN function can be used to return the token value of a column. The token value is the hashed value of the partition key, thus it can only be used on the partition key column or columns. This function is useful when using the token value in a query to filter data with a WHERE clause.

To return the token value of the race_id column in the cycling.race table:

SELECT race_name AS name,race_year AS year FROM cycling.rank_by_year_and_name
  WHERE TOKEN(race_year,race_name) >= 4582455970709790046;
Results
 name                                       | year
--------------------------------------------+------
 Tour of Japan - Stage 4 - Minami > Shinshu | 2015
 Tour of Japan - Stage 4 - Minami > Shinshu | 2015
 Tour of Japan - Stage 4 - Minami > Shinshu | 2015

(3 rows)

Return column metadata

Each non-primary key column in a table has metadata associated with it. The WRITETIME function returns the timestamp of the last write to a column. The TTL function returns the time to live of a column.

It is important to note that the metadata functions are not available for all columns. For example, the metadata functions are not available for columns with a data type of counter or primary key columns. Additionally, the metadata functions are not available for columns that are part of a collection or a user-defined type that is not frozen.

To return the writetime of the firstname column in the cycling.cyclist_points table:

SELECT WRITETIME (firstname) FROM cycling.cyclist_points
  WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
Results
 writetime(firstname)
----------------------
     1719960338817118
     1719960338823276
     1719960338827071

(3 rows)

The TTL function returns the time to live of a column. This function is useful when using Time to Live (TTL) to expire data in a table. If a TTL is set on a column, the data is automatically deleted after the specified time has elapsed.

To return the TTL of the race_points column in the cycling.cyclist_points table:

SELECT TTL(race_name) FROM cycling.calendar WHERE race_id = 200;
Results
 ttl(race_name)
----------------
            199

(1 rows)

Options

GROUP BY clause

Group by one or more columns. Condenses the selected rows that share the same values for a set of columns or values returned by a function into a group. Either one or more primary key columns or a deterministic function or aggregate can be used in the GROUP BY clause.

SELECT race_date, race_time FROM cycling.race_times_summary
  GROUP BY race_date;
Results
 race_date  | race_time
------------+--------------------
 2019-03-21 | 10:01:18.000000000
 2018-07-26 | 10:01:18.000000000
 2017-04-14 | 10:01:18.000000000

(3 rows)

Warnings :
Aggregation query used without partition key

Each set of rows with the same race_date column value are grouped together into one row in the query output. Three rows are returned because there are three groups of rows with the same race_date column value. The value returned is the first value that is found for the group.

This query groups the rows by race_date and FLOOR(race_time, 1h), which returns the hour. The number of rows in each group is returned by COUNT(*).

  SELECT race_date, FLOOR(race_time, 1h), COUNT(*) FROM cycling.race_times_summary
    GROUP BY race_date, FLOOR(race_time, 1h);
Results
 race_date  | system.floor(race_time, 1h) | count
------------+-----------------------------+-------
 2019-03-21 |          10:00:00.000000000 |     2
 2019-03-21 |          11:00:00.000000000 |     1
 2019-03-21 |          12:00:00.000000000 |     1
 2018-07-26 |          10:00:00.000000000 |     2
 2018-07-26 |          11:00:00.000000000 |     1
 2018-07-26 |          12:00:00.000000000 |     1
 2017-04-14 |          10:00:00.000000000 |     2
 2017-04-14 |          11:00:00.000000000 |     1
 2017-04-14 |          12:00:00.000000000 |     1

(9 rows)

Warnings :
Aggregation query used without partition key

Nine rows are returned because there are nine groups of rows with the same race_date and FLOOR(race_time, 1h) values.

ORDER BY clause

You can fine-tune the display order using the ORDER BY clause. The partition key must be defined in the WHERE clause and then the ORDER BY clause defines one or more clustering columns to use for ordering. The order of the specified columns must match the order of the clustering columns in the PRIMARY KEY definition. The options for ordering are ASC (ascending) and DESC (descending).

If no order is specified, the results are returned in the stored order.

Note that using both IN and ORDER BY require turning off paging with the PAGING OFF command in cqlsh.

SELECT * FROM cycling.calendar WHERE race_id IN (100, 101, 102)
ORDER BY race_start_date ASC;
Results

The ORDER BY clause also supports vector searches of the vector column. The result set is sorted using the Approximate Nearest Neighbor (ANN) algorithm with the supplied array values.

LIMIT clause

If a query returns a large number of rows, you can limit the number of rows returned, to limit the amount of data returned. The default limit is set to 10,000 rows, the number of rows cqlsh allows. This examples limits the rows to 3:

SELECT * FROM cycling.comments_vs 
  ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55] 
  LIMIT 3;
Results
 id                                   | created_at                      | comment                                | comment_vector                                    | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+---------------------------------------------------+-----------+--------------------------------------
 e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |              rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' |      John | f25e4fe1-38c3-11ef-bd85-f92c3c7170c3
 e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' |      Alex | f259bc00-38c3-11ef-bd85-f92c3c7170c3
 e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 |                    Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' |      Jane | f25eec21-38c3-11ef-bd85-f92c3c7170c3

(3 rows)

OFFSET option

The LIMIT clause can also use the OFFSET option to skip a number of rows before returning the results. OFFSET cannot be used without LIMIT.

This is an expensive feature for large offsets and should be avoided. Non-offset pagination is a better option for large datasets. There are guardrail settings in the cassandra.yaml file that will prevent offsets that are too large. The offset_rows_warn_threshold defaults to 10,000 rows. The offset_rows_failure_threshold defaults to 20,000 rows. There are a number of restrictions on the use of OFFSET:

  • Specifying an OFFSET disables normal key-based paging.

  • Using OFFSET 0 also disables key-based paging.

  • ANN queries don’t support OFFSET.

This example limits the rows to 3, but skips the first two rows:

SELECT comment,comment_vector,commenter FROM cycling.comments_vs
  WHERE commenter : 'Alex'
  LIMIT 3 OFFSET 2;
Results
 comment                                | comment_vector                | commenter
----------------------------------------+-------------------------------+-----------
 Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] |      Alex

(1 rows)

PER PARTITION LIMIT clause

The PER PARTITION LIMIT option sets the maximum number of rows that the query returns from each partition. This will only apply to tables that spread across more than one partition. An example of such a table is defined here:

USE cycling;
CREATE TABLE rank_by_year_and_name ( 
  race_year int, 
  race_name text, 
  cyclist_name text, 
  rank int, 
  PRIMARY KEY ((race_year, race_name), rank) 
);

where the partition key is a composite of race_year and race_name.

The following query returns the top two cyclists from each partition stored:

SELECT rank, cyclist_name AS name FROM cycling.rank_by_year_and_name 
  PER PARTITION LIMIT 2;
Results
 rank | name
------+----------------------
    1 |    Phillippe GILBERT
    2 |        Daniel MARTIN
    1 |        Daniel MARTIN
    2 | Johan Esteban CHAVES
    1 |        Ilnur ZAKARIN
    2 |      Carlos BETANCUR
    1 |      Benjamin PRADES
    2 |          Adam PHELAN

(8 rows)

ALLOW FILTERING

The ALLOW FILTERING clause allows you to perform queries that require scanning all partitions, with no primary key columns specified. It should not be used in production as it can cause severe performance issues! When initially modeling your data, you should avoid using ALLOW FILTERING and instead model your data to avoid it. However, for a small dataset or for testing purposes, it can be useful. It may even help you identify where you need to add indexes to your data model.

For more information, see Allow Filtering explained.

The following query selects the birthday and nationality columns from the cyclist_alt_stats table, with the ALLOW FILTERING clause:

SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
  WHERE birthday = '1991-08-25' AND nationality = 'Ethiopia'
ALLOW FILTERING;
Results
 lastname | birthday   | nationality
----------+------------+-------------
    GRMAY | 1991-08-25 |    Ethiopia

(1 rows)

Usage notes

The WHERE clause is the key to filtering rows returned by the SELECT statement. If indexes exist for columns in a table, use the indexed columns in the WHERE clause to improve query performance.

The query evaluates the WHERE logical statements hierarchically. First, partition keys are evaluated, followed by clustering columns, and then regular columns.

Partition keys

The query requires that all partitions are restricted, except when querying with an index. Use logic statements that identify the partition key columns with these operators:

  • equals (=): Any partition key column.

  • IN: Restricted to the last column of the partition key to search multiple partitions.

  • range (>=, ⇐, >, and <) on tokens: Fully tokenized partition key (all partition key columns specified in order as arguments of the token function). Use token ranges to scan data stored on a particular node.

See Partition keys for examples and instructions.

Clustering columns

Use logic statements that identify the clustering segment. Clustering columns set the sort order of the stored data, which is nested when there are multiple clustering columns. After evaluating the partition key, the query evaluates the clustering statements in the nested order, the first (top level), second, third, and so on. All operators are supported in logical statements if the table has only one clustering column. To efficiently locate the data within the partition for tables with multiple clustering columns, the following restrictions apply:

  • All clustering columns excluding the last clustering column:

    • Equals (=)

    • IN

  • Last clustering column:

    • All equality and inequality operators, and multi-column comparisons

Clustering column logic statements also support returning slices across clustering segments. The slice identifies the row that has the corresponding values and allows you to return all rows before, after, or between (when two slice statements are included).

See Clustering columns for examples and instructions.

Regular columns

The database allows retrieving data using a regular or static column if an index exists for the column. Otherwise, the IN clause works if ALLOW FILTERING is used in conjunction.

The ALLOW FILTERING option overrides restrictions on filtering partition columns, clustering columns, and regular columns, but can negatively impact performance, causing read latencies.

Examples

Detailed examples can be found for various filtering conditions:

Filter Example links

All columns

Transform columns (JSON, DISTINCT, CAST, COUNT, functions, aggregates)

Specific columns

Column metadata (TTL, WRITETIME)

Arithmetic operators

Non-arithmetic operators (IN, CONTAINS, CONTAINS KEY)

Vector search with ORDER BY

LIMIT and OFFSET

Was this helpful?

Give Feedback

How can we improve the documentation?

© 2024 DataStax | Privacy policy | Terms of use

Apache, Apache Cassandra, Cassandra, Apache Tomcat, Tomcat, Apache Lucene, Apache Solr, Apache Hadoop, Hadoop, Apache Pulsar, Pulsar, Apache Spark, Spark, Apache TinkerPop, TinkerPop, Apache Kafka and Kafka are either registered trademarks or trademarks of the Apache Software Foundation or its subsidiaries in Canada, the United States and/or other countries. Kubernetes is the registered trademark of the Linux Foundation.

General Inquiries: +1 (650) 389-6000, info@datastax.com