SELECT

Retrieve data from a Cassandra table.

Retrieve data from a Cassandra table.

Synopsis

SELECT select_expression
  FROM keyspace_name.table_name
  WHERE relation AND relation ... 
  ORDER BY ( clustering_column  ASC | DESC ...)
  LIMIT n
  ALLOW FILTERING

select expression is:

selection_list
| DISTINCT selection_list
|  COUNT ( * | 1 ) 
selection_list is one of:
  • A list of partition keys (used with DISTINCT)
  • selector AS alias, selector AS alias, ...| * 

    alias is an alias for a column name.

selector is:

column name
|  WRITETIME (column_name) 
|  TTL (column_name) 
| function (selector , selector, ...) 

function is a timeuuid function, a token function, or a blob conversion function.

relation is:

column_name op term
| ( column_name, column_name, ... ) op term-tuple 
| column_name IN  term, ( term ... ) 
| ( column_name, column_name, ... ) IN  term-tuple, ( term-tuple ... ) 
| TOKEN (column_name, ...) op  term 

op is = | < | > | <= | > | = | CONTAINS | CONTAINS KEY

term-tuple (Apache Cassandra 2.1 and later) is:
( term, term, ... )
term is
  • a constant: string, number, uuid, boolean, hex
  • a bind marker (?)
  • a function
  • set:
    { literal, literal, ... }
  • list:
    [ literal, literal, ... ]
  • map:
    { literal : literal, literal : literal, ... }
Table 1. Legend
  • Uppercase means literal
  • Lowercase means not literal
  • Italics mean optional
  • The pipe (|) symbol means OR or AND/OR
  • Ellipsis (...) means repeatable

A semicolon that terminates CQL statements is not included in the synopsis.

Description

A SELECT statement reads one or more records from a Cassandra table. The input to the SELECT statement is the select expression. The output of the select statement depends on the select expression:

Table 2. Select Expression Output
Select Expression Output
Column of list of columns Rows having a key value and collection of columns
COUNT aggregate function One row with a column that has the value of the number of rows in the resultset
DISTINCT partition key list Values of columns that are different from other column values
WRITETIME function The date/time that a write to a column occurred
TTL function The remaining time-to-live for a column

Specifying columns

The SELECT expression determines which columns, if any, appear in the result. Using the asterisk specifies selection of all columns:

SELECT * from People;

Columns in big data applications duplicate values. Use the DISTINCT keyword to return only distinct (different) values of partition keys.

Counting returned rows

A SELECT expression using COUNT(*) returns the number of rows that matched the query. Alternatively, you can use COUNT(1) to get the same result.

Count the number of rows in the users table:

SELECT COUNT(*) FROM users;
The capability to use an alias for a column name is particularly useful when using a function call on a column, such as dateOf(created_at), in the select expression.

SELECT event_id, dateOf(created_at), blobAsText(content) FROM timeline;
      

Using a column alias

You can define an alias on columns using the AS keyword.


SELECT event_id, 
  dateOf(created_at) AS creation_date,
  blobAsText(content) AS content 
FROM timeline;
      

In the output, columns assume the aesthetically-pleasing name.


 event_id                | creation_date            | content
-------------------------+--------------------------+----------------
 550e8400-e29b-41d4-a716 | 2013-07-26 10:44:33+0200 | Some stuff        

Specifying rows returned using LIMIT

Using the LIMIT option, you can specify that the query return a limited number of rows.

SELECT COUNT(*) FROM big_table LIMIT 50000;
SELECT COUNT(*) FROM big_table LIMIT 200000;

The output of these statements if you had 105,291 rows in the database would be: 50000, and 105,291. The cqlsh shell has a default row limit of 10,000. The Cassandra server and native protocol do not limit the number of rows that can be returned, although a timeout stops running queries to protect against running malformed queries that would cause system instability.

Specifying the table using FROM

The FROM clause specifies the table to query. Optionally, specify a keyspace for the table followed by a period, (.), then the table name. If a keyspace is not specified, the current keyspace is used.

For example, count the number of rows in the IndexInfo table in the system keyspace:

SELECT COUNT(*) FROM system."IndexInfo";

Filtering data using WHERE

The WHERE clause specifies which rows to query. In the WHERE clause, refer to a column using the actual name, not an alias. Columns in the WHERE clause need to meet one of these requirements:
  • The partition key definition includes the column.
  • A column that is indexed using CREATE INDEX.
The primary key in the WHERE clause tells Cassandra to race to the specific node that has the data. Put the name of the column to the left of the = or IN operator. Put the column value to the right of the operator. For example, empID and deptID columns are included in the partition key definition in the following table, so you can query all the columns using the empID in the WHERE clause:
CREATE TABLE emp (
  empID int,
  deptID int,
  first_name varchar,
  last_name varchar,
  PRIMARY KEY (empID, deptID));

SELECT deptid FROM emp WHERE empid = 104; 

Cassandra supports these conditional operators in the WHERE clause: CONTAINS, CONTAINS KEY, IN, =, >, >=, <, or <=, but not all in certain situations.

Restrictions on the use of conditions

  • Non-equal conditional operations on the partition key

    Regardless of the partitioner in use, Cassandra does not support non-equal conditional operations on the partition key. Use the token function for range queries on the partition key.

  • Using the IN condition with a compound or composite partition key

    The IN condition is allowed on the last column of the partition key only if you query all preceding columns of key for equality. "Using the IN filter condition" presents examples of using the IN operation.

  • Querying an indexed table

    A query on an indexed table must have at least one equality condition on the indexed column, as shown in "Indexing a column".

  • Range queries

    Cassandra supports greater-than and less-than comparisons, but for a given partition key, the conditions on the clustering column are restricted to the filters that allow Cassandra to select a contiguous ordering of rows.

    For example:
CREATE TABLE ruling_stewards (
  steward_name text,
  king text,
  reign_start int,
  event text,
  PRIMARY KEY (steward_name, king, reign_start)
); 

This query constructs a filter that selects data about stewards whose reign started by 2450 and ended before 2500. If king were not a component of the primary key, you would need to create an index on king to use this query:

SELECT * FROM ruling_stewards
  WHERE king = 'Brego'
  AND reign_start >= 2450
  AND reign_start < 2500 ALLOW FILTERING;

The output is:

 steward_name | king  | reign_start | event
--------------+-------+-------------+--------------------
      Boromir | Brego |        2477 |   Attacks continue
       Cirion | Brego |        2489 | Defeat of Balchoth
       
(2 rows)

To allow Cassandra to select a contiguous ordering of rows, you need to include the king component of the primary key in the filter using an equality condition. The ALLOW FILTERING clause is also required. ALLOW FILTERING provides the capability to query the clustering columns using any condition if performance is not an issue.

ALLOW FILTERING clause

When you attempt a potentially expensive query, such as searching a range of rows, this prompt appears:

Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.

To run the query, use the ALLOW FILTERING clause. Imposing a limit using the LIMIT n clause is recommended to reduce memory used. For example:

Select * FROM ruling_stewards
  WHERE king = 'none'
  AND reign_start >= 1500
  AND reign_start < 3000 LIMIT 10 ALLOW FILTERING;

Critically, LIMIT doesn't protect you from the worst liabilities. For instance, what if there are no entries with no king? Then you have to scan the entire list no matter what LIMIT is.

ALLOW FILTERING will probably become less strict as we collect more statistics on our data. For example, if we knew that 90% of entries have no king we would know that finding 10 such entries should be relatively inexpensive.

Using the IN filter condition

Use IN, an equals condition operator, in the WHERE clause to specify multiple possible values for a column. For example, select two columns, first_name and last_name, from three rows having employee ids (primary key) 105, 107, or 104:

SELECT first_name, last_name FROM emp WHERE empID IN (105, 107, 104);

Format values for the IN conditional test as a comma-separated list. The list can consist of a range of column values.

Using IN to filter on a compound or composite primary key

The IN condition is recommended on the last column of the partition key only if you query all preceding columns of key for equality. For example:
CREATE TABLE parts (part_type text, part_name text, part_num int, part_year text, serial_num text, PRIMARY KEY ((part_type, part_name), part_num, part_year));

SELECT * FROM parts WHERE part_type='alloy' AND part_name='hubcap' AND part_num=1249 AND part_year IN ('2010', '2015');
The IN condition will not work on any clustering column except the last one. The following query will not work, since part_year is the final clustering column, not part_num:
SELECT * FROM parts WHERE part_type='alloy' AND part_name='hubcap' AND part_num IN (1249, 3456);
You can omit the equality test for clustering columns other than the last when using IN, but such a query might involve data filtering and thus may have unpredictable performance. Such a query requires use of ALLOW FILTERING. For example:
SELECT * FROM parts WHERE part_num=123456 AND part_year IN ('2010', '2015') ALLOW FILTERING;

CQL supports an empty list of values in the IN clause, useful in Java Driver applications when passing empty arrays as arguments for the IN clause.

When not to use IN

The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.

Comparing clustering columns

You can group the partition key and clustering columns and compare the tuple to values for slicing over rows in a partition. For example:
SELECT * FROM ruling_stewards WHERE (steward_name, king) = ('Boromir', 'Brego');

The syntax used in the WHERE clause compares records of steward_name and king as a tuple against the Boromir, Brego tuple.

Paging through unordered results

The TOKEN function can be used with a condition operator on the partition key column to query. The query selects rows based on the token of their partition key rather than on their value. The token of a key depends on the partitioner in use. Use with the RandomPartitioner or Murmur3Partitioner will not give you a meaningful order.

For example, assume you defined this table:

CREATE TABLE periods (
  period_name text,
  event_name text,
  event_date timestamp,
  weak_race text,
  strong_race text,
  PRIMARY KEY (period_name, event_name, event_date)
);

After inserting data, this query uses the TOKEN function to find the data using the partition key.

SELECT * FROM periods
  WHERE TOKEN(period_name) > TOKEN('Third Age')
  AND TOKEN(period_name) < TOKEN('Fourth Age');

Using compound primary keys and sorting results

ORDER BY clauses can select a single column only. That column has to be the second column in a compound PRIMARY KEY. This also applies to tables with more than two column components in the primary key. Ordering can be done in ascending or descending order, default ascending, and specified with the ASC or DESC keywords.

In the ORDER BY clause, refer to a column using the actual name, not the aliases.

For example, set up the playlists table, which uses a compound primary key, insert the example data, and use this query to get information about a particular playlist, ordered by song_order. You do not need to include the ORDER BY column in the select expression.

SELECT * FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
  ORDER BY song_order DESC LIMIT 50;

Output is:

Or, create an index on playlist artists, and use this query to get titles of Fu Manchu songs on the playlist:

CREATE INDEX ON playlists(artist)

SELECT album, title FROM playlists WHERE artist = 'Fu Manchu';

Output is:

Filtering a collection set, list, or map

You can query a table containing a collection to retrieve the collection in its entirety. You can also index the collection column, and then use the CONTAINS condition in the WHERE clause to filter the data for a particular value in the collection. Continuing with the music service example, after adding the collection of tags to the playlists table, adding some tag data, and then indexing the tags, you can filter on 'blues' in the tags set.

SELECT album, tags FROM playlists WHERE tags CONTAINS 'blues';

After indexing the music venue map, you can filter on map values, such as 'The Fillmore':
SELECT * FROM playlists WHERE venue CONTAINS 'The Fillmore';
After indexing the collection keys in the venues map, you can filter on map keys.
SELECT * FROM playlists WHERE venue CONTAINS KEY '2014-09-22 22:00:00-0700';

Retrieving the date/time a write occurred

Using WRITETIME followed by the name of a column in parentheses returns date/time in microseconds that the column was written to the database.

Retrieve the date/time that a write occurred to the first_name column of the user whose last name is Jones:

SELECT WRITETIME (first_name) FROM users WHERE last_name = 'Jones';

 writetime(first_name)
-----------------------
 1353010594789000

The writetime output in microseconds converts to November 15, 2012 at 12:16:34 GMT-8