SELECT
Retrieve data from a Cassandra table.
Retrieve data from a Cassandra table.
Synopsis
SELECT select_expression
FROM keyspace_name.table_name
WHERE relation AND relation ...
ORDER BY ( clustering_column ASC | DESC ...)
LIMIT n
ALLOW FILTERING
select expression is:
selection_list
| DISTINCT selection_list
| COUNT ( * | 1 )
- A list of partition keys (used with DISTINCT)
-
selector AS alias, selector AS alias, ...| *
alias is an alias for a column name.
selector is:
column name
| WRITETIME (column_name)
| TTL (column_name)
| function (selector , selector, ...)
function is a timeuuid function, a token function, or a blob conversion function.
relation is:
column_name op term
| ( column_name, column_name, ... ) op term-tuple
| column_name IN term, ( term ... )
| ( column_name, column_name, ... ) IN term-tuple, ( term-tuple ... )
| TOKEN (column_name, ...) op term
op is = | < | > | <= | > | = | CONTAINS | CONTAINS KEY
( term, term, ... )
- a constant: string, number, uuid, boolean, hex
- a bind marker (?)
- a function
- set:
{ literal, literal, ... }
- list:
[ literal, literal, ... ]
- map:
{ literal : literal, literal : literal, ... }
A semicolon that terminates CQL statements is not included in the synopsis. |
Description
A SELECT statement reads one or more records from a Cassandra table. The input to the SELECT statement is the select expression. The output of the select statement depends on the select expression:
Select Expression | Output |
---|---|
Column of list of columns | Rows having a key value and collection of columns |
COUNT aggregate function | One row with a column that has the value of the number of rows in the resultset |
DISTINCT partition key list | Values of columns that are different from other column values |
WRITETIME function | The date/time that a write to a column occurred |
TTL function | The remaining time-to-live for a column |
Specifying columns
The SELECT expression determines which columns, if any, appear in the result. Using the asterisk specifies selection of all columns:
SELECT * from People;
Columns in big data applications duplicate values. Use the DISTINCT keyword to return only distinct (different) values of partition keys.
Counting returned rows
A SELECT expression using COUNT(*) returns the number of rows that matched the query. Alternatively, you can use COUNT(1) to get the same result.
Count the number of rows in the users table:
SELECT COUNT(*) FROM users;
SELECT event_id, dateOf(created_at), blobAsText(content) FROM timeline;
Using a column alias
You can define an alias on columns using the AS keyword.
SELECT event_id,
dateOf(created_at) AS creation_date,
blobAsText(content) AS content
FROM timeline;
In the output, columns assume the aesthetically-pleasing name.
event_id | creation_date | content
-------------------------+--------------------------+----------------
550e8400-e29b-41d4-a716 | 2013-07-26 10:44:33+0200 | Some stuff
Specifying rows returned using LIMIT
Using the LIMIT option, you can specify that the query return a limited number of rows.
SELECT COUNT(*) FROM big_table LIMIT 50000;
SELECT COUNT(*) FROM big_table LIMIT 200000;
The output of these statements if you had 105,291 rows in the database would be: 50000, and 105,291. The cqlsh shell has a default row limit of 10,000. The Cassandra server and native protocol do not limit the number of rows that can be returned, although a timeout stops running queries to protect against running malformed queries that would cause system instability.
Specifying the table using FROM
The FROM clause specifies the table to query. Optionally, specify a keyspace for the table followed by a period, (.), then the table name. If a keyspace is not specified, the current keyspace is used.
For example, count the number of rows in the IndexInfo table in the system keyspace:
SELECT COUNT(*) FROM system."IndexInfo";
Filtering data using WHERE
- The partition key definition includes the column.
- A column that is indexed using CREATE INDEX.
CREATE TABLE emp (
empID int,
deptID int,
first_name varchar,
last_name varchar,
PRIMARY KEY (empID, deptID));
SELECT deptid FROM emp WHERE empid = 104;
Cassandra supports these conditional operators in the WHERE clause: CONTAINS, CONTAINS KEY, IN, =, >, >=, <, or <=, but not all in certain situations.
Restrictions on the use of conditions
- Non-equal conditional operations on the partition key
Regardless of the partitioner in use, Cassandra does not support non-equal conditional operations on the partition key. Use the token function for range queries on the partition key.
- Using the IN condition with a compound or composite partition key
The IN condition is allowed on the last column of the partition key only if you query all preceding columns of key for equality. "Using the IN filter condition" presents examples of using the IN operation.
- Querying an indexed table
A query on an indexed table must have at least one equality condition on the indexed column, as shown in "Indexing a column".
- Range queries
Cassandra supports greater-than and less-than comparisons, but for a given partition key, the conditions on the clustering column are restricted to the filters that allow Cassandra to select a contiguous ordering of rows.
For example:
CREATE TABLE ruling_stewards (
steward_name text,
king text,
reign_start int,
event text,
PRIMARY KEY (steward_name, king, reign_start)
);
This query constructs a filter that selects data about stewards whose reign started by 2450 and ended before 2500. If king were not a component of the primary key, you would need to create an index on king to use this query:
SELECT * FROM ruling_stewards
WHERE king = 'Brego'
AND reign_start >= 2450
AND reign_start < 2500 ALLOW FILTERING;
The output is:
steward_name | king | reign_start | event
--------------+-------+-------------+--------------------
Boromir | Brego | 2477 | Attacks continue
Cirion | Brego | 2489 | Defeat of Balchoth
(2 rows)
To allow Cassandra to select a contiguous ordering of rows, you need to include the king component of the primary key in the filter using an equality condition. The ALLOW FILTERING clause is also required. ALLOW FILTERING provides the capability to query the clustering columns using any condition if performance is not an issue.
ALLOW FILTERING clause
When you attempt a potentially expensive query, such as searching a range of rows, this prompt appears:
Bad Request: Cannot execute this query as it might involve data
filtering and thus may have unpredictable performance. If you want
to execute this query despite the performance unpredictability,
use ALLOW FILTERING.
To run the query, use the ALLOW FILTERING clause. Imposing a limit using the LIMIT n clause is recommended to reduce memory used. For example:
Select * FROM ruling_stewards
WHERE king = 'none'
AND reign_start >= 1500
AND reign_start < 3000 LIMIT 10 ALLOW FILTERING;
Critically, LIMIT doesn't protect you from the worst liabilities. For instance, what if there are no entries with no king? Then you have to scan the entire list no matter what LIMIT is.
ALLOW FILTERING will probably become less strict as we collect more statistics on our data. For example, if we knew that 90% of entries have no king we would know that finding 10 such entries should be relatively inexpensive.
Using the IN filter condition
Use IN, an equals condition operator, in the WHERE clause to specify multiple possible values for a column. For example, select two columns, first_name and last_name, from three rows having employee ids (primary key) 105, 107, or 104:
SELECT first_name, last_name FROM emp WHERE empID IN (105, 107, 104);
Format values for the IN conditional test as a comma-separated list. The list can consist of a range of column values.
Using IN to filter on a compound or composite primary key
CREATE TABLE parts (part_type text, part_name text, part_num int, part_year text, serial_num text, PRIMARY KEY ((part_type, part_name), part_num, part_year));
SELECT * FROM parts WHERE part_type='alloy' AND part_name='hubcap' AND part_num=1249 AND part_year IN ('2010', '2015');
SELECT * FROM parts WHERE part_type='alloy' AND part_name='hubcap' AND part_num IN (1249, 3456);
SELECT * FROM parts WHERE part_num=123456 AND part_year IN ('2010', '2015') ALLOW FILTERING;
CQL supports an empty list of values in the IN clause, useful in Java Driver applications when passing empty arrays as arguments for the IN clause.
When not to use IN
The recommendations about when not to use an index apply to using IN in the WHERE clause. Under most conditions, using IN in the WHERE clause is not recommended. Using IN can degrade performance because usually many nodes must be queried. For example, in a single, local data center cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM, a single key query goes out to two nodes, but if the query uses the IN condition, the number of nodes being queried are most likely even higher, up to 20 nodes depending on where the keys fall in the token range.
Comparing clustering columns
SELECT * FROM ruling_stewards WHERE (steward_name, king) = ('Boromir', 'Brego');
The syntax used in the WHERE clause compares records of steward_name and king as a tuple against the Boromir, Brego tuple.
Paging through unordered results
The TOKEN function can be used with a condition operator on the partition key column to query. The query selects rows based on the token of their partition key rather than on their value. The token of a key depends on the partitioner in use. Use with the RandomPartitioner or Murmur3Partitioner will not give you a meaningful order.
For example, assume you defined this table:
CREATE TABLE periods (
period_name text,
event_name text,
event_date timestamp,
weak_race text,
strong_race text,
PRIMARY KEY (period_name, event_name, event_date)
);
After inserting data, this query uses the TOKEN function to find the data using the partition key.
SELECT * FROM periods
WHERE TOKEN(period_name) > TOKEN('Third Age')
AND TOKEN(period_name) < TOKEN('Fourth Age');
Using compound primary keys and sorting results
ORDER BY clauses can select a single column only. That column has to be the second column in a compound PRIMARY KEY. This also applies to tables with more than two column components in the primary key. Ordering can be done in ascending or descending order, default ascending, and specified with the ASC or DESC keywords.
In the ORDER BY clause, refer to a column using the actual name, not the aliases.
For example, set up the playlists table, which uses a compound primary key, insert the example data, and use this query to get information about a particular playlist, ordered by song_order. You do not need to include the ORDER BY column in the select expression.
SELECT * FROM playlists WHERE id = 62c36092-82a1-3a00-93d1-46196ee77204
ORDER BY song_order DESC LIMIT 50;
Output is:
Or, create an index on playlist artists, and use this query to get titles of Fu Manchu songs on the playlist:
CREATE INDEX ON playlists(artist)
SELECT album, title FROM playlists WHERE artist = 'Fu Manchu';
Output is:
Filtering a collection set, list, or map
You can query a table containing a collection to retrieve the collection in its entirety. You can also index the collection column, and then use the CONTAINS condition in the WHERE clause to filter the data for a particular value in the collection. Continuing with the music service example, after adding the collection of tags to the playlists table, adding some tag data, and then indexing the tags, you can filter on 'blues' in the tags set.
SELECT album, tags FROM playlists WHERE tags CONTAINS 'blues';
SELECT * FROM playlists WHERE venue CONTAINS 'The Fillmore';
SELECT * FROM playlists WHERE venue CONTAINS KEY '2014-09-22 22:00:00-0700';
Retrieving the date/time a write occurred
Using WRITETIME followed by the name of a column in parentheses returns date/time in microseconds that the column was written to the database.
Retrieve the date/time that a write occurred to the first_name column of the user whose last name is Jones:
SELECT WRITETIME (first_name) FROM users WHERE last_name = 'Jones';
writetime(first_name)
-----------------------
1353010594789000
The writetime output in microseconds converts to November 15, 2012 at 12:16:34 GMT-8