SELECT
Returns data from a single table.
A SELECT
statement without a WHERE
clause is not recommended because all rows from all partitions are returned.
DataStax recommends limiting queries to a single partition using the |
Synopsis
SELECT [ JSON ] <selectors> FROM [<keyspace_name>.]<table_name> [ WHERE [ <primary_key_conditions> | <non_primary_key_conditions> IN (<column_name> …​) ALLOW FILTERING ] [ AND ] [ <index_conditions> ] [ GROUP BY <column_name> [ , …​ ] ] [ ORDER BY <PK_column_name> [ , …​ ] ( ASC | DESC ) ] | [ ORDER BY <vector_column_name> ANN OF [n,n,n,...] [ LIMIT N ] ] [ ( LIMIT <N> | PER PARTITION LIMIT <N> ) ] [ ALLOW FILTERING ] ;
Syntax legend
Syntax conventions | Description |
---|---|
UPPERCASE |
Literal keyword. |
Lowercase |
Not literal. |
|
Variable value. Replace with a user-defined value. |
|
Optional.
Square brackets ( |
|
Group.
Parentheses ( |
|
Or.
A vertical bar ( |
|
Repeatable.
An ellipsis ( |
|
Single quotation ( |
|
Map collection.
Braces ( |
Set, list, map, or tuple.
Angle brackets ( |
|
|
End CQL statement.
A semicolon ( |
|
Separate the command line options from the command arguments with two hyphens ( |
|
Search CQL only: Single quotation marks ( |
|
Search CQL only: Identify the entity and literal value to overwrite the XML element in the schema and solrConfig files. |
- selectors
-
Determines the columns returned in the results set.
<column_list> | DISTINCT <partition_key> [ AS <output_name> ]
Restriction: Use either a column list or
DISTINCT
<partition_key>. - column_list
-
Determines the columns and column order returned in the result set. Specify a comma-separated list of columns or use an asterisk to return all columns in the stored order.
<column_name> | <function_name>( <argument_list> )
-
<column_name>: Includes a column in the result set.
-
<function_name>( <arguments> ): Execute a function on the specified argument for each row in the result set. See CQL native functions and Creating a user-defined function (UDF).
-
<aggregate_name>( <arguments> ): Executes the aggregate on matching data and returns a single result. See CQL native aggregates and CREATE AGGREGATE.
-
- DISTINCT partition_key
-
Returns unique values for the full partition key. Use a comma-separated list of columns for a composite partition key.
Run |
- AS output_name
-
Renames the column to the new output name in the result set; for example:
COUNT(id) AS "Cyclist Count"
If the name contains special characters, spaces, or to retain capitalization, surround the new name with double quotes.
keyspace_name.table_name
The keyspace name is required to identify a table in a different keyspace or if no keyspace is set for the session. If the keyspace or table name contain uppercase letters, enclose the name in double quotation marks; for example:
FROM "TestTable"
primary_key_conditions
Improves the efficiency of the query using logic statements to identify the data location and allows filtering on the last clustering column.
<partition_conditions>
[ AND <clustering_conditions> ] | [ AND <index_conditions> ]
To return all the data stored on a partition specify just the partition key values. |
Logical statement syntax
To create logic statements that test the column value, use the syntax:
<column_name> <operator> <value>
Separate multiple statements with AND
.
Rows that meet all the conditions are returned.
For example:
SELECT
rank,
cyclist_name AS name
FROM cycling.rank_by_year_and_name
WHERE "race_name" = 'Tour of Japan - Stage 4 - Minami > Shinshu'
AND race_year = 2014;
The database does not support queries with logical disjunctions (OR). |
- column_name
-
Enclose column names that have uppercase or special characters in double quotes.
Enclose string values in single quotes.
- operators
-
Operator Description =
Column value exactly matches the specified value.
IN
Equal to any value in a comma-separated list of values
>=
Greater than or equal to the value.
<=
Less than or equal to the value.
>
Greater than the value.
<
Less than the value.
CONTAINS
Matches a value in any type of collection. Only use on indexed collections.
CONTAINS KEY
Matches a key name in a map. Only use on maps with indexed keys.
- value
-
Enclose string values in single quotes.
Enclose column names that have uppercase or special characters in double quotes.
Identifying the data location and filtering by clustering columns
Use WHERE
clauses to maximize read efficiency by identifying the location of the data.
The database evaluates the WHERE
logical statements hierarchically:
-
Partition key columns: Use the equals operator to identify all partition key values (or none). Ensure that the data model supports single partition queries to avoid performance issues.
Partitions are typically large sets of data. The partitioner distributes the data by creating a hash of the partition key columns and stores all the rows with the same hash on the same node. Similar or like data, such as partition key date column values 7/01/2017 and 7/02/2017, may not be located on the same node.
-
Clustering columns determine the sort order within the partition. Data is sorted by the first clustering column, the second clustering column, and so on.
ALLOW FILTERING
overrides restrictions on filtering partition columns, clustering columns, and regular columns, but can negatively impact performance, causing read latencies.
Avoid ALLOW FILTERING
in a production environment.
In test environments, use cqlsh TRACING to analyze performance problems.
- partition_conditions
-
The database requires that all partitions are restricted except when querying a secondary or search index. Use logic statements that identify the partition key columns with these operators:
-
Equals (=): Any partition key column.
-
IN: Restricted to the last column of the partition key to search multiple partitions.
-
Range (>=, <=, >, and <) on tokens: Fully tokenized partition key (all partition key columns specified in order as arguments of the token function). Use token ranges to scan data stored on a particular node.
-
For secondary index queries, equals is the only operator supported for partition key logical statements. |
See Partition keys for examples and instructions.
- non_primary_key_conditions
-
The database allows retrieving data using a regular or static column with IN if
ALLOW FILTERING
is used in conjunction. A list of valid values for the column must be included. - clustering_conditions
-
Use logic statements that identify the clustering segment. Clustering columns set the sort order of the stored data, which is nested when there are multiple clustering columns. After evaluating the partition key, the database evaluates the clustering statements in the nested order, the first (top level), second, third, and so on.
All operators are supported in logical statements if the table has only one clustering column. To efficiently locate the data within the partition for tables with multiple clustering columns, the following restrictions apply:
-
All clustering columns excluding the last clustering column:
-
Equals (=)
-
IN
-
-
Last clustering column: All equality and inequality operators, and multi-column comparisons Clustering column logic statements also support returning slices across clustering segments:
( column1, column2, …​ ) <operator> ( value1, value2, …​ ) [ AND ( column1, column2, …​ ) <operator> ( value1, value2, …​ ) ]
-
The slice identifies the row that has the corresponding values and allows you to return all rows before, after, or between (when two slice statements are included).
See Clustering columns for examples and instructions.
index_conditions
These index types are supported:
- Secondary index
-
Logical statements on secondary index columns support these operators:
-
=
-
CONTAINS on index collection types
-
CONTAINS KEY on index map types
-
- Solr query
-
Filter the query using the
solr_query
option by creating a Solr expression. See Search index syntax. - Storage-Attached Indexing (SAI)
-
To retrieve data using an SAI index, see CREATE CUSTOM INDEX.
Additional options
Change the scope and order of the data returned by the query.
- GROUP BY <column_name> | <function_name>( <argument_list> )
-
Condenses the selected rows that share the same values for a set of columns or values returned by a function into a group.
- ORDER BY ( ASC | DESC )
-
Sorts the result set in either ascending (ASC) or descending (DESC) order.
When no order is specified, the results are returned in the order that they are stored. |
- ORDER BY <vector_column_name> ANN OF [n,n,n,…]
-
Use a vector search array to sort result set by Approximate Nearest Neighbor (ANN) of the supplied array values.
- ALLOW FILTERING
-
Enables filtering without applying logic statements that identify the primary key. Avoid
ALLOW FILTERING
in a production environment because a full scan of the cluster is performed.
- LIMIT <N> | PER PARTITION LIMIT <N>
-
Limits the number of records returned in the result set.
Examples
Using a column alias
When your selection list includes functions or other complex expressions, use aliases to make the output more readable.
This query uses the alias best_rank
for MIN(rank)
:
CAPTURE 'select_best_rank_from_rank_by_year_and_name.results';
SELECT
MIN(rank) AS best_rank,
cyclist_name
FROM cycling.rank_by_year_and_name
WHERE "race_name" = 'Tour of Japan - Stage 4 - Minami > Shinshu'
AND race_year = 2014;
Output:
best_rank | cyclist_name
-----------+---------------
1 | Daniel MARTIN
(1 rows)
The number of rows returned by the query is shown at the bottom of the output.
Specifying the source table using FROM
The following example SELECT statement returns the number of rows in the rank_by_year_and_name
table:
SELECT COUNT(*)
FROM cycling.rank_by_year_and_name;
Controlling the number of rows returned using LIMIT
The LIMIT
option sets the maximum number of rows that the query returns:
SELECT cyclist_name
FROM cycling.rank_by_year_and_name
LIMIT 50000;
Even if the query matches 105,291 rows, the database only returns the first 50,000.
The cqlsh
shell has a default row limit of 10,000.
The DSE server and native protocol do not limit the number of returned rows, but they apply a timeout to prevent malformed queries from causing system instability.
Selecting partitions
Simple partition key, select a single partition:
WHERE <partition_column> = <value>
Simple partition key, select multiple partitions:
WHERE <partition_column> IN ( <value1>, <value2> [ ,…​ ] )
For a composite partition key, create a condition for each partition key column separated by AND
:
WHERE <partition_column1> = <value1>
AND <partition_column2> = <value2> [ AND …​ ] )
Controlling the number of rows returned using PER PARTITION LIMIT
The PER PARTITION LIMIT
option sets the maximum number of rows that the query returns from each partition.
For example, the cycling keyspace contains this table:
CREATE TABLE IF NOT EXISTS cycling.rank_by_year_and_name (
race_year int,
race_name text,
cyclist_name text,
rank int,
PRIMARY KEY ((race_year, race_name), rank)
);
The table contains these rows:
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | 4th Tour of Beijing | 1 | Phillippe GILBERT
2014 | 4th Tour of Beijing | 2 | Daniel MARTIN
2014 | 4th Tour of Beijing | 3 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Benjamin PRADES
2015 | Giro d'Italia - Stage 11 - Forli > Imola | 1 | Ilnur ZAKARIN
2015 | Giro d'Italia - Stage 11 - Forli > Imola | 2 | Carlos BETANCUR
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Adam PHELAN
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 3 | Thomas LEBAS
(11 rows)
This query returns the top two racers for each race year and race name combination using PER PARTITION LIMIT 2
:
SELECT *
FROM cycling.rank_by_year_and_name
PER PARTITION LIMIT 2;
Output:
race_year | race_name | rank | cyclist_name
-----------+--------------------------------------------+------+----------------------
2014 | 4th Tour of Beijing | 1 | Phillippe GILBERT
2014 | 4th Tour of Beijing | 2 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Daniel MARTIN
2014 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Johan Esteban CHAVES
2015 | Giro d'Italia - Stage 11 - Forli > Imola | 1 | Ilnur ZAKARIN
2015 | Giro d'Italia - Stage 11 - Forli > Imola | 2 | Carlos BETANCUR
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 1 | Benjamin PRADES
2015 | Tour of Japan - Stage 4 - Minami > Shinshu | 2 | Adam PHELAN
(8 rows)
Filtering data using WHERE
The WHERE
clause contains one or more relations that filter the rows returned by SELECT.
Column specifications
The column specification of the relation must be one of these:
-
One or more members of the partition key of the table.
-
A clustering column, only if the relation is preceded by other relations that specify all columns in the partition key.
-
A column that is indexed using CREATE INDEX.
Restriction: In the WHERE
clause, refer to a column using the actual name, not an alias.
Filtering on the partition key
This table has id
as the partition key (it is the only column in the primary key, and is therefore also the partition key by default):
CREATE TABLE IF NOT EXISTS cycling.cyclist_career_teams (
id UUID PRIMARY KEY,
lastname text,
teams set<text>
);
This query includes the partition key id
value in the WHERE clause:
SELECT id, lastname, teams
FROM cycling.cyclist_career_teams
WHERE id = 5b6962dd-3f90-4c93-8f61-eabfa4a803e2;
Restriction: A relation that references the partition key can only use an equality operator = or IN. For more details about the IN operator, see the Examples below.
This example table contains a more complex primary key:
CREATE TABLE IF NOT EXISTS cycling.events (
year int,
start_month int,
start_day int,
end_month int,
end_day int,
race text,
discipline text,
location text,
uci_code text,
PRIMARY KEY (
(year, discipline), start_month, start_day, race
)
);
This query contains a WHERE
clause that provides values for the primary key columns that precede the race
column:
SELECT *
FROM cycling.events
WHERE year = 2017
AND discipline = 'Cyclo-cross'
AND start_month = 1
AND start_day = 1;
Output:
year | discipline | start_month | start_day | race | end_day | end_month | location | uci_code
------+-------------+-------------+-----------+----------------------------------------+---------+-----------+----------+----------
2017 | Cyclo-cross | 1 | 1 | DVV verzekeringen trofee - GP Sven Nys | null | null | Baal | C1
(1 rows)
Filtering on a clustering column
Use a relation on a clustering column only if it is preceded by relations that reference all the elements of the partition key.
Example table in the cycling keyspace (the partition key is the id column, the clustering column is race_points):
CREATE TABLE IF NOT EXISTS cycling.cyclist_points (
id UUID,
race_points int,
firstname text,
lastname text,
race_title text,
PRIMARY KEY (id, race_points)
);
Example query:
SELECT SUM(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac
AND race_points > 7;
Output:
system.sum(race_points)
-------------------------
195
(1 rows)
It is possible to add ALLOW FILTERING
to filter a non-indexed cluster column.
Avoid |
The following table definition contains a clustering column named race_start_date
and does not have a secondary index.
CREATE TABLE IF NOT EXISTS cycling.calendar (
race_id int,
race_name text,
race_start_date timestamp,
race_end_date timestamp,
PRIMARY KEY (
race_id, race_start_date, race_end_date
)
) WITH CLUSTERING ORDER BY (
race_start_date DESC, race_end_date DESC
);
Example query with ALLOW FILTERING
:
SELECT *
FROM cycling.calendar
WHERE race_start_date = '2015-06-13'
ALLOW FILTERING;
Output:
race_id | race_start_date | race_end_date | race_name
---------+---------------------------------+---------------------------------+----------------
102 | 2015-06-13 00:00:00.000000+0000 | 2015-06-21 00:00:00.000000+0000 | Tour de Suisse
(1 rows)
Filtering on indexed columns
A WHERE clause in a SELECT on a table with a Storage-Attached Indexing (SAI) secondary indexed column must include at least one equality relation to the indexed column. See CREATE CUSTOM INDEX.
Using the IN operator
Use IN
, an equals condition operator, to list multiple values for a column in a WHERE clause.
This example selects the rows where race_id
is in a list of values:
SELECT *
FROM cycling.calendar
WHERE race_id IN (101, 102, 103);
SELECT *
FROM cycling.calendar
WHERE race_id IN (100, 101, 102)
ORDER BY race_start_date DESC;
SELECT *
FROM cycling.calendar
WHERE race_id IN (100, 101, 102)
ORDER BY race_start_date ASC;
SELECT *
FROM cycling.calendar
WHERE race_id IN (100, 101, 102)
AND (race_start_date, race_end_date) >= ('2015-05-09', '2015-05-24');
The values in the list are separated by commas.
Using IN to filter on a compound primary key
Use an IN
condition on the last column of a compound primary key only when it is preceded by equality conditions for all preceding columns of the primary key.
For example, examine the primary key in this table:
CREATE TABLE IF NOT EXISTS cycling.cyclist_id (
lastname text,
firstname text,
age int,
id UUID,
PRIMARY KEY ((lastname, firstname), age)
);
This query contains the appropriate WHERE
clause containing equality conditions for the first two columns of the primary key and an IN
condition for the last column of the primary key:
SELECT *
FROM cycling.cyclist_id
WHERE lastname = 'EENKHOORN'
AND firstname = 'Pascal'
AND age IN (17, 18);
When using IN
, you can omit the equality test for clustering columns other than the last clustering column.
This may require ALLOW FILTERING
and should not be used in a production environment.
This table shows an example in which the race column is the last clustering column:
CREATE TABLE IF NOT EXISTS cycling.events (
year int,
start_month int,
start_day int,
end_month int,
end_day int,
race text,
discipline text,
location text,
uci_code text,
PRIMARY KEY (
(year, discipline), start_month, start_day, race
)
);
This query contains a WHERE
clause with the equality condition for the race
column (the last clustering column), an IN
clause for the start_month
column, and ALLOW FILTERING
(avoid in a production environment):
SELECT *
FROM cycling.events
WHERE race = 'Superprestige - Hoogstraten -2017'
AND start_month IN (1, 2)
ALLOW FILTERING;
CQL supports an empty list of values in the IN
clause, which can be useful in driver applications when passing empty arrays as arguments for the IN
clause.
See Connecting to DSE clusters using DSE drivers.
When not to use IN
Typically, using IN
in relations on the partition key is not recommended.
To process a list of values, the SELECT may have to query many nodes, which degrades performance.
For example, consider a single local datacenter cluster with 30 nodes, a replication factor of 3, and a consistency level of LOCAL_QUORUM
.
A query on a single partition key query goes out to two nodes.
But if the SELECT uses the IN
condition, the operation can involve more nodes — up to 20, depending on where the keys fall in the token range.
Using IN
for clustering columns will cause less performance latency because all query actions are performed in a single partition.
Filtering on collections
A query can retrieve a collection in its entirety.
You can also index the collection column, and then use the CONTAINS
condition in the WHERE
clause to filter the data for a particular value in the collection, or use CONTAINS KEY
to filter by key.
This example features a set of text values named teams
in the cyclist_career_teams
table.
This query filters on a value in the teams
set.
SELECT *
FROM cycling.cyclist_career_teams
WHERE teams CONTAINS 'Rabobank-Liv Giant';
Output:
id | lastname | teams
--------------------------------------+----------+-----------------------------------------------------------------------------
-----------------------
1c9ebc13-1eab-4ad5-be87-dce433216d40 | BRAND | {'AA Drink - Leontien.nl', 'Leontien.nl', 'Rabobank-Liv Giant', 'Rabobank-Li
v Woman Cycling Team'}
(1 rows)
The cyclist_teams
table contains a map of int
keys and text
values named teams
.
The teams
map keys are indexed:
CREATE INDEX IF NOT EXISTS team_year_keys_idx
ON cycling.cyclist_teams ( KEYS (teams) );
The index allows a query to filter the map keys:
SELECT *
FROM cycling.cyclist_teams
WHERE teams CONTAINS KEY 2015;
Filtering map entries
This example adds an index for map entries.
CREATE INDEX IF NOT EXISTS blist_idx
ON cycling.birthday_list ( ENTRIES(blist) );
This method only works for maps. |
This query finds all cyclists who are 23 years old based on their entry in the blist
map in the birthday_list
table.
SELECT *
FROM cycling.birthday_list
WHERE blist[ 'age' ] = '23';
Filtering a full frozen collection
The example in this section uses a table containing a FROZEN list
collection named rnumbers
.
This statement creates an index, which is required for the query:
CREATE INDEX IF NOT EXISTS rnumbers_idx
ON cycling.race_starts ( FULL(rnumbers) );
This query retrieves the row that fully matches the collection’s values, specifically a cyclist who has 39 Pro wins, 7 Grand Tour starts, and 14 Classic starts in rnumbers
:
SELECT *
FROM cycling.race_starts
WHERE rnumbers = [39, 7, 14];
Range relations
Cassandra supports greater-than and less-than comparisons. But, for a given partition key, the conditions on the clustering column are restricted to the filters that allow selection of a contiguous set of rows.
This query constructs a filter that selects cycling calendar data whose start date is within a specified range and the race_id
is 101.
(If race_id
were not a component of the primary key, you would need to create an index on race_id
to use this query.)
SELECT *
FROM cycling.calendar
WHERE race_id = 101
AND race_start_date >= '2014-05-27'
AND race_start_date < '2017-06-16';
Output:
race_id | race_start_date | race_end_date | race_name
---------+---------------------------------+---------------------------------+-----------------------
101 | 2015-06-07 00:00:00.000000+0000 | 2015-06-14 00:00:00.000000+0000 | Criterium du Dauphine
101 | 2014-06-06 00:00:00.000000+0000 | 2014-06-13 00:00:00.000000+0000 | Criterium du Dauphine
(2 rows)
To allow selection of a contiguous set of rows, the WHERE clause must apply an equality condition to the race_id
component of the primary key.
Using compound primary keys and sorting results
These restrictions apply when using an ORDER BY
clause with a compound primary key:
-
Only include clustering columns in the
ORDER BY
clause. -
In the
WHERE
clause, provide all the partition key values and clustering column values that precede the column(s) in theORDER BY
clause. In 6.0 and later, the columns specified in theORDER BY
clause must be an ordered subset of the columns of the clustering key; however, columns restricted by the equals operator (=) or a single-valuedIN
restriction can be skipped. -
When sorting multiple columns, the columns must be listed in the same order in the
ORDER BY
clause as they are listed in thePRIMARY KEY
clause of the table definition. -
Sort ordering is limited. For example, if your table definition uses
CLUSTERING ORDER BY (start_month ASC, start_day ASC)
, then you can useORDER BY start_day, race
in your query (ASC
is the default). You can also reverse the sort ordering if you apply it to all of the columns; for example,ORDER BY start_day DESC, race DESC
. -
Refer to a column using the actual name, not an alias.
See the cyclist_category table, which uses a compound primary key. This query retrieves the cyclist categories, in descending order by points.
SELECT *
FROM cycling.cyclist_category
WHERE category = 'Time-trial'
ORDER BY points DESC;
Output:
category | points | id | lastname
------------+--------+--------------------------------------+------------
Time-trial | 182 | 220844bf-4860-49d6-9a4b-6b5d3a79cbfb | TIRALONGO
Time-trial | 3 | 6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 | KRUIJSWIJK
(2 rows)
The following example shows a table with a more complex compound primary key.
CREATE TABLE IF NOT EXISTS cycling.events (
year int,
start_month int,
start_day int,
end_month int,
end_day int,
race text,
discipline text,
location text,
uci_code text,
PRIMARY KEY (
(year, discipline), start_month, start_day, race
)
);
This query contains a WHERE
clause that provides values for all primary key columns that precede the race
column and orders the results by race
:
SELECT *
FROM cycling.events
WHERE year = 2017
AND discipline = 'Cyclo-cross'
AND start_month = 1
AND start_day = 1
ORDER BY race;
Output:
year | discipline | start_month | start_day | race | end_day | end_month | location | uci_code
------+-------------+-------------+-----------+----------------------------------------+---------+-----------+----------+----------
2017 | Cyclo-cross | 1 | 1 | DVV verzekeringen trofee - GP Sven Nys | null | null | Baal | C1
(1 rows)
This query has multiple clustering columns in the ORDER BY
clause:
SELECT *
FROM cycling.events
WHERE year = 2017
AND discipline = 'Cyclo-cross'
AND start_month = 1
ORDER BY start_day, race;
Vector search sorting
This query uses a vector in the ORDER BY
clause:
SELECT * FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 3;
Grouping results
The GROUP BY
clause condenses the selected rows that share the same values for a set of columns into a group.
A GROUP BY
clause can contain:
-
Partition key columns and clustering columns.
-
A deterministic monotonic function, including a user-defined function (UDF), on the last clustering column specified in the
GROUP BY
clause. TheFLOOR()
function is monotonic when the duration and start time parameters are constants. -
A deterministic aggregate.
The examples in this section use the race_times_summary
table:
CREATE TABLE IF NOT EXISTS cycling.race_times_summary (
race_date date,
race_time time,
PRIMARY KEY (race_date, race_time)
);
The table contains these rows:
race_date | race_time
------------+--------------------
2019-03-21 | 10:01:18.000000000
2019-03-21 | 10:15:20.000000000
2019-03-21 | 11:15:38.000000000
2019-03-21 | 12:15:40.000000000
2018-07-26 | 10:01:18.000000000
2018-07-26 | 10:15:20.000000000
2018-07-26 | 11:15:38.000000000
2018-07-26 | 12:15:40.000000000
2017-04-14 | 10:01:18.000000000
2017-04-14 | 10:15:20.000000000
2017-04-14 | 11:15:38.000000000
2017-04-14 | 12:15:40.000000000
(12 rows)
This query groups the rows by the race_date
column values:
SELECT race_date, race_time
FROM cycling.race_times_summary
GROUP BY race_date;
Each set of rows with the same race_date
column value are grouped together into one row in the query output.
Three rows are returned because there are three groups of rows with the same race_date
column value.
The value returned is the first value that is found for the group.
race_date | race_time
------------+--------------------
2019-03-21 | 10:01:18.000000000
2018-07-26 | 10:01:18.000000000
2017-04-14 | 10:01:18.000000000
(3 rows)
This query groups the rows by race_date
and FLOOR(race_time, 1h)
, which returns the hour.
The number of rows in each group is returned by COUNT(*)
.
SELECT race_date, FLOOR(race_time, 1h), COUNT(*)
FROM cycling.race_times_summary
GROUP BY race_date, FLOOR(race_time, 1h);
Nine rows are returned because there are nine groups of rows with the same race_date
and FLOOR(race_time, 1h)
values:
race_date | system.floor(race_time, 1h) | count
------------+-----------------------------+-------
2019-03-21 | 10:00:00.000000000 | 2
2019-03-21 | 11:00:00.000000000 | 1
2019-03-21 | 12:00:00.000000000 | 1
2018-07-26 | 10:00:00.000000000 | 2
2018-07-26 | 11:00:00.000000000 | 1
2018-07-26 | 12:00:00.000000000 | 1
2017-04-14 | 10:00:00.000000000 | 2
2017-04-14 | 11:00:00.000000000 | 1
2017-04-14 | 12:00:00.000000000 | 1
(9 rows)
Computing aggregates
DSE provides the built-in functions COUNT()
, MIN()
, MAX()
, SUM()
, and AVG()
that return aggregate values to SELECT
statements.
You can also create user-defined aggregates (UDAs).
The following sections show examples.
Using COUNT() to get the non-null value count for a column
A SELECT
using COUNT(column_name)
returns the number of non-null values in a column.
COUNT
ignores null values.
This query counts the number of last names in the cyclist_name
table:
SELECT COUNT(lastname)
FROM cycling.cyclist_name;
Getting the number of matching rows and aggregate values with COUNT()
A SELECT
using COUNT()
returns the number of rows that matched the query.
Use COUNT(1)
to get the same result.
COUNT()
or COUNT(1)
can be used in conjunction with other aggregate functions or columns.
This query returns the number of rows in the cyclist_name
table:
SELECT COUNT(*)
FROM cycling.cyclist_name;
This query counts the number of rows and calculates the maximum value for start_day
in the events
table:
SELECT start_month, MAX(start_day), COUNT(*)
FROM cycling.events
WHERE year = 2017
AND discipline = 'Cyclo-cross';
This example provides a year that is not stored in the events table:
SELECT start_month, MAX(start_day)
FROM cycling.events
WHERE year = 2022
ALLOW FILTERING;
In the output, notice that the columns are null, and one row is returned:
start_month | system.max(start_day)
-------------+-----------------------
null | null
(1 rows)
Getting maximum and minimum values in a column
A SELECT
using MAX(<column_name>)
returns the maximum value in a column.
When the column’s data type is numeric (bigint
, decimal
, double
, float
, int
, or smallint
), MAX(<column_name>)
returns the maximum value.
For example:
SELECT MAX(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
Output:
system.max(race_points)
-------------------------
120
(1 rows)
If you do not include a |
Warnings :
Aggregation query used without partition key
The MIN
function returns the minimum value:
system.min(race_points)
-------------------------
6
(1 rows)
If the column referenced by the MAX
or MIN
functions has an ascii
or text
data type, the functions return the last or first item in an alphabetic sort of the column values.
If the specified column has data type date
or timestamp
, the functions return the most recent or least recent times and dates.
If a column has a null value, MAX
and MIN
ignore the null value.
If the column for an entire set of rows contains null, MAX
and MIN
return null.
If the query includes a WHERE
clause (recommended), MAX
returns the largest value from the rows that satisfy the WHERE
condition, and MIN
returns the smallest value from the rows that satisfy the WHERE
condition.
Getting the average or sum of a column of numbers
This example computes the average of all values in a column using AVG
:
SELECT AVG(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
Output:
system.avg(race_points)
-------------------------
67
(1 rows)
Use SUM
to get a total:
SELECT SUM(race_points)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac
AND race_points > 7;
Output:
system.sum(race_points)
-------------------------
195
(1 rows)
If any of the rows returned have a null value for the column referenced in the AVG
function, the function includes that row in the row count, but uses zero for the null value to calculate the average.
The SUM
and AVG
functions do not work with text
, uuid
, or date
fields.
This query returns the cyclist team average time using a user-defined aggregate (UDA).
SELECT cycling.average(cyclist_time_sec)
FROM cycling.team_average
WHERE team_name = 'UnitedHealthCare Pro Cycling Womens Team'
AND race_title = 'Amgen Tour of California Women''s Race presented by SRAM - Stage 1 - Lake Tahoe > Lake Tahoe';
Retrieving the date/time a write occurred
The WRITETIME function applied to a column returns the date/time in microseconds at which the column was written to the database.
This query retrieves the date/time that writes occurred to the firstname
column of a cyclist:
SELECT WRITETIME (firstname)
FROM cycling.cyclist_points
WHERE id = e3b19ec4-774a-4d1c-9e5a-decec1e30aac;
Output:
writetime(firstname)
----------------------
1538688876521481
1538688876523973
1538688876525239
The WRITETIME
output of the last write 1538688876525239
in microseconds converts to Thursday, October 4, 2018 4:34:36.525 PM GMT-05:00 DST
.
Retrieving the time-to-live of a column
The time-to-live (TTL) of a column value in a row is the number of seconds before the value is marked with a tombstone.
This example INSERT
sets the TTL of the column values to 200 seconds:
INSERT INTO cycling.calendar (
race_id, race_name, race_start_date, race_end_date
) VALUES (
200, 'placeholder', '2015-05-27', '2015-05-27'
)
USING TTL 200;
This example UPDATE
sets the TTL of a single race_name
column value to 200 seconds:
UPDATE cycling.calendar
USING TTL 300
SET race_name = 'Tour de France - Stage 12'
WHERE race_id = 200
AND race_start_date = '2015-05-27'
AND race_end_date = '2015-05-27';
This query retrieves the current TTL of the specified race_name
column value:
SELECT cycling.average(cyclist_time_sec)
FROM cycling.team_average
WHERE team_name = 'UnitedHealthCare Pro Cycling Womens Team'
AND race_title = 'Amgen Tour of California Women''s Race presented by SRAM - Stage 1 - Lake Tahoe > Lake Tahoe';
calendar-table.cql#token=START-select_with_TTL
SELECT TTL(race_name)
FROM cycling.calendar
WHERE race_id = 200;
Output:
ttl(race_name)
----------------
276
(1 rows)