GROUP BY clause
Group by one or more columns.
Condenses the selected rows that share the same values for a set of columns or values returned by a function into a group.
Either one or more primary key columns or a deterministic function or aggregate can be used in the GROUP BY
clause.
SELECT race_date, race_time FROM cycling.race_times_summary
GROUP BY race_date;
Results
race_date | race_time
------------+--------------------
2019-03-21 | 10:01:18.000000000
2018-07-26 | 10:01:18.000000000
2017-04-14 | 10:01:18.000000000
(3 rows)
Warnings :
Aggregation query used without partition key
Each set of rows with the same race_date
column value are grouped together into one row in the query output.
Three rows are returned because there are three groups of rows with the same race_date
column value.
The value returned is the first value that is found for the group.
This query groups the rows by race_date
and FLOOR(race_time, 1h)
, which returns the hour.
The number of rows in each group is returned by COUNT(*)
.
SELECT race_date, FLOOR(race_time, 1h), COUNT(*) FROM cycling.race_times_summary
GROUP BY race_date, FLOOR(race_time, 1h);
Results
race_date | system.floor(race_time, 1h) | count
------------+-----------------------------+-------
2019-03-21 | 10:00:00.000000000 | 2
2019-03-21 | 11:00:00.000000000 | 1
2019-03-21 | 12:00:00.000000000 | 1
2018-07-26 | 10:00:00.000000000 | 2
2018-07-26 | 11:00:00.000000000 | 1
2018-07-26 | 12:00:00.000000000 | 1
2017-04-14 | 10:00:00.000000000 | 2
2017-04-14 | 11:00:00.000000000 | 1
2017-04-14 | 12:00:00.000000000 | 1
(9 rows)
Warnings :
Aggregation query used without partition key
Nine rows are returned because there are nine groups of rows with the same race_date
and FLOOR(race_time, 1h)
values.
ORDER BY clause
You can fine-tune the display order using the ORDER BY
clause.
The partition key must be defined in the WHERE
clause and then the ORDER BY
clause defines one or more clustering columns to use for ordering.
The order of the specified columns must match the order of the clustering columns in the PRIMARY KEY
definition.
The options for ordering are ASC
(ascending) and DESC
(descending).
If no order is specified, the results are returned in the stored order.
Note that using both |
SELECT * FROM cycling.calendar WHERE race_id IN (100, 101, 102)
ORDER BY race_start_date ASC;
Results
The ORDER BY
clause also supports vector searches of the vector column.
The result set is sorted using the Approximate Nearest Neighbor (ANN) algorithm with the supplied array values.
LIMIT clause
If a query returns a large number of rows, you can limit the number of rows returned, to limit the amount of data returned.
The default limit is set to 10,000 rows, the number of rows cqlsh
allows.
This examples limits the rows to 3:
SELECT * FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 3;
Results
id | created_at | comment | comment_vector | commenter | record_id
--------------------------------------+---------------------------------+----------------------------------------+---------------------------------------------------+-----------+--------------------------------------
e8ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | rain, rain,rain, go away! | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | John | f25e4fe1-38c3-11ef-bd85-f92c3c7170c3
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | LATE RIDERS SHOULD NOT DELAY THE START | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Alex | f259bc00-38c3-11ef-bd85-f92c3c7170c3
e8ae5df3-d358-4d99-b900-85902fda9bb0 | 2017-04-01 14:33:02.160000+0000 | Rain like a monsoon | b'?fff?\\n=q=\\xf5\\xc2\\x8f=\\xcc\\xcc\\xcd?s33' | Jane | f25eec21-38c3-11ef-bd85-f92c3c7170c3
(3 rows)
OFFSET option
The LIMIT
clause can also use the OFFSET
option to skip a number of rows before returning the results.
OFFSET
cannot be used without LIMIT
.
This is an expensive feature for large offsets and should be avoided.
Non-offset pagination is a better option for large datasets.
There are guardrail settings in the cassandra.yaml
file that will prevent offsets that are too large.
The offset_rows_warn_threshold
defaults to 10,000 rows. The offset_rows_failure_threshold
defaults to 20,000 rows.
There are a number of restrictions on the use of OFFSET
:
-
Specifying an OFFSET disables normal key-based paging.
-
Using
OFFSET 0
also disables key-based paging. -
ANN
queries don’t supportOFFSET
.
This example limits the rows to 3, but skips the first two rows:
SELECT comment,comment_vector,commenter FROM cycling.comments_vs
WHERE commenter : 'Alex'
LIMIT 3 OFFSET 2;
Results
comment | comment_vector | commenter
----------------------------------------+-------------------------------+-----------
Raining too hard should have postponed | [0.45, 0.09, 0.01, 0.2, 0.11] | Alex
(1 rows)
PER PARTITION LIMIT clause
The PER PARTITION LIMIT
option sets the maximum number of rows that the query returns from each partition.
This will only apply to tables that spread across more than one partition.
An example of such a table is defined here:
USE cycling;
CREATE TABLE rank_by_year_and_name (
race_year int,
race_name text,
cyclist_name text,
rank int,
PRIMARY KEY ((race_year, race_name), rank)
);
where the partition key is a composite of race_year
and race_name
.
The following query returns the top two cyclists from each partition stored:
SELECT rank, cyclist_name AS name FROM cycling.rank_by_year_and_name
PER PARTITION LIMIT 2;
Results
rank | name
------+----------------------
1 | Phillippe GILBERT
2 | Daniel MARTIN
1 | Daniel MARTIN
2 | Johan Esteban CHAVES
1 | Ilnur ZAKARIN
2 | Carlos BETANCUR
1 | Benjamin PRADES
2 | Adam PHELAN
(8 rows)
ALLOW FILTERING
The ALLOW FILTERING
clause allows you to perform queries that require scanning all partitions, with no primary key columns specified.
It should not be used in production as it can cause severe performance issues!
When initially modeling your data, you should avoid using ALLOW FILTERING
and instead model your data to avoid it.
However, for a small dataset or for testing purposes, it can be useful.
It may even help you identify where you need to add indexes to your data model.
For more information, see Allow Filtering explained.
The following query selects the birthday
and nationality
columns from the cyclist_alt_stats
table, with the ALLOW FILTERING
clause:
SELECT lastname, birthday, nationality FROM cycling.cyclist_alt_stats
WHERE birthday = '1991-08-25' AND nationality = 'Ethiopia'
ALLOW FILTERING;
Results
lastname | birthday | nationality
----------+------------+-------------
GRMAY | 1991-08-25 | Ethiopia
(1 rows)