WHERE clause
The SELECT
statement is used to query data from a table in CQL.
The results are returned as a set of rows.
Selectors are used to specify which columns are queried and returned.
The WHERE
clause is used to filter the rows that are returned.
Using WHERE
, you can specify a condition that must be met for a row to be included in the result set.
The condition can include multiple columns and multiple conditions linked with AND
.
A subtle variation to the use of AND
is using a tuple to group columns to meet a condition.
The TOKEN
function can be used to retrieve data from a specific node in the cluster.
Several arithmetic and non-arithmetic operators are used in the WHERE
clause to specify the condition.
Operator | Description | Usage |
---|---|---|
|
Equal |
Any column type |
|
Greater than |
Numeric, text, date, time, timestamp |
|
Less than |
Numeric, text, date, time, timestamp |
|
Greater than or equal |
Numeric, text, date, time, timestamp |
|
Less than or equal |
Numeric, text, date, time, timestamp |
|
Supply a list of values to match |
Any column type |
|
Any collection contains value |
Any column type within a non-frozen collection, used with SAI and 2i |
|
Map collection contains key |
Any column type within a non-frozen map collection, used with SAI and 2i |
|
Fuzzy text pattern matching |
Text column type, used with SASI |
Note that the |
Arithmetic operators
To return a result set that uses an equality condition, use the =
operator.
Partition key columns always use the equality operator in the WHERE
clause.
The exception is if the partition key is retrieved using the TOKEN
function, inequality operators can be used.
Clustering columns can use the equality operator in the WHERE
clause.
The equality operator will result in the quickest query response time, provided the query avoids multiple partitions.
SELECT * FROM cycling.cyclist_base WHERE age = 28;
Results
cid | age | birthday | country | name
--------------------------------------+-----+------------+-------------+-------------------
6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 | 28 | 1987-06-07 | Netherlands | Steven KRUIKSWIJK
(1 rows)
Only tokenized partition key columns and the last clustering columns in a WHERE
clause can use the inequality operators.
To return a result set that uses an inequality condition, use one of the the operators: <
, >
, ⇐
, or >=
.
SELECT * FROM cycling.cyclist_base WHERE age >= 20;
Results
cid | age | birthday | country | name
--------------------------------------+-----+------------+---------------+-------------------
e7ae5cf3-d358-4d99-b900-85902fda9bb0 | 22 | 1993-06-18 | New Zealand | Alex FRAME
c9c9c484-5e4a-4542-8203-8d047a01b8a8 | 27 | 1987-09-04 | Brazil | Cristian EGIDIO
862cc51f-00a1-4d5a-976b-a359cab7300e | 20 | 1994-09-04 | Denmark | Joakim BUKDAL
220844bf-4860-49d6-9a4b-6b5d3a79cbfb | 38 | 1977-07-08 | Italy | Paolo TIRALONGO
6ab09bec-e68e-48d9-a5f8-97e6fb4c9b47 | 28 | 1987-06-07 | Netherlands | Steven KRUIKSWIJK
95fed1a4-db7e-4c8d-8b2d-f7f9340f2c0b | 20 | 1993-04-09 | United States | Zam WEATON
(6 rows)
Non-arithmetic operators
Non-arithmetic operators are used to specify conditions that are not based on numerical values.
Partition key columns always use the equality operator in the WHERE
clause, with the exception being the use of the IN
operator if the column is the last partition key column.
Clustering columns can use the IN
operator in the WHERE
clause.
To return a result set that uses the IN
condition, use the IN
operator.
Note that using both |
SELECT * FROM cycling.calendar WHERE race_id IN (100, 101, 102)
ORDER BY race_start_date DESC;
Results
There are cases in which the IN
operator should not be used.
A good blogpost on this topic is Cassandra Query Patterns: Not using the “in” query for multiple partitions.
The CONTAINS
and CONTAINS KEY
operators are used with collections (set, list, map) to specify conditions that are based on the contents of the collection.
In order to facilitate the use of these operators, a secondary index is often required, as the collection column is usually a non-primary key column.
To return a result set that uses the CONTAINS
condition, use the CONTAINS
operator.
This particular example uses the CONTAINS
operator to return a result set that contains a specific value for a column in a map collection.
SELECT firstname,lastname,teams FROM cycling.cyclist_teams
WHERE teams CONTAINS 'Team Garmin - Cervelo';
Results
firstname | lastname | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Jamie | BENNETT | {2011: 'Team Garmin - Cervelo', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team'}
Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
(2 rows)
firstname | lastname | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Jamie | BENNETT | {2011: 'Team Garmin - Cervelo', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team'}
Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
(2 rows)
Another example demonstrates that frozen collections can also be used with the CONTAINS
operator:
SELECT * FROM cycling.race_results WHERE race_history CONTAINS 1984;
Results
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Column race_history has an index but does not support the operators
specified in the query. If you want to execute this query despite the performance
unpredictability, use ALLOW FILTERING"
Remember that the CONTAINS KEY
is used only with a map collection.
This example uses the CONTAINS KEY
operator to return a result set that contains a specific key of a key-value pair in a map collection.
SELECT firstname,lastname,teams FROM cycling.cyclist_teams
WHERE teams CONTAINS KEY 2015;
Results
firstname | lastname | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
Marianne | VOS | {2015: 'Rabobank-Liv Woman Cycling Team'}
(2 rows)
firstname | lastname | teams
-----------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Elizabeth | ARMITSTEAD | {2011: 'Team Garmin - Cervelo', 2012: 'AA Drink - Leontien.nl', 2013: 'Boels:Dolmans Cycling Team', 2014: 'Boels:Dolmans Cycling Team', 2015: 'Boels:Dolmans Cycling Team'}
Marianne | VOS | {2015: 'Rabobank-Liv Woman Cycling Team'}
(2 rows)
A SELECT
statement can also use the LIKE
operator to return a result set that matches a specific pattern.
A SASI indexis required to use the LIKE
operator.
SASI is considered experimental and is not recommended for production use.
SELECT * FROM cycling.country_flag WHERE cyclist_name LIKE 'Jac%';
Results
country | cyclist_name | flag
---------+--------------+------
Belgium | Jacques | 1
(1 rows)
Warnings :
SASI index was enabled for 'cycling.country_flag'. SASI is still in beta, take extra caution when using it in production.
country | cyclist_name | flag
---------+--------------+------
Belgium | Jacques | 1
(1 rows)
Warnings :
SASI index was enabled for 'cycling.country_flag'. SASI is still in beta, take extra caution when using it in production.
Multiple conditions
Multiple conditions can be specified in the WHERE
clause using the AND
operator.
Using more than one condition in the WHERE
clause allows you to filter the result set based on multiple criteria.
In the following example, the AND
operator is used to return a result set that meets two conditions.
SELECT * FROM cycling.cyclist_base WHERE age >= 28 AND age < 20;
Results
cid | age | birthday | country | name
-----+-----+----------+---------+------
(0 rows)
Fails because of the disjoint ranges
Using a tuple for multiple conditions
A grouping of columns can be used to meet a condition.
In this example, a tuple is used to group two columns, race_start_date
and race_end_date
, to meet a condition.
Not that searching for race_start_date = <value> AND race_end_date = <value>
may return a different result set than searching for the tuple of values.
SELECT * FROM cycling.calendar WHERE race_id IN (100, 101, 102)
AND (race_start_date, race_end_date) >= ('2015-05-09', '2015-05-24');
Results
race_id | race_start_date | race_end_date | race_name
---------+---------------------------------+---------------------------------+-----------------------
100 | 2015-05-09 00:00:00.000000+0000 | 2015-05-31 00:00:00.000000+0000 | Giro d'Italia
101 | 2015-06-07 00:00:00.000000+0000 | 2015-06-14 00:00:00.000000+0000 | Criterium du Dauphine
102 | 2015-06-13 00:00:00.000000+0000 | 2015-06-21 00:00:00.000000+0000 | Tour de Suisse
(3 rows)
race_id | race_start_date | race_end_date | race_name
---------+---------------------------------+---------------------------------+-----------------------
100 | 2015-05-09 00:00:00.000000+0000 | 2015-05-31 00:00:00.000000+0000 | Giro d'Italia
101 | 2015-06-07 00:00:00.000000+0000 | 2015-06-14 00:00:00.000000+0000 | Criterium du Dauphine
102 | 2015-06-13 00:00:00.000000+0000 | 2015-06-21 00:00:00.000000+0000 | Tour de Suisse
(3 rows)
Clustering columns
Clustering columns can make filtering with a WHERE
clause uniquely responsive for some data.
The clustering columns are used to sort the data within a partition, and the data is stored in the sorted order.
This quality means that slices, where a range of rows are retrieved, can be very efficient.
A particular row can be found using the clustering columns in the order that they are defined in the primary key.
The rows that come before or after that row can be retrieved quickly.
In fact, even a slice of a slice can be retrieved, by narrowing down the range of rows with conditions on the first clustering columns.
A tuple can be handy for sorting on multiple clustering columns.
Because the database uses the clustering columns to determine the location of the data on the partition, you must identify the higher level clustering columns definitively using the equals (=) or IN operators. In a query, you can only restrict the lowest level using the range operators (>, >=, <, or <=).
When a query contains no restrictions on clustering or index columns, all the data from the partition is returned.
How order impacts clustering restrictions
Because the database uses the clustering columns to determine the location of the data on the partition, you must identify the higher level clustering columns definitively using the equals (=) or IN operators. In a query, you can only restrict the lowest level using the range operators (>, >=, <, or <=).
How data is stored
The following table is used to illustrate how clustering works:
CREATE TABLE cycling.numbers (
key int,
col_1 int,
col_2 int,
col_3 int,
col_4 int,
PRIMARY KEY ((key), col_1, col_2, col_3, col_4)
);
The example table contains the following data:
key | col_1 | col_2 | col_3 | col_4
-----+-------+-------+-------+-------
100 | 1 | 1 | 1 | 1
100 | 1 | 1 | 1 | 2
100 | 1 | 1 | 1 | 3
100 | 1 | 1 | 2 | 1
100 | 1 | 1 | 2 | 2
100 | 1 | 1 | 2 | 3
100 | 1 | 2 | 2 | 1
100 | 1 | 2 | 2 | 2
100 | 1 | 2 | 2 | 3
100 | 2 | 1 | 1 | 1
100 | 2 | 1 | 1 | 2
100 | 2 | 1 | 1 | 3
100 | 2 | 1 | 2 | 1
100 | 2 | 1 | 2 | 2
100 | 2 | 1 | 2 | 3
100 | 2 | 2 | 2 | 1
100 | 2 | 2 | 2 | 2
100 | 2 | 2 | 2 | 3
(18 rows)
The database stores and locates the data using a nested sort order. The data is stored in hierarchy that the query must traverse:
[json]
--------------------------------------------------------------
{"key": 100, "col_1": 1, "col_2": 1, "col_3": 1, "col_4": 1}
{"key": 100, "col_1": 1, "col_2": 1, "col_3": 1, "col_4": 2}
{"key": 100, "col_1": 1, "col_2": 1, "col_3": 1, "col_4": 3}
{"key": 100, "col_1": 1, "col_2": 1, "col_3": 2, "col_4": 1}
{"key": 100, "col_1": 1, "col_2": 1, "col_3": 2, "col_4": 2}
{"key": 100, "col_1": 1, "col_2": 1, "col_3": 2, "col_4": 3}
{"key": 100, "col_1": 1, "col_2": 2, "col_3": 2, "col_4": 1}
{"key": 100, "col_1": 1, "col_2": 2, "col_3": 2, "col_4": 2}
{"key": 100, "col_1": 1, "col_2": 2, "col_3": 2, "col_4": 3}
{"key": 100, "col_1": 2, "col_2": 1, "col_3": 1, "col_4": 1}
{"key": 100, "col_1": 2, "col_2": 1, "col_3": 1, "col_4": 2}
{"key": 100, "col_1": 2, "col_2": 1, "col_3": 1, "col_4": 3}
{"key": 100, "col_1": 2, "col_2": 1, "col_3": 2, "col_4": 1}
{"key": 100, "col_1": 2, "col_2": 1, "col_3": 2, "col_4": 2}
{"key": 100, "col_1": 2, "col_2": 1, "col_3": 2, "col_4": 3}
{"key": 100, "col_1": 2, "col_2": 2, "col_3": 2, "col_4": 1}
{"key": 100, "col_1": 2, "col_2": 2, "col_3": 2, "col_4": 2}
{"key": 100, "col_1": 2, "col_2": 2, "col_3": 2, "col_4": 3}
(18 rows)
To avoid full scans of the partition and to make queries more efficient, the database requires that the higher level columns in the sort order (col_1, col_2, and col_3) are identified using the equals or IN operators. Ranges are allowed on the last column (col_4).
Selecting data from a clustering segment
For example, to find only values in column 4 that are less than or equal to 2:
SELECT * FROM cycling.numbers
WHERE key = 100 AND col_1 = 1 AND col_2 = 1 AND col_3 = 1 AND col_4 <= 2;
Results
The results contain the first two rows:
key | col_1 | col_2 | col_3 | col_4
-----+-------+-------+-------+-------
100 | 1 | 1 | 1 | 1
100 | 1 | 1 | 1 | 2
(2 rows)
The IN operator can impact performance on medium-large datasets. When selecting multiple segments, the database loads and filters all the specified segments.
For example, to find all values less than or equal to 2 in both col_1
segments 1 and 2:
SELECT * FROM cycling.numbers WHERE key = 100 AND col_1 = 1 AND col_2 > 1
Results
The following visualization shows all the segments the database must load to filter multiple segments:
The results return the range from both segments.
key | col_1 | col_2 | col_3 | col_4
-----+-------+-------+-------+-------
100 | 1 | 2 | 2 | 1
100 | 1 | 2 | 2 | 2
100 | 1 | 2 | 2 | 3
(3 rows)
Use TRACING to analyze the impact of various queries in your environment. |
Invalid restrictions
Queries that attempt to return ranges without identifying any of the higher level segments are rejected:
SELECT * FROM cycling.numbers WHERE key = 100 AND col_4 <= 2;
Results
The request is invalid:
InvalidRequest: Error from server: code=2200 [Invalid query] message="PRIMARY KEY column "col_4"
cannot be restricted as preceding column "col_1" is not restricted"
You can force the query using the ALLOW FILTERING option; however, this loads the entire partition and negatively impacts performance by causing long READ latencies. |
Only restricting top level clustering columns
Unlike partition columns, a query can omit lower level clustering column in logical statements.
For example, to filter one of the mid-level columns, restrict the first level column using equals or IN, then specify a range on the second level:
SELECT * FROM cycling.numbers WHERE key = 100 AND col_1 = 1 AND col_2 > 1
Results
key | col_1 | col_2 | col_3 | col_4
-----+-------+-------+-------+-------
100 | 1 | 2 | 2 | 1
100 | 1 | 2 | 2 | 2
100 | 1 | 2 | 2 | 3
(3 rows)
Returning ranges that span clustering segments
Slicing provides a way to look at an entire clustering segment and find a row that matches values in multiple columns. The slice logical statement finds a single row location and allows you to return all the rows before, including, between, or after the row.
Slice syntax:
(clustering1, clustering2[, …]) <range_operator> (value1, value2[, …])
[AND (clustering1, clustering2[, …]) <range_operator> (value1, value2[, …])]
Slices across full partition
The slice determines the exact location within the sorted columns; therefore, the highest level is evaluated first, then the second, and so forth in order to drill down to the precise row location. The following statement identifies the row where column 1, 2, and 3 are equal to 2 and column 4 is less than or equal to 1.
SELECT * FROM cycling.numbers
WHERE key = 100 AND (col_1, col_2, col_3, col_4) <= (2, 2, 2, 1);
The database locates the matching row and then returns every record before the identified row in the results set.
Where col_1 = 1, col_4 contains values 2 and 3 in the results (which are greater than 1). |
The database is NOT filtering on all values in column 4, it is finding the exact location shown in dark green. Once it locates the row, the evaluation ends.
The location might be hypothetical, that is the dataset does not contain a row that exactly matches the values. For example, the query specifies slice values of (2, 1, 1, 4).
SELECT * FROM cycling.numbers
WHERE key = 100 AND (col_1, col_2, col_3, col_4) <= (2, 1, 1, 4);
The query finds where the row would be in the order if a row with those values existed and returns all rows before it:
The value of column 4 is only evaluated to locate the row placement within the clustering segment. The database locates the segment and then finds col_4 = 4. After finding the location, it returns the row and all the rows before it in the sort order (which in this case spans all clustering columns). |
Slices of clustering segments
The same rules apply to slice restrictions when finding a slice on a lower level segment; identify the higher level clustering segments using equals or IN and specify a range on the lower segments.
For example, to return rows where the value is greater than (1, 3) and less than or equal to (2, 5):
SELECT * FROM cycling.numbers WHERE key = 100 AND col_1 = 1 AND col_2 = 1
AND (col_3, col_4) >= (1, 2) AND (col_3, col_4) < (2, 3);
When finding a between range, the two slice statements must be on the same columns for lowest columns in the hierarchy.
Invalid queries
When returning a slice between two rows, the slice statements must define the same clustering columns. The query is rejected if the columns are different:
SELECT * FROM cycling.numbers WHERE key = 100 AND col_1 = 1
AND (col_2, col_3, col_4) >= (1, 1, 2) AND (col_3, col_4) < (2, 3);
Results
The request is invalid:
InvalidRequest: Error from server: code=2200 [Invalid query]
message="Column "col_3" cannot be restricted by two inequalities not starting
with the same column"
-
Find the road cycling races that start in 2017 between January 15th and February 14th.
CREATE TABLE IF NOT EXISTS cycling.events ( year int, start_month int, start_day int, end_month int, end_day int, race text, discipline text, location text, uci_code text, PRIMARY KEY ( (year, discipline), start_month, start_day, race ) );
Limit the start_month
and start_day
for the range using a slice:
SELECT start_month as month, start_day as day, race FROM cycling.events WHERE year = 2017 AND discipline = 'Road' AND (start_month, start_day) < (2, 14) AND (start_month, start_day) > (1, 15);
Results
The results contain events in that time period:
month | day | race
-------+-----+------
(0 rows)
month | day | race
-------+-----+-----------------------------------------------------------------
1 | 23 | Vuelta Ciclista a la Provincia de San Juan
1 | 26 | Cadel Evans Great Ocean Road Race - Towards Zero Race Melbourne
1 | 26 | Challenge Mallorca: Trofeo Porreres-Felanitx-Ses Salines-Campos
1 | 28 | Cadel Evans Great Ocean Road Race
1 | 28 | Challenge Mallorca: Trofeo Andratx-Mirador des Colomer
1 | 28 | Challenge Mallorca: Trofeo Serra de Tramuntana -2017
1 | 29 | Cadel Evans Great Ocean Road Race
1 | 29 | Dubai Tour
1 | 29 | Grand Prix Cycliste la Marseillaise
1 | 29 | Mallorca Challenge: Trofeo Palma
1 | 31 | Ladies Tour of Qatar
2 | 1 | Etoile de Besseges
2 | 1 | Jayco Herald Sun Tour
2 | 1 | Volta a la Comunitat Valenciana
2 | 5 | G.P. Costa degli Etruschi
2 | 6 | Tour of Qatar
2 | 9 | South African Road Championships
2 | 11 | Trofeo Laigueglia
2 | 11 | Vuelta Ciclista a la Region de Murcia
2 | 12 | Clasica de Almeria
(20 rows)
Vector search sorting
This query uses a vector in the ORDER BY
clause to get the closest matches to the stored embeddings vector:
SELECT * FROM cycling.comments_vs
ORDER BY comment_vector ANN OF [0.15, 0.1, 0.1, 0.35, 0.55]
LIMIT 3;
Results