Using a secondary index

Using CQL to create a secondary index on a column after defining a table.

Create indexes on a column after defining a table. Secondary indexes are used to query a table using a column that is not normally query-able.

Secondary indexes can impact performance greatly. The index table is stored on each node in a cluster, so a query involving a secondary index can rapidly become a performance problem if multiple nodes are accessed. A general rule is to index a column with low cardinality of few values. Before creating an index, be aware of when not to create an index.

Procedure

  • The table rank_by_year_and_name stores the rank of cyclists for races.
    CREATE TABLE cycling.rank_by_year_and_name ( 
      race_year int, 
      race_name text, 
      cyclist_name text, 
      rank int, 
      PRIMARY KEY ((race_year, race_name), rank) 
    );
  • Both race_year and race_name must be specified as these columns comprise the partition key.
    SELECT *
    FROM cycling.rank_by_year_and_name
    WHERE race_year = 2015
      AND race_name = 'Tour of Japan - Stage 4 - Minami > Shinshu';
    race_year | race_name                                  | rank | cyclist_name
    ----------+--------------------------------------------+------+-----------------
         2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 | Benjamin PRADES
         2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 |     Adam PHELAN
         2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |    Thomas LEBAS
  • A logical query to try is a listing of the rankings for a particular year. Because the table has a composite partition key, this query will fail if only the first column is used in the conditional operator.
    SELECT *
    FROM cycling.rank_by_year_and_name
    WHERE race_year = 2015;
    InvalidRequest: code=2200 [Invalid query] message=
    "Partition key parts: race_name must be restricted as other parts are"
  • An index is created for the race year, and the query will succeed. An index name is optional and must be unique within a keyspace. If you do not provide a name, DataStax Distribution of Apache Cassandra assigns a name like race_year_idx.
    CREATE INDEX ryear ON
    cycling.rank_by_year_and_name (race_year);
    
    SELECT *
    FROM cycling.rank_by_year_and_name
    WHERE race_year = 2015;
    race_year | race_name                                  | rank | cyclist_name
    ----------+--------------------------------------------+------+-----------------
         2015 |   Giro d'Italia - Stage 11 - Forli > Imola |    1 |   Ilnur ZAKARIN
         2015 |   Giro d'Italia - Stage 11 - Forli > Imola |    2 | Carlos BETANCUR
         2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 | Benjamin PRADES
         2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    2 |     Adam PHELAN
         2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    3 |    Thomas LEBAS
  • A clustering column can also be used to create an index. An index is created on rank, and used in a query.
    CREATE INDEX rrank ON
    cycling.rank_by_year_and_name (rank);
    
    SELECT *
    FROM cycling.rank_by_year_and_name
    WHERE rank = 1;
    race_year | race_name                                  | rank | cyclist_name
    ----------+--------------------------------------------+------+-------------------
         2014 |                        4th Tour of Beijing |    1 | Phillippe GILBERT
         2015 |   Giro d'Italia - Stage 11 - Forli > Imola |    1 |     Ilnur ZAKARIN
         2015 | Tour of Japan - Stage 4 - Minami > Shinshu |    1 |   Benjamin PRADES