How data modeling should be approached for Cassandra. A music service example is used throughout the CQL document.
For a very basic explanation of CQL, see the Data model distilled in Cassandra & DataStax Enterprise Essentials.
Data modeling is a process that involves identifying the entities, or items to be stored, and the relationships between entities. In addition, data modeling involves the identification of the patterns of data access and the queries that will be performed. These two ideas inform the organization and structure of how storing the data, and the design and creation of the database's tables. In some cases, indexing the data improves the performance, so judicious choices about secondary indexing must be considered.
Data modeling in Cassandra uses a query-driven approach, in which specific queries are the key to organizing the data. Cassandra's database design is based on the requirement for fast reads and writes, so the better the schema design, the faster data is written and retrieved. Queries are the result of selecting data from a table; schema is the definition of how data in the table is arranged.
Cassandra's data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns can be indexed separately from the primary key. Because Cassandra is a distributed database, efficiency is gained for reads and writes when data is grouped together on nodes by partition. The fewer partitions that must be queried to get an answer to a question, the faster the response. Tuning the consistency level is another factor in latency, but is not part of the data modeling process.
For this reason, Cassandra data modeling focuses on the queries. Throughout this topic, the
music service example demonstrates the schema that results from modeling the Cassandra
tables for specific queries.
One basic query for a music service is a listing of songs, including the
title, album, and
artist. To uniquely identify a song in the table, an additional
column id is added. For a simple query to list all songs, a table
that includes all the columns identified and a partition key (K) of
id is created.
A related query searches for all songs by a particular artist. For Cassandra, this query is
more efficient if a table is created that groups all songs by artist. All the same
columns of title, album, and
artist are required, but now the primary key of the table
includes the artist as the partition key (K) and groups within the
partition by the id (C). This ensures that unique records for each
song are created.
Notice that the key to designing the table is not the relationship of the table to other tables, as it is in relational database modeling. Data in Cassandra is often arranged as one query per table, and data is repeated amongst many tables, a process known as denormalization. The relationship of the entities is important, because the order in which data is stored in Cassandra can greatly affect the ease and speed of data retrieval.