Data Modeling Concepts

How data modeling should be approached for Apache Cassandra. A music service example is used throughout the CQL document.

Note: DataStax Academy provides a course in Apache Cassandra™ data modeling. This course presents techniques using the Chebotko method for translating a real-world domain model into a running Cassandra schema.

Data modeling is a process that involves identifying the entities, or items to be stored, and the relationships between entities. In addition, data modeling involves the identification of the patterns of data access and the queries that will be performed. These two ideas inform the organization and structure of how storing the data, and the design and creation of the database's tables. In some cases, indexing the data improves the performance, so judicious choices about secondary indexing must be considered.

Data modeling in Cassandra uses a query-driven approach, in which specific queries are the key to organizing the data. Cassandra's database design is based on the requirement for fast reads and writes, so the better the schema design, the faster data is written and retrieved. Queries are the result of selecting data from a table; schema is the definition of how data in the table is arranged.

Apache Cassandra™'s data model is a partitioned row store with tunable consistency. Rows are organized into tables; the first component of a table's primary key is the partition key; within a partition, rows are clustered by the remaining columns of the key. Other columns can be indexed separately from the primary key. Because Cassandra is a distributed database, efficiency is gained for reads and writes when data is grouped together on nodes by partition. The fewer partitions that must be queried to get an answer to a question, the faster the response. Tuning the consistency level is another factor in latency, but is not part of the data modeling process.

For this reason, Cassandra data modeling focuses on the queries. Throughout this topic, the music service example demonstrates the schema that results from modeling the Cassandra tables for specific queries.

One basic query for a music service is a listing of songs, including the title, album, and artist. To uniquely identify a song in the table, an additional column id is added. For a simple query to list all songs, a table that includes all the columns identified and a partition key (K) of id is created.

A related query searches for all songs by a particular artist. For Cassandra, this query is more efficient if a table is created that groups all songs by artist. All the same columns of title, album, and artist are required, but now the primary key of the table includes the artist as the partition key (K) and groups within the partition by the id (C). This ensures that unique records for each song are created.

Notice that the key to designing the table is not the relationship of the table to other tables, as it is in relational database modeling. Data in Cassandra is often arranged as one query per table, and data is repeated amongst many tables, a process known as denormalization. The relationship of the entities is important, because the order in which data is stored in Cassandra can greatly affect the ease and speed of data retrieval.