Apache Cassandra structure
Apache Cassandra is a distributed database system, meaning that data is spread across multiple nodes in a cluster. Because data may be spread across distributed nodes, it’s important to design your schema and queries with this fact in mind, to ensure optimal performance and scalability. Each node in the cluster store some of the pieces, or partitions, of the total data.
These partitions play a key role in efficiently retrieving data. The partition key is a critical element of data modeling.
Let’s step back for a moment to look at the overall structure of how data is stored in Cassandra, from the point of view of the Cassandra Query Language (CQL).
The first database object that must be defined is a keyspace. A keyspace is an object that will store one or more tables. Its main characteristic is that the replication factor, or how many replicas of a particular piece of data, will be stored in the Cassandra cluster.
As mentioned, tables are database objects stored in keyspaces. Tables define the bits of information that will be stored together. Generally, tables group information that is related to answer a query, so that retrieval will be fast and performant.
Tables may be thought of as storing rows of data in columns. The contents of a particular row-column pair will be stored in a cell, and must conform to a definition of data type. The data type of each column defined in a table must be defined. One or more columns will be used to define a partition key. A partition key is required in all table definitions as part of the primary key. A primary key consists of a partition key, and optionally, one or more clustering columns. Clustering columns will define how rows are sorted within a partition.
Tables can be indexed, making non-partition key columns available for query. Apache Cassandra has multiple methods of indexing. Apache Cassandra can use secondary indexes (2i), SASI indexes, or SAI indexes.
Indexes allow data to be retrieved from tables without specifying a primary key. However, they are less performant that data retrieved with a primary key, and their use should be considered carefully when data modeling and designing table structure.
Security is an important feature of all databases, and Cassandra is no different. Table data can be secured with a login-password combination based on defined roles. In addition, row-based access can be controlled.