Architecture in brief

Essential information for understanding and using Cassandra.

Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. Its architecture is based in the understanding that system and hardware failure can and do occur. Cassandra addresses the problem of failures by employing a peer-to-peer distributed system where all nodes are the same and data is distributed among all nodes in the cluster. Each node exchanges information across the cluster every second. A commit log on each node captures write activity to ensure data durability. Data is also written to an in-memory structure, called a memtable, and then written to a data file called an SSTable on disk once the memory structure is full. All writes are automatically partitioned and replicated throughout the cluster.

Cassandra is a row-oriented database. Cassandra's architecture allows any authorized user to connect to any node in any data center and access data using the CQL language. For ease of use, CQL uses a similar syntax to SQL. From the CQL perspective the database consists of tables. Typically, a cluster has one keyspace per application. Developers can access CQL through cqlsh as well as via drivers for application languages.

Client read or write requests can go to any node in the cluster. When a client connects to a node with a request, that node serves as the coordinator for that particular client operation. The coordinator acts as a proxy between the client application and the nodes that own the data being requested. The coordinator determines which nodes in the ring should get the request based on how the cluster is configured. For more information, see Client requests.

Key components for configuring Cassandra

Gossip : A peer-to-peer communication protocol to discover and share location and state information about the other nodes in a Cassandra cluster.

Gossip information is also persisted locally by each node to use immediately when a node restarts. You may want to purge gossip history on node restart for various reasons, such as when the node's IP addresses has changed.
Partitioner : A partitioner determines how to distribute the data across the nodes in the cluster. Choosing a partitioner determines which node to place the first copy of data on.

You must set the partitioner type and assign the node a num_tokens value for each node. If not using virtual nodes (vnodes), use the initial_token setting instead.
Replica placement strategy : Cassandra stores copies (replicas) of data on multiple nodes to ensure reliability and fault tolerance. A replication strategy determines which nodes to place replicas on. The first replica of data is simply the first copy; it is not unique in any sense.

When you create a keyspace, you must define the replica placement strategy and the number of replicas you want.
Snitch : A snitch defines the topology information that the replication strategy uses to place replicas and route requests efficiently.

You need to configure a snitch when you create a cluster. The snitch is responsible for knowing the location of nodes within your network topology and distributing replicas by grouping machines into data centers and racks.
The cassandra.yaml file is the main configuration file for Cassandra. In this file, you set the initialization properties for a cluster, caching parameters for tables, properties for tuning and resource utilization, timeout settings, client connections, backups, and security.
Cassandra stores keyspace attributes in the system keyspace. You set keyspace or table attributes on a per-keyspace or per-table basis programmatically or using a client application, such as CQL.

By default, a node is configured to store the data it manages in the /var/lib/cassandra directory. In a production cluster deployment, you change the commitlog-directory to a different disk drive from the data_file_directories.

Related topics