Data distribution and replication

How data is distributed and factors influencing replication.

In Cassandra, data distribution and replication go together. This is because Cassandra is designed as a peer-to-peer system that makes copies of the data and distributes the copies among a group of nodes. Data is organized by table and identified by a primary key. The primary key determines which node the data is stored on. Copies of rows are called replicas. When data is first written, it is also referred to as a replica.

When your create a cluster, you must specify the following:

  • Virtual nodes: assigns data ownership to physical machines.
  • Partitioner: partitions the data across the cluster.
  • Replication strategy: determines the replicas for each row of data.
  • Snitch: defines the topology information that the replication strategy uses to place replicas.