Configuring replication

How to set up DataStax Enterprise to store multiple copies of data on multiple nodes for reliability and fault tolerance.

Cassandra can store multiple copies of data on multiple nodes for reliability and fault tolerance. To configure replication, you must:
  • Configure gossip.

    Nodes communicate with each other about replication and other things using the gossip protocol.

  • Choose whether to use vnodes.
    Vnodes provide many tokens per node and simplify many tasks in Cassandra.
    Attention: DataStax Enterprise turns off virtual nodes (vnodes) by default. DataStax does not recommend turning on vnodes for DSE Hadoop or BYOH nodes. Before turning vnodes on for Hadoop, understand the implications. DataStax Enterprise does support turning on vnodes for Spark nodes.
  • Choose a data partitioner.

    Data partitioning determines how data is placed across the nodes in the cluster.

  • Choose a snitch.

    A snitch determines which data centers and racks are written to and read from.

  • Choose replica placement strategy.

    A replication strategy determines the nodes where replicas are placed.

For information about how these components work, see Data distribution and replication.

Partitioner settings 

You can use either Murmur3Partitioner or RandomPartitioner with virtual nodes.

The Murmur3Partitioner (org.apache.cassandra.dht.Murmur3Partitioner) is the default partitioning strategy for Cassandra clusters. The Murmur3Partitioner is the right choice for new clusters in almost all cases. You can use Murmur3Partitioner for new clusters; you cannot change the partitioner in existing clusters.

The RandomPartitioner (org.apache.cassandra.dht.RandomPartitioner) was the default partitioner in Cassandra 1.2 and earlier. You can continue to use this partitioner when migrating to virtual nodes.

Snitch settings 

A snitch determines which data centers and racks are written to and read from. It informs Cassandra about the network topology so that requests are routed efficiently and allows Cassandra to distribute replicas by grouping machines into data centers and racks. All nodes must have exactly the same snitch configuration. You set the snitch in the endpoint_snitch property in the cassandra.yaml file.

The following sections describe commonly used snitches. All available snitches are described in the Cassandra documentation.

DseSimpleSnitch 

Use DseSimpleSnitch only for development in DataStax Enterprise deployments. This snitch logically configures each type of node in separate data centers to segregate the analytics, real-time, and search workloads.

When defining your keyspace, use Analytics, Cassandra, or Search for your data center names.

Note: Do not use SimpleSnitch with DataStax Enterprise nodes.

GossipingPropertyFile Snitch 

The GossipingPropertyFileSnitch defines a local node's data center and rack; it uses gossip for propagating this information to other nodes. The cassandra-rackdc.properties file defines the default data center and rack that are used by this snitch:
dc=DC1
rack=RAC1
The default location of the cassandra-rackdc.properties file depends on the type of installation:
Installer-Services and Package installations /etc/dse/cassandra/cassandra-rackdc.properties
Installer-No Services and Tarball installations install_location/resources/cassandra/conf/cassandra-rackdc.properties

PropertyFileSnitch 

The PropertyFileSnitch property allows you to define your data center and rack names to be whatever you want. Using this snitch requires that you define network details for each node in the cluster in the cassandra-topology.properties configuration file.

The default location of the cassandra-topology.properties file depends on the type of installation:
Installer-Services and Package installations /etc/dse/cassandra/cassandra-topology.properties
Installer-No Services and Tarball installations install_location/resources/cassandra/conf/cassandra-topology.properties

Every node in the cluster should be described in this file, and specified exactly the same on every node in the cluster.

For example, suppose you had non-uniform IPs and two physical data centers with two racks in each, and a third logical data center for replicating analytics data, you would specify them as follows:

# Data Center One

175.56.12.105=DC1:RAC1
175.50.13.200=DC1:RAC1
175.54.35.197=DC1:RAC1

120.53.24.101=DC1:RAC2
120.55.16.200=DC1:RAC2
120.57.102.103=DC1:RAC2

# Data Center Two

110.56.12.120=DC2:RAC1
110.50.13.201=DC2:RAC1
110.54.35.184=DC2:RAC1

50.33.23.120=DC2:RAC2
50.45.14.220=DC2:RAC2
50.17.10.203=DC2:RAC2

# Analytics Replication Group

172.106.12.120=DC3:RAC1
172.106.12.121=DC3:RAC1
172.106.12.122=DC3:RAC1

# default for unknown nodes 
default=DC3:RAC1

Make sure the data center names defined in the cassandra-topology.properties file correlates to the data centers names in your keyspace definition.

Choosing keyspace replication options 

When you create a keyspace, you must define the replica placement strategy class and the number of replicas. DataStax recommends choosing NetworkTopologyStrategy for single and multiple data center clusters. This strategy is as easy to use as the SimpleStrategy and allows for expansion to multiple data centers in the future. It is easier to configure the most flexible replication strategy when you create a keyspace, than to reconfigure replication after data is loaded into your cluster.

NetworkTopologyStrategy takes as options the number of replicas you want per data center. Even for single data center clusters, you can use this replica placement strategy and just define the number of replicas for one data center. For example:

CREATE KEYSPACE test
    WITH  REPLICATION= { 'class' :  'NetworkTopologyStrategy',  'us-east' : 6 };

For a single node cluster, use the default data center name, Cassandra, Solr, or Analytics.

CREATE KEYSPACE test
    WITH  REPLICATION= { 'class' :  'NetworkTopologyStrategy',  'Analytics' : 1 };

To define the number of replicas for a multiple data center cluster:

CREATE KEYSPACE test2
    WITH REPLICATION= { 'class' :  'NetworkTopologyStrategy',  'dc1' : 3,  'dc2' : 3 };

When creating the keyspace, what you name your data centers depends on the snitch you have chosen for your cluster. The data center names must correlate to the snitch you are using in order for replicas to be placed in the correct location.

As a general rule, the number of replicas should not exceed the number of nodes in a replication group. However, it is possible to increase the number of replicas, and then add the desired number of nodes afterwards. When the replication factor exceeds the number of nodes, writes will be rejected, but reads will still be served as long as the desired consistency level can be met. The default consistency level is QUORUM.

To avoid DSE Hadoop operational problems, change the replication factor of these system keyspaces:

  • cfs
  • cfs_archive
  • HiveMetaStore
The location of the cassandra.yaml file depends on the type of installation:
Package installations /etc/dse/cassandra/cassandra.yaml
Tarball installations install_location/resources/cassandra/conf/cassandra.yaml