Internode communications (gossip)
DataStax Enterprise 5.1 uses a protocol called gossip to discover location and state information about the other nodes participating in a cluster.
What is gossip?
Gossip is a peer-to-peer communication protocol in which nodes periodically exchange state information about themselves and about other nodes they know about. The gossip process runs every second and exchanges state messages with up to three other nodes in the cluster. The nodes exchange information about themselves and about the other nodes that they have gossiped about, so all nodes quickly learn about all other nodes in the cluster. A gossip message has a version, so that during a gossip exchange, older information is overwritten with the most current state for a particular node.
Preventing problems in gossip communications
To prevent problems in gossip communications, be sure to use the same list of seed nodes for all nodes in a cluster. Setting the seeds the same on all nodes most critical the first time a node starts up. By default, a node remembers other nodes it has gossiped with between subsequent restarts. The seed node designation has no purpose other than bootstrapping the gossip process for new nodes joining the cluster. Seed nodes are not a single point of failure, nor do they have any other special purpose in cluster operations beyond the bootstrapping of nodes.
Making every node a seed node is not recommended because of increased maintenance and reduced gossip performance. Gossip optimization is not critical, but it is recommended to use a small seed list (approximately three nodes per datacenter). |
About failure detection and recovery
Failure detection is a method for locally determining from gossip state and history when a node in the system is down or has come back up. The DSE database uses this information to avoid routing client requests to unreachable nodes whenever possible. (The database can also avoid routing to poorly performing nodes, through the dynamic snitch.)
The gossip process tracks state from other nodes both directly (nodes gossiping directly to it) and indirectly (nodes communicated about secondhand, third-hand, and so on). Rather than using a fixed threshold for marking failing nodes, the database uses an accrual detection mechanism to calculate a per-node threshold. The threshold takes into account network performance, workload, and historical conditions. During gossip exchanges, every node maintains a sliding window of inter-arrival times of gossip messages from other nodes in the cluster.
To adjust the sensitivity of the failure detector, configure the phi_convict_threshold
property in cassandra.yaml.
Lower values increase the likelihood that an unresponsive node will be marked as down.
Use the default value for most situations, but increase it to 10 or 12 for Amazon EC2 (due to frequently encountered network congestion).
In unstable network environments (EC2 at times), raising the value to 10 or 12 helps prevent false failures.
Values higher than 12 and lower than 5 are not recommended.
Node failures can result from various causes such as hardware failures and network outages. Node outages are often transient but can last for extended periods. Because a node outage rarely signifies a permanent departure from the cluster, it does not automatically result in permanent removal of the node from the ring. Other nodes will periodically try to re-establish contact with failed nodes to see if they are back up. To permanently change a node’s membership in a cluster, you must explicitly add or remove nodes from a cluster.
When a node comes back online after an outage, it may have missed writes for the replica data it maintains.
Repair mechanisms exist to recover missed data, such as hinted handoffs and manual repair with nodetool repair
.
The length of the outage determines which repair mechanism is used to make the data consistent.