Nearly every organisation in the world has two data centres for a given solution or region. In order for Quorums to work they require one or three plus DataCenters. This article covers the complexities of quorums running over two sites.

1. - Background, What is a Quorum

A Quorum is a computer science pattern for allowing HA across multiple nodes or sites while removing the risk of HA.

In a Quorum a node can only write to a disk (or elect a primary) if it can connect to more then 50% of the total nodes. This means that if a node loses connectivity with 50% or more other nodes it shuts itself down greatly reducing the risk of split brain.

The key advantage of Quorums is the reduction of data corruption caused by split brain.

1.1. - 3 Node Quorum Example

Node1 Node2 Node3 Quorum Intact
Up Up Up Yes
Down Up Up Yes
Down Down Up No

If you lose two nodes the quorum is lost.

1.2. - 4 Node Quorum Example

Node1 Node2 Node3 Node4 Quorum Intact
Up Up Up Up Yes
Down Up Up Up Yes
Down Down Up Up No

If you lose two nodes we still have lost quorum because we do not have more then 50% of the nodes available. Quorums are recommended to have an odd number of nodes as from an availability standpoint an even number has the same availability as one less.

2. - Active Active

There are three common anti patterns for Active Active deployments.

2.1 - Majority of nodes on one site

TLDR: This provide no additional availability over having a single site.

For this example I will use two nodes on site 1 and one on site 2. This is true regardless of the number of nodes as long as they are not equal and there are only two sites.

Site 1   Site 2 Quorum Intact Comment
Node1 Node2 Node3    
Up Up Up Yes Everything running as normal
Up Up Outage Yes Site 2 Outage
Outage Outage Up No Site 1 Outage

In the event of a site 2 outage everything continues as normal. However in the event of a site 1 outage the quorum is list and we are unable to write or elect a primary.

This means that site 1 is a single point of failure and this solution provide no availability advantage over a single site solution.

2.2 - Equal number of nodes on each site

TLDR: This provide LESS availability then having a single site.

For this example I will use one node on site 1 and one on site 2. This is true regardless of the number of nodes as long as they are equal and there are only two sites.

Site 1 Site 2 Quorum Intact Comment
Node1 Node2    
Up Up Yes Everything running as normal
Up Outage No Site 2 Outage
Outage Up No Site 1 Outage

Regardless if there is an outage on site one or site two the quorum is unavailable as there is no majority.

This means that there are two critical points of failure and this solution provide worse availability over a single site solution.

2.3 - Floating node

TLDR: This has a significant risk of data Corruption. If data corruption occurs then you must role back to your last good backup.

For this example I will use one static node on each site and a third node that is replicated between the sites. If the second site detects that the first site is having an outage then it will activate the third node in site 2 thus creating a majority.

This example brings in the concept of site connectivity. The site connectivity is the network connection linking the sites together.

Site 1   Site Connectivity Site 2   Quorum Intact Comment
Node1 Floating Node3   Node2 Floating Node3    
Up Up Up Up Standby Yes Everything running as normal
Up Up Up Outage Outage Yes Site 2 Outage
Outage Outage Up Up Up No Site 1 Outage
Up Up Down Up Up No Split Brain, Connectivity lost between sites.

With this pattern if we lose either Site 1 or Site 2 then the Quorum will be intact, allowing for RPO RTO.

However if a site connectivity issue occurs, site 2 will believe that site 1 is having an outage. This means that the floating VM will be activated on site 2. Once this is activated on site 2 There are two quorums both writing to disk. In the event of this scenario you have data corruption as the data sets will not trivially be merged as there is a very high chance of data conflicts.

This pattern carries the most risk.

3. How it should work - Three sites

Three plus sites is the preferred option for Quorums. There are two common patterns for three sites but not all solutions support both.

3.1. Active Active Active

Three DataCenters with a low latency connection. The latency requirement is dictated by the application requiring the Quorum. For Kubernetes this is 50ms round trip between each site.

Site1 Site2 Site3 Quorum Intact Comment
Node1 Node2 Node3    
Up Up Up Yes Business as Usual
Up Up Down Yes Site 3 Outage
Up Down Down No Site 3 and Site 2 Outage

This pattern allows for the quorum to fail after two outages. Greatly improving availability over the pattens in section 2.

3.2. - Arbiter Node

This Pattern is the same as 3.1 however node 3 is the arbiter node. In this pattern no data is sent to site3. Node 3 is only used to determine Quorum integrity.

Please note: This still requires three sites. If you put the Arbiter Node on to site 1 or site 2 then you have the same problems as those dictated in 2.

4. - Conclusion

From section two of this article I have shown the significant risks of having,

  • A majority on one site
  • An equal numbers on each site
  • Replicating a node between sites.

Given the solutions above I would recommend having a third site. If this is not possible I would evaluate Active Passive along side any above other solution. Though the cost is more, the risk is greatly reduced.