Implementing Replica Set Priorities

Replica set priorities will, very shortly, be allowed to vary between 0.0 and 100.0. The member with the highest priority that can reach a majority of the set will be elected master. (The change is done and works, but is being held up by 1.8.0… look for it after that release.) Implementing priorities was kind of an interesting problem, so I thought people might be interested in how it works. Following in the grand distributed system lit tradition I’m using the island nation of Replicos to demonstrate.

Replicos is a nation of islands that elect a primary island, called the Grand Poobah, to lead them. Each island cast a vote (or votes) and the island that gets a majority of the votes wins poobahship. If no one gets a majority (out of the total number of votes), no one becomes Grand Poobah. The islands can only communicate via carrier seagull.

Healthy configurations of islands, completely connected via carrier seagulls.

However, due to a perpetual war with the neighboring islands of Entropos, seagulls are often shot down mid-flight, distrupting communications between the Replicos islands until new seagulls can be trained.

The people of Replicos have realized that some islands are better at Poobah-ing than others. Their goal is to elect the island with the highest Poobah-ing ability that can reach a majority of the other islands. If all of the seagulls can make it to their destinations and back, electing a Poobah becomes trivial: an island sends a message saying they want to be Grand Poobah and everyone votes for them or says “I’m better at Poobah-ing, I should be Poobah.” However, it becomes tricky when you throw the Entropos Navy into the mix.

So, let’s Entropos has shot down a bunch of seagulls, leaving us with only three seagulls:

The island with .5 Poobah ability should be elected leader (the island with 1 Poobah ability can’t reach a majority of the set). But how can .5 know that it should be Poobah? It knows 1.0 exists, so theoretically it could ask the islands it can reach to ask 1.0 if it wants to be Poobah, but it’s a pain to pass messages through multiple islands (takes longer, more chances of failure, a lot more edge cases to check), so we’d like to be able to elect a Poobah using only the directly reachable islands, if possible.

One possibility might be for the islands sent a response indicating if they were connected to an island with a higher Poobah ability. In the case above, this would work (only one island is connected to an island with higher Poobah ability, so it can’t have a majority), but what about this case:

Every island, other than .5, is connected to a 1.0, but .5 should be the one elected! So, suppose we throw in a bit more information (which island of higher priority can be reached) and let the island trying to elect itself figure things out? Well, that doesn’t quite work, what if both .5 and 1.0 can reach a majority, but not the same one?

Conclusion: the Poobah-elect can’t figure this out on their own, everyone needs to work together.

Preliminaries: define an island to be Poohable if it has any aptitude for Poobah-ing and can reach a majority of the set. An island is not Poohable if it has no aptitude for Poobah-ing and/or cannot reach a majority of the set. Islands can be more or less Poohable, depending on their aptitude for Poobah-ing.

Every node knows whether or not it, itself, is Poohable: it knows its leadership abilities and if it can reach a majority of the islands. If more than one island (say islands A and B) is Poohable, then there must be at least one island that can reach both A and B [Proof at the bottom].

Let’s have each island keep a list of “possible Poobahs.” So, say we have an island A, that starts out with an empty list. If A is Poohable, it’ll add itself to the list (if it stops being Poohable, it’ll remove itself from the list). Now, whenever A communicates with another island, the other island will either say “add me to your list” or “remove me from your list,” depending on whether it is currently Poohable or not. Every other island does the same, so now each island has a list of the Poohable islands it can reach.

Now, say island X tries to elect itself master. It contacts all of the islands it can reach for votes. Each of the voting islands checks its list: if it has an entry on it that is more Poohable than X, it’ll send a veto. Otherwise X can be elected master. If you check the situations above (and any other situation) you can see that Poohability works, due to the strength of the guarantee that a Poobah must be able to reach a majority of the set.

Proof: suppose a replica set has n members and a node A can reach a majority of the set (at least ⌊n/2+1⌋) and a node B can reach a majority of the set (again, ⌊n/2+1⌋). If the sets of members A and B can reach are disjoint, then there must be ⌊n/2+1⌋+⌊n/2+1⌋ = at least n+1 members in the set. Therefore the set of nodes that A can reach and the set of nodes that B can reach are not disjoint.

kristina chodorow's blog