ZooKeeper Leader election mechanism source code analysis election core method lookForLeader()

Posted by degsy on Sun, 16 Jan 2022 22:11:51 +0100

Business logic analysis of election core method lookForLeader()

After knowing the role of important classes and member variables related to the election, let's start to analyze the method lookForLeader(), which actually executes the election logic:

1) Preparations for the election
2) Throw yourself as the initial leader
3) Loop exchange voting until Leader is selected. In the process of loop exchange voting, there are three situations according to the status of the voting sender received:
3.1) sender status is LOOKING:
3.1.1) verify who is more suitable to be a leader than others
3.1.2) judge whether the current round of elections can be ended
3.2) the sender's status is OBSERVING:
3.3) sender status is FOLLOWING/LEADING:

As long as the leader successfully sends a message, the whole cluster will not lose the message, because the leader election will choose the zxid largest server. If you hang up without sending it, the message will be lost

1) Preparations for the election

public Vote lookForLeader() throws InterruptedException {
	// -----------------------1 pre election initialization---------------------
    try {
        // Java Management eXtensions, distributed application monitoring technology provided by Oracle
        self.jmxLeaderElectionBean = new LeaderElectionBean();
        MBeanRegistry.getInstance().register(
                self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
    } catch (Exception e) {
        LOG.warn("Failed to register with JMX", e);
        self.jmxLeaderElectionBean = null;
    }

    if (self.start_fle == 0) {
	    //System start time
        self.start_fle = Time.currentElapsedTime();
    }
    try {
        // recvset, receive set, used to store external ballots. One entry represents one vote
        // key is the serverid of the voter and value is the vote
        // This collection is equivalent to a ballot box
        HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();
        // outofelection, out of election
        // Illegal ballots are stored, that is, the voter's status is not looking
        HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();
        // notTimeout,notification Timeout
        int notTimeout = finalizeWait;

        // -----------------------2 cast yourself as the initial leader---------------------
        synchronized(this){
        ...

self.start_fle = Time.currentElapsedTime();

 Why not System.currentTimeMillis()?

Because the system time can be changed and is unsafe, and the system time returns milliseconds, while the currentelapsed time is nanoseconds, which is more accurate

Get the time relative to the virtual machine, there will be no system time problem
HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

receive set is used to store votes from outside. An entry represents a vote
key is the serverid of the voter and value is the Vote
This collection is equivalent to a ballot box, which records the voting results of other nodes in the cluster
HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

out of election
Illegal ballots are stored in them. The ballots sent by the Server that has withdrawn from the election, that is, the voters' status is not looking
int notTimeout = finalizeWait;

notification Timeout, 200 ms
The time allowed to wait for a reply after a ballot is issued

2) Throw yourself as the initial leader

 // ----------------------2 throw yourself as an initialization leader----------------
            // notTimeout,notification timeout
            int notTimeout = finalizeWait;

            synchronized(this){
                // Logic clock increment one
                logicalclock.incrementAndGet();
                // Update proposal (update your vote)
                // getInitId(): returns the id of the current server
                // getInitLastLoggedZxid(): returns the maximum zxid (the last zxid) of the current server
                // Getpeereoch(): returns the epoch of the current Server
                // Take yourself as the initialization leader and update the recommendation information
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            // Send notification to queue
            sendNotifications();

logicalclock.incrementAndGet();
Logical clock plus one

The logical clock can be understood as follows: the logical clock represents the election logical clock (similar to the 18th National People's Congress and the 19th national people's Congress in reality). This value increases from 0. In the same election, the values of each node are basically the same. There are exceptions. For example, in the 18th election, one node A hangs up and other nodes complete the Leader election, But before long, the Leader hung up again and entered the 19th Leader election. At the same time, node A resumed and joined the Leader election. Then the logicallock of node A is 18 and the logicallock of other nodes is 19. In this case, the logicallock of node A will be directly updated to 19 and participate in the 19th Leader election.

updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());

Update current server The recommended information is current server Yourself, pay attention to the method and logicalclock.incrementAndGet()Together is an atomic operation
synchronized void updateProposal(long leader, long zxid, long epoch){
    if(LOG.isDebugEnabled()){
        LOG.debug("Updating proposal: " + leader + " (newleader), 0x"
                + Long.toHexString(zxid) + " (newzxid), " + proposedLeader
                + " (oldleader), 0x" + Long.toHexString(proposedZxid) + " (oldzxid)");
    }
    // Update the recommendation information of the current server
    // As mentioned in the previous chapter, these three fields are member variables that record the Leader information recommended by the current Server
    proposedLeader = leader;
    proposedZxid = zxid;
    proposedEpoch = epoch;
}

getInitId(): get the id of the current server

private long getInitId(){
	//Judge whether it is a participant. Only the Server with the right to vote is a participant during the election, otherwise it is an OBSERVER observer
	//If it is a participant, the ServerId of the current Server is returned
    if(self.getLearnerType() == LearnerType.PARTICIPANT)
        return self.getId();
    else return Long.MIN_VALUE;
}

public enum LearnerType {
    PARTICIPANT, OBSERVER;
}

Judge whether the current status is a participant, that is, a participant who excludes observers and does not have the right to vote
A Server with the right to vote is a Follower when there is a Leader. In an election, it is called a Participant and a Participant

getInitLastLoggedZxid(): get the last (and largest) zxid of the current server, that is, the transaction Id

private long getInitLastLoggedZxid(){
	//Similarly, judge whether it has the right to vote
    if(self.getLearnerType() == LearnerType.PARTICIPANT)
        return self.getLastLoggedZxid();
    else return Long.MIN_VALUE;
}

Getpeereoch(): get the epoch of the current server

private long getPeerEpoch(){
    if(self.getLearnerType() == LearnerType.PARTICIPANT)
    	try {
    		return self.getCurrentEpoch();
    	} catch(IOException e) {
    		RuntimeException re = new RuntimeException(e.getMessage());
    		re.setStackTrace(e.getStackTrace());
    		throw re;
    	}
    else return Long.MIN_VALUE;
}

sendNotifications();

Send the updated Ledaer recommendation information (write the updated information into a sending queue. The specific sending logic is not here. As mentioned in the previous chapter, there are special threads to process it)

/**
 * Send notifications to all peers upon a change in our vote
 */
private void sendNotifications() {
    // Traverse all server s with voting rights
    for (QuorumServer server : self.getVotingView().values()) {
        long sid = server.id;

        // notmsg,notification msg
        ToSend notmsg = new ToSend(ToSend.mType.notification,//Message type
                proposedLeader,//ServerId (myid) of the recommended Leader
                proposedZxid,//Recommended Leader zxid
                logicalclock.get(),//The logical clock of this election
                QuorumPeer.ServerState.LOOKING,//Status of the current Server
                sid,    // server id of the recipient
                proposedEpoch);//Recommended Leader's epoch
        if(LOG.isDebugEnabled()){
            LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                  Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                  " (n.round), " + sid + " (recipient), " + self.getId() +
                  " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
        }
        //Put in send queue
        sendqueue.offer(notmsg);
    }
}

/**
 * Send notifications to all peers upon a change in our vote
 */
private void sendNotifications() {
    // Traverse all server s with voting rights
    for (QuorumServer server : self.getVotingView().values()) {
        long sid = server.id;

        // notmsg,notification msg
        ToSend notmsg = new ToSend(ToSend.mType.notification,//Message type
                proposedLeader,//ServerId (myid) of the recommended Leader
                proposedZxid,//Recommended Leader zxid
                logicalclock.get(),//The logical clock of this election
                QuorumPeer.ServerState.LOOKING,//Status of the current Server
                sid,    // server id of the recipient
                proposedEpoch);//Recommended Leader's epoch
        if(LOG.isDebugEnabled()){
            LOG.debug("Sending Notification: " + proposedLeader + " (n.leader), 0x"  +
                  Long.toHexString(proposedZxid) + " (n.zxid), 0x" + Long.toHexString(logicalclock.get())  +
                  " (n.round), " + sid + " (recipient), " + self.getId() +
                  " (myid), 0x" + Long.toHexString(proposedEpoch) + " (n.peerEpoch)");
        }
        //Put in send queue
        sendqueue.offer(notmsg);
    }
}

What is traversal?
self.getVotingView().values() returns all servers with the right to vote and stand for election

public Map<Long,QuorumPeer.QuorumServer> getVotingView() {
    return QuorumPeer.viewToVotingView(getView());
}

/**
* A 'view' is a node's current opinion(Evaluation of
* the membership(Of the whole
* "view" is the current address of a node (Server) to the members of the whole system.
*/
// Get all servers in zk cluster (including participant and observer)
public Map<Long,QuorumPeer.QuorumServer> getView() {
   return Collections.unmodifiableMap(this.quorumPeers);
}

static Map<Long,QuorumPeer.QuorumServer> viewToVotingView(Map<Long,QuorumPeer.QuorumServer> view) {
    Map<Long,QuorumPeer.QuorumServer> ret = new HashMap<Long, QuorumPeer.QuorumServer>();
    // Exclude observer s and only get participants, that is, servers with voting rights
    for (QuorumServer server : view.values()) {
        if (server.type == LearnerType.PARTICIPANT) {
            ret.put(server.id, server);
        }
    }
    return ret;
}

ToSend notmsg = new ToSend(...)

notification msg, a notification message, encapsulates the recommended Leader information into ToSend objects and puts them into the send queue. A special thread sends the message
sid represents the server id of the message receiver

3) Cycle the exchange of votes until the Leader is elected

After you cast yourself as the initial leader, you will cycle through the received ballot information:

// -----------------------3. Cycle the exchange of votes until the Leader is elected---------------------
/*
 * Loop in which we exchange notifications until we find a leader
 * Cycle through notifications until the Leader is found
 */

while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){
    /*
     * Remove next notification from queue, times out after 2 times
     * the termination time
     */
    // recvqueue, receive queue, which stores all received external notifications
    // There are special threads to process and receive notifications from other servers, parse and encapsulate the received information into notifications and put them into the recvqueue queue
    Notification n = recvqueue.poll(notTimeout,TimeUnit.MILLISECONDS);

    /*
     * Sends more notifications if haven't received enough.
     * Otherwise processes new notification.
     */
    if(n == null){
        if(manager.haveDelivered()){
            // Resend for re reception
            sendNotifications();
        } else {
            // Reconnect each server in the zk cluster
            manager.connectAll();
        }

        /*
         * Exponential backoff
         */
        int tmpTimeOut = notTimeout*2;
        notTimeout = (tmpTimeOut < maxNotificationInterval?
                tmpTimeOut : maxNotificationInterval);
        LOG.info("Notification time out: " + notTimeout);
    }
    else if(validVoter(n.sid) && validVoter(n.leader)) {
	   	//validVoter(n.sid): verify the ServerId of the sender
	    //validVoter(n.leader): verify the ServerId of the recommended leader in the current notification
	...

while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){

Circularly exchange notifications until the Leader is found (once the Leader is found, the status is no longer LOOKING) Notification n=
recvqueue.poll(notTimeout,TimeUnit.MILLISECONDS);

receive queue, which stores all received external notifications
There are special threads to process and receive notifications from other servers, parse and encapsulate the received information into notifications and put them into the recvqueue queue

You can see that the notification retrieved from the recvqueue is empty

When is it empty?
If 8 are broadcast, only 3 may be received due to network reasons, and the fourth time is empty
There may be 8 more, but the election is not over yet, and it will be empty again
In short, it is to ensure that when the election is not over, we can continue to receive votes from other servers and continue to process and judge until the Leader is elected
if(manager.haveDelivered()) {/ / in short, this method is to judge whether the cluster is lost. If false is returned, it means the cluster is lost

Manager: QuorumCnxManager, connection manager, maintains TCP connections between servers
haveDelivered: judge whether it has been delivered, that is, check whether all queues are empty, indicating that all messages have been delivered.

boolean haveDelivered() {
    for (ArrayBlockingQueue<ByteBuffer> queue : queueSendMap.values()) {
        LOG.debug("Queue size: " + queue.size());
        if (queue.size() == 0) {
            return true;
        }
    }

    return false;
}

queueSendMap is the previously mentioned Map maintained by the connection manager for sending copies of failed messages to other servers
As long as one queue is 0, it will return true. We won't look at it later. As mentioned earlier, as long as one queue is empty, it means that there is no problem with the connection between the current Server and zk cluster
Only when all queues are not empty can it indicate that the current Server is disconnected from the zk cluster
sendNotifications();

If * * manager Havedelivered() * * returns true, indicating that there is no problem with the connection between the current Server and the cluster, so resend the vote notification of the Leader recommended by the current Server in order to receive the replies from other servers again
manager.connectAll();

If * * manager Havedelivered() * * returns false, indicating that the current server has lost contact with the cluster, so reconnect zk each server in the cluster

public void connectAll(){
    long sid;
    for(Enumeration<Long> en = queueSendMap.keys();
        en.hasMoreElements();){
        sid = en.nextElement();
        connectOne(sid);
    }      
}

Why is it necessary to resend the notification after reconnection?
Because I lost contact, but the messages in the sending queue are still sent. After reconnection, I will continue to send again, and other servers are in recvqueue When the poll is null, if it does not lose contact with the cluster, it will resend sendNotifications, so it is not necessary here.

int tmpTimeOut = notTimeout*2;
notTimeout = (tmpTimeOut < maxNotificationInterval?tmpTimeOut : maxNotificationInterval);

After resending the notification or reconnecting the cluster, double the notification timeout. If the maximum notification time is exceeded, set the timeout to the maximum time
else if(validVoter(n.sid) && validVoter(n.leader)) {

If the voting notice retrieved from the recvqueue is not empty, it will verify whether the sender and recommender of the voting are legal, and then continue processing

// Verify whether the specified server is legal
private boolean validVoter(long sid) {
	//That is to judge whether they have the right to vote and stand for election
    return self.getVotingView().containsKey(sid);
}

3.1) the status of the sender of the received vote is LOOKING: 3.1.1) verify who is more suitable to be a leader

...
while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){
	...
    if(n == null){...}
    else if(validVoter(n.sid) && validVoter(n.leader)) {
        switch (n.state) {
            case LOOKING:
            	// 3.1.1) verify who is more suitable to be a leader than others
                // If notification > current, replace and send messages out
                // n.electionEpoch: the logical clock of the election where the external notification is located
                // logicalclock.get(): get the logical clock of the current server
                // Deal with the current election obsolescence: empty the ballot box and update the logical clock
                if (n.electionEpoch > logicalclock.get()) {
                    // Update the logical clock of the election where the current server is located
                    logicalclock.set(n.electionEpoch);
                    // Empty the ticket box
                    recvset.clear();
                    // Judge the current server and n who is more suitable to be a leader, no matter who is more suitable,
                    // You need to update the recommendation information of the current server, and then broadcast it
                    if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                        getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                        updateProposal(n.leader, n.zxid, n.peerEpoch);
                    } else {
                        updateProposal(getInitId(),
                            getInitLastLoggedZxid(),
                            getPeerEpoch());
                    }
                    sendNotifications();
                    // Deal with the situation that n is outdated: n is of no use to the current election and is directly discarded
                } else if (n.electionEpoch < logicalclock.get()) {
                    if(LOG.isDebugEnabled()){
                        LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                            + Long.toHexString(n.electionEpoch)
                            + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                    }
                    break;
                    // Handle n.electionEpoch and logicalclock Get() is equal
                    // Totalorderpredict() is used to judge the foreign n and the leader recommended by the current server
                    // Who is more suitable to be a new leader
                } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                    proposedLeader, proposedZxid, proposedEpoch)) {
                    // Update the recommendation information of the current server
                    updateProposal(n.leader, n.zxid, n.peerEpoch);
                    // Broadcast out
                    sendNotifications();
                }

                if(LOG.isDebugEnabled()){
                    LOG.debug("Adding vote: from=" + n.sid +
                        ", proposed leader=" + n.leader +
                        ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                        ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                }
                // Encapsulate the external n notice as a ballot and put it into the "ballot box"
                recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
				// n.sid: ServerId of the notification sender
				// n. Leader, n.zxid, n.peereepoch: information about the recommended leader
				// n.electionEpoch: the logical clock of the election where the external notification is located
                // -----------------------3.1.2) judge whether the current round of elections can be ended---------------------
                if (termPredicate(recvset,
                    new Vote(proposedLeader, proposedZxid,
                            logicalclock.get(), proposedEpoch))) {
                ...

n.electionEpoch: the logical clock of the election where the external notification is located
logicalclock.get(): get the logical clock of the current server election

Normally, the electionEpoch of each Server should be the same during the election, that is, they are obtained in the same round of election through the current currentEpoch+1, not synchronously. There are also exceptions. For example, in the 18th election, A node A hangs up and other nodes complete the Leader election, but not long after, the Leader hangs up again, so it enters the 19th Leader election. At the same time, node A recovers and joins the Leader election. Then the logicallock of node A is 18 and the logicallock of other nodes is 19. For this case, The logicallock of node A will be directly updated to 19 and participate in the 19th Leader election.

At this time, it is necessary to compare whether the logical clock of the election where the ballot is located is equal to the logical clock of the current Server election by comparing n.electionEpoch and
logicalclock. The value of get() can be used in three cases:

Under what circumstances, foreign voting is large or small?
For example, five machines have elected leaders, two of them have been notified, and the other two don't know. At this time, the Leader who just took office suddenly hangs up again. This will happen when the other two machines have not been notified. The two epochs that have been notified will be re elected again, and the logical clock will be incremented by one, that is, epochs will be incremented by one, The two epochs that have not been notified have not changed
From the perspective of the Server that has not been notified, when receiving the Server reply that has been notified, you will find that the notification epoch is larger
From the perspective of notified servers, when receiving notifications from non notified servers, you will find that you are larger than the notified epoch

if (n.electionEpoch > logicalclock.get()) {...}

Processing n.electionEpoch is better than logicalclock Get() is large (foreign voting epoch is large)
You are out of date. It makes no sense to choose anyone, so do the following:
logicalclock.set(n.electionEpoch): updates the logical clock of the election where the current server is located
recvset.clear(): empty the ballot box. The votes collected before are outdated and meaningless.
Totalorderpredict (n.leader, n.zxid, n.peereoch, getinitid(), getinitlastloggedzxid(), getpeereoch()): judge which foreign N and the current server are more suitable for the new leader (note that it is not recommended by the current server, but the current server)
updateProposal(...): select a more suitable recommendation to update the current server
sendNotifications(): broadcast your votes
else if (n.electionEpoch < logicalclock.get()) {...}

Processing n.electionEpoch is better than logicalclock Get() is small (foreign voting epoch is small)
It indicates that the foreign ballot is outdated, its ballot is meaningless, and no processing is done. Directly break the switch, re-enter the loop, take a notice from the recvqueue and continue processing
else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {...}

Handle n.electionEpoch and logicalclock Get() is equal, that is, they are in the same round of election
Totalorderpredict (...): asserts to judge whether foreign n is more suitable for the new leader than the leader recommended by the current server. If true is returned, n (foreign) is more suitable
If true is returned, that is, foreign is more appropriate, execute the following method:
updateProposal(): updates the recommendation information of the current server
sendNotifications(): broadcast out
After handling the above situation, if there is no break, that is, the logical clock of the foreign ballot is larger or equal, which means that the foreign ballot is valid, put the ballot into the ballot box:

recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

Encapsulate the external n notice as a ballot and put it into the "ballot box"

Special case: when the current Server receives an external notification and finds that the leader recommended by the external notification is more suitable, it will update its recommendation information and broadcast it again. At this time, recvqueue will receive a reply to a new round of broadcasting in addition to the reply it received for the first time. For other servers, it may reply to two notifications, However, it has no impact on the local Server, because the ballot box recvset is a Map, and the key is the ServerId of the Server sending the message. Each Server will only record one vote, and the new one will overwrite the old one

Next, I will try to take the step of 3.1.2 judging whether this round of election can be ended. However, if the election is just started, the election will not end until more than half of the same votes are obtained, so the logic will not go. Therefore, I will directly break the switch, and then cycle to the beginning, take a notice from the recvqueue and continue processing

totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,proposedLeader, proposedZxid, proposedEpoch)

totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,getInitId(), getInitLastLoggedZxid(), getPeerEpoch())

Judge who is more suitable to be a leader
This method returns true, which means that foreign ones are more suitable, that is, new is more suitable

protected boolean totalOrderPredicate(long newId, long newZxid, long newEpoch, long curId, long curZxid, long curEpoch) {
    LOG.debug("id: " + newId + ", proposed id: " + curId + ", zxid: 0x" +
            Long.toHexString(newZxid) + ", proposed zxid: 0x" + Long.toHexString(curZxid));
    // Get the weight. The weight of observer is 0. If 0 is observer, return false
    if(self.getQuorumVerifier().getWeight(newId) == 0){
        return false;
    }
    // zxid: it is a 64 bit Long type, where the high 32 bits represent epoch and the low 32 bits represent xid.
    // First compare the first 32 bits. If newepoch > curepoch, make sure newzxid > curzxid, and directly return true
    // If newEpoch and curEpoch are the same
    // When looking at Zxid, it actually compares xid (the first 32 bits are equal). If newzxid > curzxid, it directly returns true
    // If Zxid is the same, compare ServerId
    return ((newEpoch > curEpoch) || 
            ((newEpoch == curEpoch) &&
            ((newZxid > curZxid) || ((newZxid == curZxid) && (newId > curId)))));
}

3.1.2) judge whether the current round of elections can be ended

...
case LOOKING:
	....
    // -----------------------3.1.2) judge whether the current round of elections can be ended---------------------
    /*
 	 * Try to judge whether it is enough to confirm the final leader through the received information. Use the method termpredict(),
 	 * The judgment criterion is very simple: whether more than half of the leaders recommended by the machine are the current leaders recommended by themselves
 	 * If yes, to be on the safe side, wait for finalizeWait (200ms by default) at most for final confirmation,
 	 * If the updated leader information is found, put the Notification back into the recvqueue. Obviously, the election will continue.
 	 * Otherwise, at the end of the election, set your status to LEADING, OBSERVING or FOLLOWING according to whether the leader of the election is yourself.
 	 */
    if (termPredicate(recvset,
        new Vote(proposedLeader, proposedZxid,
                logicalclock.get(), proposedEpoch))) {

        // Verify if there is any change in the proposed leader
        // The cycle has two outlets:
        // break: jump out of this exit, indicating that the value of n is not null, indicating that a more suitable leader notification has been found in the remaining notifications
        // while() condition: jump out of the exit, indicating that the value of n is null, indicating that there is no more suitable leader recommended by the current server in the remaining notifications
        while((n = recvqueue.poll(finalizeWait,
            TimeUnit.MILLISECONDS)) != null){
            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                proposedLeader, proposedZxid, proposedEpoch)){
                // Put the more appropriate n back into the recvqueue for re voting
                recvqueue.put(n);
                break;
            }
        }

        // If n is null, the leader recommended by the current server is the final leader,
        // Then you can finish the work at this time
        if (n == null) {
            // Modify the status of the current server. Non leader means following
            self.setPeerState((proposedLeader == self.getId()) ?
                ServerState.LEADING: learningState());
            // Form the final vote
            Vote endVote = new Vote(proposedLeader,
                                proposedZxid,
                                logicalclock.get(),
                                proposedEpoch);
            // Clear recvqueue queue
            leaveInstance(endVote);
            return endVote;
        }
    }
    break;

if (termPredicate(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch))) {...}

Terminate assertion: judge whether the leader recommended by the current Server has more than half the support rate in the ticket box

/**
 * Terminate assertion. Given a group of votes, decide whether there are enough votes to declare the election closed.
 */
protected boolean termPredicate(
        HashMap<Long, Vote> votes,
        Vote vote) {

    HashSet<Long> set = new HashSet<Long>();

    // Traverse the ballot box: find the same ballot as vote from the ballot box
    for (Map.Entry<Long,Vote> entry : votes.entrySet()) {
        if (vote.equals(entry.getValue())){
            set.add(entry.getKey());
        }
    }

    return self.getQuorumVerifier().containsQuorum(set);
}

org.apache.zookeeper.server.quorum.flexible.QuorumMaj#containsQuorum

/**
 * Verifies if a set is a majority.
 */
public boolean containsQuorum(Set<Long> set){
    return (set.size() > half);
}

Half is half the total number of clusters

You can see that it must be greater than half or equal to half, which is why the number of servers is recommended to be odd.
Based on this theory, a cluster composed of 5 hosts can only allow 2 downtime at most (at least 3 tickets). For a cluster composed of 6 machines, only 2 machines are allowed to go down at most (3 tickets but half, at least 4 tickets). That is, the disaster tolerance capacity of 6 and 5 is the same. Based on this disaster recovery capability, it is recommended to use an odd number of hosts to form a cluster to avoid resource waste. However, in terms of system throughput, the performance of 6 hosts must be higher than that of 5 hosts. So using six hosts is not a waste of resources.

It's already half way, but the notifications in the recvqueue haven't been processed yet. There may be more suitable Leader notifications

If there is a more appropriate, it will notify you to rejoin the tail of the recvqueue queue and break out of the loop. At this time, n= Null, there will be no closing action, the election will be re conducted, and finally the recommendation information of the current Server will be updated to this more suitable Leader and broadcast
If not, that is, n is null, it means that the leader recommended by the current server is the final leader, and you can finish the work at this time

if (termPredicate(recvset,new Vote(proposedLeader, proposedZxid,logicalclock.get(), proposedEpoch))) {
    // The cycle has two outlets:
    // break: jump out of this exit, indicating that the value of n is not null, indicating that a more suitable leader notification has been found in the remaining notifications
    // while() condition: jump out of the exit, indicating that the value of n is null, indicating that there is no more suitable leader recommended by the current server in the remaining notifications
    while((n = recvqueue.poll(finalizeWait,
        TimeUnit.MILLISECONDS)) != null){
        if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
            proposedLeader, proposedZxid, proposedEpoch)){
            // If you find a more suitable
            // Put the more appropriate n back into the recvqueue for re voting
            recvqueue.put(n);
            // put: inserts the specified element at the end of this queue and waits for available space if necessary.
            break;
        }
    }

    // If n is null, the leader recommended by the current server is the final leader,
    // Then you can finish the work at this time
    if (n == null) {
		...
    }
}
break;

Closing work:

 // If n is null, the leader recommended by the current server is the final leader,
  // Then you can finish the work at this time
  if (n == null) {
      // Modify the status of the current server. Non leader means following
      // If the recommended Leader is myself, change my current status to LEADING
      // If it is not me, judge whether I am a participant. If yes, the status will be set to FOLLOWING, otherwise it will be OBSERVING
      self.setPeerState((proposedLeader == self.getId()) ?
          ServerState.LEADING: learningState());
      // Form the final vote
      Vote endVote = new Vote(proposedLeader,
                          proposedZxid,
                          logicalclock.get(),
                          proposedEpoch);
      // Clear recvqueue queue
      leaveInstance(endVote);
      // Return final ballot
      return endVote;
  }
  
  private ServerState learningState(){
  	if(self.getLearnerType() == LearnerType.PARTICIPANT){
  		LOG.debug("I'm a participant: " + self.getId());
  		return ServerState.FOLLOWING;
  	}
  	else{
  		LOG.debug("I'm an observer: " + self.getId());
  		return ServerState.OBSERVING;
  	}
  }
  
  private void leaveInstance(Vote v) {
      if(LOG.isDebugEnabled()){
          LOG.debug("About to leave FLE instance: leader="
              + v.getId() + ", zxid=0x" +
              Long.toHexString(v.getZxid()) + ", my id=" + self.getId()
              + ", my state=" + self.getPeerState());
      }
      recvqueue.clear();
  }

3.2) the sender's status is OBSERVING:

Observers do not participate in the Leader election, so they will not be processed after receiving such votes

case OBSERVING: 
    LOG.debug("Notification from observer: " + n.sid);
    break;

3.3) sender status is FOLLOWING/LEADING:

First of all, two points should be clear:
When a Server receives notifications from other servers, it will send its own notifications to that Server regardless of its status
If a Server can receive notifications from other servers, it indicates that the Server is not an Observer, but a Participant. Because the sendNotifications() method does not send to the Observer
This n.state is the Server status of the sender receiving the external notification

Each host in zk cluster will be in different states at different stages. Each host has four states.

LOOKING: election status
FOLLOWING: normal working state of Follower
OBSERVING: the normal working state of the Observer
LEADING: the normal working state of the Leader

The code contains the notification that the Status is being observed:
Why does Observer send notifications
First of all, I didn't read other codes, so I'm not very clear, but I can speculate that if an Observer is added, how does it know who is the Leader when it is started? It must be a notification, and others will tell it, but the logic code is not here
There is a contradiction between case OBSERVING and else if (validvoter (n.sid) & & validvoter (n.leader)) {. I'm sure I won't go through case OBSERVING. It has been filtered in else if. Why is it written like this
It may be to solve thread safety problems and program robustness

while ((self.getPeerState() == ServerState.LOOKING) &&(!stop))//As long as the current status is LOOKING, that is, if no Leader is selected, it will cycle all the time
{
    // recvqueue, receive queue, which stores all received external notifications
    Notification n = recvqueue.poll(notTimeout,TimeUnit.MILLISECONDS);

    if(n == null){
		...
    }
    else if(validVoter(n.sid) && validVoter(n.leader)) {
        switch (n.state) {
            case LOOKING:
			...
            case OBSERVING: 
                LOG.debug("Notification from observer: " + n.sid);
                break;
            // -----------------------3.3) the sender status is FOLLOWING/LEADING-----------------------
            // -----------------------Dealing with situations where elections are not required---------------------
            // First of all, two points should be clear:
            // 1) When a Server receives notifications from other servers, no matter what state it is in,
            //    They will send their own notifications to that Server
            // 2) If a Server can receive notifications from other servers, it indicates that the Server is not an Observer
            //     It's a Participant. Because the sendNotifications() method does not send to the Observer

            // There are two scenarios in which a leader or follower will send a notification to the current server:
            // 1) When a new server wants to join a normal cluster, the new server starts,
            //    Its status is looking. To find the leader, it sends out a notification. The leader
            //    The status of follower is definitely not looking, but leading and following respectively.
            //    When leader s and follower s receive notifications, they will send their own notifications to them
            //	  At this time, the logical time of the current Server election is the same as or different from the epoch of other follower s or leader s
            //
            // 2) When other servers have selected a new leader in this round of election, but the current Server has not been notified
            //    Therefore, the state of the current Server remains looking, while some hosts in other servers may be in the same state
            //    It is already leading or following
            //    At this time, the logical time of the current Server election must be the same as the epoch of other follower s or leader s

            // According to the analysis, the final two scenarios are:
            // 1) The logical time of the current Server election is the same as the epoch of other follower s or leader s
            // 2) The logical time of the current Server election is different from the epoch of other follower s or leader s
            case FOLLOWING:
            case LEADING:
            /*
             * Consider all notifications from the same epoch together.
             * Consider all notices from the same era together.
             */
            if(n.electionEpoch == logicalclock.get()){
                recvset.put(n.sid, new Vote(n.leader,
                                              n.zxid,
                                              n.electionEpoch,
                                              n.peerEpoch));

                // Judge whether the current server should withdraw from this round of election
                // It first determines n whether the recommended leader has more than half the support rate in the ticket box of the current Server
                // If more than half, judge n whether the recommended leader's status in the out of selection is legal
                // If it is legal, you can withdraw from this round of elections
                if(ooePredicate(recvset, outofelection, n)) {
                    // Close out work
                    self.setPeerState((n.leader == self.getId()) ?
                            ServerState.LEADING: learningState());

                    Vote endVote = new Vote(n.leader, 
                            n.zxid, 
                            n.electionEpoch, 
                            n.peerEpoch);
                    leaveInstance(endVote);
                    return endVote;
                }
            }

            /*
             * Before joining an established ensemble, verify
             * a majority is following the same leader.
             * Before joining an established team, make sure that most people follow the same leader.
             */
            outofelection.put(n.sid, new Vote(n.version,
                                                n.leader,
                                                n.zxid,
                                                n.electionEpoch,
                                                n.peerEpoch,
                                                n.state));

            // If n the recommended leader has more than half the support rate in the set formed by your notification, then
            // I know who the leader is, and I can quit the election
            if(ooePredicate(outofelection, outofelection, n)) {
                synchronized(this){
                    logicalclock.set(n.electionEpoch);
                    self.setPeerState((n.leader == self.getId()) ?
                            ServerState.LEADING: learningState());
                }
                Vote endVote = new Vote(n.leader,
                                        n.zxid,
                                        n.electionEpoch,
                                        n.peerEpoch);
                leaveInstance(endVote);
                return endVote;
            }
            break;
        default:
            LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                    n.state, n.sid);
            break;
        }
    } else {
		...
    }
}

There are two scenarios in which a leader or follower will send a notification to the current server:

1) When a new server wants to join a normal cluster, the state of the new server is looking when it is started. To find the leader, it sends a notification to the outside. At this time, the status of leader and follower is definitely not looking, but leading and following respectively. When leaders and followers receive notifications, they will send their own notifications to them
At this time, the logical time of the current Server election is the same as or different from the epoch of other follower s or leader s
2) When other servers have selected a new leader in this round of election, but have not notified the current Server, the status of the current Server remains looking, and some hosts in other servers may already be leading or following
At this time, the logical time of the current Server election must be the same as the epoch of other follower s or leader s. After analysis, we can see that the final two scenarios are:

The logical time of the current Server election is the same as the epoch of other follower s or leader s
The logical time of the current Server election is different from the epoch of other follower s or leader s

epoch is the same

case FOLLOWING:
case LEADING:
/*
 * Consider all notifications from the same epoch together.
 * Consider all notices from the same era together.
 */
if(n.electionEpoch == logicalclock.get()){
    recvset.put(n.sid, new Vote(n.leader,
                                  n.zxid,
                                  n.electionEpoch,
                                  n.peerEpoch));

    // Judge whether the current server should withdraw from this round of election
    // It first determines n whether the recommended leader has more than half the support rate in the ticket box of the current Server
    // If more than half, judge n whether the recommended leader's status in the out of selection is legal
    // If it is legal, you can withdraw from this round of elections
    if(ooePredicate(recvset, outofelection, n)) {
        // Close out work
        self.setPeerState((n.leader == self.getId()) ?
                ServerState.LEADING: learningState());

        Vote endVote = new Vote(n.leader, 
                n.zxid, 
                n.electionEpoch, 
                n.peerEpoch);
        leaveInstance(endVote);
        return endVote;
    }
}

recvset.put(n.sid,new Vote(...));
If it is a ballot notice for the election of the same logical clock, it will be packaged into a ballot and put into the ballot box. Note that although the status of the ballot is either FOLLOWING or LEADING, it is considered to be valid because it is a ballot in the same logical clock
**if(ooePredicate(recvset, outofelection, n)) {...}
**
Judge whether the current server should exit this round of election recvset: ballot box
Out of selection: illegal ballots are stored in it, that is, the voters' status is not looking ballots. How to judge?
First, judge whether n the recommended leader has more than half the support rate in the ticket box of the current Server (that is, judge whether it has more than half in the set of the first parameter)
If it is more than half, then judge whether the state of the leader recommended by n in the outofelection is legal (judge whether it is legal from the set of the second parameter)
If it is legal, you can withdraw from this round of elections

If ooepredicte returns true, it indicates that the current server has exited the current round of elections and performs closing work: changing the status, generating final votes, and emptying the queue
If ooepredicte returns false, go on to deal with different epoch situations

If it is the second scenario in the above scenario analysis, recvset may have many votes at this time, which cannot be empty. There is a certain probability that leaders can be selected at this time. Therefore, the same code in epoch is an optimization for scenario 2 to speed up the selection of leaders

Different situations of epoch

Note that when the epoch is the same, the same situation will be handled first. At this time, if the Leader has not been decided, the different situations of epoch will continue to be handled. At this time, it is actually for scenario 1 mentioned above

outofelection.put(n.sid, new Vote(n.version,
                                    n.leader,
                                    n.zxid,
                                    n.electionEpoch,
                                    n.peerEpoch,
                                    n.state));

// If n the recommended leader has more than half the support rate in the set formed by your notification (you represent the LEADING and FOLLOWING Server), then
// I know who the leader is, and I can quit the election
if(ooePredicate(outofelection, outofelection, n)) {
    synchronized(this){
        logicalclock.set(n.electionEpoch);
        self.setPeerState((n.leader == self.getId()) ?
                ServerState.LEADING: learningState());
    }
    Vote endVote = new Vote(n.leader,
                            n.zxid,
                            n.electionEpoch,
                            n.peerEpoch);
    leaveInstance(endVote);
    return endVote;
}
break;

outofelection.put(n.sid, new Vote(xxxx));

Put the from the Server in FOLLOWING/LEADING status into the outofelection illegal ballot set
if(ooePredicate(outofelection, outofelection, n)) {...}

In scenario 1, the new Server needs to join a normal cluster, and the logical clock of the election must be different. Therefore, assuming that the conditions have not been met, it will cycle again, and then take the notification from the queue to continue processing. When it continues to be put into the outofelection... There are more and more votes in the outofelection.

If n the recommended leader has more than half the support rate in the collection formed by your notification (you represent the LEADING and FOLLOWING Server), then I will know who the leader is, and I can withdraw from the election

protected boolean ooePredicate(HashMap<Long,Vote> recv, 
                                HashMap<Long,Vote> ooe, 
                                Notification n) {
    // First, judge whether the leader recommended by n has more than half the support rate in recv,
    // If more than half, execute checkLeader().
    // The checkLeader() method is used to determine whether the leader's status is legal
    return (termPredicate(recv, new Vote(n.version, 
                                         n.leader,
                                         n.zxid, 
                                         n.electionEpoch, 
                                         n.peerEpoch, 
                                         n.state))
            && checkLeader(ooe, n.leader, n.electionEpoch));
    
}

termPredicate(recv, new Vote(n.xx...))
This method is described above to determine whether more than half of n votes are in the collection recv
checkLeader(...)
If more than half of the conditions are met, the method will be executed
It is used to judge whether the leader's status is legal

/**
 * In this case, a leader has been elected, and a legal Server supports the leader,
 * We must check whether the leader has voted and confirmed his leadership. We need this check to avoid the server
 * Repeatedly select a leader who has collapsed and no longer leads.
 */
protected boolean checkLeader(
        HashMap<Long, Vote> votes,
        long leader,
        long electionEpoch){
	//Default to true first, followed by exclusion
    boolean predicate = true;

    /*
     * If everyone else thinks I'm the leader, I must be the leader.
     * The other two checks are just for the case in which I'm not the
     * leader. If I'm not the leader and I haven't received a message
     * from leader stating that it is leading, then predicate is false.
     */

    if(leader != self.getId()){   
    	// If the recommended leader is someone else, I am not a leader
        if(votes.get(leader) == null) predicate = false;
        //If it does not exist in votes, that is, outofelection, it must be false
        else if(votes.get(leader).getState() != ServerState.LEADING) predicate = false;
        //If it exists, but the status is not LEADING, it is also false
    } else if(logicalclock.get() != electionEpoch) {
    	// If everyone thinks I'm a leader, I'm a leader.  
	    // If the recommended leader is the current server, judge whether the logical clock is the same as the recommended epoch. Otherwise, it must be false
        predicate = false;
    } 

    return predicate;
}

Topics: Zookeeper