ZooKeeper source code analysis 13 Leader election

Posted by spooke2k on Sun, 02 Jan 2022 06:14:24 +0100

1, run method of QuorumPeer

Above Just preparing for the Leader election, but it hasn't triggered the election process. In the start method, after calling startLeaderElection(), the QuorumPeer thread will start. Next we'll start with his run method.

    public void run() {
        try {
            while (running) {
                switch (getPeerState()) {
                case LOOKING:
                  setCurrentVote(makeLEStrategy().lookForLeader());
                    break;
                case OBSERVING:
                        setObserver(makeObserver(logFactory));
                        observer.observeLeader();
                    break;
                case FOLLOWING:
                    try {
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    break;
                case LEADING:
                        setLeader(makeLeader(logFactory));
                        leader.lead();
                        setLeader(null);
                    break;
                }
            }
        } 
    }

Part of the code is omitted. Here, we only need to care about the call in the initial state, that is, the call under LOOKING:

setCurrentVote(makeLEStrategy().lookForLeader());

In terms of name, this is the entry method triggered by Leader election. What is actually called is the lookForLeader method of FastLeaderElection.

    public Vote lookForLeader() throws InterruptedException {

        self.start_fle = Time.currentElapsedTime();
        try {
        //Initialize two collections
            Map<Long, Vote> recvset = new HashMap<Long, Vote>();
            Map<Long, Vote> outofelection = new HashMap<Long, Vote>();
            int notTimeout = minNotificationInterval;

            synchronized (this) {
                logicalclock.incrementAndGet();
                //First, update the current ballot to this node
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }
             /*
             * Start sending notification information, that is, send your own ballot to other nodes, package it into ToSend entity, and then add it to sendqueue
             * The sending thread consumes the queue and sends it to the sending thread in the Listener. We know that when the Listener initializes, it only turns on the listening port
             * At this time, the sending and receiving thread has not been created, so the toSend method of QuorumCnxManager will call the connectOne(sid) method to establish a connection
             * If a node in the cluster has been started, the corresponding connection relationship will be established
             */
            sendNotifications();
            SyncedLearnerTracker voteSet;
            //Loop execution
            while ((self.getPeerState() == ServerState.LOOKING) && (!stop)) {
               //At this time, the ballot information of other nodes will be received
                Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS);
                if (n == null) {
                    if (manager.haveDelivered()) {
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }
                    int tmpTimeOut = notTimeout * 2;
                    notTimeout = Math.min(tmpTimeOut, maxNotificationInterval);
                    //First, judge whether it is a legal vote
                } else if (validVoter(n.sid) && validVoter(n.leader)) {
                    switch (n.state) {
                    case LOOKING:
                        //If the epoch of the current ballot is greater than the selected node
                        if (n.electionEpoch > logicalclock.get()) {
                           //Set logicalclock to current ballot
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            //First, calculate according to Epoch. If Epoch is equal, judge zxid. If Epoch is equal, calculate sid
                            if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                //If the current ballot is newer than the selected node, update the selected node as the current ballot
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                //Otherwise, update to the selected node
                                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
                            }
                            //Notify other nodes of the current ballot results
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            //If the current ballot is smaller than the selected result, log it
                            break;
                            //If the electionEpoch of the current ballot is equal to the selected result, judge whether to update the currently selected result
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }
                        // Record the votes currently received
                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));
                       //Instantiate vote tracker
                        voteSet = getVoteTracker(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch));
                       //At this time, it will be judged whether more than half of the nodes agree to the selected ballot
                        if (voteSet.hasAllQuorums()) {
                           //If it is
                            while ((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null) {
                                if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) {
                                    recvqueue.put(n);
                                    break;
                                }
                            }
                          //If the election has been completed
                            if (n == null) {
                               //Set the current status to. If the elected ballot is in this node, set the status of this node to Leader. Otherwise, set it to Follower or Observer according to whether it is a participant or not
                                setPeerState(proposedLeader, voteSet);
                                Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch);                 //Clear the current recvqueue queue
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        
                        break;
                    case FOLLOWING:
                    case LEADING:
                     //If it is a collapse and re-election, it is the same idea
                        if (n.electionEpoch == logicalclock.get()) {
                            recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                            voteSet = getVoteTracker(recvset, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                            if (voteSet.hasAllQuorums() && checkLeader(recvset, n.leader, n.electionEpoch)) {
                                setPeerState(n.leader, voteSet);
                                Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        //Half of the elections were held
                        outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));
                        voteSet = getVoteTracker(outofelection, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state));

                        if (voteSet.hasAllQuorums() && checkLeader(outofelection, n.leader, n.electionEpoch)) {
                            synchronized (this) {
                                logicalclock.set(n.electionEpoch);
                                setPeerState(n.leader, voteSet);
                            }
                            Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
 
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {

                    }
                    if (!validVoter(n.sid)) {
                    }
                }
            }
            return null;
        } finally {
        }
    }

2, Process combing

For leader election, each node in the cluster first votes for itself and sends it to other nodes in the cluster. At this time, there will be three values epoch, zxid and sid. The election leader also uses these three values. Epoch has been analyzed and represents the latest globally unique proposal id in the system. Zxid indicates that the latest transaction id and sid are the myid of the node itself. First, judge according to epoch, Vote for the largest epoch. If it is equal, according to zxid, the last is myid. Therefore, when the cluster is started for the first time, the leader node will often be the server with the largest myid. More than half of the nodes in the cluster vote for one node, so this node can be elected as the leader node, The rest are Follower or Observer nodes (where Observer nodes do not participate in voting).

The election algorithm is not difficult to understand, but the whole communication process is relatively complex. As we know earlier, the mutual communication among nodes is realized through the Listener, and a receiving and sending working thread is opened in FastLeaderElection to process the communication data. The simple understanding is that the sending and receiving working thread in FastLeaderElection is used to receive and send votes, and the sending and receiving thread in Listener is used for actual communication. The specific communication process is as follows:

  1. The lookForLeader() method starts the Leader election. At this time, the sendNotifications() method will be called first. This method will encapsulate its votes into n ToSend objects (n represents the number of nodes in the cluster), and then add them to the sendqueue sending queue.
  2. At this time, the sending thread of FastLeaderElection will continuously take out votes from the sendqueue queue, package them into a ByteBuffer object, and then call the toSend method of QuorumCnxManager. This method will add ByteBuffer to the queueSendMap collection, and the key value of this collection is sid. We know that in the initial state, the Listener is only listening on the port and the communication link has not been established, so we will actively call the connectOne(sid) method to establish the communication link. The consumption of sendqueue queue is completed, which also means that the current communication link has been established.
  3. We know that the SendWorker thread in the Listener will constantly take values from the queueSendMap queue, and each sid will correspond to a SendWorker, that is, the votes are finally sent through the SendWorker. At this time, each node will send votes to other nodes.
  4. SendWorker corresponds to RecvWorker. RecvWorker can continuously receive votes from other nodes, package them into messages, and then add them to the blocking queue recvQueue.
  5. The receiving thread WorkerReceiver in FastLeaderElection will continuously take out votes from the recvqueue queue, and then package them into a Notification object. If the Leader election is currently in progress and the status of this node is LOOLING, the current Notification will be added to the recvqueue queue, and the current ballot will be compared with the selected result. If the selected result wins, Then encapsulate the selected results into ToSend objects, add them to sendqueue, and send them to other nodes. If this node is not in the voting state, but the voting node is in the voting state, it means that this node is newer than the voting node, that is, the votes of this node will be cast and added to the sendqueue queue.
  6. sendNotifications() continues after execution. At this time, the lookForLeader() method will take out the Notification object from the recvqueue queue, judge whether to update the current ballot according to the current ballot, and then send the voting results. If the current ballot has more than half the votes, the Notification object will be continuously taken out from the recvqueue queue through the while loop, Then compare with the current ballot. If there are still updated votes, continue to add them to the recvqueue queue and exit the while cycle. If there are no updated votes at this time, it means that the currently selected node can be used as a Leader node. Then set the corresponding node status according to the selected ticket, and then empty the recvqueue queue. The method returns.

3, Summary

This paper roughly analyzes and summarizes the Leader election process and the establishment of corresponding communication links when the cluster is started. Next, we will analyze the data processing and execution process in the cluster mode. When the Leader crashes or the Follower node in the cluster has been half down, how to recover the data.

If there is anything wrong with the above, please leave a message and correct it. Please understand.

Topics: Zookeeper Distribution Cloud Native