Meticulous in-depth explanation of Zookeeper source code - sorting out the core process

Posted by davemwohio on Sun, 20 Feb 2022 20:12:46 +0100

I Source code warehouse:

zookeeper
Based on branch 3.4.14, the startup process of branch in windows system is analyzed.

II Process analysis:

  1. Source code entry
    Through zkserver From the contents of CMD executable file, we can see that the server of zookeeper is through org apache. zookeeper. server. quorum. The main of quorumpeermain class is used as the entry to start the server program The main method passes in our zoo Cfg file address, and then by parsing zoo Cfg file to convert the configuration information of key and value into the object of QuorumPeerConfig. See QuorumPeerConfig for conversion details Parse method, in which the converted core configuration parameters are:
Parameter nameParameter description
dataLogDirTransaction log storage path
dataDirSnapshot storage path
electionTypeAt present, only 3-fast election algorithm is supported
myidCurrent service id
tickTimeTime unit
initLimit
syncLimitTransaction storage path
minSessionTimeoutMinimum session timeout
maxSessionTimeoutMaximum session timeout
peerTypeRole type - OBSERVER,PARTICIPANT
clientPortClient connection port
clientPortAddressClient connection Host
snapRetainCountNumber of snapshots reserved, minimum 3
purgeIntervalSnapshot cleanup interval
server.sidHostname: Port: electionport: peerType
maxClientCnxnsMaximum number of client connections

After obtaining the parsed parameters, you can determine whether the server is configured The ID parameter determines whether to start the cluster or the single machine. The single machine start-up operation is started through the ZooKeeperServerMain#main method, and the cluster start-up is processed in the QuorumPeerMain#runFromConfig method. Here we will directly explain the cluster mode, because the cluster mode has more communication related processing between clusters than the single machine mode, such as Leader election, data synchronization, request forwarding, etc

    public void runFromConfig(QuorumPeerConfig config) throws IOException {
      try {
          ManagedUtil.registerLog4jMBeans();
      } catch (JMException e) {
          LOG.warn("Unable to register log4j JMX control", e);
      }
  
      LOG.info("Starting quorum peer");
      try {
          ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory();
          cnxnFactory.configure(config.getClientPortAddress(),
                                config.getMaxClientCnxns());

          quorumPeer = getQuorumPeer();

          quorumPeer.setQuorumPeers(config.getServers());
          quorumPeer.setTxnFactory(new FileTxnSnapLog(
                  new File(config.getDataLogDir()),
                  new File(config.getDataDir())));
          quorumPeer.setElectionType(config.getElectionAlg());
          quorumPeer.setMyid(config.getServerId());
          quorumPeer.setTickTime(config.getTickTime());
          quorumPeer.setInitLimit(config.getInitLimit());
          quorumPeer.setSyncLimit(config.getSyncLimit());
          quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs());
          quorumPeer.setCnxnFactory(cnxnFactory);
          quorumPeer.setQuorumVerifier(config.getQuorumVerifier());
          quorumPeer.setClientPortAddress(config.getClientPortAddress());
          quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout());
          quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout());
          quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory()));
          quorumPeer.setLearnerType(config.getPeerType());
          quorumPeer.setSyncEnabled(config.getSyncEnabled());

          // sets quorum sasl authentication configurations
          quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl);
          if(quorumPeer.isQuorumSaslAuthEnabled()){
              quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl);
              quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl);
              quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal);
              quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext);
              quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext);
          }

          quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize);
          quorumPeer.initialize();

          quorumPeer.start();
          quorumPeer.join();
      } catch (InterruptedException e) {
          // warn, but generally this is ok
          LOG.warn("Quorum Peer interrupted", e);
      }
    }

It can be seen from the code snippet that a new QuorumPeer object is created. In fact, this is the idea of OOP. The current instance represents a node of the cluster, and then reset the QuorumPeerConfig to the QuorumPeer object. Here are several new classes:

Class nameClass description
FileTxnSnapLogThe core categories of persistence include snapshot, transaction log and operation
ServerCnxnFactory 3The core class of server-side network processing includes two implementations: NIO and Netty
ZKDatabaseThe core class of memory operation is stored through tree structure

After setting the parameters, the QuorumPeer#initialize method is called. In this method, the objects of some authentication classes are instantiated. The core is also the QuorumPeer#start method:

        loadDataBase();//Load data from snapshots and transaction logs into memory
        cnxnFactory.start();        //Network service startup
        startLeaderElection(); //Preparations for elections
        super.start(); 

loadDataBase:
In this method, the loading work is mainly carried out by delegating to ZKDatabase#loadDataBase

    public long loadDataBase() throws IOException {
        long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener);
        initialized = true;
        return zxid;
    }
    public long restore(DataTree dt, Map<Long, Integer> sessions, 
            PlayBackListener listener) throws IOException {
        snapLog.deserialize(dt, sessions); //Data deserialization
        return fastForwardFromEdits(dt, sessions, listener);
    }
 public long deserialize(DataTree dt, Map<Long, Integer> sessions)
            throws IOException {
        //Find 100 valid snapshot files, in descending order
        List<File> snapList = findNValidSnapshots(100);
        if (snapList.size() == 0) {
            return -1L;
        }
        File snap = null;
        boolean foundValid = false;
        for (int i = 0; i < snapList.size(); i++) {
            snap = snapList.get(i);
            InputStream snapIS = null;
            CheckedInputStream crcIn = null;
            try {
                LOG.info("Reading snapshot " + snap);
                snapIS = new BufferedInputStream(new FileInputStream(snap));
                crcIn = new CheckedInputStream(snapIS, new Adler32());
                InputArchive ia = BinaryInputArchive.getArchive(crcIn);
                //Really serialized place
                deserialize(dt,sessions, ia);
                long checkSum = crcIn.getChecksum().getValue();
                long val = ia.readLong("val");
                //Verify the integrity of the snapshot file
                if (val != checkSum) {
                    throw new IOException("CRC corruption in snapshot :  " + snap);
                }
                foundValid = true;
                break;
            } catch(IOException e) {
                LOG.warn("problem reading snap file " + snap, e);
            } finally {
                if (snapIS != null) 
                    snapIS.close();
                if (crcIn != null) 
                    crcIn.close();
            } 
        }
        if (!foundValid) {
            throw new IOException("Not able to find valid snapshots in " + snapDir);
        }
        //The snapshot file is named snapshot lastZxid
        dt.lastProcessedZxid = Util.getZxidFromName(snap.getName(), SNAPSHOT_FILE_PREFIX);
        return dt.lastProcessedZxid;
    }

There are the following core attributes in ZkDataBase:

Table ATable B
DataTree dataTreeStorage tree structure
FileTxnSnapLog snapLogTransaction snapshot persistence category
,ConcurrentHashMap<Long, Integer> sessionsWithTimeoutsSession management, sessionId

In the loadDataBase method, you can see the called snapLog#restore method. When you enter the restore method, you can see that FileTxnSnapLog#deserialize is called for reverse sequencing, and then saved to the passed dt,sessions parameters. You can locate filetxnsnaplog# deserialize (datatree DT, map < long, integer > sessions,

        InputArchive ia)Let's take a look at this overloaded method,How to deserialize snapshot files:
    public void deserialize(DataTree dt, Map<Long, Integer> sessions,
            InputArchive ia) throws IOException {
        FileHeader header = new FileHeader();
        header.deserialize(ia, "fileheader");
        if (header.getMagic() != SNAP_MAGIC) {
            throw new IOException("mismatching magic headers "
                    + header.getMagic() + 
                    " !=  " + FileSnap.SNAP_MAGIC);
        }
        

First, read through the wrapper class InputArchive of the file input stream, and call the FileHeader#deserialize method:

  public void deserialize(InputArchive a_, String tag) throws java.io.IOException {
    a_.startRecord(tag);
    magic=a_.readInt("magic");
    version=a_.readInt("version");
    dbid=a_.readLong("dbid");
    a_.endRecord(tag);
}

FileHeader implements the Record interface. In fact, all subsequent serialization and deserialization need to implement this interface. It defines its own serialization and deserialization details through the input stream object passed in
Here you can see that the storage structure of FileHeader is:

Attribute valueOccupancy sizedescribe
magic4 bytesMagic number
version4 bytesVersion number
version8 bytesDatabase id

After the FileHedare#deserialize method, 16 bytes have been read from the file stream. Next, serializeutils #deserialize snapshot (DT, IA, sessions) is called to load other contents,

    public static void deserializeSnapshot(DataTree dt,InputArchive ia,
            Map<Long, Integer> sessions) throws IOException {
        //Number of sessions
        int count = ia.readInt("count");
        while (count > 0) {
            //Session id
            long id = ia.readLong("id");
            //Session timeout
            int to = ia.readInt("timeout");
            sessions.put(id, to);
            if (LOG.isTraceEnabled()) {
                ZooTrace.logTraceMessage(LOG, ZooTrace.SESSION_TRACE_MASK,
                        "loadData --- session in archive: " + id
                        + " with timeout: " + to);
            }
            count--;
        }
        dt.deserialize(ia, "tree");
    }

You can see that first, the count attribute of 4 bytes is read from the stream, that is, the number of sessions, then 8 bytes of sessionId (session id) and 4 bytes of timeout (session timeout) are read, and then assigned to sessions (that is, the sessionsWithTimeouts attribute of ZkDataBase), Finally, DataTree#deserialize is called to reverse serialization of real storage content:

    public void deserialize(InputArchive ia, String tag) throws IOException {
        aclCache.deserialize(ia);
        nodes.clear();
        pTrie.clear();
        String path = ia.readString("path");
        while (!path.equals("/")) {
            DataNode node = new DataNode();
            ia.readRecord(node, "node");
            nodes.put(path, node);
            synchronized (node) {
                aclCache.addUsage(node.acl);
            }
            int lastSlash = path.lastIndexOf('/');
            if (lastSlash == -1) {
                root = node;
            } else {
                String parentPath = path.substring(0, lastSlash);
                node.parent = nodes.get(parentPath);
                if (node.parent == null) {
                    throw new IOException("Invalid Datatree, unable to find " +
                            "parent " + parentPath + " of path " + path);
                }
                node.parent.addChild(path.substring(lastSlash + 1));
                long eowner = node.stat.getEphemeralOwner();
                if (eowner != 0) {
                    HashSet<String> list = ephemerals.get(eowner);
                    if (list == null) {
                        list = new HashSet<String>();
                        ephemerals.put(eowner, list);
                    }
                    list.add(path);
                }
            }
            path = ia.readString("path");
        }
        nodes.put("/", root);

        setupQuota();

        aclCache.purgeUnused();
    }
  1. Network transmission (NIO)
    The connection between zookeeper and the client and the data transmission of request and response are processed through the implementation class of ServerCnxnFactory. Here we explain it directly through NIO's implementation class NIOServerCnxnFactory. In QuorumPeer's start method, we see that NIOServerCnxnFactory#start method is called
    public void start() {
        // ensure thread is started once and only once
        if (thread.getState() == Thread.State.NEW) {
            thread.start();
        }
    }

In the start method, we simply call the Thread#start method to start the thread As for where the thread method is initialized, I can locate the NIOServerCnxnFactory#configure method:

    public void configure(InetSocketAddress addr, int maxcc) throws IOException {
        configureSaslLogin();
        //Initialize thread object
        thread = new ZooKeeperThread(this, "NIOServerCxn.Factory:" + addr);
        thread.setDaemon(true);
        //Set the maximum number of connections parameter
        maxClientCnxns = maxcc;
        //Initialize Socket related configuration
        this.ss = ServerSocketChannel.open();
        ss.socket().setReuseAddress(true);
        LOG.info("binding to port " + addr);
        ss.socket().bind(addr);
        ss.configureBlocking(false);
        ss.register(selector, SelectionKey.OP_ACCEPT);
    }
  1. election
    After starting the network transmission service, we begin to prepare for some preparations before the election. We can call QuorumPeer#startLeaderElection() in the QuorumPeer#start method to make an election entry point:

     synchronized public void startLeaderElection() {
         try {
    //Set initial voting
             currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch());
         } catch(IOException e) {
             RuntimeException re = new RuntimeException(e.getMessage());
             re.setStackTrace(e.getStackTrace());
             throw re;
         }
         for (QuorumServer p : getView().values()) {
             if (p.id == myid) {
                 myQuorumAddr = p.addr;
                 break;
             }
         }
         if (myQuorumAddr == null) {
             throw new RuntimeException("My id " + myid + " not in the peer list");
         }
         if (electionType == 0) {
             try {
                 udpSocket = new DatagramSocket(myQuorumAddr.getPort());
                 //Start response thread
                 responder = new ResponderThread();
                 responder.start();
             } catch (SocketException e) {
                 throw new RuntimeException(e);
             }
         }
         //Perform some initialization according to the configured election algorithm
         this.electionAlg = createElectionAlgorithm(electionType);
     }

    It can be seen from the startLeaderElection method that the initial voting is mainly set to itself, sid is its serverId,zxid is the maximum lastZxid loaded through snapshots and transaction logs, and peereoch (election year) is the current election year of itself, and then the response thread of repeatethread is started, The core logic is still in the createElectionAlgorithm method. We can follow it to see the specific code logic:

   protected Election createElectionAlgorithm(int electionAlgorithm){
        Election le=null;
                
        //TODO: use a factory rather than a switch
        switch (electionAlgorithm) {
        case 0:
            le = new LeaderElection(this);
            break;
        case 1:
//Deprecated 
            le = new AuthFastLeaderElection(this);
            break;
        case 2:
//Deprecated 
            le = new AuthFastLeaderElection(this, true);
            break;
        case 3:
//Create connection manager
            qcm = createCnxnManager();
            QuorumCnxManager.Listener listener = qcm.listener;
            if(listener != null){
                //Start listening for other connection requests of nodes
                listener.start();
//Instantiate the core class of fast election algorithm
                le = new FastLeaderElection(this, qcm);
            } else {
                LOG.error("Null listener when initializing cnx manager");
            }
            break;
        default:
            assert false;
        }
        return le;
    }

From the above code, we can see that the main work is to instantiate a QuorumCnxManager object, that is, to process the connection request with other nodes through the Listener class in this object. Calling the Listener#start method actually runs into the Listener#run method code:

        public void run() {
            int numRetries = 0;
            InetSocketAddress addr;
            while((!shutdown) && (numRetries < 3)){
                try {
                    //Instantiate ServerSocket
                    ss = new ServerSocket();
                    ss.setReuseAddress(true);
                    if (listenOnAllIPs) {
                        int port = view.get(QuorumCnxManager.this.mySid)
                            .electionAddr.getPort();
                        addr = new InetSocketAddress(port);
                    } else {
                        addr = view.get(QuorumCnxManager.this.mySid)
                            .electionAddr;
                    }
                    LOG.info("My election bind port: " + addr.toString());
                    setName(view.get(QuorumCnxManager.this.mySid)
                            .electionAddr.toString());
                    ss.bind(addr);
                    while (!shutdown) {
                        //Blocking waiting for other nodes to request a connection
                        Socket client = ss.accept();
                        setSockOpts(client);
                        LOG.info("Received connection request "
                                + client.getRemoteSocketAddress());

                        if (quorumSaslAuthEnabled) {
                            receiveConnectionAsync(client);
                        } else {
                            //Accept request core logic
                            receiveConnection(client);
                        }

                        numRetries = 0;
                    }
                } catch (IOException e) {
                    LOG.error("Exception while listening", e);
                    numRetries++;
                    try {
                        ss.close();
                        Thread.sleep(1000);
                    } catch (IOException ie) {
                        LOG.error("Error closing server socket", ie);
                    } catch (InterruptedException ie) {
                        LOG.error("Interrupted while sleeping. " +
                                  "Ignoring exception", ie);
                    }
                }
            }
            LOG.info("Leaving listener");
            if (!shutdown) {
                LOG.error("As I'm leaving the listener thread, "
                        + "I won't be able to participate in leader "
                        + "election any longer: "
                        + view.get(QuorumCnxManager.this.mySid).electionAddr);
            }
        }

This method mainly uses the blocking io of jdk to establish a connection with other nodes. Those who do not know can supplement the basic knowledge of socket programming of jdk, SS in the second while loop The accept () code will always block and wait for other nodes to request a connection. When other nodes establish a connection, a socket instance will be returned, and then the socket instance will be passed into the receiveConnection method. Then we can communicate with other nodes. The specific receiveConnection code logic is as follows:

    public void receiveConnection(final Socket sock) {
        DataInputStream din = null;
        try {
//Wrap the input stream multiple times
            din = new DataInputStream(
                    new BufferedInputStream(sock.getInputStream()));

//Really handle connections
            handleConnection(sock, din);
        } catch (IOException e) {
            LOG.error("Exception handling connection, addr: {}, closing server connection",
                     sock.getRemoteSocketAddress());
            closeSocket(sock);
        }
    }

After wrapping the io input stream, handleConnection is further called for connection processing:

    private void handleConnection(Socket sock, DataInputStream din)
            throws IOException {
        Long sid = null;
        try {
            // Blocking the first packet waiting for another node to send an establishment request
            //Read 8 bytes first, which may be sid (service id) or protocol version
            sid = din.readLong();
//The protocol version was read
            if (sid < 0) {
//Read 8 bytes further, which is the real sid
                sid = din.readLong();
//Read 4 bytes, that is, read the number of bytes of other contents remaining
                int num_remaining_bytes = din.readInt();
//Perform word count check
                if (num_remaining_bytes < 0 || num_remaining_bytes > maxBuffer) {
                    LOG.error("Unreasonable buffer length: {}", num_remaining_bytes);
                    closeSocket(sock);
                    return;
                }
                byte[] b = new byte[num_remaining_bytes];

            //Read all the remaining byte contents into b this byte array at one time
                int num_read = din.read(b);
                if (num_read != num_remaining_bytes) {
                    LOG.error("Read only " + num_read + " bytes out of " + num_remaining_bytes + " sent by server " + sid);
                }
            }
            if (sid == QuorumPeer.OBSERVER_ID) {
                sid = observerCounter.getAndDecrement();
                LOG.info("Setting arbitrary identifier to observer: " + sid);
            }
        } catch (IOException e) {
            closeSocket(sock);
            LOG.warn("Exception reading or writing challenge: " + e.toString());
            return;
        }

        LOG.debug("Authenticating learner server.id: {}", sid);
        authServer.authenticate(sock, din);
        //If the read sid is less than the SID of the current node, the previously established connection will be closed
        if (sid < this.mySid) {
            SendWorker sw = senderWorkerMap.get(sid);
            if (sw != null) {
                sw.finish();
            }
            LOG.debug("Create new connection to server: " + sid);
            closeSocket(sock);
            //After closing the previous connection, the current node initiates a connection request
            connectOne(sid);

        } else {
            //Send thread
            SendWorker sw = new SendWorker(sock, sid);
            //Accept thread
            RecvWorker rw = new RecvWorker(sock, din, sid, sw);
            sw.setRecv(rw);
            SendWorker vsw = senderWorkerMap.get(sid);
            if(vsw != null)
                vsw.finish();
            senderWorkerMap.put(sid, sw);
            queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY));
            //Start sending thread
            sw.start();
            //Start accept thread
            rw.start();
            return;
        }
    }

It can be seen from this code that the establishment request can only be initiated by the party with the largest Sid and accepted by the party with the smallest Sid. If there are three nodes, sid = 1, sid = 2 and sid = 3, then only node 2 can initiate the connection request and node 1 can process the connection request This ensures that only one connection is maintained between the two sides, because the socket is in full duplex mode and supports communication between the two sides Socket can be accessed through SS Accept. You can also connect to nodes with smaller Sid through the connectOne method of the current method:

    synchronized public void connectOne(long sid){
//This is to determine whether the sendWorkerMap contains the current sid
        if (!connectedToPeer(sid)){
            InetSocketAddress electionAddr;
            if (view.containsKey(sid)) {
            //Get the previously configured server Election address of ID
                electionAddr = view.get(sid).electionAddr;
            } else {
                LOG.warn("Invalid server id: " + sid);
                return;
            }
            try {
                LOG.debug("Opening channel to server " + sid);
//Instantiate Socket object
                Socket sock = new Socket();
                setSockOpts(sock);
                //Connect
                sock.connect(view.get(sid).electionAddr, cnxTO);
                LOG.debug("Connected to server " + sid);
                if (quorumSaslAuthEnabled) {
                    initiateConnectionAsync(sock, sid);
                } else {
                    //Synchronously initialize the connection, that is, send some information about itself to other nodes
                    initiateConnection(sock, sid);
                }
            } catch (UnresolvedAddressException e) {
                LOG.warn("Cannot open channel to " + sid
                        + " at election address " + electionAddr, e);
                if (view.containsKey(sid)) {
                    view.get(sid).recreateSocketAddresses();
                }
                throw e;
            } catch (IOException e) {
                LOG.warn("Cannot open channel to " + sid
                        + " at election address " + electionAddr,
                        e);
                if (view.containsKey(sid)) {
                    view.get(sid).recreateSocketAddresses();
                }
            }
        } else {
            LOG.debug("There is a connection already for server " + sid);
        }
    }
    public void initiateConnection(final Socket sock, final Long sid) {
        try {
            startConnection(sock, sid);
        } catch (IOException e) {
            LOG.error("Exception while connecting, id: {}, addr: {}, closing learner connection",
                     new Object[] { sid, sock.getRemoteSocketAddress() }, e);
            closeSocket(sock);
            return;
        }
    }
    private boolean startConnection(Socket sock, Long sid)
            throws IOException {
        DataOutputStream dout = null;
        DataInputStream din = null;
        try {
            dout = new DataOutputStream(sock.getOutputStream());
            //Send its own sid to other nodes
            dout.writeLong(this.mySid);
            dout.flush();
            din = new DataInputStream(
                    new BufferedInputStream(sock.getInputStream()));
        } catch (IOException e) {
            LOG.warn("Ignoring exception reading or writing challenge: ", e);
            closeSocket(sock);
            return false;
        }
        // authenticate learner
        authLearner.authenticate(sock, view.get(sid).hostname);
        if (sid > this.mySid) {
            LOG.info("Have smaller server identifier, so dropping the " +
                     "connection: (" + sid + ", " + this.mySid + ")");
            closeSocket(sock);
            // Otherwise proceed with the connection
        } else {
            //The following logic is through SS Accept has the same logic after getting the socket object
            SendWorker sw = new SendWorker(sock, sid);
            RecvWorker rw = new RecvWorker(sock, din, sid, sw);
            sw.setRecv(rw);
            SendWorker vsw = senderWorkerMap.get(sid);
            if(vsw != null)
                vsw.finish();
            senderWorkerMap.put(sid, sw);
            queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY));
            sw.start();
            rw.start();
            return true;    
            
        }
        return false;
    }

As can be seen from the above methods, through ServerSocket Accpet and Socket After connect gets the Socket object, it instantiates a SendWorker and a RecvWorker, and calls their respective start methods to start two threads. In fact, these two threads are used to complete the data transmission of requests and responses with other nodes. A node maintains a SendWorker A RecvWorker communicates with a queue stored through queueSendMap.
How the latter three objects work will be explained in detail in the election After completing this series of election preparations, we return to the QuorumPeer#start method. Next, the QuorumPeer#start method calls super Start () method, because the QuorumPeer object inherits ZooKeeperThread, and ZooKeeperThread inherits the Thread class of jdk, so super is called After start, a separate Thread will be opened to execute the QuorumPeer#run method, which is the place where the election is really held:

    public void run() {
        setName("QuorumPeer" + "[myid=" + getId() + "]" +
                cnxnFactory.getLocalAddress());
        LOG.debug("Starting quorum peer");
        //1.jmx expansion points
        try {
            jmxQuorumBean = new QuorumBean(this);
            MBeanRegistry.getInstance().register(jmxQuorumBean, null);
            for(QuorumServer s: getView().values()){
                ZKMBeanInfo p;
                if (getId() == s.id) {
                    p = jmxLocalPeerBean = new LocalPeerBean(this);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                        jmxLocalPeerBean = null;
                    }
                } else {
                    p = new RemotePeerBean(s);
                    try {
                        MBeanRegistry.getInstance().register(p, jmxQuorumBean);
                    } catch (Exception e) {
                        LOG.warn("Failed to register with JMX", e);
                    }
                }
            }
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            jmxQuorumBean = null;
        }
        2.//Election logic
        try {
            /*
             * Main loop
             */
            while (running) {
                switch (getPeerState()) {
                //1.Looking status
                case LOOKING:
                    LOG.info("LOOKING");
                    //Turn on read-only mode
                    if (Boolean.getBoolean("readonlymode.enabled")) {
                        LOG.info("Attempting to start ReadOnlyZooKeeperServer");
                        final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer(
                                logFactory, this,
                                new ZooKeeperServer.BasicDataTreeBuilder(),
                                this.zkDb);
                        Thread roZkMgr = new Thread() {
                            public void run() {
                                try {
                                    // lower-bound grace period to 2 secs
                                    sleep(Math.max(2000, tickTime));
                                    if (ServerState.LOOKING.equals(getPeerState())) {
                                        roZk.startup();
                                    }
                                } catch (InterruptedException e) {
                                    LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started");
                                } catch (Exception e) {
                                    LOG.error("FAILED to start ReadOnlyZooKeeperServer", e);
                                }
                            }
                        };
                        try {
                            roZkMgr.start();
                            setBCVote(null);
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception",e);
                            setPeerState(ServerState.LOOKING);
                        } finally {
                            // If the thread is in the the grace period, interrupt
                            // to come out of waiting.
                            roZkMgr.interrupt();
                            roZk.shutdown();
                        }
                    } else {
                        try {
                            setBCVote(null);
                            //Call the selectionalg#lookforleader method, and then return the voting information after the election
                            setCurrentVote(makeLEStrategy().lookForLeader());
                        } catch (Exception e) {
                            LOG.warn("Unexpected exception", e);
                            setPeerState(ServerState.LOOKING);
                        }
                    }
                    break;
                //After the election, enter the observer role here
                case OBSERVING:
                    try {
                        LOG.info("OBSERVING");
                        setObserver(makeObserver(logFactory));
                        observer.observeLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e );                        
                    } finally {
                        observer.shutdown();
                        setObserver(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                //After the election, the Follower role enters here
                case FOLLOWING:
                    try {
                        LOG.info("FOLLOWING");
                        setFollower(makeFollower(logFactory));
                        follower.followLeader();
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        follower.shutdown();
                        setFollower(null);
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                //After the election, the Leader role enters here
                case LEADING:
                    LOG.info("LEADING");
                    try {
                        setLeader(makeLeader(logFactory));
                        leader.lead();
                        setLeader(null);
                    } catch (Exception e) {
                        LOG.warn("Unexpected exception",e);
                    } finally {
                        if (leader != null) {
                            leader.shutdown("Forcing shutdown");
                            setLeader(null);
                        }
                        setPeerState(ServerState.LOOKING);
                    }
                    break;
                }
            }
        } finally {
            LOG.warn("QuorumPeer main thread exited");
            try {
                MBeanRegistry.getInstance().unregisterAll();
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            jmxQuorumBean = null;
            jmxLocalPeerBean = null;
        }
    }

We can start from the MainLoop in the appeal code. After entering the while loop, Su enters the looking branch because the current node is still in the looking state. In this branch, we can first judge whether the current node is in read-only mode. Because the read only mode is not explained at present, we can directly enter another branch:

                        setBCVote(null);
                        //Call the selectionalg#lookforleader method, and then return the voting information after the election
                        setCurrentVote(makeLEStrategy().lookForLeader());

The makeLEStrategy method returns is actually the FastLeaderElection instance that we speak in the QuorumPeer#startLeaderElection method, then calls the FastLeaderElection#lookForLeader method to carry on the Leader election:

  public Vote lookForLeader() throws InterruptedException {
        try {
            self.jmxLeaderElectionBean = new LeaderElectionBean();
            MBeanRegistry.getInstance().register(
                    self.jmxLeaderElectionBean, self.jmxLocalPeerBean);
        } catch (Exception e) {
            LOG.warn("Failed to register with JMX", e);
            self.jmxLeaderElectionBean = null;
        }
        if (self.start_fle == 0) {
           self.start_fle = Time.currentElapsedTime();
        }
        try {
            HashMap<Long, Vote> recvset = new HashMap<Long, Vote>();

            HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>();

            int notTimeout = finalizeWait;

            synchronized(this){
                logicalclock.incrementAndGet();
                updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch());
            }

            LOG.info("New election. My id =  " + self.getId() +
                    ", proposed zxid=0x" + Long.toHexString(proposedZxid));
            sendNotifications();

            /*
             * Loop in which we exchange notifications until we find a leader
             */

            while ((self.getPeerState() == ServerState.LOOKING) &&
                    (!stop)){
                /*
                 * Remove next notification from queue, times out after 2 times
                 * the termination time
                 */
                Notification n = recvqueue.poll(notTimeout,
                        TimeUnit.MILLISECONDS);

                /*
                 * Sends more notifications if haven't received enough.
                 * Otherwise processes new notification.
                 */
                if(n == null){
                    if(manager.haveDelivered()){
                        sendNotifications();
                    } else {
                        manager.connectAll();
                    }

                    /*
                     * Exponential backoff
                     */
                    int tmpTimeOut = notTimeout*2;
                    notTimeout = (tmpTimeOut < maxNotificationInterval?
                            tmpTimeOut : maxNotificationInterval);
                    LOG.info("Notification time out: " + notTimeout);
                }
                else if(validVoter(n.sid) && validVoter(n.leader)) {
                    /*
                     * Only proceed if the vote comes from a replica in the
                     * voting view for a replica in the voting view.
                     */
                    switch (n.state) {
                    case LOOKING:
                        // If notification > current, replace and send messages out
                        if (n.electionEpoch > logicalclock.get()) {
                            logicalclock.set(n.electionEpoch);
                            recvset.clear();
                            if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                    getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) {
                                updateProposal(n.leader, n.zxid, n.peerEpoch);
                            } else {
                                updateProposal(getInitId(),
                                        getInitLastLoggedZxid(),
                                        getPeerEpoch());
                            }
                            sendNotifications();
                        } else if (n.electionEpoch < logicalclock.get()) {
                            if(LOG.isDebugEnabled()){
                                LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x"
                                        + Long.toHexString(n.electionEpoch)
                                        + ", logicalclock=0x" + Long.toHexString(logicalclock.get()));
                            }
                            break;
                        } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                proposedLeader, proposedZxid, proposedEpoch)) {
                            updateProposal(n.leader, n.zxid, n.peerEpoch);
                            sendNotifications();
                        }

                        if(LOG.isDebugEnabled()){
                            LOG.debug("Adding vote: from=" + n.sid +
                                    ", proposed leader=" + n.leader +
                                    ", proposed zxid=0x" + Long.toHexString(n.zxid) +
                                    ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch));
                        }

                        recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch));

                        if (termPredicate(recvset,
                                new Vote(proposedLeader, proposedZxid,
                                        logicalclock.get(), proposedEpoch))) {

                            // Verify if there is any change in the proposed leader
                            while((n = recvqueue.poll(finalizeWait,
                                    TimeUnit.MILLISECONDS)) != null){
                                if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch,
                                        proposedLeader, proposedZxid, proposedEpoch)){
                                    recvqueue.put(n);
                                    break;
                                }
                            }

                            /*
                             * This predicate is true once we don't read any new
                             * relevant message from the reception queue
                             */
                            if (n == null) {
                                self.setPeerState((proposedLeader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(proposedLeader,
                                                        proposedZxid,
                                                        logicalclock.get(),
                                                        proposedEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }
                        break;
                    case OBSERVING:
                        LOG.debug("Notification from observer: " + n.sid);
                        break;
                    case FOLLOWING:
                    case LEADING:
                        /*
                         * Consider all notifications from the same epoch
                         * together.
                         */
                        if(n.electionEpoch == logicalclock.get()){
                            recvset.put(n.sid, new Vote(n.leader,
                                                          n.zxid,
                                                          n.electionEpoch,
                                                          n.peerEpoch));
                           
                            if(ooePredicate(recvset, outofelection, n)) {
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());

                                Vote endVote = new Vote(n.leader, 
                                        n.zxid, 
                                        n.electionEpoch, 
                                        n.peerEpoch);
                                leaveInstance(endVote);
                                return endVote;
                            }
                        }

                        /*
                         * Before joining an established ensemble, verify
                         * a majority is following the same leader.
                         */
                        outofelection.put(n.sid, new Vote(n.version,
                                                            n.leader,
                                                            n.zxid,
                                                            n.electionEpoch,
                                                            n.peerEpoch,
                                                            n.state));
           
                        if(ooePredicate(outofelection, outofelection, n)) {
                            synchronized(this){
                                logicalclock.set(n.electionEpoch);
                                self.setPeerState((n.leader == self.getId()) ?
                                        ServerState.LEADING: learningState());
                            }
                            Vote endVote = new Vote(n.leader,
                                                    n.zxid,
                                                    n.electionEpoch,
                                                    n.peerEpoch);
                            leaveInstance(endVote);
                            return endVote;
                        }
                        break;
                    default:
                        LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)",
                                n.state, n.sid);
                        break;
                    }
                } else {
                    if (!validVoter(n.leader)) {
                        LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid);
                    }
                    if (!validVoter(n.sid)) {
                        LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid);
                    }
                }
            }
            return null;
        } finally {
            try {
                if(self.jmxLeaderElectionBean != null){
                    MBeanRegistry.getInstance().unregister(
                            self.jmxLeaderElectionBean);
                }
            } catch (Exception e) {
                LOG.warn("Failed to unregister with JMX", e);
            }
            self.jmxLeaderElectionBean = null;
            LOG.debug("Number of connection processing threads: {}",
                    manager.getConnectionThreadCount());
        }
    }

To be continued

Topics: Zookeeper source code analysis