I Source code warehouse:
zookeeper
Based on branch 3.4.14, the startup process of branch in windows system is analyzed.
II Process analysis:
- Source code entry
Through zkserver From the contents of CMD executable file, we can see that the server of zookeeper is through org apache. zookeeper. server. quorum. The main of quorumpeermain class is used as the entry to start the server program The main method passes in our zoo Cfg file address, and then by parsing zoo Cfg file to convert the configuration information of key and value into the object of QuorumPeerConfig. See QuorumPeerConfig for conversion details Parse method, in which the converted core configuration parameters are:
Parameter name | Parameter description |
---|---|
dataLogDir | Transaction log storage path |
dataDir | Snapshot storage path |
electionType | At present, only 3-fast election algorithm is supported |
myid | Current service id |
tickTime | Time unit |
initLimit | |
syncLimit | Transaction storage path |
minSessionTimeout | Minimum session timeout |
maxSessionTimeout | Maximum session timeout |
peerType | Role type - OBSERVER,PARTICIPANT |
clientPort | Client connection port |
clientPortAddress | Client connection Host |
snapRetainCount | Number of snapshots reserved, minimum 3 |
purgeInterval | Snapshot cleanup interval |
server.sid | Hostname: Port: electionport: peerType |
maxClientCnxns | Maximum number of client connections |
After obtaining the parsed parameters, you can determine whether the server is configured The ID parameter determines whether to start the cluster or the single machine. The single machine start-up operation is started through the ZooKeeperServerMain#main method, and the cluster start-up is processed in the QuorumPeerMain#runFromConfig method. Here we will directly explain the cluster mode, because the cluster mode has more communication related processing between clusters than the single machine mode, such as Leader election, data synchronization, request forwarding, etc
public void runFromConfig(QuorumPeerConfig config) throws IOException { try { ManagedUtil.registerLog4jMBeans(); } catch (JMException e) { LOG.warn("Unable to register log4j JMX control", e); } LOG.info("Starting quorum peer"); try { ServerCnxnFactory cnxnFactory = ServerCnxnFactory.createFactory(); cnxnFactory.configure(config.getClientPortAddress(), config.getMaxClientCnxns()); quorumPeer = getQuorumPeer(); quorumPeer.setQuorumPeers(config.getServers()); quorumPeer.setTxnFactory(new FileTxnSnapLog( new File(config.getDataLogDir()), new File(config.getDataDir()))); quorumPeer.setElectionType(config.getElectionAlg()); quorumPeer.setMyid(config.getServerId()); quorumPeer.setTickTime(config.getTickTime()); quorumPeer.setInitLimit(config.getInitLimit()); quorumPeer.setSyncLimit(config.getSyncLimit()); quorumPeer.setQuorumListenOnAllIPs(config.getQuorumListenOnAllIPs()); quorumPeer.setCnxnFactory(cnxnFactory); quorumPeer.setQuorumVerifier(config.getQuorumVerifier()); quorumPeer.setClientPortAddress(config.getClientPortAddress()); quorumPeer.setMinSessionTimeout(config.getMinSessionTimeout()); quorumPeer.setMaxSessionTimeout(config.getMaxSessionTimeout()); quorumPeer.setZKDatabase(new ZKDatabase(quorumPeer.getTxnFactory())); quorumPeer.setLearnerType(config.getPeerType()); quorumPeer.setSyncEnabled(config.getSyncEnabled()); // sets quorum sasl authentication configurations quorumPeer.setQuorumSaslEnabled(config.quorumEnableSasl); if(quorumPeer.isQuorumSaslAuthEnabled()){ quorumPeer.setQuorumServerSaslRequired(config.quorumServerRequireSasl); quorumPeer.setQuorumLearnerSaslRequired(config.quorumLearnerRequireSasl); quorumPeer.setQuorumServicePrincipal(config.quorumServicePrincipal); quorumPeer.setQuorumServerLoginContext(config.quorumServerLoginContext); quorumPeer.setQuorumLearnerLoginContext(config.quorumLearnerLoginContext); } quorumPeer.setQuorumCnxnThreadsSize(config.quorumCnxnThreadsSize); quorumPeer.initialize(); quorumPeer.start(); quorumPeer.join(); } catch (InterruptedException e) { // warn, but generally this is ok LOG.warn("Quorum Peer interrupted", e); } }
It can be seen from the code snippet that a new QuorumPeer object is created. In fact, this is the idea of OOP. The current instance represents a node of the cluster, and then reset the QuorumPeerConfig to the QuorumPeer object. Here are several new classes:
Class name | Class description |
---|---|
FileTxnSnapLog | The core categories of persistence include snapshot, transaction log and operation |
ServerCnxnFactory 3 | The core class of server-side network processing includes two implementations: NIO and Netty |
ZKDatabase | The core class of memory operation is stored through tree structure |
After setting the parameters, the QuorumPeer#initialize method is called. In this method, the objects of some authentication classes are instantiated. The core is also the QuorumPeer#start method:
loadDataBase();//Load data from snapshots and transaction logs into memory cnxnFactory.start(); //Network service startup startLeaderElection(); //Preparations for elections super.start();
loadDataBase:
In this method, the loading work is mainly carried out by delegating to ZKDatabase#loadDataBase
public long loadDataBase() throws IOException { long zxid = snapLog.restore(dataTree, sessionsWithTimeouts, commitProposalPlaybackListener); initialized = true; return zxid; }
public long restore(DataTree dt, Map<Long, Integer> sessions, PlayBackListener listener) throws IOException { snapLog.deserialize(dt, sessions); //Data deserialization return fastForwardFromEdits(dt, sessions, listener); }
public long deserialize(DataTree dt, Map<Long, Integer> sessions) throws IOException { //Find 100 valid snapshot files, in descending order List<File> snapList = findNValidSnapshots(100); if (snapList.size() == 0) { return -1L; } File snap = null; boolean foundValid = false; for (int i = 0; i < snapList.size(); i++) { snap = snapList.get(i); InputStream snapIS = null; CheckedInputStream crcIn = null; try { LOG.info("Reading snapshot " + snap); snapIS = new BufferedInputStream(new FileInputStream(snap)); crcIn = new CheckedInputStream(snapIS, new Adler32()); InputArchive ia = BinaryInputArchive.getArchive(crcIn); //Really serialized place deserialize(dt,sessions, ia); long checkSum = crcIn.getChecksum().getValue(); long val = ia.readLong("val"); //Verify the integrity of the snapshot file if (val != checkSum) { throw new IOException("CRC corruption in snapshot : " + snap); } foundValid = true; break; } catch(IOException e) { LOG.warn("problem reading snap file " + snap, e); } finally { if (snapIS != null) snapIS.close(); if (crcIn != null) crcIn.close(); } } if (!foundValid) { throw new IOException("Not able to find valid snapshots in " + snapDir); } //The snapshot file is named snapshot lastZxid dt.lastProcessedZxid = Util.getZxidFromName(snap.getName(), SNAPSHOT_FILE_PREFIX); return dt.lastProcessedZxid; }
There are the following core attributes in ZkDataBase:
Table A | Table B |
---|---|
DataTree dataTree | Storage tree structure |
FileTxnSnapLog snapLog | Transaction snapshot persistence category |
,ConcurrentHashMap<Long, Integer> sessionsWithTimeouts | Session management, sessionId |
In the loadDataBase method, you can see the called snapLog#restore method. When you enter the restore method, you can see that FileTxnSnapLog#deserialize is called for reverse sequencing, and then saved to the passed dt,sessions parameters. You can locate filetxnsnaplog# deserialize (datatree DT, map < long, integer > sessions,
InputArchive ia)Let's take a look at this overloaded method,How to deserialize snapshot files:
public void deserialize(DataTree dt, Map<Long, Integer> sessions, InputArchive ia) throws IOException { FileHeader header = new FileHeader(); header.deserialize(ia, "fileheader"); if (header.getMagic() != SNAP_MAGIC) { throw new IOException("mismatching magic headers " + header.getMagic() + " != " + FileSnap.SNAP_MAGIC); }
First, read through the wrapper class InputArchive of the file input stream, and call the FileHeader#deserialize method:
public void deserialize(InputArchive a_, String tag) throws java.io.IOException { a_.startRecord(tag); magic=a_.readInt("magic"); version=a_.readInt("version"); dbid=a_.readLong("dbid"); a_.endRecord(tag); }
FileHeader implements the Record interface. In fact, all subsequent serialization and deserialization need to implement this interface. It defines its own serialization and deserialization details through the input stream object passed in
Here you can see that the storage structure of FileHeader is:
Attribute value | Occupancy size | describe |
---|---|---|
magic | 4 bytes | Magic number |
version | 4 bytes | Version number |
version | 8 bytes | Database id |
After the FileHedare#deserialize method, 16 bytes have been read from the file stream. Next, serializeutils #deserialize snapshot (DT, IA, sessions) is called to load other contents,
public static void deserializeSnapshot(DataTree dt,InputArchive ia, Map<Long, Integer> sessions) throws IOException { //Number of sessions int count = ia.readInt("count"); while (count > 0) { //Session id long id = ia.readLong("id"); //Session timeout int to = ia.readInt("timeout"); sessions.put(id, to); if (LOG.isTraceEnabled()) { ZooTrace.logTraceMessage(LOG, ZooTrace.SESSION_TRACE_MASK, "loadData --- session in archive: " + id + " with timeout: " + to); } count--; } dt.deserialize(ia, "tree"); }
You can see that first, the count attribute of 4 bytes is read from the stream, that is, the number of sessions, then 8 bytes of sessionId (session id) and 4 bytes of timeout (session timeout) are read, and then assigned to sessions (that is, the sessionsWithTimeouts attribute of ZkDataBase), Finally, DataTree#deserialize is called to reverse serialization of real storage content:
public void deserialize(InputArchive ia, String tag) throws IOException { aclCache.deserialize(ia); nodes.clear(); pTrie.clear(); String path = ia.readString("path"); while (!path.equals("/")) { DataNode node = new DataNode(); ia.readRecord(node, "node"); nodes.put(path, node); synchronized (node) { aclCache.addUsage(node.acl); } int lastSlash = path.lastIndexOf('/'); if (lastSlash == -1) { root = node; } else { String parentPath = path.substring(0, lastSlash); node.parent = nodes.get(parentPath); if (node.parent == null) { throw new IOException("Invalid Datatree, unable to find " + "parent " + parentPath + " of path " + path); } node.parent.addChild(path.substring(lastSlash + 1)); long eowner = node.stat.getEphemeralOwner(); if (eowner != 0) { HashSet<String> list = ephemerals.get(eowner); if (list == null) { list = new HashSet<String>(); ephemerals.put(eowner, list); } list.add(path); } } path = ia.readString("path"); } nodes.put("/", root); setupQuota(); aclCache.purgeUnused(); }
- Network transmission (NIO)
The connection between zookeeper and the client and the data transmission of request and response are processed through the implementation class of ServerCnxnFactory. Here we explain it directly through NIO's implementation class NIOServerCnxnFactory. In QuorumPeer's start method, we see that NIOServerCnxnFactory#start method is called
public void start() { // ensure thread is started once and only once if (thread.getState() == Thread.State.NEW) { thread.start(); } }
In the start method, we simply call the Thread#start method to start the thread As for where the thread method is initialized, I can locate the NIOServerCnxnFactory#configure method:
public void configure(InetSocketAddress addr, int maxcc) throws IOException { configureSaslLogin(); //Initialize thread object thread = new ZooKeeperThread(this, "NIOServerCxn.Factory:" + addr); thread.setDaemon(true); //Set the maximum number of connections parameter maxClientCnxns = maxcc; //Initialize Socket related configuration this.ss = ServerSocketChannel.open(); ss.socket().setReuseAddress(true); LOG.info("binding to port " + addr); ss.socket().bind(addr); ss.configureBlocking(false); ss.register(selector, SelectionKey.OP_ACCEPT); }
election
After starting the network transmission service, we begin to prepare for some preparations before the election. We can call QuorumPeer#startLeaderElection() in the QuorumPeer#start method to make an election entry point:synchronized public void startLeaderElection() { try { //Set initial voting currentVote = new Vote(myid, getLastLoggedZxid(), getCurrentEpoch()); } catch(IOException e) { RuntimeException re = new RuntimeException(e.getMessage()); re.setStackTrace(e.getStackTrace()); throw re; } for (QuorumServer p : getView().values()) { if (p.id == myid) { myQuorumAddr = p.addr; break; } } if (myQuorumAddr == null) { throw new RuntimeException("My id " + myid + " not in the peer list"); } if (electionType == 0) { try { udpSocket = new DatagramSocket(myQuorumAddr.getPort()); //Start response thread responder = new ResponderThread(); responder.start(); } catch (SocketException e) { throw new RuntimeException(e); } } //Perform some initialization according to the configured election algorithm this.electionAlg = createElectionAlgorithm(electionType); }
It can be seen from the startLeaderElection method that the initial voting is mainly set to itself, sid is its serverId,zxid is the maximum lastZxid loaded through snapshots and transaction logs, and peereoch (election year) is the current election year of itself, and then the response thread of repeatethread is started, The core logic is still in the createElectionAlgorithm method. We can follow it to see the specific code logic:
protected Election createElectionAlgorithm(int electionAlgorithm){ Election le=null; //TODO: use a factory rather than a switch switch (electionAlgorithm) { case 0: le = new LeaderElection(this); break; case 1: //Deprecated le = new AuthFastLeaderElection(this); break; case 2: //Deprecated le = new AuthFastLeaderElection(this, true); break; case 3: //Create connection manager qcm = createCnxnManager(); QuorumCnxManager.Listener listener = qcm.listener; if(listener != null){ //Start listening for other connection requests of nodes listener.start(); //Instantiate the core class of fast election algorithm le = new FastLeaderElection(this, qcm); } else { LOG.error("Null listener when initializing cnx manager"); } break; default: assert false; } return le; }
From the above code, we can see that the main work is to instantiate a QuorumCnxManager object, that is, to process the connection request with other nodes through the Listener class in this object. Calling the Listener#start method actually runs into the Listener#run method code:
public void run() { int numRetries = 0; InetSocketAddress addr; while((!shutdown) && (numRetries < 3)){ try { //Instantiate ServerSocket ss = new ServerSocket(); ss.setReuseAddress(true); if (listenOnAllIPs) { int port = view.get(QuorumCnxManager.this.mySid) .electionAddr.getPort(); addr = new InetSocketAddress(port); } else { addr = view.get(QuorumCnxManager.this.mySid) .electionAddr; } LOG.info("My election bind port: " + addr.toString()); setName(view.get(QuorumCnxManager.this.mySid) .electionAddr.toString()); ss.bind(addr); while (!shutdown) { //Blocking waiting for other nodes to request a connection Socket client = ss.accept(); setSockOpts(client); LOG.info("Received connection request " + client.getRemoteSocketAddress()); if (quorumSaslAuthEnabled) { receiveConnectionAsync(client); } else { //Accept request core logic receiveConnection(client); } numRetries = 0; } } catch (IOException e) { LOG.error("Exception while listening", e); numRetries++; try { ss.close(); Thread.sleep(1000); } catch (IOException ie) { LOG.error("Error closing server socket", ie); } catch (InterruptedException ie) { LOG.error("Interrupted while sleeping. " + "Ignoring exception", ie); } } } LOG.info("Leaving listener"); if (!shutdown) { LOG.error("As I'm leaving the listener thread, " + "I won't be able to participate in leader " + "election any longer: " + view.get(QuorumCnxManager.this.mySid).electionAddr); } }
This method mainly uses the blocking io of jdk to establish a connection with other nodes. Those who do not know can supplement the basic knowledge of socket programming of jdk, SS in the second while loop The accept () code will always block and wait for other nodes to request a connection. When other nodes establish a connection, a socket instance will be returned, and then the socket instance will be passed into the receiveConnection method. Then we can communicate with other nodes. The specific receiveConnection code logic is as follows:
public void receiveConnection(final Socket sock) { DataInputStream din = null; try { //Wrap the input stream multiple times din = new DataInputStream( new BufferedInputStream(sock.getInputStream())); //Really handle connections handleConnection(sock, din); } catch (IOException e) { LOG.error("Exception handling connection, addr: {}, closing server connection", sock.getRemoteSocketAddress()); closeSocket(sock); } }
After wrapping the io input stream, handleConnection is further called for connection processing:
private void handleConnection(Socket sock, DataInputStream din) throws IOException { Long sid = null; try { // Blocking the first packet waiting for another node to send an establishment request //Read 8 bytes first, which may be sid (service id) or protocol version sid = din.readLong(); //The protocol version was read if (sid < 0) { //Read 8 bytes further, which is the real sid sid = din.readLong(); //Read 4 bytes, that is, read the number of bytes of other contents remaining int num_remaining_bytes = din.readInt(); //Perform word count check if (num_remaining_bytes < 0 || num_remaining_bytes > maxBuffer) { LOG.error("Unreasonable buffer length: {}", num_remaining_bytes); closeSocket(sock); return; } byte[] b = new byte[num_remaining_bytes]; //Read all the remaining byte contents into b this byte array at one time int num_read = din.read(b); if (num_read != num_remaining_bytes) { LOG.error("Read only " + num_read + " bytes out of " + num_remaining_bytes + " sent by server " + sid); } } if (sid == QuorumPeer.OBSERVER_ID) { sid = observerCounter.getAndDecrement(); LOG.info("Setting arbitrary identifier to observer: " + sid); } } catch (IOException e) { closeSocket(sock); LOG.warn("Exception reading or writing challenge: " + e.toString()); return; } LOG.debug("Authenticating learner server.id: {}", sid); authServer.authenticate(sock, din); //If the read sid is less than the SID of the current node, the previously established connection will be closed if (sid < this.mySid) { SendWorker sw = senderWorkerMap.get(sid); if (sw != null) { sw.finish(); } LOG.debug("Create new connection to server: " + sid); closeSocket(sock); //After closing the previous connection, the current node initiates a connection request connectOne(sid); } else { //Send thread SendWorker sw = new SendWorker(sock, sid); //Accept thread RecvWorker rw = new RecvWorker(sock, din, sid, sw); sw.setRecv(rw); SendWorker vsw = senderWorkerMap.get(sid); if(vsw != null) vsw.finish(); senderWorkerMap.put(sid, sw); queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY)); //Start sending thread sw.start(); //Start accept thread rw.start(); return; } }
It can be seen from this code that the establishment request can only be initiated by the party with the largest Sid and accepted by the party with the smallest Sid. If there are three nodes, sid = 1, sid = 2 and sid = 3, then only node 2 can initiate the connection request and node 1 can process the connection request This ensures that only one connection is maintained between the two sides, because the socket is in full duplex mode and supports communication between the two sides Socket can be accessed through SS Accept. You can also connect to nodes with smaller Sid through the connectOne method of the current method:
synchronized public void connectOne(long sid){ //This is to determine whether the sendWorkerMap contains the current sid if (!connectedToPeer(sid)){ InetSocketAddress electionAddr; if (view.containsKey(sid)) { //Get the previously configured server Election address of ID electionAddr = view.get(sid).electionAddr; } else { LOG.warn("Invalid server id: " + sid); return; } try { LOG.debug("Opening channel to server " + sid); //Instantiate Socket object Socket sock = new Socket(); setSockOpts(sock); //Connect sock.connect(view.get(sid).electionAddr, cnxTO); LOG.debug("Connected to server " + sid); if (quorumSaslAuthEnabled) { initiateConnectionAsync(sock, sid); } else { //Synchronously initialize the connection, that is, send some information about itself to other nodes initiateConnection(sock, sid); } } catch (UnresolvedAddressException e) { LOG.warn("Cannot open channel to " + sid + " at election address " + electionAddr, e); if (view.containsKey(sid)) { view.get(sid).recreateSocketAddresses(); } throw e; } catch (IOException e) { LOG.warn("Cannot open channel to " + sid + " at election address " + electionAddr, e); if (view.containsKey(sid)) { view.get(sid).recreateSocketAddresses(); } } } else { LOG.debug("There is a connection already for server " + sid); } }
public void initiateConnection(final Socket sock, final Long sid) { try { startConnection(sock, sid); } catch (IOException e) { LOG.error("Exception while connecting, id: {}, addr: {}, closing learner connection", new Object[] { sid, sock.getRemoteSocketAddress() }, e); closeSocket(sock); return; } }
private boolean startConnection(Socket sock, Long sid) throws IOException { DataOutputStream dout = null; DataInputStream din = null; try { dout = new DataOutputStream(sock.getOutputStream()); //Send its own sid to other nodes dout.writeLong(this.mySid); dout.flush(); din = new DataInputStream( new BufferedInputStream(sock.getInputStream())); } catch (IOException e) { LOG.warn("Ignoring exception reading or writing challenge: ", e); closeSocket(sock); return false; } // authenticate learner authLearner.authenticate(sock, view.get(sid).hostname); if (sid > this.mySid) { LOG.info("Have smaller server identifier, so dropping the " + "connection: (" + sid + ", " + this.mySid + ")"); closeSocket(sock); // Otherwise proceed with the connection } else { //The following logic is through SS Accept has the same logic after getting the socket object SendWorker sw = new SendWorker(sock, sid); RecvWorker rw = new RecvWorker(sock, din, sid, sw); sw.setRecv(rw); SendWorker vsw = senderWorkerMap.get(sid); if(vsw != null) vsw.finish(); senderWorkerMap.put(sid, sw); queueSendMap.putIfAbsent(sid, new ArrayBlockingQueue<ByteBuffer>(SEND_CAPACITY)); sw.start(); rw.start(); return true; } return false; }
As can be seen from the above methods, through ServerSocket Accpet and Socket After connect gets the Socket object, it instantiates a SendWorker and a RecvWorker, and calls their respective start methods to start two threads. In fact, these two threads are used to complete the data transmission of requests and responses with other nodes. A node maintains a SendWorker A RecvWorker communicates with a queue stored through queueSendMap.
How the latter three objects work will be explained in detail in the election After completing this series of election preparations, we return to the QuorumPeer#start method. Next, the QuorumPeer#start method calls super Start () method, because the QuorumPeer object inherits ZooKeeperThread, and ZooKeeperThread inherits the Thread class of jdk, so super is called After start, a separate Thread will be opened to execute the QuorumPeer#run method, which is the place where the election is really held:
public void run() { setName("QuorumPeer" + "[myid=" + getId() + "]" + cnxnFactory.getLocalAddress()); LOG.debug("Starting quorum peer"); //1.jmx expansion points try { jmxQuorumBean = new QuorumBean(this); MBeanRegistry.getInstance().register(jmxQuorumBean, null); for(QuorumServer s: getView().values()){ ZKMBeanInfo p; if (getId() == s.id) { p = jmxLocalPeerBean = new LocalPeerBean(this); try { MBeanRegistry.getInstance().register(p, jmxQuorumBean); } catch (Exception e) { LOG.warn("Failed to register with JMX", e); jmxLocalPeerBean = null; } } else { p = new RemotePeerBean(s); try { MBeanRegistry.getInstance().register(p, jmxQuorumBean); } catch (Exception e) { LOG.warn("Failed to register with JMX", e); } } } } catch (Exception e) { LOG.warn("Failed to register with JMX", e); jmxQuorumBean = null; } 2.//Election logic try { /* * Main loop */ while (running) { switch (getPeerState()) { //1.Looking status case LOOKING: LOG.info("LOOKING"); //Turn on read-only mode if (Boolean.getBoolean("readonlymode.enabled")) { LOG.info("Attempting to start ReadOnlyZooKeeperServer"); final ReadOnlyZooKeeperServer roZk = new ReadOnlyZooKeeperServer( logFactory, this, new ZooKeeperServer.BasicDataTreeBuilder(), this.zkDb); Thread roZkMgr = new Thread() { public void run() { try { // lower-bound grace period to 2 secs sleep(Math.max(2000, tickTime)); if (ServerState.LOOKING.equals(getPeerState())) { roZk.startup(); } } catch (InterruptedException e) { LOG.info("Interrupted while attempting to start ReadOnlyZooKeeperServer, not started"); } catch (Exception e) { LOG.error("FAILED to start ReadOnlyZooKeeperServer", e); } } }; try { roZkMgr.start(); setBCVote(null); setCurrentVote(makeLEStrategy().lookForLeader()); } catch (Exception e) { LOG.warn("Unexpected exception",e); setPeerState(ServerState.LOOKING); } finally { // If the thread is in the the grace period, interrupt // to come out of waiting. roZkMgr.interrupt(); roZk.shutdown(); } } else { try { setBCVote(null); //Call the selectionalg#lookforleader method, and then return the voting information after the election setCurrentVote(makeLEStrategy().lookForLeader()); } catch (Exception e) { LOG.warn("Unexpected exception", e); setPeerState(ServerState.LOOKING); } } break; //After the election, enter the observer role here case OBSERVING: try { LOG.info("OBSERVING"); setObserver(makeObserver(logFactory)); observer.observeLeader(); } catch (Exception e) { LOG.warn("Unexpected exception",e ); } finally { observer.shutdown(); setObserver(null); setPeerState(ServerState.LOOKING); } break; //After the election, the Follower role enters here case FOLLOWING: try { LOG.info("FOLLOWING"); setFollower(makeFollower(logFactory)); follower.followLeader(); } catch (Exception e) { LOG.warn("Unexpected exception",e); } finally { follower.shutdown(); setFollower(null); setPeerState(ServerState.LOOKING); } break; //After the election, the Leader role enters here case LEADING: LOG.info("LEADING"); try { setLeader(makeLeader(logFactory)); leader.lead(); setLeader(null); } catch (Exception e) { LOG.warn("Unexpected exception",e); } finally { if (leader != null) { leader.shutdown("Forcing shutdown"); setLeader(null); } setPeerState(ServerState.LOOKING); } break; } } } finally { LOG.warn("QuorumPeer main thread exited"); try { MBeanRegistry.getInstance().unregisterAll(); } catch (Exception e) { LOG.warn("Failed to unregister with JMX", e); } jmxQuorumBean = null; jmxLocalPeerBean = null; } }
We can start from the MainLoop in the appeal code. After entering the while loop, Su enters the looking branch because the current node is still in the looking state. In this branch, we can first judge whether the current node is in read-only mode. Because the read only mode is not explained at present, we can directly enter another branch:
setBCVote(null); //Call the selectionalg#lookforleader method, and then return the voting information after the election setCurrentVote(makeLEStrategy().lookForLeader());
The makeLEStrategy method returns is actually the FastLeaderElection instance that we speak in the QuorumPeer#startLeaderElection method, then calls the FastLeaderElection#lookForLeader method to carry on the Leader election:
public Vote lookForLeader() throws InterruptedException { try { self.jmxLeaderElectionBean = new LeaderElectionBean(); MBeanRegistry.getInstance().register( self.jmxLeaderElectionBean, self.jmxLocalPeerBean); } catch (Exception e) { LOG.warn("Failed to register with JMX", e); self.jmxLeaderElectionBean = null; } if (self.start_fle == 0) { self.start_fle = Time.currentElapsedTime(); } try { HashMap<Long, Vote> recvset = new HashMap<Long, Vote>(); HashMap<Long, Vote> outofelection = new HashMap<Long, Vote>(); int notTimeout = finalizeWait; synchronized(this){ logicalclock.incrementAndGet(); updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); } LOG.info("New election. My id = " + self.getId() + ", proposed zxid=0x" + Long.toHexString(proposedZxid)); sendNotifications(); /* * Loop in which we exchange notifications until we find a leader */ while ((self.getPeerState() == ServerState.LOOKING) && (!stop)){ /* * Remove next notification from queue, times out after 2 times * the termination time */ Notification n = recvqueue.poll(notTimeout, TimeUnit.MILLISECONDS); /* * Sends more notifications if haven't received enough. * Otherwise processes new notification. */ if(n == null){ if(manager.haveDelivered()){ sendNotifications(); } else { manager.connectAll(); } /* * Exponential backoff */ int tmpTimeOut = notTimeout*2; notTimeout = (tmpTimeOut < maxNotificationInterval? tmpTimeOut : maxNotificationInterval); LOG.info("Notification time out: " + notTimeout); } else if(validVoter(n.sid) && validVoter(n.leader)) { /* * Only proceed if the vote comes from a replica in the * voting view for a replica in the voting view. */ switch (n.state) { case LOOKING: // If notification > current, replace and send messages out if (n.electionEpoch > logicalclock.get()) { logicalclock.set(n.electionEpoch); recvset.clear(); if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, getInitId(), getInitLastLoggedZxid(), getPeerEpoch())) { updateProposal(n.leader, n.zxid, n.peerEpoch); } else { updateProposal(getInitId(), getInitLastLoggedZxid(), getPeerEpoch()); } sendNotifications(); } else if (n.electionEpoch < logicalclock.get()) { if(LOG.isDebugEnabled()){ LOG.debug("Notification election epoch is smaller than logicalclock. n.electionEpoch = 0x" + Long.toHexString(n.electionEpoch) + ", logicalclock=0x" + Long.toHexString(logicalclock.get())); } break; } else if (totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)) { updateProposal(n.leader, n.zxid, n.peerEpoch); sendNotifications(); } if(LOG.isDebugEnabled()){ LOG.debug("Adding vote: from=" + n.sid + ", proposed leader=" + n.leader + ", proposed zxid=0x" + Long.toHexString(n.zxid) + ", proposed election epoch=0x" + Long.toHexString(n.electionEpoch)); } recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); if (termPredicate(recvset, new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch))) { // Verify if there is any change in the proposed leader while((n = recvqueue.poll(finalizeWait, TimeUnit.MILLISECONDS)) != null){ if(totalOrderPredicate(n.leader, n.zxid, n.peerEpoch, proposedLeader, proposedZxid, proposedEpoch)){ recvqueue.put(n); break; } } /* * This predicate is true once we don't read any new * relevant message from the reception queue */ if (n == null) { self.setPeerState((proposedLeader == self.getId()) ? ServerState.LEADING: learningState()); Vote endVote = new Vote(proposedLeader, proposedZxid, logicalclock.get(), proposedEpoch); leaveInstance(endVote); return endVote; } } break; case OBSERVING: LOG.debug("Notification from observer: " + n.sid); break; case FOLLOWING: case LEADING: /* * Consider all notifications from the same epoch * together. */ if(n.electionEpoch == logicalclock.get()){ recvset.put(n.sid, new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch)); if(ooePredicate(recvset, outofelection, n)) { self.setPeerState((n.leader == self.getId()) ? ServerState.LEADING: learningState()); Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch); leaveInstance(endVote); return endVote; } } /* * Before joining an established ensemble, verify * a majority is following the same leader. */ outofelection.put(n.sid, new Vote(n.version, n.leader, n.zxid, n.electionEpoch, n.peerEpoch, n.state)); if(ooePredicate(outofelection, outofelection, n)) { synchronized(this){ logicalclock.set(n.electionEpoch); self.setPeerState((n.leader == self.getId()) ? ServerState.LEADING: learningState()); } Vote endVote = new Vote(n.leader, n.zxid, n.electionEpoch, n.peerEpoch); leaveInstance(endVote); return endVote; } break; default: LOG.warn("Notification state unrecognized: {} (n.state), {} (n.sid)", n.state, n.sid); break; } } else { if (!validVoter(n.leader)) { LOG.warn("Ignoring notification for non-cluster member sid {} from sid {}", n.leader, n.sid); } if (!validVoter(n.sid)) { LOG.warn("Ignoring notification for sid {} from non-quorum member sid {}", n.leader, n.sid); } } } return null; } finally { try { if(self.jmxLeaderElectionBean != null){ MBeanRegistry.getInstance().unregister( self.jmxLeaderElectionBean); } } catch (Exception e) { LOG.warn("Failed to unregister with JMX", e); } self.jmxLeaderElectionBean = null; LOG.debug("Number of connection processing threads: {}", manager.getConnectionThreadCount()); } }
To be continued