Solve Bug Path - dubbo apps cannot reconnect zookeeper

Posted by rschneid on Tue, 07 Apr 2020 05:02:01 +0200

Preface

Dubbo is a mature and widely used framework.Rao is such that in some extreme cases dubbo-based applications will also have the problem of not being able to reconnect zookeeper.Because this problem is easy to cause relatively large failures, the author has spent a lot of time to locate, and now will write a blog to share the investigation process.

Bug Site

This is a failure in the test environment.This is due to a networking switch switch switching exercise, possibly due to an incorrect posture, which makes the time to disconnect from the estimated second level to the minute level.When the network is restored, the test environment explodes, and virtually all applications are no longer serviceable, and there are no providers visible on the dubbo console. They are disconnected from zk and appear to have no sign of reconnection at all.As shown in the following figure:

Unable to recover quickly

In order not to affect the progress of the test, Operations and Maintenance students urgently restarted, but Dad is that most systems have a startup dependency, blind restart can only be started because xxx provider does not exist.You can only restart from the most basic service and slowly recover.As shown in the following figure:

It's good that it's just a test environment, but in order to keep the production line from having this problem, you have to check it out and pick up the Bug.

Start investigation

Simulate zookeeper disconnection

The advantage of the test environment is that we can simulate recurrence by various means, instead of searching around for clues and making logical inferences (inference is a very brain-burning process) as we do with production lines.So the author contacted SA classmates to simulate offline network disconnection through iptables.The command is as follows:

// Disable flow in and out of native and zk machines
iptables -A INPUT -s zk-1-ip/32 -j DROP
iptables -A INPUT -s zk-2-ip/32 -j DROP
iptables -A INPUT -s zk-3-ip/32 -j DROP

iptables -A OUTPUT -s zk-1-ip/32 -j DROP
iptables -A OUTPUT -s zk-2-ip/32 -j DROP
iptables -A OUTPUT -s zk-3-ip/32 -j DROP

The topology diagram is as follows:

After drop ping the zk package, no matter how long you wait, you can reconnect the zk as soon as the connection is released! It seems that dubbo's reconnection to zookeeper is quite reliable.

Simultaneously simulate DNS disconnection

Since analog zk disconnection does not result in unreachable phenomena.The author began to think about whether a switch exception caused all packages to fail to send/receive, and it was not the connection to zookeeper that caused the reconnection problem.So the author looked at the configuration, whether there are other points associated with reconnection, and carefully observed the configuration:

// One of the less noticeable points is that domain name resolution also requires network package interaction.
dubbo.registry.address=zookeeper://dubbo-1.com?back=dubbo-2.com,dubbo-3.com

Is DNS inaccessibility causing this problem?In the inverse of the test environment, continue simulating the issue with the following commands:

// Disable flow in and out of native and zk machines
iptables -A INPUT -s zk-1-ip/32 -j DROP
iptables -A INPUT -s zk-2-ip/32 -j DROP
iptables -A INPUT -s zk-3-ip/32 -j DROP
iptables -A OUTPUT -s zk-1-ip/32 -j DROP
iptables -A OUTPUT -s zk-2-ip/32 -j DROP
iptables -A OUTPUT -s zk-3-ip/32 -j DROP
// Disable traffic entry and exit for both local and DNS machines
iptables -A INPUT -s dns-ip/32 -j DROP
iptables -A INPUT -s dns-ip/32 -j DROP
iptables -A OUTPUT -s dns-ip/32 -j DROP
iptables -A OUTPUT -s dns-ip/32 -j DROP

The network topology is as follows:

This time, after disabling traffic, we deliberately released zk traffic before DNS traffic, as shown in the following figure:
It seems that during dubbo's reconnection to zookeeper, if DNS also fails to respond, there will be a phenomenon in which the network can no longer be reconnected after recovery.However, it is not possible to judge that the unreachable result of a switch failure is definitely caused by this Bug.You need to find evidence to prove this!

It's the stone of the mountain that can attack the jade

Now that you have this information about DNS, google first to see if anyone else has ever encountered this pit.So I found this link

https://github.com/sgroschupf/zkclient/issues/23


As described on github, zkclient can no longer reconnect zookeeper after UnknownHostException is thrown.However, he encountered it in Kafka, but he inferred that this would be the problem for all low-level versions of org.apache.zookeper.Follow the Bug Fix link given above

https://issues.apache.org/jira/browse/ZOOKEEPER-1576

The author found that it was repaired in 3.5.0

The corresponding application's version of org.apache.zookeeper was upgraded to version 3.5.5. After re-experimenting, the problem was solved!
Here's a little trick we can use

zip -d xxx.jar WEB-INF/lib/zookeeper-3.4.8.jar
zip -r 0 xxx.jar WEB-INF/lib/zookeeper-3.5.5.jar
// And zip-r other zookeeper-3.5.5 new dependent packages

This allows you to modify the version of the jar package used by your application without recompiling the package, so that you don't need to notify your development modification dependencies when validating quickly.

Find zookeeper jar packages that support jdk1.6

Since the production environment in which I work has many JDK1.6 for older systems and zookeeper-3.5.5 supports versions 1.8 and above, I need to look for packages that can be used for jdk1.6.At this time, my colleague responsible for kafka, who tested kafa chaos and believed that Kafka does not have this problem, continued the test with the zookeeper-3.4.13 package which Kafka depends on, and found that zookeeper-3.4.13 is also okay. The specific code changes will be described below.

Search logs for evidence

Since it is possible that UnknownHostException caused this problem, the author looked for it in the problematic application and found the UnknownHostException.The logs are as follows:

// The following filters a large number of disconnected disconnected logs, listing only core related
2020-03-19 21:06:28.926 [DubboZkclientConnector-EventThread] zookeeper state changed (Disconnected)
2020-03-19 21:06:28.926 [ZkClient-EventThread-101-dubbo.com] Zookeeper Lost Connection
2020-03-19 21:06:49.758 [DubboZkclientConnector-EventThread] zookeeper state changed (Expired)
2020-03-19 21:06:49:759 [DubboZkclientConnecto-SendThread] Unable to reconnect to ZooKeeper sercice ,session 0xXXXXX has expired
2020-03-19 21:07:29.793 [DubboZkclientConnector-EventThread] ERROR ClientCxnn - Error while calling watcher
java.lang.RuntimeException: Exception while restarting zk client
	......
	......
Caused by: java.net.UnknownHostException: dubbo-1.com
	at...lookupAllHostAddr...
	......
	at...StaticHostProvier...
	......
	at...reconnect...[zookeeper-3.4.8.jar:3.4.8--1]
	

The log above indicates that if java.net.UnknownHostException is thrown during session reestablishment after zookeeper session expired, the zkclient thread will no longer act.

code analysis

Old version code logic

Combining the above evidence with the results of the experiment, it is almost certain that this DNS anomaly will cause dubbo to be unable to reconnect to zookeeper.So the author develops the flip code, first let's look at the logic of dubbo reconnection:

public class ZkclientZookeeperClient extends AbstractZookeeperClient<IZkChildListener> {
	......
    public ZkclientZookeeperClient(URL url) {
        super(url);
        client = new ZkClientWrapper(url.getBackupAddress(), 30000);
        client.addListener(new IZkStateListener() {
            public void handleStateChanged(KeeperState state) throws Exception {
                ZkclientZookeeperClient.this.state = state;
                if (state == KeeperState.Disconnected) {
                    stateChanged(StateListener.DISCONNECTED);
                } else if (state == KeeperState.SyncConnected) {
                    stateChanged(StateListener.CONNECTED);
                }
            }
			  // This is the session rebuilding process
            public void handleNewSession() throws Exception {
                stateChanged(StateListener.RECONNECTED);
            }
        });
        client.start();
    }
	......
}
// Processing of StateListener.RECONNECTED in zookeeperResigry.java
public class ZookeeperRegistry extends FailbackRegistry{
	......
        zkClient.addStateListener(new StateListener() {
        	  // Recovery is the process of recover y after the RECONNECTED event is received
            public void stateChanged(int state) {
                if (state == RECONNECTED) {
                    try {
                        recover();
                    } catch (Exception e) {
                        logger.error(e.getMessage(), e);
                    }
                }
            }
        });
   ......
}

From the code above, we know that session is rebuilt internally after session expired, and after a new session, dubbo's Statelistener sends a reconnected event to perform the recovery process, as shown in the following figure:
So let's see what happens when UnknownHostException is thrown, and the code looks like this:

public class ZkClient implements Watcher {
......
    private void processStateChanged(WatchedEvent event) {
    	  // This side corresponds to the session expired log
        LOG.info("zookeeper state changed (" + event.getState() + ")");
        setCurrentState(event.getState());
        if (getShutdownTrigger()) {
            return;
        }
        try {
            fireStateChangedEvent(event.getState());

            if (event.getState() == KeeperState.Expired) {
				// This is where UnknownHostException is thrown.
                reconnect();
                fireNewSessionEvents();
            }
        } catch (final Exception e) {
			// This side corresponds to the Error while calling watcher Log
            throw new RuntimeException("Exception while restarting zk client", e);
        }
    }

......
}

Exceptions are thrown by reconnect, and once they are thrown, the logic fireNewSessionEvents will not run, nor will the handleNewSession logic in the listener be executed, and thus will not recover, resulting in dubbo not being able to reconnect!As shown in the following figure:

How to fix a new version

Since UnknownHostException is triggered in StaticHostProviver, here the author gives the corresponding code for the old and new versions, zookeeper-3.4.8

public final class StaticHostProvider implements HostProvider{
	......
    public StaticHostProvider(Collection<InetSocketAddress> serverAddresses)
            throws UnknownHostException {
        for (InetSocketAddress address : serverAddresses) {
            InetAddress ia = address.getAddress();
            // UnknownHostException exception not caught here
            InetAddress resolvedAddresses[] = InetAddress.getAllByName((ia!=null) ? ia.getHostAddress():
                address.getHostName());
            for (InetAddress resolvedAddress : resolvedAddresses) {
 				 	......
                if (resolvedAddress.toString().startsWith("/") 
                        && resolvedAddress.getAddress() != null) {
                    this.serverAddresses.add(
                            new InetSocketAddress(InetAddress.getByAddress(
                                    address.getHostName(),
                                    resolvedAddress.getAddress()), 
                                    address.getPort()));
                } else {
                    this.serverAddresses.add(new InetSocketAddress(resolvedAddress.getHostAddress(), address.getPort()));
                }  
            }
        }
        
        if (this.serverAddresses.isEmpty()) {
            throw new IllegalArgumentException(
                    "A HostProvider may not be empty!");
        }
        Collections.shuffle(this.serverAddresses);
    }
   ......
}

A minor restructuring of the new version of zookeeper-3.4.13 put the logic of DNS into the next function and caught the UnknownHostException exception.

public final class StaticHostProvider implements HostProvider{
	......
    public InetSocketAddress next(long spinDelay) {
        currentIndex = ++currentIndex % serverAddresses.size();
        if (currentIndex == lastIndex && spinDelay > 0) {
            try {
                Thread.sleep(spinDelay);
            } catch (InterruptedException e) {
                LOG.warn("Unexpected exception", e);
            }
        } else if (lastIndex == -1) {
            // We don't want to sleep on the first ever connect attempt.
            lastIndex = 0;
        }

        InetSocketAddress curAddr = serverAddresses.get(currentIndex);
        try {
            String curHostString = getHostString(curAddr);
            List<InetAddress> resolvedAddresses = new ArrayList<InetAddress>(Arrays.asList(this.resolver.getAllByName(curHostString)));
            if (resolvedAddresses.isEmpty()) {
                return curAddr;
            }
            Collections.shuffle(resolvedAddresses);
            return new InetSocketAddress(resolvedAddresses.get(0), curAddr.getPort());
        } catch (UnknownHostException e) {
        	// This is where you grab UnknownHostException and fix the problem
            return curAddr;
        }
    }	
	......
}

We can see that the new zookeeper-3.4.13 grabs this UnknownHostException and fixes the problem.

BUG Trigger Conditional Disk

This is easily achieved by using a lower version of the zookeeper jar package with an abnormal connection to the zookeeper service and session expired (default 30s) and a DNS cache timeout (default 30s).Previous test environments caused network disconnection for more than 30 seconds due to some reason of switch switching.It's not just Dubbo, but any lower version jar package with zookeeper can have this problem!

summary

Hain's Law states that behind each serious accident, there must be 29 minor accidents, 300 threatening attempts and 1,000 potential accidents.Therefore, attention should be paid to any problems in the test environment so as to eliminate them in their infancy.

Public Number

Follow the author's public number for more dry goods articles:

Text Link

https://my.oschina.net/alchemystar/blog/3222767

Topics: Programming Zookeeper iptables Dubbo DNS