Analyzing the process of dubbo service registering to nacos

Posted by juhl on Wed, 15 Apr 2020 13:55:37 +0200

We talked about the migration of our dubbo service from redis to Nacos registry. After the migration, we found that an exception of ERROR com.alibaba.nacos.client.naming - [CLIENT-BEAT] failed to send beat: would be thrown from time to time. Therefore, with this analysis process, we found out that the exception is our SLB network mapping problem, which has nothing to do with Nacos.

  • dubbo version: 2.7.4.1
  • nacos client version: 1.0.0
  • nacos server version: 1.1.3

Brief description of the process

  • dubbo side: dubbo is implemented through nacos registration center, registering services to nacos, adding heartbeat task, which sends healthy heartbeat every 5s. At the same time, query whether the nacos service list is updated every 1s. If there is an update that triggers the service instance update notification, update the dubbo local service list
  • nacos side: after nacos receives the heartbeat, if the service instance does not exist at this time, a new service instance will be created. If the service instance is not healthy at this time, it will be set to the healthy state and actively push the state to the client. There is a task to check the service status in nacos. If there is no health heartbeat report in 15s, the service instance is set to be unhealthy. If there is no health heartbeat report in 30s, the service instance is offline and the status is pushed to the client.

Specific implementation of source code analysis

Under dubbo's registry package, four interfaces are defined for service registration behavior. All the service registration support (zookeeper, nacos, redis, etcd, etc.) are the implementation of these interfaces

  • NotifyListener: the interface definition of service change notification listening. When implementing the registry, you don't need to care about the implementation. Just connect the specific listener and pass the instance down
  • RegistryService: the interface definition of service registration, deregistration, definition, unsubscribe, and service search. It is the core interface and contains the core functions implemented by the registry
  • Registry: for the packaging of RegistryService and Node, there are many ways to check whether the service is available and to destroy the offline service. Generally, the registry interface is directly implemented
  • RegistryFactory: get the interface definition implemented by the registration center through the registration center URL. dubbo's spi design maps a registration center protocol header for each specific implementation. For example, nacos implementation corresponds to nacos: / / new docking of a registration center. It does not need to directly implement the Registry interface. It can directly inherit the FailbackRegistry abstract class and implement the relevant do methods. dubbo's abstraction for service registration is very consistent with that of nacos service registration. Most of the interfaces can be used directly. Only the definition of service subscription listener is different, just a little packaging and transformation, so the implementation is very simple.

Service registration

org.apache.dubbo.registry.nacos.NacosRegistry:152

    @Override
    public void doRegister(URL url) {
        final String serviceName = getServiceName(url);
        final Instance instance = createInstance(url);
        execute(namingService -> namingService.registerInstance(serviceName, instance));
    }

In dubbo, all services are encapsulated as URLs, corresponding to the service Instance instance in nacos. Therefore, when registering a service, you only need to simply convert the URL into an Instance to register in nacos. Let's see the specific registration behavior in the naming service.

com.alibaba.nacos.client.naming.NacosNamingService:283

    @Override
    public void registerInstance(String serviceName, String groupName, Instance instance) throws NacosException {
        if (instance.isEphemeral()) {
            BeatInfo beatInfo = new BeatInfo();
            beatInfo.setServiceName(NamingUtils.getGroupedName(serviceName, groupName));
            beatInfo.setIp(instance.getIp());
            beatInfo.setPort(instance.getPort());
            beatInfo.setCluster(instance.getClusterName());
            beatInfo.setWeight(instance.getWeight());
            beatInfo.setMetadata(instance.getMetadata());
            beatInfo.setScheduled(false);

            beatReactor.addBeatInfo(NamingUtils.getGroupedName(serviceName, groupName), beatInfo);
        }
        serverProxy.registerService(NamingUtils.getGroupedName(serviceName, groupName), groupName, instance);
    }

In the above code, in addition to registering an instance, it also determines whether the instance instance is a temporary instance. If it is a temporary instance, the heartbeat list of beatReactor is added. This is because nacos divides services into two categories. One is temporary services, such as dubbo and spring cloud, which need to be kept alive by heartbeat. If the heartbeat is not sent in time, the server will automatically log off the instance. One is permanent services, such as database and cache services. The client will not and cannot send heartbeat. This kind of services will be reverse explored by the server through TCP port detection and other ways. Let's see how the heartbeat of the temporary instance is sent.

com.alibaba.nacos.client.naming.NacosNamingService:104

    private int initClientBeatThreadCount(Properties properties) {
        if (properties == null) {
            return UtilAndComs.DEFAULT_CLIENT_BEAT_THREAD_COUNT;
        }
        return NumberUtils.toInt(properties.getProperty(PropertyKeyConst.NAMING_CLIENT_BEAT_THREAD_COUNT),
            UtilAndComs.DEFAULT_CLIENT_BEAT_THREAD_COUNT);
    }
    //You can set the number of threads to maintain the heartbeat by configuring dubbo.registers.nacos.parameters.namingclientbeatthreadcount = 10

First, look at the initialization code to obtain the number of heartbeat beatReactor thread pool threads. The incoming Properties are the parameter list when dubbo registry is configured. If namingClientBeatThreadCount is configured, the configured value is taken. The default thread pool size for maintaining heartbeat is: if it is single core, it is a thread, and multi-core is half of the CPU core. Continue heartbeat logic

com.alibaba.nacos.client.naming.beat.BeatReactor:78

    class BeatProcessor implements Runnable {
        @Override
        public void run() {
            try {
                for (Map.Entry<String, BeatInfo> entry : dom2Beat.entrySet()) {
                    BeatInfo beatInfo = entry.getValue();
                    if (beatInfo.isScheduled()) {
                        continue;
                    }
                    beatInfo.setScheduled(true);
                    executorService.schedule(new BeatTask(beatInfo), 0, TimeUnit.MILLISECONDS);
                }
            } catch (Exception e) {
                NAMING_LOGGER.error("[CLIENT-BEAT] Exception while scheduling beat.", e);
            } finally {
                executorService.schedule(this, clientBeatInterval, TimeUnit.MILLISECONDS);
            }
        }
    }
    class BeatTask implements Runnable {
        BeatInfo beatInfo;
        public BeatTask(BeatInfo beatInfo) {
            this.beatInfo = beatInfo;
        }
        @Override
        public void run() {
            long result = serverProxy.sendBeat(beatInfo);
            beatInfo.setScheduled(false);
            if (result > 0) {
                clientBeatInterval = result;
            }
        }
    }

dom2Beat is a map container for storing temporary instances that need heartbeat reporting. In NacosNamingService.registerInstance, it is added to the map by judging the logic of adding temporary nodes to the heartbeat list. After the initialization of BeatReactor, the call of BeatProcessor thread will be triggered. The BeatProcessor thread is a thread that constantly self triggers the call. After the execution of the previous heartbeat report logic, the next heartbeat report will be triggered every 5S. The interval time is controlled by the variable clientBeatInterval. The heart beat interval may change due to the impact of the heartbeat result value returned by the nacos server. The nacos server finds the value with the key of preserved.heart.beat.interval from the metadata of the instance to return. If it is empty, it returns 5S. This function is not mature in Dubbo 2.7.4.1. It can only be specified by annotation elements, such as @ Reference(parameters = "preserved.heart.beat.interval,10000"). Later, if you can directly configure the url parameter in the registry, it will be mature. Therefore, this function is not recommended for the time being and can be used as an experimental function.

service subscription

org.apache.dubbo.registry.nacos.NacosRegistry:399

    private void subscribeEventListener(String serviceName, final URL url, final NotifyListener listener)
            throws NacosException {
        if (!nacosListeners.containsKey(serviceName)) {
            EventListener eventListener = event -> {
                if (event instanceof NamingEvent) {
                    NamingEvent e = (NamingEvent) event;
                    notifySubscriber(url, listener, e.getInstances());
                }
            };
            namingService.subscribe(serviceName, eventListener);
            nacosListeners.put(serviceName, eventListener);
        }
    }

nacos' service listening is EventListener, so dubbo's service subscription only needs to package NotifyListener's processing into onEvent for processing, and add nacos subscription through nangservice.subscribe. Finally, the EventListener object will be added to the listener list of the event scheduler, as shown in the following code:

com.alibaba.nacos.client.naming.core.EventDispatcher:

public class EventDispatcher {

    private ExecutorService executor = null;

    private BlockingQueue<ServiceInfo> changedServices = new LinkedBlockingQueue<ServiceInfo>();

    private ConcurrentMap<String, List<EventListener>> observerMap
        = new ConcurrentHashMap<String, List<EventListener>>();

    public EventDispatcher() {

        executor = Executors.newSingleThreadExecutor(new ThreadFactory() {
            @Override
            public Thread newThread(Runnable r) {
                Thread thread = new Thread(r, "com.alibaba.nacos.naming.client.listener");
                thread.setDaemon(true);

                return thread;
            }
        });

        executor.execute(new Notifier());
    }

    public void addListener(ServiceInfo serviceInfo, String clusters, EventListener listener) {

        NAMING_LOGGER.info("[LISTENER] adding " + serviceInfo.getName() + " with " + clusters + " to listener map");
        List<EventListener> observers = Collections.synchronizedList(new ArrayList<EventListener>());
        observers.add(listener);

        observers = observerMap.putIfAbsent(ServiceInfo.getKey(serviceInfo.getName(), clusters), observers);
        if (observers != null) {
            observers.add(listener);
        }

        serviceChanged(serviceInfo);
    }

    public void removeListener(String serviceName, String clusters, EventListener listener) {

        NAMING_LOGGER.info("[LISTENER] removing " + serviceName + " with " + clusters + " from listener map");

        List<EventListener> observers = observerMap.get(ServiceInfo.getKey(serviceName, clusters));
        if (observers != null) {
            Iterator<EventListener> iter = observers.iterator();
            while (iter.hasNext()) {
                EventListener oldListener = iter.next();
                if (oldListener.equals(listener)) {
                    iter.remove();
                }
            }
            if (observers.isEmpty()) {
                observerMap.remove(ServiceInfo.getKey(serviceName, clusters));
            }
        }
    }

    public List<ServiceInfo> getSubscribeServices() {
        List<ServiceInfo> serviceInfos = new ArrayList<ServiceInfo>();
        for (String key : observerMap.keySet()) {
            serviceInfos.add(ServiceInfo.fromKey(key));
        }
        return serviceInfos;
    }

    public void serviceChanged(ServiceInfo serviceInfo) {
        if (serviceInfo == null) {
            return;
        }

        changedServices.add(serviceInfo);
    }

    private class Notifier implements Runnable {
        @Override
        public void run() {
            while (true) {
                ServiceInfo serviceInfo = null;
                try {
                    serviceInfo = changedServices.poll(5, TimeUnit.MINUTES);
                } catch (Exception ignore) {
                }

                if (serviceInfo == null) {
                    continue;
                }

                try {
                    List<EventListener> listeners = observerMap.get(serviceInfo.getKey());

                    if (!CollectionUtils.isEmpty(listeners)) {
                        for (EventListener listener : listeners) {
                            List<Instance> hosts = Collections.unmodifiableList(serviceInfo.getHosts());
                            listener.onEvent(new NamingEvent(serviceInfo.getName(), hosts));
                        }
                    }

                } catch (Exception e) {
                    NAMING_LOGGER.error("[NA] notify error for service: "
                        + serviceInfo.getName() + ", clusters: " + serviceInfo.getClusters(), e);
                }
            }
        }
    }

    public void setExecutor(ExecutorService executor) {
        ExecutorService oldExecutor = this.executor;
        this.executor = executor;

        oldExecutor.shutdown();
    }
}

In the event dispatcher, a listener list observerMap is maintained, and a blocking queue changedServices for event changes is maintained. After the listener scheduler is initialized, a thread will be triggered to consume the blocking queue data. When the registration service changes, the changed data will be queued, and the thread will be awakened to update the service list in dubbo memory. As mentioned above, nacos client will pull the registered instance at the frequency of 1s. When there is a discrepancy between the pulled instance and the local memory, the queue entry will be triggered, such as:

com.alibaba.nacos.client.naming.core.HostReactor:296

public class UpdateTask implements Runnable {
        long lastRefTime = Long.MAX_VALUE;
        private String clusters;
        private String serviceName;

        public UpdateTask(String serviceName, String clusters) {
            this.serviceName = serviceName;
            this.clusters = clusters;
        }
        @Override
        public void run() {
            try {
                ServiceInfo serviceObj = serviceInfoMap.get(ServiceInfo.getKey(serviceName, clusters));

                if (serviceObj == null) {
                    updateServiceNow(serviceName, clusters);
                    executor.schedule(this, DEFAULT_DELAY, TimeUnit.MILLISECONDS);
                    return;
                }

                if (serviceObj.getLastRefTime() <= lastRefTime) {
                    updateServiceNow(serviceName, clusters);
                    serviceObj = serviceInfoMap.get(ServiceInfo.getKey(serviceName, clusters));
                } else {
                    // if serviceName already updated by push, we should not override it
                    // since the push data may be different from pull through force push
                    refreshOnly(serviceName, clusters);
                }

                executor.schedule(this, serviceObj.getCacheMillis(), TimeUnit.MILLISECONDS);

                lastRefTime = serviceObj.getLastRefTime();
            } catch (Throwable e) {
                NAMING_LOGGER.warn("[NA] failed to update serviceName: " + serviceName, e);
            }

        }
    }

The DEFAULT_DELAY value is 1s. At the same time, nacos will also actively push data change events. When nacos is actively pushed, serviceObj in serviceInfoMap will be updated. Then the next time the time interval of nacos client pull will be set to 10S. The specific logic of comparing with local list is in the updateServiceNow method, which will not be discussed here.

epilogue

dubbo registration service to nacos and subscription service is a relatively complex process. In the process of analysis, it will take half the effort to look at the source code with doubts. For example, before looking at the source code, bloggers first want to find out the abnormal heartbeat of nacos, and then they are curious about how nacos can realize event monitoring. And then, it's clear from the analysis. Of course, when analyzing the dubbo registration service to nacos, you also need to understand the processing logic of nacos server. The two core classes of nacos server, ClientBeatCheckTask and ClientBeatProcessor, include the logic of heartbeat processing, health detection and event push. If you are interested, you can have a look at them

About the author:

Chen Kailing joined Kaijing technology in May 2016. At present, he is the structure group manager and fire fighting team leader of Kaijing R & D center. Independent blog KL blog( http://www.kailing.pub Blogger.

Topics: Programming Dubbo Redis Apache network