Principle analysis of Eureka heartbeat mechanism and automatic protection mechanism

Posted by misterguru on Mon, 18 Oct 2021 07:48:43 +0200

Eureka heartbeat mechanism:

After the application starts, the nodes will report to Eureka   Server sends heartbeat. The default cycle is 30 seconds. If Eureka   The server does not receive the heartbeat of a node in multiple heartbeat cycles, Eureka   The server will remove the service node from the service registry (90 seconds by default).
 

Eureka automatic protection mechanism:

Eureka   The Server will count whether the percentage of successful heartbeat is less than 85% within 15 minutes during operation  ,  If less than 85%,   Eureka   The Server will think that there is a network failure between the client of the current instance and its own heartbeat connection, then Eureka   The Server will protect these instances so that they will not expire and cause instances to be rejected. This is to prevent EurekaClient from running normally,   However, when the network with EurekaServer is disconnected,   EurekaSerJer will not remove Eureka client service immediately

The purpose of this is to reduce network instability or network partition in the case of Eureka   Server removed health services from offline.   Using self-protection mechanism can make Eureka   The cluster runs more robust and stable. After entering the self-protection state, the following situations will occur: Eureka   The server will no longer remove the expired services that should be eliminated because the heartbeat has not been received for a long time from the registration list; Eureka   The server can still accept the registration and query requests of new services, but it will not be synchronized to other nodes to ensure that the current node is still available.

Self protection mode:

By default, if EurekaServer does not receive the heartbeat of a micro service instance within a certain period of time, EurekaServer will log off the instance   (default: 90 seconds). However, when the network partition fails (delay, jamming, congestion), the micro service and EurekaServer cannot communicate normally, and the above behavior may become very dangerous - because the micro service itself is actually healthy, it should not be cancelled at this time. Eureka passed“   Self protection mode to solve this problem - when EurekaServer node loses too many clients in a short time (network partition failure may occur), the node will enter self-protection mode.

Once in this mode, Eureka   The Server will protect the information in the service registry and will not delete the data in the service registry (that is, it will not log off any microservices). After the network fault recovers, the Eureka   The Server node will automatically exit the self-protection mode. To sum up, the self-protection mode is a security protection measure to deal with network exceptions. Its architectural philosophy is that it is better to retain all microservices at the same time (both healthy and unhealthy microservices will be retained) , and don't log off any healthy microservices blindly. Using self-protection mode can make Eureka cluster more robust and stable. In Spring   In Cloud, eureka.server.enable-self-preservation can be used  =  false   Disable self-protection mode.

Turn on the self-protection mechanism: change the determination time to 10s through configuration, and then start Eureka   Server, after waiting for 10s, the above prompt message will appear, indicating that self-protection is activated.

#   set up   eureka   The default waiting time for server synchronization failure is 5 minutes

#During this period, it does not provide service registration information to the client

eureka.server.wait-time-in-ms-when-sync-empty=10000

Important variables

There are two very important variables in Eureka's self-protection mechanism. Eureka's self-protection mechanism is implemented around these two variables, which are defined in the AbstractInstanceRegistry class:

**numberOfRenewsPerMinThreshold**

protected volatile int numberOfRenewsPerMinThreshold;

//The minimum number of renewals per minute is Eureka   The threshold at which the Server expects to receive the total number of client instance renewals per minute. If less than this threshold, the self-protection mechanism will be triggered.

Its assignment method:

protected void updateRenewsPerMinThreshold() {
        this.numberOfRenewsPerMinThreshold = (int)((double)this.expectedNumberOfClientsSendingRenews * (60.0D / (double)this.serverConfig.getExpectedClientRenewalIntervalSeconds()) * this.serverConfig.getRenewalPercentThreshold());
    }

Getexpectedclientrewealintervalseconds, the renewal interval of the client, which is 30s by default;

getRenewalPercentThreshold, self-protection renewal percentage threshold factor, default 0.85, that is, the number of renewals per minute should be greater than 85%.

It should be noted that these two variables are dynamically updated, and there are four places to update these two values.

#   Initialization of Eureka server

The initEurekaServerContext method in EurekaBootstrap class initializes Eureka server:

protected void initEurekaServerContext() throws Exception {
EurekaServerConfig eurekaServerConfig = new DefaultEurekaServerConfig();
//...
registry.openForTraffic(applicationInfoManager, registryCount);
}

openForTra ffi c method in PeerAwareInstanceRegistryImpl class:

public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
    this.expectedNumberOfClientsSendingRenews = count; //initialization
    this.updateRenewsPerMinThreshold();
    //Update numberOfRenewsPerMinThreshold
    logger.info("Got {} instances from neighboring DS node", count);
    logger.info("Renew threshold is: {}", this.numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && this.serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        this.primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    super.postInit();
}

#   Active offline of service

cancel method in PeerAwareInstanceRegistryImpl class:

When the service provider actively goes offline, it means that Eureka server will eliminate the address of the service provider at this time, and it also means that the heartbeat renewal threshold will change. Therefore, it is in PeerAwareInstanceRegistryImpl.cancel   You can see the update of the data in

Call path

PeerAwareInstanceRegistryImpl.cancel -> AbstractInstanceRegistry.cancel->internalCancel

After the service goes offline, it means that the number of clients that need to send renewal decreases, so modify it here

protected boolean internalCancel(String appName, String id, boolean isReplication) {
//....
synchronized (lock) {
if (this.expectedNumberOfClientsSendingRenews > 0) {
// Since the client wants to cancel it, reduce the number of clients to send renews.
this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews - 1;
updateRenewsPerMinThreshold();
}
}
}

#   Registration of services

register method in PeerAwareInstanceRegistryImpl class:

When a new service provider registers with Eureka server, the number of clients to renew the contract needs to be increased, so it will be processed in the register method

register ->super.register(AbstractInstanceRegistry)

public void register(InstanceInfo info, boolean isReplication) {
    int leaseDuration = 90;
    if (info.getLeaseInfo() != null && info.getLeaseInfo().getDurationInSecs() > 0) {
        leaseDuration = info.getLeaseInfo().getDurationInSecs();
    }
    super.register(info, leaseDuration, isReplication);
    this.replicateToPeers(PeerAwareInstanceRegistryImpl.Action.Register, info.getAppName(), info.getId(), info, (InstanceStatus)null, isReplication);
}

register method in parent AbstractInstanceRegistry:

public void register(InstanceInfo registrant, int leaseDuration, boolean isReplication) {

//....
// The lease does not exist and hence it is a new registration synchronized (lock) {
if (this.expectedNumberOfClientsSendingRenews > 0) {
// Since the client wants to register it, increase the number of clients sending renews
this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews + 1;
updateRenewsPerMinThreshold();
}
}
}

scheduleRenewalThresholdUpdateTask method in PeerAwareInstanceRegistryImpl class (update heartbeat):

Run once every 15 minutes to determine whether the failure rate of heart jump in 15 minutes is less than 85%

DefaultEurekaServerContext  -> @ initialize() method modified by PostConstruct  ->  init()

private void scheduleRenewalThresholdUpdateTask() {
        this.timer.schedule(new TimerTask() {
            public void run() {
                PeerAwareInstanceRegistryImpl.this.updateRenewalThreshold();
            }
        }, (long)this.serverConfig.getRenewalThresholdUpdateIntervalMs(), (long)this.serverConfig.getRenewalThresholdUpdateIntervalMs());
    }

 

private void updateRenewalThreshold() {
        try {
            Applications apps = this.eurekaClient.getApplications();
            int count = 0;
            Iterator var3 = apps.getRegisteredApplications().iterator();
            while(var3.hasNext()) {
                Application app = (Application)var3.next();
                Iterator var5 = app.getInstances().iterator();
                while(var5.hasNext()) {
                    InstanceInfo instance = (InstanceInfo)var5.next();
                    if (this.isRegisterable(instance)) {
                        ++count;
                    }
                }
            }
// Update threshold only if the threshold is greater than the
// current expected threshold or if self preservation is disabled. 
            synchronized(this.lock) {
                if ((double)count > this.serverConfig.getRenewalPercentThreshold() * (double)this.expectedNumberOfClientsSendingRenews || !this.isSelfPreservationModeEnabled()) {
                    this.expectedNumberOfClientsSendingRenews = count;
                    this.updateRenewsPerMinThreshold();
                }
            }

            logger.info("Current renewal threshold is : {}", this.numberOfRenewsPerMinThreshold);
        } catch (Throwable var9) {
            logger.error("Cannot update renewal threshold", var9);
        }
    }

#   Self protection mechanism trigger task

In the postInit method of AbstractInstanceRegistry, an EvictionTask task will be started to detect whether the self-protection mechanism needs to be enabled:

protected void postInit() {
    this.renewsLastMin.start();
    if (this.evictionTaskRef.get() != null) {
        ((AbstractInstanceRegistry.EvictionTask)this.evictionTaskRef.get()).cancel();
    }
    this.evictionTaskRef.set(new AbstractInstanceRegistry.EvictionTask());
    this.evictionTimer.schedule((TimerTask)this.evictionTaskRef.get(), this.serverConfig.getEvictionIntervalTimerInMs(), this.serverConfig.getEvictionIntervalTimerInMs());
}

Among them, EvictionTask represents the final task to be executed:

class EvictionTask extends TimerTask {
    private final AtomicLong lastExecutionNanosRef = new AtomicLong(0L);
    EvictionTask() {
    }
    public void run() {
        try {
            long compensationTimeMs = this.getCompensationTimeMs();
            AbstractInstanceRegistry.logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
            AbstractInstanceRegistry.this.evict(compensationTimeMs);
        } catch (Throwable var3) {
            AbstractInstanceRegistry.logger.error("Could not run the evict task", var3);
        }
    }
    long getCompensationTimeMs() {
        long currNanos = this.getCurrentTimeNano();
        long lastNanos = this.lastExecutionNanosRef.getAndSet(currNanos);
        if (lastNanos == 0L) {
            return 0L;
        } else {
            long elapsedMs = TimeUnit.NANOSECONDS.toMillis(currNanos - lastNanos);
            long compensationTime = elapsedMs - AbstractInstanceRegistry.this.serverConfig.getEvictionIntervalTimerInMs();
            return compensationTime <= 0L ? 0L : compensationTime;
        }
    }
    long getCurrentTimeNano() {
        return System.nanoTime();
    }
}


@Override
public void run() {
     try {
        long compensationTimeMs = getCompensationTimeMs();
         logger.info("Running the evict task with compensationTime {}ms",compensationTimeMs);
        evict(compensationTimeMs);
} catch (Throwable e) {
        logger.error("Could not run the evict task", e);
}
}

evict

public void evict(long additionalLeaseMs) {
     logger.debug("Running the evict task");
    // Whether the self-protection mechanism needs to be enabled. If so, return directly. There is no need to continue
    if (!isLeaseExpirationEnabled()) {
    logger.debug("DS: lease expiration is currently disabled."); return;
}

//The following is mainly about the automatic offline operation of the service

isLeaseExpirationEnabled method: judge whether the self-protection mechanism is enabled. If not, skip and enable by default. Calculate whether self-protection needs to be enabled and judge whether the number of renewals received in the last minute is greater than numberOfRenewsPerMinThreshold

public boolean isLeaseExpirationEnabled() {

    if (!isSelfPreservationModeEnabled()) {
    // The self preservation mode is disabled, hence allowing the instances to expire.
        return true;
    }
    return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}

Topics: Java Big Data Microservices eureka microservice