Heartbeat renewal and self-protection mechanism of Spring Cloud Eureka source code analysis

Posted by Carline on Fri, 07 Jan 2022 09:36:02 +0100

How does Eureka server determine that a service is unavailable?

Eureka checks the health status of each service provider through heartbeat renewal.

In fact, the part of judging that the service is unavailable will be divided into two pieces of logic.

  1. Eureka server needs to regularly check the health status of service providers.
  2. Eureka client needs to update its registration information regularly during operation.

Eureka's heartbeat renewal mechanism is shown in the figure below.

  1. When the client starts, it will start a heartbeat task and send a heartbeat request to the service order every 30s.
  2. The server maintains the last heartbeat time of each instance. After the client sends a heartbeat packet, the heartbeat time will be updated.
  3. When the server starts, it starts a scheduled task. The task is executed every 60s to check whether the last heartbeat time of each instance exceeds 90s. If it exceeds 90s, it is considered expired and needs to be eliminated.

The time involved in the above process can be changed through the following configuration

#The timeout time for the Server to wait for the next heartbeat after receiving the Client's heartbeat last time. If the next heartbeat is not received within this time, the Instance will be removed.
eureka.instance.lease-expiration-duration-in-seconds=90
# The time interval for the Server to clean up invalid nodes. The default is 60000 milliseconds, or 60 seconds.
eureka.server.eviction-interval-timer-in-ms=60

Client heartbeat initiation process

Heartbeat renewal is initiated by the client and executed every 30s.

DiscoveryClient.initScheduledTasks

Continue back to discoveryclient In the initscheduledtasks method,

private void initScheduledTasks() {
    //Omit
    heartbeatTask = new TimedSupervisorTask(
        "heartbeat",
        scheduler,
        heartbeatExecutor,
        renewalIntervalInSecs,
        TimeUnit.SECONDS,
        expBackOffBound,
        new HeartbeatThread()
    );
    scheduler.schedule(
        heartbeatTask,
        renewalIntervalInSecs, TimeUnit.SECONDS);
    //Omit
}

renewalIntervalInSecs=30s, which is executed every 30s by default.

HeartbeatThread

The implementation of this thread is very simple. Call renew() to renew the contract. If the renewal is successful, the renewal time of the last heartbeat will be updated.

private class HeartbeatThread implements Runnable {

    public void run() {
        if (renew()) {
            lastSuccessfulHeartbeatTimestamp = System.currentTimeMillis();
        }
    }
}

In the renew() method, call EurekaServer's "apps/" + appName + "/" + ID; Use this address to renew your contract.

boolean renew() {
    EurekaHttpResponse<InstanceInfo> httpResponse;
    try {
        httpResponse = eurekaTransport.registrationClient.sendHeartBeat(instanceInfo.getAppName(), instanceInfo.getId(), instanceInfo, null);
        logger.debug(PREFIX + "{} - Heartbeat status: {}", appPathIdentifier, httpResponse.getStatusCode());
        if (httpResponse.getStatusCode() == Status.NOT_FOUND.getStatusCode()) {
            REREGISTER_COUNTER.increment();
            logger.info(PREFIX + "{} - Re-registering apps/{}", appPathIdentifier, instanceInfo.getAppName());
            long timestamp = instanceInfo.setIsDirtyWithTime();
            boolean success = register();
            if (success) {
                instanceInfo.unsetIsDirty(timestamp);
            }
            return success;
        }
        return httpResponse.getStatusCode() == Status.OK.getStatusCode();
    } catch (Throwable e) {
        logger.error(PREFIX + "{} - was unable to send heartbeat!", appPathIdentifier, e);
        return false;
    }
}

The server receives heartbeat processing

The server specifically calls the renewLease method of the InstanceResource class under the [com.netflix.eureka.resources] package to renew the contract. The code is as follows

@PUT
public Response renewLease(
        @HeaderParam(PeerEurekaNode.HEADER_REPLICATION) String isReplication,
        @QueryParam("overriddenstatus") String overriddenStatus,
        @QueryParam("status") String status,
        @QueryParam("lastDirtyTimestamp") String lastDirtyTimestamp) {
    boolean isFromReplicaNode = "true".equals(isReplication);
    //Call renew to renew
    boolean isSuccess = registry.renew(app.getName(), id, isFromReplicaNode);

    // Not found in the registry, immediately ask for a register
    if (!isSuccess) { //If the renewal fails, an exception is returned
        logger.warn("Not Found (Renew): {} - {}", app.getName(), id);
        return Response.status(Status.NOT_FOUND).build();
    }
    // Check if we need to sync based on dirty time stamp, the client
    // instance might have changed some value
    Response response;
    //Check the time difference between the client and the server. If there is a problem, you need to re initiate the registration
    if (lastDirtyTimestamp != null && serverConfig.shouldSyncWhenTimestampDiffers()) {
        response = this.validateDirtyTimestamp(Long.valueOf(lastDirtyTimestamp), isFromReplicaNode);
        // Store the overridden status since the validation found out the node that replicates wins
        if (response.getStatus() == Response.Status.NOT_FOUND.getStatusCode()
                && (overriddenStatus != null)
                && !(InstanceStatus.UNKNOWN.name().equals(overriddenStatus))
                && isFromReplicaNode) {
            registry.storeOverriddenStatusIfRequired(app.getAppName(), id, InstanceStatus.valueOf(overriddenStatus));
        }
    } else {
        response = Response.ok().build(); // If the contract is renewed successfully, 200 is returned
    }
    logger.debug("Found (Renew): {} - {}; reply status={}", app.getName(), id, response.getStatus());
    return response;
}

InstanceRegistry.renew

The implementation method of renew is as follows. There are two main processes

  1. Find the instance matching the current request from the service registration list
  2. Publish EurekaInstanceRenewedEvent event
@Override
public boolean renew(final String appName, final String serverId,
                     boolean isReplication) {
    log("renew " + appName + " serverId " + serverId + ", isReplication {}"
        + isReplication);
    //Get all service registration information
    List<Application> applications = getSortedApplications();
    for (Application input : applications) { //Traverse one by one
        if (input.getName().equals(appName)) { //If the client currently renewed is the same as a service registration information node
            InstanceInfo instance = null;
            for (InstanceInfo info : input.getInstances()) { //Traverse all nodes under the service cluster, find a matching instance, and the instance returns.
                if (info.getId().equals(serverId)) {
                    instance = info; //
                    break;
                }
            }
            //Publish EurekaInstanceRenewedEvent event. This event is not handled in EurekaServer. We can listen to this event to do some things, such as monitoring.
            publishEvent(new EurekaInstanceRenewedEvent(this, appName, serverId,
                                                        instance, isReplication));
            break;
        }
    }
    return super.renew(appName, serverId, isReplication);
}

super.renew

public boolean renew(final String appName, final String id, final boolean isReplication) {
    if (super.renew(appName, id, isReplication)) { //Call the renewal method of the parent class. If the renewal is successful
        replicateToPeers(Action.Heartbeat, appName, id, null, null, isReplication); //Synchronize to all nodes in the cluster
        return true;
    }
    return false;
}

AbstractInstanceRegistry.renew

In this method, we will get the corresponding application list and then call Lease.. Renew() to renew the contract.

public boolean renew(String appName, String id, boolean isReplication) {
    RENEW.increment(isReplication);
    Map<String, Lease<InstanceInfo>> gMap = registry.get(appName); //Get instance information according to service name
    Lease<InstanceInfo> leaseToRenew = null;
    if (gMap != null) { 
        leaseToRenew = gMap.get(id);  //Obtain the service instance that needs to be renewed,
    }
    if (leaseToRenew == null) { //If it is empty, it indicates that the service instance does not exist, and the renewal failure is returned directly
        RENEW_NOT_FOUND.increment(isReplication);
        logger.warn("DS: Registry: lease doesn't exist, registering resource: {} - {}", appName, id);
        return false;
    } else { //Indicates that the instance exists
        InstanceInfo instanceInfo = leaseToRenew.getHolder(); //Get the basic information of the instance
        if (instanceInfo != null) { //Instance basic information cannot be empty
            // touchASGCache(instanceInfo.getASGName());
            //Get the running status of the instance
            InstanceStatus overriddenInstanceStatus = this.getOverriddenInstanceStatus(
                    instanceInfo, leaseToRenew, isReplication);
            if (overriddenInstanceStatus == InstanceStatus.UNKNOWN) { //If the running status is unknown, the renewal failure is also returned
                logger.info("Instance status UNKNOWN possibly due to deleted override for instance {}"
                        + "; re-register required", instanceInfo.getId());
                RENEW_NOT_FOUND.increment(isReplication);
                return false;
            }
            //If the currently requested instance information
            if (!instanceInfo.getStatus().equals(overriddenInstanceStatus)) {
                logger.info(
                        "The instance status {} is different from overridden instance status {} for instance {}. "
                                + "Hence setting the status to overridden status", instanceInfo.getStatus().name(),
                                overriddenInstanceStatus.name(),
                                instanceInfo.getId());
                instanceInfo.setStatusWithoutDirty(overriddenInstanceStatus);

            }
        }
        //Update the number of renewals in the last minute
        renewsLastMin.increment();
        leaseToRenew.renew(); //Renewal
        return true;
    }
}

The implementation of renewal is to update the time when the server last received the heartbeat request.

public void renew() {
    lastUpdateTimestamp = System.currentTimeMillis() + duration;

}

Eureka's self-protection mechanism

In fact, the heartbeat detection mechanism is uncertain. For example, the service provider may be normal, but due to the problem of network communication, the heartbeat request is not received within 90s, which will lead to the accidental killing of healthy services.

To avoid this problem, Eureka provides something called a self-protection mechanism. In short, after the self-protection mechanism is enabled, Eureka Server will protect these service instances to avoid the problem of instance rejection due to expiration, so as to ensure that Eureka cluster is more robust and stable.

After entering the self-protection state, the following situations will occur:

  • Eureka Server will no longer remove from the registration list expired services that should be removed because no heartbeat has been received for a long time. If the service provider goes offline abnormally during the protection period, the service consumer will get an invalid service instance and the call will fail. For this problem, the service consumer needs to have some fault-tolerant mechanisms, Such as retry, circuit breaker, etc!
  • Eureka Server can still accept the registration and query requests of new services, but it will not be synchronized to other nodes to ensure that the current node is still available.

Eureka self-protection mechanism by configuring Eureka server. Enable self preservation to [true] open / [false] disable the self-protection mechanism. It is open by default. It is recommended that the production environment open this configuration.

How should the self-protection mechanism be designed to more accurately control the communication delay caused by "network abnormality" rather than service downtime?

Eureka does this: if less than 85% of the client nodes do not have a normal heartbeat, Eureka Server considers that there is a network failure between the client and the registry, and Eureka Server automatically enters the self-protection state

The threshold of 85% can be set through the following configuration

# Self protection renewal percentage, default is 0.85
eureka.server.renewal-percent-threshold=0.85

But there is another question, more than 85% of whom? Here is an expected renewal quantity. The calculation formula of this quantity is as follows:

//Self protection threshold = total number of services * renewal per minute (60S / client renewal interval) * self protection renewal percentage threshold factor

Assuming that there are 100 services, the renewal interval is 30S, and the self-protection threshold is 0.85, the expected number of renewals is:

Self protection threshold =100 * 60 / 30 * 0.85 = 170. 

Threshold setting for automatic renewal

In the contextInitialized method of EurekaServerBootstrap class, initEurekaServerContext will be called for initialization

public void contextInitialized(ServletContext context) {
    try {
        initEurekaEnvironment();
        initEurekaServerContext();

        context.setAttribute(EurekaServerContext.class.getName(), this.serverContext);
    }
    catch (Throwable e) {
        log.error("Cannot bootstrap eureka server :", e);
        throw new RuntimeException("Cannot bootstrap eureka server :", e);
    }
}

Keep looking down.

protected void initEurekaServerContext() throws Exception {
        EurekaServerConfig eurekaServerConfig = new DefaultEurekaServerConfig();
    //...
    registry.openForTraffic(applicationInfoManager, registryCount);
}

In the openForTraffic method, the value expectedNumberOfClientsSendingRenews will be initialized. This value means that the number of clients expected to receive renewal per minute depends on the number of services registered on the eureka server

@Override
public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) {
    // Renewals happen every 30 seconds and for a minute it should be a factor of 2.
    this.expectedNumberOfClientsSendingRenews = count; //The initial value is 1
    updateRenewsPerMinThreshold();
    logger.info("Got {} instances from neighboring DS node", count);
    logger.info("Renew threshold is: {}", numberOfRenewsPerMinThreshold);
    this.startupTime = System.currentTimeMillis();
    if (count > 0) {
        this.peerInstancesTransferEmptyOnStartup = false;
    }
    DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName();
    boolean isAws = Name.Amazon == selfName;
    if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) {
        logger.info("Priming AWS connections for all replicas..");
        primeAwsReplicas(applicationInfoManager);
    }
    logger.info("Changing status to UP");
    applicationInfoManager.setInstanceStatus(InstanceStatus.UP);
    super.postInit();
}

updateRenewsPerMinThreshold

Then call the updateRenewsPerMinThreshold method to update the minimum number of renewals per minute, that is, the threshold of the total number of renewals that Eureka Server expects to receive from client instances per minute. If it is less than this threshold, the self-protection mechanism will be triggered.

protected void updateRenewsPerMinThreshold() {
    this.numberOfRenewsPerMinThreshold = (int) (this.expectedNumberOfClientsSendingRenews
            * (60.0 / serverConfig.getExpectedClientRenewalIntervalSeconds())
            * serverConfig.getRenewalPercentThreshold());
}
//Self protection threshold = total number of services * renewal per minute (60S / client renewal interval) * self protection renewal percentage threshold factor
  • getExpectedClientRenewalIntervalSeconds: the renewal interval of the client. The default is 30s
  • getRenewalPercentThreshold, self-protection renewal percentage threshold factor, default 0.85. In other words, the number of renewals per minute should be greater than 85%

Trigger mechanism of expected value change

expectedNumberOfClientsSendingRenews and numberOfRenewsPerMinThreshold will change with the new service registration and the triggering of service offline.

PeerAwareInstanceRegistryImpl.cancel

When the service provider actively goes offline, it means that Eureka server will eliminate the address of the service provider at this time, and it also means that the heartbeat renewal threshold will change. So in peerawareinstanceregistryimpl You can see the data update in cancel

Call path peerawareinstanceregistryimpl cancel -> AbstractInstanceRegistry. cancel->internalCancel

After the service goes offline, it means that the number of clients that need to send renewal decreases, so modify it here

protected boolean internalCancel(String appName, String id, boolean isReplication) {
  //....
    synchronized (lock) {
        if (this.expectedNumberOfClientsSendingRenews > 0) {
            // Since the client wants to cancel it, reduce the number of clients to send renews.
            this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews - 1;
            updateRenewsPerMinThreshold();
        }
    }
}

PeerAwareInstanceRegistryImpl.register

When a new service provider registers with Eureka server, the number of clients to renew the contract needs to be increased, so it will be processed in the register method

register ->super.register(AbstractInstanceRegistry)

public void register(InstanceInfo registrant, int leaseDuration, boolean isReplication) {
    //....    
    // The lease does not exist and hence it is a new registration
    synchronized (lock) {
        if (this.expectedNumberOfClientsSendingRenews > 0) {
            // Since the client wants to register it, increase the number of clients sending renews
            this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews + 1;
            updateRenewsPerMinThreshold();
        }
    }
}

Refresh self-protection threshold every 15 minutes

PeerAwareInstanceRegistryImpl.scheduleRenewalThresholdUpdateTask

Update the self-protection threshold every 15 minutes!

private void updateRenewalThreshold() {
    try {
        // 1. Calculate the number of application instances
        Applications apps = eurekaClient.getApplications();
        int count = 0;
        for (Application app : apps.getRegisteredApplications()) {
            for (InstanceInfo instance : app.getInstances()) {
                if (this.isRegisterable(instance)) {
                    ++count;
                }
            }
        }
        
        synchronized (lock) {
            // Update threshold only if the threshold is greater than the
            // current expected threshold or if self preservation is disabled.
            //When the number of nodes count is greater than the minimum number of renewals, or when the self-protection mechanism is not enabled, recalculate the expectedNumberOfClientsSendingRenews and numberOfRenewsPerMinThreshold
            if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews)
                || (!this.isSelfPreservationModeEnabled())) {
                this.expectedNumberOfClientsSendingRenews = count;
                updateRenewsPerMinThreshold();
            }
        }
        logger.info("Current renewal threshold is : {}", numberOfRenewsPerMinThreshold);
    } catch (Throwable e) {
        logger.error("Cannot update renewal threshold", e);
    }
}

Trigger of self-protection mechanism

In the postInit method of AbstractInstanceRegistry, an EvictionTask task will be started to detect whether the self-protection mechanism needs to be enabled.

This method is also triggered when the EurekaServerBootstrap method is started.

protected void postInit() {
    renewsLastMin.start(); //Start a scheduled task to realize the renewal quantity per minute, and recalculate it every 60s
    if (evictionTaskRef.get() != null) {
        evictionTaskRef.get().cancel();
    }
    evictionTaskRef.set(new EvictionTask()); //Start a scheduled task EvictionTask and execute it every 60s
    evictionTimer.schedule(evictionTaskRef.get(),
                           serverConfig.getEvictionIntervalTimerInMs(),
                           serverConfig.getEvictionIntervalTimerInMs());
}

The code of EvictionTask is as follows.

private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l);

@Override
public void run() {
    try {
        //Get compensation time milliseconds
        long compensationTimeMs = getCompensationTimeMs();
        logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs);
        evict(compensationTimeMs);
    } catch (Throwable e) {
        logger.error("Could not run the evict task", e);
    }
}

evict method

public void evict(long additionalLeaseMs) {
    logger.debug("Running the evict task");
     // Whether the self-protection mechanism needs to be turned on. If so, return directly. There is no need to continue
    if (!isLeaseExpirationEnabled()) {
        logger.debug("DS: lease expiration is currently disabled.");
        return;
    }

    //The following is mainly for the automatic offline operation of the service.
}

isLeaseExpirationEnabled

  • Whether the self-protection mechanism is enabled. If not, skip. It is enabled by default
  • Calculate whether self-protection needs to be enabled, and judge whether the number of renewals received in the last minute is greater than numberOfRenewsPerMinThreshold
public boolean isLeaseExpirationEnabled() {
    if (!isSelfPreservationModeEnabled()) {
        // The self preservation mode is disabled, hence allowing the instances to expire.
        return true;
    }
    return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold;
}

Copyright notice: unless otherwise stated, all articles on this blog adopt CC BY-NC-SA 4.0 license agreement. Reprint please indicate from Mic to take you to learn architecture!
If this article is helpful to you, please pay attention and praise. Your persistence is the driving force of my continuous creation. Welcome to WeChat public official account for more dry cargo.

Topics: Java