How does Eureka server determine that a service is unavailable?
Eureka checks the health status of each service provider through heartbeat renewal.
In fact, the part of judging that the service is unavailable will be divided into two pieces of logic.
- Eureka server needs to regularly check the health status of service providers.
- Eureka client needs to update its registration information regularly during operation.
Eureka's heartbeat renewal mechanism is shown in the figure below.
- When the client starts, it will start a heartbeat task and send a heartbeat request to the service order every 30s.
- The server maintains the last heartbeat time of each instance. After the client sends a heartbeat packet, the heartbeat time will be updated.
- When the server starts, it starts a scheduled task. The task is executed every 60s to check whether the last heartbeat time of each instance exceeds 90s. If it exceeds 90s, it is considered expired and needs to be eliminated.
The time involved in the above process can be changed through the following configuration
#The timeout time for the Server to wait for the next heartbeat after receiving the Client's heartbeat last time. If the next heartbeat is not received within this time, the Instance will be removed. eureka.instance.lease-expiration-duration-in-seconds=90 # The time interval for the Server to clean up invalid nodes. The default is 60000 milliseconds, or 60 seconds. eureka.server.eviction-interval-timer-in-ms=60
Client heartbeat initiation process
Heartbeat renewal is initiated by the client and executed every 30s.
DiscoveryClient.initScheduledTasks
Continue back to discoveryclient In the initscheduledtasks method,
private void initScheduledTasks() { //Omit heartbeatTask = new TimedSupervisorTask( "heartbeat", scheduler, heartbeatExecutor, renewalIntervalInSecs, TimeUnit.SECONDS, expBackOffBound, new HeartbeatThread() ); scheduler.schedule( heartbeatTask, renewalIntervalInSecs, TimeUnit.SECONDS); //Omit }
renewalIntervalInSecs=30s, which is executed every 30s by default.
HeartbeatThread
The implementation of this thread is very simple. Call renew() to renew the contract. If the renewal is successful, the renewal time of the last heartbeat will be updated.
private class HeartbeatThread implements Runnable { public void run() { if (renew()) { lastSuccessfulHeartbeatTimestamp = System.currentTimeMillis(); } } }
In the renew() method, call EurekaServer's "apps/" + appName + "/" + ID; Use this address to renew your contract.
boolean renew() { EurekaHttpResponse<InstanceInfo> httpResponse; try { httpResponse = eurekaTransport.registrationClient.sendHeartBeat(instanceInfo.getAppName(), instanceInfo.getId(), instanceInfo, null); logger.debug(PREFIX + "{} - Heartbeat status: {}", appPathIdentifier, httpResponse.getStatusCode()); if (httpResponse.getStatusCode() == Status.NOT_FOUND.getStatusCode()) { REREGISTER_COUNTER.increment(); logger.info(PREFIX + "{} - Re-registering apps/{}", appPathIdentifier, instanceInfo.getAppName()); long timestamp = instanceInfo.setIsDirtyWithTime(); boolean success = register(); if (success) { instanceInfo.unsetIsDirty(timestamp); } return success; } return httpResponse.getStatusCode() == Status.OK.getStatusCode(); } catch (Throwable e) { logger.error(PREFIX + "{} - was unable to send heartbeat!", appPathIdentifier, e); return false; } }
The server receives heartbeat processing
The server specifically calls the renewLease method of the InstanceResource class under the [com.netflix.eureka.resources] package to renew the contract. The code is as follows
@PUT public Response renewLease( @HeaderParam(PeerEurekaNode.HEADER_REPLICATION) String isReplication, @QueryParam("overriddenstatus") String overriddenStatus, @QueryParam("status") String status, @QueryParam("lastDirtyTimestamp") String lastDirtyTimestamp) { boolean isFromReplicaNode = "true".equals(isReplication); //Call renew to renew boolean isSuccess = registry.renew(app.getName(), id, isFromReplicaNode); // Not found in the registry, immediately ask for a register if (!isSuccess) { //If the renewal fails, an exception is returned logger.warn("Not Found (Renew): {} - {}", app.getName(), id); return Response.status(Status.NOT_FOUND).build(); } // Check if we need to sync based on dirty time stamp, the client // instance might have changed some value Response response; //Check the time difference between the client and the server. If there is a problem, you need to re initiate the registration if (lastDirtyTimestamp != null && serverConfig.shouldSyncWhenTimestampDiffers()) { response = this.validateDirtyTimestamp(Long.valueOf(lastDirtyTimestamp), isFromReplicaNode); // Store the overridden status since the validation found out the node that replicates wins if (response.getStatus() == Response.Status.NOT_FOUND.getStatusCode() && (overriddenStatus != null) && !(InstanceStatus.UNKNOWN.name().equals(overriddenStatus)) && isFromReplicaNode) { registry.storeOverriddenStatusIfRequired(app.getAppName(), id, InstanceStatus.valueOf(overriddenStatus)); } } else { response = Response.ok().build(); // If the contract is renewed successfully, 200 is returned } logger.debug("Found (Renew): {} - {}; reply status={}", app.getName(), id, response.getStatus()); return response; }
InstanceRegistry.renew
The implementation method of renew is as follows. There are two main processes
- Find the instance matching the current request from the service registration list
- Publish EurekaInstanceRenewedEvent event
@Override public boolean renew(final String appName, final String serverId, boolean isReplication) { log("renew " + appName + " serverId " + serverId + ", isReplication {}" + isReplication); //Get all service registration information List<Application> applications = getSortedApplications(); for (Application input : applications) { //Traverse one by one if (input.getName().equals(appName)) { //If the client currently renewed is the same as a service registration information node InstanceInfo instance = null; for (InstanceInfo info : input.getInstances()) { //Traverse all nodes under the service cluster, find a matching instance, and the instance returns. if (info.getId().equals(serverId)) { instance = info; // break; } } //Publish EurekaInstanceRenewedEvent event. This event is not handled in EurekaServer. We can listen to this event to do some things, such as monitoring. publishEvent(new EurekaInstanceRenewedEvent(this, appName, serverId, instance, isReplication)); break; } } return super.renew(appName, serverId, isReplication); }
super.renew
public boolean renew(final String appName, final String id, final boolean isReplication) { if (super.renew(appName, id, isReplication)) { //Call the renewal method of the parent class. If the renewal is successful replicateToPeers(Action.Heartbeat, appName, id, null, null, isReplication); //Synchronize to all nodes in the cluster return true; } return false; }
AbstractInstanceRegistry.renew
In this method, we will get the corresponding application list and then call Lease.. Renew() to renew the contract.
public boolean renew(String appName, String id, boolean isReplication) { RENEW.increment(isReplication); Map<String, Lease<InstanceInfo>> gMap = registry.get(appName); //Get instance information according to service name Lease<InstanceInfo> leaseToRenew = null; if (gMap != null) { leaseToRenew = gMap.get(id); //Obtain the service instance that needs to be renewed, } if (leaseToRenew == null) { //If it is empty, it indicates that the service instance does not exist, and the renewal failure is returned directly RENEW_NOT_FOUND.increment(isReplication); logger.warn("DS: Registry: lease doesn't exist, registering resource: {} - {}", appName, id); return false; } else { //Indicates that the instance exists InstanceInfo instanceInfo = leaseToRenew.getHolder(); //Get the basic information of the instance if (instanceInfo != null) { //Instance basic information cannot be empty // touchASGCache(instanceInfo.getASGName()); //Get the running status of the instance InstanceStatus overriddenInstanceStatus = this.getOverriddenInstanceStatus( instanceInfo, leaseToRenew, isReplication); if (overriddenInstanceStatus == InstanceStatus.UNKNOWN) { //If the running status is unknown, the renewal failure is also returned logger.info("Instance status UNKNOWN possibly due to deleted override for instance {}" + "; re-register required", instanceInfo.getId()); RENEW_NOT_FOUND.increment(isReplication); return false; } //If the currently requested instance information if (!instanceInfo.getStatus().equals(overriddenInstanceStatus)) { logger.info( "The instance status {} is different from overridden instance status {} for instance {}. " + "Hence setting the status to overridden status", instanceInfo.getStatus().name(), overriddenInstanceStatus.name(), instanceInfo.getId()); instanceInfo.setStatusWithoutDirty(overriddenInstanceStatus); } } //Update the number of renewals in the last minute renewsLastMin.increment(); leaseToRenew.renew(); //Renewal return true; } }
The implementation of renewal is to update the time when the server last received the heartbeat request.
public void renew() { lastUpdateTimestamp = System.currentTimeMillis() + duration; }
Eureka's self-protection mechanism
In fact, the heartbeat detection mechanism is uncertain. For example, the service provider may be normal, but due to the problem of network communication, the heartbeat request is not received within 90s, which will lead to the accidental killing of healthy services.
To avoid this problem, Eureka provides something called a self-protection mechanism. In short, after the self-protection mechanism is enabled, Eureka Server will protect these service instances to avoid the problem of instance rejection due to expiration, so as to ensure that Eureka cluster is more robust and stable.
After entering the self-protection state, the following situations will occur:
- Eureka Server will no longer remove from the registration list expired services that should be removed because no heartbeat has been received for a long time. If the service provider goes offline abnormally during the protection period, the service consumer will get an invalid service instance and the call will fail. For this problem, the service consumer needs to have some fault-tolerant mechanisms, Such as retry, circuit breaker, etc!
- Eureka Server can still accept the registration and query requests of new services, but it will not be synchronized to other nodes to ensure that the current node is still available.
Eureka self-protection mechanism by configuring Eureka server. Enable self preservation to [true] open / [false] disable the self-protection mechanism. It is open by default. It is recommended that the production environment open this configuration.
How should the self-protection mechanism be designed to more accurately control the communication delay caused by "network abnormality" rather than service downtime?
Eureka does this: if less than 85% of the client nodes do not have a normal heartbeat, Eureka Server considers that there is a network failure between the client and the registry, and Eureka Server automatically enters the self-protection state
The threshold of 85% can be set through the following configuration
# Self protection renewal percentage, default is 0.85 eureka.server.renewal-percent-threshold=0.85
But there is another question, more than 85% of whom? Here is an expected renewal quantity. The calculation formula of this quantity is as follows:
//Self protection threshold = total number of services * renewal per minute (60S / client renewal interval) * self protection renewal percentage threshold factor
Assuming that there are 100 services, the renewal interval is 30S, and the self-protection threshold is 0.85, the expected number of renewals is:
Self protection threshold =100 * 60 / 30 * 0.85 = 170.
Threshold setting for automatic renewal
In the contextInitialized method of EurekaServerBootstrap class, initEurekaServerContext will be called for initialization
public void contextInitialized(ServletContext context) { try { initEurekaEnvironment(); initEurekaServerContext(); context.setAttribute(EurekaServerContext.class.getName(), this.serverContext); } catch (Throwable e) { log.error("Cannot bootstrap eureka server :", e); throw new RuntimeException("Cannot bootstrap eureka server :", e); } }
Keep looking down.
protected void initEurekaServerContext() throws Exception { EurekaServerConfig eurekaServerConfig = new DefaultEurekaServerConfig(); //... registry.openForTraffic(applicationInfoManager, registryCount); }
In the openForTraffic method, the value expectedNumberOfClientsSendingRenews will be initialized. This value means that the number of clients expected to receive renewal per minute depends on the number of services registered on the eureka server
@Override public void openForTraffic(ApplicationInfoManager applicationInfoManager, int count) { // Renewals happen every 30 seconds and for a minute it should be a factor of 2. this.expectedNumberOfClientsSendingRenews = count; //The initial value is 1 updateRenewsPerMinThreshold(); logger.info("Got {} instances from neighboring DS node", count); logger.info("Renew threshold is: {}", numberOfRenewsPerMinThreshold); this.startupTime = System.currentTimeMillis(); if (count > 0) { this.peerInstancesTransferEmptyOnStartup = false; } DataCenterInfo.Name selfName = applicationInfoManager.getInfo().getDataCenterInfo().getName(); boolean isAws = Name.Amazon == selfName; if (isAws && serverConfig.shouldPrimeAwsReplicaConnections()) { logger.info("Priming AWS connections for all replicas.."); primeAwsReplicas(applicationInfoManager); } logger.info("Changing status to UP"); applicationInfoManager.setInstanceStatus(InstanceStatus.UP); super.postInit(); }
updateRenewsPerMinThreshold
Then call the updateRenewsPerMinThreshold method to update the minimum number of renewals per minute, that is, the threshold of the total number of renewals that Eureka Server expects to receive from client instances per minute. If it is less than this threshold, the self-protection mechanism will be triggered.
protected void updateRenewsPerMinThreshold() { this.numberOfRenewsPerMinThreshold = (int) (this.expectedNumberOfClientsSendingRenews * (60.0 / serverConfig.getExpectedClientRenewalIntervalSeconds()) * serverConfig.getRenewalPercentThreshold()); } //Self protection threshold = total number of services * renewal per minute (60S / client renewal interval) * self protection renewal percentage threshold factor
- getExpectedClientRenewalIntervalSeconds: the renewal interval of the client. The default is 30s
- getRenewalPercentThreshold, self-protection renewal percentage threshold factor, default 0.85. In other words, the number of renewals per minute should be greater than 85%
Trigger mechanism of expected value change
expectedNumberOfClientsSendingRenews and numberOfRenewsPerMinThreshold will change with the new service registration and the triggering of service offline.
PeerAwareInstanceRegistryImpl.cancel
When the service provider actively goes offline, it means that Eureka server will eliminate the address of the service provider at this time, and it also means that the heartbeat renewal threshold will change. So in peerawareinstanceregistryimpl You can see the data update in cancel
Call path peerawareinstanceregistryimpl cancel -> AbstractInstanceRegistry. cancel->internalCancel
After the service goes offline, it means that the number of clients that need to send renewal decreases, so modify it here
protected boolean internalCancel(String appName, String id, boolean isReplication) { //.... synchronized (lock) { if (this.expectedNumberOfClientsSendingRenews > 0) { // Since the client wants to cancel it, reduce the number of clients to send renews. this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews - 1; updateRenewsPerMinThreshold(); } } }
PeerAwareInstanceRegistryImpl.register
When a new service provider registers with Eureka server, the number of clients to renew the contract needs to be increased, so it will be processed in the register method
register ->super.register(AbstractInstanceRegistry)
public void register(InstanceInfo registrant, int leaseDuration, boolean isReplication) { //.... // The lease does not exist and hence it is a new registration synchronized (lock) { if (this.expectedNumberOfClientsSendingRenews > 0) { // Since the client wants to register it, increase the number of clients sending renews this.expectedNumberOfClientsSendingRenews = this.expectedNumberOfClientsSendingRenews + 1; updateRenewsPerMinThreshold(); } } }
Refresh self-protection threshold every 15 minutes
PeerAwareInstanceRegistryImpl.scheduleRenewalThresholdUpdateTask
Update the self-protection threshold every 15 minutes!
private void updateRenewalThreshold() { try { // 1. Calculate the number of application instances Applications apps = eurekaClient.getApplications(); int count = 0; for (Application app : apps.getRegisteredApplications()) { for (InstanceInfo instance : app.getInstances()) { if (this.isRegisterable(instance)) { ++count; } } } synchronized (lock) { // Update threshold only if the threshold is greater than the // current expected threshold or if self preservation is disabled. //When the number of nodes count is greater than the minimum number of renewals, or when the self-protection mechanism is not enabled, recalculate the expectedNumberOfClientsSendingRenews and numberOfRenewsPerMinThreshold if ((count) > (serverConfig.getRenewalPercentThreshold() * expectedNumberOfClientsSendingRenews) || (!this.isSelfPreservationModeEnabled())) { this.expectedNumberOfClientsSendingRenews = count; updateRenewsPerMinThreshold(); } } logger.info("Current renewal threshold is : {}", numberOfRenewsPerMinThreshold); } catch (Throwable e) { logger.error("Cannot update renewal threshold", e); } }
Trigger of self-protection mechanism
In the postInit method of AbstractInstanceRegistry, an EvictionTask task will be started to detect whether the self-protection mechanism needs to be enabled.
This method is also triggered when the EurekaServerBootstrap method is started.
protected void postInit() { renewsLastMin.start(); //Start a scheduled task to realize the renewal quantity per minute, and recalculate it every 60s if (evictionTaskRef.get() != null) { evictionTaskRef.get().cancel(); } evictionTaskRef.set(new EvictionTask()); //Start a scheduled task EvictionTask and execute it every 60s evictionTimer.schedule(evictionTaskRef.get(), serverConfig.getEvictionIntervalTimerInMs(), serverConfig.getEvictionIntervalTimerInMs()); }
The code of EvictionTask is as follows.
private final AtomicLong lastExecutionNanosRef = new AtomicLong(0l); @Override public void run() { try { //Get compensation time milliseconds long compensationTimeMs = getCompensationTimeMs(); logger.info("Running the evict task with compensationTime {}ms", compensationTimeMs); evict(compensationTimeMs); } catch (Throwable e) { logger.error("Could not run the evict task", e); } }
evict method
public void evict(long additionalLeaseMs) { logger.debug("Running the evict task"); // Whether the self-protection mechanism needs to be turned on. If so, return directly. There is no need to continue if (!isLeaseExpirationEnabled()) { logger.debug("DS: lease expiration is currently disabled."); return; } //The following is mainly for the automatic offline operation of the service. }
isLeaseExpirationEnabled
- Whether the self-protection mechanism is enabled. If not, skip. It is enabled by default
- Calculate whether self-protection needs to be enabled, and judge whether the number of renewals received in the last minute is greater than numberOfRenewsPerMinThreshold
public boolean isLeaseExpirationEnabled() { if (!isSelfPreservationModeEnabled()) { // The self preservation mode is disabled, hence allowing the instances to expire. return true; } return numberOfRenewsPerMinThreshold > 0 && getNumOfRenewsInLastMin() > numberOfRenewsPerMinThreshold; }
Copyright notice: unless otherwise stated, all articles on this blog adopt CC BY-NC-SA 4.0 license agreement. Reprint please indicate from Mic to take you to learn architecture!
If this article is helpful to you, please pay attention and praise. Your persistence is the driving force of my continuous creation. Welcome to WeChat public official account for more dry cargo.