阿里二面:使用 Nacos 做注冊中心怎么做優雅發布?
大家好,我是君哥。今天重新發一下這篇文章。
今天來聊一聊使用 Nacos 做注冊中心怎么做優雅發布。
跟其他的注冊中心一樣,Nacos 作為注冊中心的使用如下圖:
圖片
Service Provider 啟動后注冊到 Nacos Server,Service Consumer 則從 Nacos Server 拉取服務列表,根據一定算法選擇一個 Service Provider 來發送請求。
1.優雅要求
對于優雅發布,要求是 Service Provider 上線(注冊到 Nacos)后,服務能夠正常地接收和處理請求,而 Service Provider 停服后,則不會再收到請求。這就有兩個要求:
- 優雅上線:Service Provider 發布完成之前,Service Consumer 不應該從服務列表中拉取到這個服務地址;
- 優雅下線:Service Provider 下線后,Service Consumer 不會從服務列表中拉取到這個服務地址。
解決了這兩個問題,優雅發布就可以做到了。
2.搭建環境
搭建環境是為了看 Nacos 日志,通過日志找到對應的源代碼。本文搭建的環境如下圖:
圖片
2.1 啟動 provider
啟動 springboot-provider 的應用,注冊到 Nacos,啟動日志如下:
2023-06-11 18:58:10,120 [main] [INFO] com.alibaba.nacos.client.naming - [BEAT] adding beat: BeatInfo{port=8083, ip='192.168.31.94', weight=1.0, serviceName='DEFAULT_GROUP@@springboot-provider', cluster='DEFAULT', metadata={management.endpoints.web.base-path=/actuator, management.port=18082, preserved.register.source=SPRING_CLOUD, management.address=127.0.0.1}, scheduled=false, period=5000, stopped=false} to beat map.
2023-06-11 18:58:10,121 [main] [INFO] com.alibaba.nacos.client.naming - [REGISTER-SERVICE] public registering service DEFAULT_GROUP@@springboot-provider with instance: Instance{instanceId='null', ip='192.168.31.94', port=8083, weight=1.0, healthy=true, enabled=true, ephemeral=true, clusterName='DEFAULT', serviceName='null', metadata={management.endpoints.web.base-path=/actuator, management.port=18082, preserved.register.source=SPRING_CLOUD, management.address=127.0.0.1}}
2023-06-11 18:58:10,133 [main] [INFO] com.alibaba.cloud.nacos.registry.NacosServiceRegistry - nacos registry, DEFAULT_GROUP springboot-provider 192.168.31.94:8083 register finished
2023-06-11 18:58:10,221 [main] [INFO] org.springframework.boot.web.embedded.tomcat.TomcatWebServer - Tomcat initialized with port(s): 18082 (http)
2023-06-11 18:58:10,222 [main] [INFO] org.apache.coyote.http11.Http11NioProtocol - Initializing ProtocolHandler ["http-nio-127.0.0.1-18082"]
2023-06-11 18:58:10,223 [main] [INFO] org.apache.catalina.core.StandardService - Starting service [Tomcat]
2023-06-11 18:58:10,223 [main] [INFO] org.apache.catalina.core.StandardEngine - Starting Servlet engine: [Apache Tomcat/9.0.21]
2023-06-11 18:58:10,239 [main] [INFO] org.apache.catalina.core.ContainerBase.[Tomcat-1].[localhost].[/] - Initializing Spring embedded WebApplicationContext
2023-06-11 18:58:10,239 [main] [INFO] org.springframework.web.context.ContextLoader - Root WebApplicationContext: initialization completed in 99 ms
2023-06-11 18:58:10,268 [main] [INFO] org.springframework.boot.actuate.endpoint.web.EndpointLinksResolver - Exposing 22 endpoint(s) beneath base path '/actuator'
2023-06-11 18:58:10,336 [main] [INFO] org.apache.coyote.http11.Http11NioProtocol - Starting ProtocolHandler ["http-nio-127.0.0.1-18082"]
2023-06-11 18:58:10,340 [main] [INFO] org.springframework.boot.web.embedded.tomcat.TomcatWebServer - Tomcat started on port(s): 18082 (http) with context path ''
2023-06-11 18:58:10,342 [main] [INFO] boot.Application - Started Application in 7.051 seconds (JVM running for 7.874)
2023-06-11 18:58:10,358 [main] [INFO] com.alibaba.nacos.client.config.impl.ClientWorker - [fixed-39.105.183.91_8848] [subscribe] springboot-provider.properties+DEFAULT_GROUP
2023-06-11 18:58:10,359 [main] [INFO] com.alibaba.nacos.client.config.impl.CacheData - [fixed-39.105.183.91_8848] [add-listener] ok, tenant=, dataId=springboot-provider.properties, group=DEFAULT_GROUP, cnt=1
2023-06-11 18:58:10,359 [main] [INFO] com.alibaba.nacos.client.config.impl.ClientWorker - [fixed-39.105.183.91_8848] [subscribe] springboot-provider-dev.properties+DEFAULT_GROUP
2023-06-11 18:58:10,359 [main] [INFO] com.alibaba.nacos.client.config.impl.CacheData - [fixed-39.105.183.91_8848] [add-listener] ok, tenant=, dataId=springboot-provider-dev.properties, group=DEFAULT_GROUP, cnt=1
2023-06-11 18:58:10,360 [main] [INFO] com.alibaba.nacos.client.config.impl.ClientWorker - [fixed-39.105.183.91_8848] [subscribe] springboot-provider+DEFAULT_GROUP
2023-06-11 18:58:10,360 [main] [INFO] com.alibaba.nacos.client.config.impl.CacheData - [fixed-39.105.183.91_8848] [add-listener] ok, tenant=, dataId=springboot-provider, group=DEFAULT_GROUP, cnt=1
2023-06-11 18:58:10,639 [RMI TCP Connection(1)-192.168.31.94] [INFO] org.apache.catalina.core.ContainerBase.[Tomcat].[localhost].[/] - Initializing Spring DispatcherServlet 'dispatcherServlet'
2023-06-11 18:58:10,839 [com.alibaba.nacos.client.naming.updater] [INFO] com.alibaba.nacos.client.naming - [BEAT] adding beat: BeatInfo{port=8083, ip='192.168.31.94', weight=1.0, serviceName='DEFAULT_GROUP@@springboot-provider', cluster='DEFAULT', metadata={management.endpoints.web.base-path=/actuator, management.port=18082, preserved.register.source=SPRING_CLOUD, management.address=127.0.0.1}, scheduled=false, period=5000, stopped=false} to beat map.
2023-06-11 18:58:10,840 [com.alibaba.nacos.client.naming.updater] [INFO] com.alibaba.nacos.client.naming - modified ips(1) service: DEFAULT_GROUP@@springboot-provider@@DEFAULT -> [{"instanceId":"192.168.31.94#8083#DEFAULT#DEFAULT_GROUP@@springboot-provider","ip":"192.168.31.94","port":8083,"weight":1.0,"healthy":true,"enabled":true,"ephemeral":true,"clusterName":"DEFAULT","serviceName":"DEFAULT_GROUP@@springboot-provider","metadata":{"management.endpoints.web.base-path":"/actuator","management.port":"18082","preserved.register.source":"SPRING_CLOUD","management.address":"127.0.0.1"},"ipDeleteTimeout":30000,"instanceHeartBeatInterval":5000,"instanceHeartBeatTimeOut":15000}]
2023-06-11 18:58:10,841 [com.alibaba.nacos.client.naming.updater] [INFO] com.alibaba.nacos.client.naming - current ips:(1) service: DEFAULT_GROUP@@springboot-provider@@DEFAULT -> [{"instanceId":"192.168.31.94#8083#DEFAULT#DEFAULT_GROUP@@springboot-provider","ip":"192.168.31.94","port":8083,"weight":1.0,"healthy":true,"enabled":true,"ephemeral":true,"clusterName":"DEFAULT","serviceName":"DEFAULT_GROUP@@springboot-provider","metadata":{"management.endpoints.web.base-path":"/actuator","management.port":"18082","preserved.register.source":"SPRING_CLOUD","management.address":"127.0.0.1"},"ipDeleteTimeout":30000,"instanceHeartBeatInterval":5000,"instanceHeartBeatTimeOut":15000}]
我們再看下 Nacos 的日志,這里看的文件 naming-server.log,日志如下:
2023-06-11 18:58:09,723 INFO Client connection 192.168.31.94:51885#true connect
2023-06-11 18:58:10,105 INFO Client change for service Service{namespace='public', group='DEFAULT_GROUP', name='springboot-provider', ephemeral=true, revisinotallow=1}, 192.168.31.94:8083#true
2023-06-11 18:58:18,204 INFO Client connection 192.168.31.94:60850#true disconnect, remove instances and subscribers
springboot-provider 啟動成功后,從Nacos 管理后臺可以看到下圖:
圖片
2.2 provider 下線
服務下線后,Nacos 日志如下:
2023-06-11 19:01:03,375 INFO Client connection 192.168.31.94:51885#true disconnect, remove instances and subscribers
2023-06-11 19:01:05,048 INFO [AUTO-DELETE-IP] service: Service{namespace='public', group='DEFAULT_GROUP', name='springboot-provider', ephemeral=true, revisinotallow=2}, ip: {"ip":"192.168.31.94","port":8083,"healthy":false,"cluster":"DEFAULT","extendDatum":{"management.endpoints.web.base-path":"/actuator","management.port":"18082","preserved.register.source":"SPRING_CLOUD","management.address":"127.0.0.1","customInstanceId":"192.168.31.94#8083#DEFAULT#DEFAULT_GROUP@@springboot-provider"},"lastHeartBeatTime":1686481231604,"metadataId":"192.168.31.94:8083:DEFAULT"}
2023-06-11 19:01:05,048 INFO Client remove for service Service{namespace='public', group='DEFAULT_GROUP', name='springboot-provider', ephemeral=true, revisinotallow=2}, 192.168.31.94:8083#true
2023-06-11 19:01:08,379 INFO Client connection 192.168.31.94:8083#true disconnect, remove instances and subscribers
2.3 服務調用
在 springboot-consumer 上跑一個單元測試的用例,用 FeignClient 調用下面的方法:
@FeignClient(value = "springboot-provider", configuration = FeignMultipartSupportConfig.class)
public interface FeignAsEurekaClient {
@PostMapping("/employee/save")
String saveEmployeebyName(@RequestBody Employee employee);
}
日志如下:
2023-06-11 19:15:47,694 [main] [INFO] org.springframework.test.context.transaction.TransactionContext - Began transaction (1) for test context [DefaultTestContext@5bf0d49 testClass = TestFeignAsEurekaClient, testInstance = boot.service.TestFeignAsEurekaClient@10683d9d, testMethod = testPostEmployByFeign@TestFeignAsEurekaClient, testException = [null], mergedContextConfiguration = [WebMergedContextConfiguration@5b7a5baa testClass = TestFeignAsEurekaClient, locations = '{}', classes = '{class boot.Application, class boot.Application}', contextInitializerClasses = '[]', activeProfiles = '{}', propertySourceLocations = '{}', propertySourceProperties = '{org.springframework.boot.test.context.SpringBootTestCnotallow=true, server.port=0}', contextCustomizers = set[org.springframework.boot.test.context.filter.ExcludeFilterContextCustomizer@166fa74d, org.springframework.boot.test.json.DuplicateJsonObjectContextCustomizerFactory$DuplicateJsonObjectContextCustomizer@588df31b, org.springframework.boot.test.mock.mockito.MockitoContextCustomizer@0, org.springframework.boot.test.web.client.TestRestTemplateContextCustomizer@7fad8c79, org.springframework.boot.test.autoconfigure.properties.PropertyMappingContextCustomizer@0, org.springframework.boot.test.autoconfigure.web.servlet.WebDriverContextCustomizerFactory$Customizer@10b48321], resourceBasePath = 'src/main/webapp', contextLoader = 'org.springframework.boot.test.context.SpringBootContextLoader', parent = [null]], attributes = map['org.springframework.test.context.web.ServletTestExecutionListener.activateListener' -> false]]; transaction manager [org.springframework.jdbc.datasource.DataSourceTransactionManager@693676d]; rollback [true]
2023-06-11 19:15:47,941 [main] [INFO] com.netflix.config.ChainedDynamicProperty - Flipping property: springboot-provider.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
2023-06-11 19:15:47,962 [main] [INFO] com.netflix.loadbalancer.BaseLoadBalancer - Client: springboot-provider instantiated a LoadBalancer: DynamicServerListLoadBalancer:{NFLoadBalancer:name=springboot-provider,current list of Servers=[],Load balancer stats=Zone stats: {},Server stats: []}ServerList:null
2023-06-11 19:15:47,969 [main] [INFO] com.netflix.loadbalancer.DynamicServerListLoadBalancer - Using serverListUpdater PollingServerListUpdater
2023-06-11 19:15:48,064 [main] [INFO] com.netflix.config.ChainedDynamicProperty - Flipping property: springboot-provider.ribbon.ActiveConnectionsLimit to use NEXT property: niws.loadbalancer.availabilityFilteringRule.activeConnectionsLimit = 2147483647
2023-06-11 19:15:48,064 [main] [INFO] com.netflix.loadbalancer.DynamicServerListLoadBalancer - DynamicServerListLoadBalancer for client springboot-provider initialized: DynamicServerListLoadBalancer:{NFLoadBalancer:name=springboot-provider,current list of Servers=[192.168.31.94:8083],Load balancer stats=Zone stats: {unknown=[Zone:unknown; Instance count:1; Active connections count: 0; Circuit breaker tripped count: 0; Active connections per server: 0.0;]
},Server stats: [[Server:192.168.31.94:8083; Zone:UNKNOWN; Total Requests:0; Successive connection failure:0; Total blackout seconds:0; Last connection made:Thu Jan 01 08:00:00 CST 1970; First connection made: Thu Jan 01 08:00:00 CST 1970; Active Connections:0; total failure count in last (1000) msecs:0; average resp time:0.0; 90 percentile resp time:0.0; 95 percentile resp time:0.0; min resp time:0.0; max resp time:0.0; stddev resp time:0.0]
]}ServerList:com.alibaba.cloud.nacos.ribbon.NacosServerList@24d998ba
注意,這里使用了 OpenFeign,其中用到了 Ribbon 做負載均衡,那就需要考慮到 Ribbon 刷新本地服務列表的時間,從源代碼中看,刷新周期是 30s。如下圖:
圖片
Ribbon 刷新緩存的邏輯參考下面代碼:
public synchronized void start(final UpdateAction updateAction) {
if (isActive.compareAndSet(false, true)) {
final Runnable wrapperRunnable = new Runnable() {
@Override
public void run() {
//...
}
};
scheduledFuture = getRefreshExecutor().scheduleWithFixedDelay(
wrapperRunnable,
initialDelayMs,
refreshIntervalMs,//這里定義的是30s
TimeUnit.MILLISECONDS
);
}//...
}
3.優雅發布
前面第一節提到過,優雅發布有兩個要求:優雅上線和優雅下線。
Service Consumer 初始化時會從 Nacos Server 獲取服務列表并更新本地緩存,同時會向 Nacos Server 訂閱服務列表(如果 Nacos Server 上的服務列表發生變化,會主動通知 Service Consumer)。之后會定時(默認間隔 1s )拉取服務列表并更新本地緩存。代碼如下:
//NacosNamingService 類
public List<Instance> selectInstances(String serviceName, String groupName, List<String> clusters, boolean healthy,
boolean subscribe) throws NacosException {
ServiceInfo serviceInfo;
String clusterString = StringUtils.join(clusters, ",");
if (subscribe) {
serviceInfo = serviceInfoHolder.getServiceInfo(serviceName, groupName, clusterString);
if (null == serviceInfo) {
serviceInfo = clientProxy.subscribe(serviceName, groupName, clusterString);
}
} else {
serviceInfo = clientProxy.queryInstancesOfService(serviceName, groupName, clusterString, 0, false);
}
return selectInstances(serviceInfo, healthy);
}
在訂閱的代碼中,加入了定時更新服務列表的代碼,如下:
//NamingClientProxyDelegate 類
public ServiceInfo subscribe(String serviceName, String groupName, String clusters) throws NacosException {
NAMING_LOGGER.info("[SUBSCRIBE-SERVICE] service:{}, group:{}, clusters:{} ", serviceName, groupName, clusters);
String serviceNameWithGroup = NamingUtils.getGroupedName(serviceName, groupName);
String serviceKey = ServiceInfo.getKey(serviceNameWithGroup, clusters);
serviceInfoUpdateService.scheduleUpdateIfAbsent(serviceName, groupName, clusters);
ServiceInfo result = serviceInfoHolder.getServiceInfoMap().get(serviceKey);
if (null == result || !isSubscribed(serviceName, groupName, clusters)) {
result = grpcClientProxy.subscribe(serviceName, groupName, clusters);
}
serviceInfoHolder.processServiceInfo(result);
return result;
}
Nacos Server 會定時(每隔 5s)檢查 Service Provider 是否健康(根據心跳來檢查),如果 15s (默認,可以配置)沒有收到心跳,則會把服務置為不健康,并且通知 Service Consumer。代碼如下:
//UnhealthyInstanceChecker 類
public void doCheck(Client client, Service service, HealthCheckInstancePublishInfo instance) {
if (instance.isHealthy() && isUnhealthy(service, instance)) {
changeHealthyStatus(client, service, instance);
}
}
private void changeHealthyStatus(Client client, Service service, HealthCheckInstancePublishInfo instance) {
instance.setHealthy(false);
NotifyCenter.publishEvent(new ServiceEvent.ServiceChangedEvent(service));
NotifyCenter.publishEvent(new ClientEvent.ClientChangedEvent(client));
NotifyCenter.publishEvent(new HealthStateChangeTraceEvent(System.currentTimeMillis(),
service.getNamespace(), service.getGroup(), service.getName(), instance.getIp(), instance.getPort(),
false, "client_beat"));
}
3.1 優雅上線
優雅上線存在的問題主要在于 Service Provider 注冊到 Nacos 后,服務還沒有完成初始化,請求已經到來。這種情況主要原因是 Service Provider 啟動后立刻注冊 Naocs,而本身提供的接口可能還沒有初始化完成。
這種情況的解決方法是關閉自動注冊:
spring.cloud.nacos.discovery.registerEnabled=false
在服務初始化后使用代碼手動注冊,代碼如下:
Properties setting8 = new Properties();
String serverIp8 = "127.0.0.1:8848";
setting8.put(PropertyKeyConst.SERVER_ADDR, serverIp8);
setting8.put(PropertyKeyConst.USERNAME, "nacos");
setting8.put(PropertyKeyConst.PASSWORD, "nacos");
NamingService inaming8 = NacosFactory.createNamingService(setting7);
inaming8.registerInstance("springboot-provider", "192.168.31.94", 8083);
3.2 優雅下線
對于正常下線,Nacos Server 收到 Provider 發送的下線請求后,會通知訂閱的 Server Consumer,而且 Consumer 也會每隔 1s 去更新本地服務列表,這個過程已經非常接近優雅下線了。
而對于異常下線,Nacos Server 采用心跳檢測機制來更新服務列表。心跳周期是 5s,Nacos Server 如果 15s 沒收到心跳就才會將實例設置為不健康。
3.2.1 正常停服
正常下線的情況下,最優雅的方式是先向 Nacos Server 發送下線通知,發送通知一段時間(比如 5s)后,再停服。比如增加一個 API 接口,服務下線之前增加 preStopHook 函數調用這個 API 接口來實現下線。API 接口示例代碼如下:
@GetMapping(value = "/nacos/deregisterInstance")
public String deregisterInstance() {
Properties prop = new Properties();
prop.setProperty("serverAddr", "localhost");
prop.put(PropertyKeyConst.NAMESPACE, "test");
NacosNamingService client = new NacosNamingService(prop);
client.deregisterInstance("springboot-provider", "192.168.31.94", 8083);
return "success";
}
在使用 Ribbon 的場景,也需要考慮 Ribbon 更新本地緩存服務列表的機制,手動下線后,可以再等待 30s 后關閉服務。
3.2.1 服務故障
對于服務故障的情況,Nacos Server 需要采用心跳來檢測服務在線,如果 15s 沒收到心跳才會將實例設置為不健康,在 30s 沒收到心跳才會把這個服務從列表中刪除。這個時間可以做優化設置:
spring.cloud.nacos.discovery.metadata.preserved.heart.beat.interval=1000 #心跳間隔5s->1s
spring.cloud.nacos.discovery.metadata.preserved.heart.beat.timeout=3000 #超時時間15s->3s
spring.cloud.nacos.discovery.metadata.preserved.ip.delete.timeout=5000 #刪除時間30s->5s
但是,Service Provider 故障情況下,即使做優化配置,也是很難讓 Service Consumer 無感知。
極端情況下,可能 Provider 部分服務已經不能正常提供了,但還是會向 Nacos Server 發送心跳,這種情況可以采用服務本身的健康檢查來通知 Nacos Server 服務下線。
4 總結
無論是哪一款注冊中心,優雅發布要解決的問題都是優雅上線和優雅下線。本文結合 Nacos 的原理講解了 Nacos 的優雅發布,希望對你有所幫助。