使用KubeNurse進(jìn)行集群網(wǎng)絡(luò)監(jiān)控
前言
在Kubernetes中,網(wǎng)絡(luò)是通過(guò)第三方網(wǎng)絡(luò)插件來(lái)提供,這些三方插件本身的實(shí)現(xiàn)就比較復(fù)雜,以至于在排查網(wǎng)絡(luò)問(wèn)題時(shí)常常碰壁。那么有沒(méi)有什么方式來(lái)監(jiān)控集群中所有的網(wǎng)絡(luò)連接呢?
kubenurse就是這樣一個(gè)項(xiàng)目,它通過(guò)監(jiān)視集群中的所有網(wǎng)絡(luò)連接,并提供監(jiān)控指標(biāo)供Prometheus采集。
Kubenurse
kubenurse的部署非常簡(jiǎn)單,使用Daemonset形式部署到集群節(jié)點(diǎn)上,Yaml文件在項(xiàng)目的example目錄下。
部署成功后,每5秒鐘會(huì)對(duì)/alive發(fā)一次檢查請(qǐng)求,然后其內(nèi)部會(huì)運(yùn)行各種方法全方位對(duì)集群網(wǎng)絡(luò)進(jìn)行檢測(cè),為了防止過(guò)多的網(wǎng)絡(luò)流量,會(huì)對(duì)檢查結(jié)果緩存3秒。其檢測(cè)機(jī)制如下:
從上圖可以看出,kubenurse會(huì)對(duì)ingress、dns、apiserver、kube-proxy進(jìn)行網(wǎng)絡(luò)探測(cè)。
所有的檢查都會(huì)創(chuàng)建公開(kāi)的指標(biāo),這些指標(biāo)可用于檢測(cè):
- SDN網(wǎng)絡(luò)延遲以及錯(cuò)誤
- Kubelet之間的網(wǎng)絡(luò)延遲以及錯(cuò)誤
- Pod與apiserver通信問(wèn)題
- Ingress往返網(wǎng)絡(luò)延遲以及錯(cuò)誤
- Service往返網(wǎng)絡(luò)延遲以及錯(cuò)誤(kube-proxy)
- Kube-apiserver問(wèn)題
- Kube-dns(CoreDns)錯(cuò)誤
- 外部DNS解析錯(cuò)誤(ingress url解析)
然后這些數(shù)據(jù)主要通過(guò)兩個(gè)監(jiān)控指標(biāo)來(lái)體現(xiàn):
- kubenurse_errors_total:按錯(cuò)誤類型劃分的錯(cuò)誤計(jì)數(shù)器
- kubenurse_request_duration:按類型劃分的請(qǐng)求時(shí)間分布
這些指標(biāo)都是通過(guò)Type類型進(jìn)行標(biāo)識(shí),對(duì)應(yīng)幾種不同的檢測(cè)目標(biāo):
- api_server_direct:從節(jié)點(diǎn)直接檢測(cè) API Server
- api_server_dns:從節(jié)點(diǎn)通過(guò) DNS 檢測(cè) API Server
- me_ingress:通過(guò) Ingress 檢測(cè)本服務(wù) Service
- me_service:使用 Service 檢測(cè)本服務(wù) Service
- path_$KUBELET_HOSTNAME:節(jié)點(diǎn)之間的互相檢測(cè)
然后這些指標(biāo)分別按P50,P90,P99分位數(shù)進(jìn)行劃分,就可以根據(jù)不同的情況來(lái)確認(rèn)集群網(wǎng)絡(luò)狀況了。
安裝部署
這里直接使用官方的部署文件進(jìn)行部署。不過(guò)需要更改幾個(gè)地方。(1)首先將代碼clone到本地
- git clone https://github.com/postfinance/kubenurse.git
(2)進(jìn)入example目錄,修改ingress.yaml配置,主要是添加域名,如下。
- ---
- apiVersion: extensions/v1beta1
- kind: Ingress
- metadata:
- annotations:
- kubernetes.io/ingress.class: nginx
- name: kubenurse
- namespace: kube-system
- spec:
- rules:
- - host: kubenurse-test.coolops.cn
- http:
- paths:
- - backend:
- serviceName: kubenurse
- servicePort: 8080
(2)更新daemonset.yaml配置,主要是更改ingress的入口域名,如下。
- ---
- apiVersion: apps/v1
- kind: DaemonSet
- metadata:
- labels:
- app: kubenurse
- name: kubenurse
- namespace: kube-system
- spec:
- selector:
- matchLabels:
- app: kubenurse
- template:
- metadata:
- labels:
- app: kubenurse
- annotations:
- prometheus.io/path: "/metrics"
- prometheus.io/port: "8080"
- prometheus.io/scheme: "http"
- prometheus.io/scrape: "true"
- spec:
- serviceAccountName: nurse
- containers:
- - name: kubenurse
- env:
- - name: KUBENURSE_INGRESS_URL
- value: kubenurse-test.coolops.cn # 需要更改的地方
- - name: KUBENURSE_SERVICE_URL
- value: http://kubenurse.kube-system.svc.cluster.local:8080
- - name: KUBENURSE_NAMESPACE
- value: kube-system
- - name: KUBENURSE_NEIGHBOUR_FILTER
- value: "app=kubenurse"
- image: "postfinance/kubenurse:v1.2.0"
- ports:
- - containerPort: 8080
- protocol: TCP
- tolerations:
- - effect: NoSchedule
- key: node-role.kubernetes.io/master
- operator: Equal
- - effect: NoSchedule
- key: node-role.kubernetes.io/control-plane
- operator: Equal
(4)新創(chuàng)建一個(gè)ServiceMonitor,用于獲取指標(biāo)數(shù)據(jù),如下:
- apiVersion: monitoring.coreos.com/v1
- kind: ServiceMonitor
- metadata:
- name: kubenurse
- namespace: monitoring
- labels:
- k8s-app: kubenurse
- spec:
- jobLabel: k8s-app
- endpoints:
- - port: "8080-8080"
- interval: 30s
- scheme: http
- selector:
- matchLabels:
- app: kubenurse
- namespaceSelector:
- matchNames:
- - kube-system
(5)部署應(yīng)用,在example目錄下執(zhí)行以下命令。
- kubectl apply -f .
(6)等待所有應(yīng)用變成running,如下。
- # kubectl get all -n kube-system -l app=kubenurse
- NAME READY STATUS RESTARTS AGE
- pod/kubenurse-fznsw 1/1 Running 0 17h
- pod/kubenurse-n52rq 1/1 Running 0 17h
- pod/kubenurse-nwtl4 1/1 Running 0 17h
- pod/kubenurse-xp92p 1/1 Running 0 17h
- pod/kubenurse-z2ksz 1/1 Running 0 17h
- NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
- service/kubenurse ClusterIP 10.96.229.244 <none> 8080/TCP 17h
- NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
- daemonset.apps/kubenurse 5 5 5 5 5 <none> 17h
(7)到prometheus上查看是否正常獲取數(shù)據(jù)。
查看指標(biāo)是否正常。
(8)這時(shí)候就可以在grafana上畫(huà)圖,展示監(jiān)控?cái)?shù)據(jù)了,如下。
參考文檔
【1】https://github.com/postfinance/kubenurse
【2】https://github.com/postfinance/kubenurse/tree/master/examples