Kubernetes Pod 突然就無法掛載 Ceph RBD 存儲卷了.....
本文轉載自微信公眾號「云原生實驗室」,作者米開朗基楊 。轉載本文請聯系云原生實驗室公眾號。
前言
Kubernetes 坑不坑?坑!Ceph 坑不坑?坑!他倆湊到一起呢?巨坑!
之前在 Kubernetes 集群中部署了高可用 Harbor 鏡像倉庫,并使用 Ceph RBD 提供持久化存儲。本來是挺美滋滋的,誰料昨天有一臺節點 NotReady 了,導致 Harbor 的某個組件所在的 Pod 被重新調度了,但是重新調度后的 Pod 并沒有啟動成功。
進一步通過 describe pod 查看 events,發現如下 Warning:
- Events:
- Type Reason Age From Message
- ---- ------ ---- ---- -------
- Normal Scheduled 23s default-scheduler Successfully assigned harbor/harbor-harbor-registry-5796cdddd7-kxzp9 to k8s03
- Warning FailedAttachVolume 22s attachdetach-controller Multi-Attach error for volume "pvc-ec045b5e-2471-469d-9a1b-6e7db0e938b3" Volume is already exclusively attached to one node and can't be attached to another
好家伙,當前的 PV 所對應的 RBD image 還在被另一個 Pod 占用著,所以無法掛載到新 Pod 中。我到 NotReady 的節點中通過 docker rm -vf xxx 直接將之前的 Pod 刪除,仍然不起作用。
現在看來我只能從之前的 Pod 所在節點中將 RBD image 映射的塊設備強行 unmount 了。首先得找到該 PV 所對應的 RBD image,直接查看 PV 的信息:
- 🐳 → kubectl -n harbor get pv pvc-ec045b5e-2471-469d-9a1b-6e7db0e938b3 -o go-template='{{.spec.csi.volumeAttributes.imageName}}'
- csi-vol-bf0dc641-4a5a-11eb-988c-6ab597a1411c
到 Ceph 管理節點中查看該 image 正在被誰使用:
- 🐳 → rbd status kubernetes/csi-vol-bf0dc641-4a5a-11eb-988c-6ab597a1411c
- Watchers:
- watcher=172.16.7.1:0/3619044864 client.195600 cookie=18446462598732840980
找到了罪魁禍首,于是登錄到 172.16.7.1 將塊設備強行卸載:
- 🐳 → docker ps|grep csi
- 77255fe4f26b 650757c4f32d "/usr/local/bin/ceph…" 3 weeks ago Up 3 weeks k8s_liveness-prometheus_csi-rbdplugin-hscf8_ceph-csi_2b7da817-3f4a-4e8f-9f99-a39da07c5b94_5
- fb4e5e10f064 650757c4f32d "/usr/local/bin/ceph…" 3 weeks ago Up 3 weeks k8s_csi-rbdplugin_csi-rbdplugin-hscf8_ceph-csi_2b7da817-3f4a-4e8f-9f99-a39da07c5b94_5
- 5330c84529e9 37c1d9ea538b "/csi-node-driver-re…" 3 weeks ago Up 3 weeks k8s_driver-registrar_csi-rbdplugin-hscf8_ceph-csi_2b7da817-3f4a-4e8f-9f99-a39da07c5b94_6
- 4452755ffccf k8s.gcr.io/pause:3.2 "/pause" 3 weeks ago Up 3 weeks k8s_POD_csi-rbdplugin-hscf8_ceph-csi_2b7da817-3f4a-4e8f-9f99-a39da07c5b94_5
- 🐳 → docker exec -it fb4e5e10f064 bash
- [root@k8s01 /]# rbd showmapped|grep csi-vol-bf0dc641-4a5a-11eb-988c-6ab597a1411c
- 4 kubernetes csi-vol-bf0dc641-4a5a-11eb-988c-6ab597a1411c - /dev/rbd4
- [root@k8s01 /]# rbd unmap -o force /dev/rbd4
現在在來看新 Pod,已經啟動成功了:
- Events:
- Type Reason Age From Message
- ---- ------ ---- ---- -------
- Normal Scheduled 18m default-scheduler Successfully assigned harbor/harbor-harbor-registry-5796cdddd7-kxzp9 to k8s03
- Warning FailedAttachVolume 18m attachdetach-controller Multi-Attach error for volume "pvc-ec045b5e-2471-469d-9a1b-6e7db0e938b3" Volume is already exclusively attached to one node and can't be attached to another
- Warning FailedMount 14m kubelet, k8s03 Unable to attach or mount volumes: unmounted volumes=[registry-data], unattached volumes=[default-token-phjbz registry-data registry-root-certificate registry-htpasswd registry-config]: timed out waiting for the condition
- Normal SuccessfulAttachVolume 12m attachdetach-controller AttachVolume.Attach succeeded for volume "pvc-ec045b5e-2471-469d-9a1b-6e7db0e938b3"
- Warning FailedMount 12m kubelet, k8s03 Unable to attach or mount volumes: unmounted volumes=[registry-data], unattached volumes=[registry-htpasswd registry-config default-token-phjbz registry-data registry-root-certificate]: timed out waiting for the condition
- Warning FailedMount 5m21s (x2 over 16m) kubelet, k8s03 Unable to attach or mount volumes: unmounted volumes=[registry-data], unattached volumes=[registry-config default-token-phjbz registry-data registry-root-certificate registry-htpasswd]: timed out waiting for the condition
- Warning FailedMount 3m5s (x2 over 9m55s) kubelet, k8s03 Unable to attach or mount volumes: unmounted volumes=[registry-data], unattached volumes=[registry-root-certificate registry-htpasswd registry-config default-token-phjbz registry-data]: timed out waiting for the condition
- Warning FailedMount 2m54s (x9 over 11m) kubelet, k8s03 MountVolume.MountDevice failed for volume "pvc-ec045b5e-2471-469d-9a1b-6e7db0e938b3" : rpc error: code = Internal desc = rbd image kubernetes/csi-vol-bf0dc641-4a5a-11eb-988c-6ab597a1411c is still being used
- Warning FailedMount 50s (x2 over 7m39s) kubelet, k8s03 Unable to attach or mount volumes: unmounted volumes=[registry-data], unattached volumes=[registry-data registry-root-certificate registry-htpasswd registry-config default-token-phjbz]: timed out waiting for the condition
- Normal Pulling 15s kubelet, k8s03 Pulling image "goharbor/registry-photon:v2.1.2"
- Normal Pulled 12s kubelet, k8s03 Successfully pulled image "goharbor/registry-photon:v2.1.2"
- Normal Created 12s kubelet, k8s03 Created container registry
- Normal Started 12s kubelet, k8s03 Started container registry