OBSERVE! This post should not be seen as a tutorial or best practices, the fixes I do in this post is probably bad practice and seeing I should have had a better backup routine on this cluster, the issue would have been a whole lot smaller! Read it as a rant, as sort of a funny story, don’t follow it and hold me liable for any data loss!!
So, today I woke up to a k8s cluster which had a broken node. This is usually not a huge issue, you can restart the node and it often re-join the cluster pretty much automatically. But in this case it didnt…
So what is the problem then? Well, as it turns out, the master node had some issues, in this cluster, which is a very small cluster, the master node is untainted and runs pods on itself, it also runs the ETCD servers internally and the scheduler had broken and the server itself was behaving badly…
Well, smart as I am, instead of figuring this out from the start I tried to restart the containers… and when that didn’t work
I actually deleted a couple of daemonsets and deployments… and well, you can guess that it didnt end up as I wanted…
After messing around with restarting and deleting I decided to restart the node (which I had not done yet, as I didnt know it was broken…), it rebooted and kubernetes didnt start.
When kubernetes don’t start and you dont know why, the best thing to do is to check some of the logs. Easiest way to check when it comes to the startup process is through journalctl, just print:
> sudo journalctl -u kubelet --since "1 hour ago"
and you will get all your kubelet logs in the terminal right away. In this case, it was quite an easy fix, the
server had swap enabled (something that kubernetes DONT LIKE!), so turning that off and changing the entry in the
/etc/fstab file fixed that and kubelet started fine.
In the fstab file, you can find a entry which has the type
swap, just put a
# (comment) in front of it and
it will not load during boot.
> cat /etc/fstab # Hah! I masked my UUIDs! UUID=xxxxxxxx-xxxx-4d2c-xxxx-7bdxx94e39da / ext4 noatime,errors=remount-ro 0 1 # /boot was on /dev/sda1 during installation UUID=xxxxxxxx-a569-xxxx-xxxx-1405f7axx148 /boot ext4 noatime 0 2 # swap was on /dev/sda3 during installation UUID=xxxxxxxx-4008-xxxx-a9db-2ff3bxxxxxx7 none swap sw 0 0
To just turn it of right away without having to remount the system, run
[sudo] swapoff -a, this will turn of the swap
but when restarting it will be back again if not changing the fstab file.
So, now my little minion was up and running again, but did that fix the issues with the pods that where broken? No… It did not… not only that… Daemonsets would’nt spawn new pods, containers didnt have their Ceph mounts working… Fun fun I thought.
That is when I understood that the master must be miss-behaving, so I tried to reboot it… and it didnt work… Not at all…
The command returned nothing and the server was still up…
Luckily my bare-metal provider have a control panel with ability to force-reboot computers. This is pretty much like pulling the plug and turning it on again, something I do NOT like to do, but this time I had to… Waited and waited… Finally, the server came back online (this was like half a minute, but felt like hours)!
So, did this fix my issue? Well yes, slightly! The containers went back up all daemonsets spawned new pods, all
networking and ceph storage was just as it should be… yay!… err… Right, I forgot that I actually killed a couple of
kubectl delete -f ...yml… And that - smart as I am sometimes - those deployments had their
storage claims in the same yml file as the deployment… so the pods had created new persistent volumes for their claims!
Now I had a couple of deployments with the wrong volumes, so they where like they where all new… The containers where my OAuth provider and my Issue Tracker, stuff that you really don’t want to loose!
This story ends quite well (I hope… for now it seems good at the least!), cause a lost PV is just a reference, not actually a lost
volume as long as you use
retain on them!
So, what to do when you loose a PV from your pod? When the PVC is pointing to a new volume instead?
Well, first off, you need to find the volume that you need.
To fetch all your PV’s from kubernetes just run:
> kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE pvc-00632722-2e2d-11e9-bc92-0007cb040cb0 20Gi RWX Retain Bound jitesoft/minio-storage-minio-2 rook-ceph-block 96d pvc-127342a6-2e2d-11e9-bc92-0007cb040cb0 20Gi RWX Retain Bound jitesoft/minio-storage-minio-3 rook-ceph-block 96d pvc-141f359f-552c-11e9-bc92-0007cb040cb0 10Gi RWX Retain Bound kube-system/consul-storage-consul-1 rook-ceph-block 47d pvc-193f8ac5-552c-11e9-bc92-0007cb040cb0 10Gi RWX Retain Bound kube-system/consul-storage-consul-2 rook-ceph-block 47d pvc-1defdb0c-5b73-11e9-bc92-0007cb040cb0 5Gi RWX Retain Bound monitoring/grafana-storage rook-ceph-block 39d pvc-2649d5ae-6a65-11e9-bc92-0007cb040cb0 10Gi RWX Retain Released jitesoft/mongo-persistent-storage-mongo-0 rook-ceph-block 20d pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0 10Gi RWX Retain Released jitesoft/mongo-persistent-storage-mongo-1 rook-ceph-block 20d pvc-2fb2a7d4-2af7-11e9-a529-0007cb040cb0 1Gi RWO Retain Released default/jiteeu.isso.persistent-volume.claim rook-ceph-block 101d pvc-3df9fba6-552b-11e9-bc92-0007cb040cb0 10Gi RWX Retain Bound kube-system/consul-storage-consul-0 rook-ceph-block 47d
As you can see in the above, I use Rook-Ceph as my storage engine, it works great, I like it a lot!
What you need from the above view is the NAME of the
PV, the names of the ones above are all auto generated, so lets for example just take the
mongo-1 claim as it is released!.
pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0… a long and nice name, something you really don’t have to remember though, hehe…
The first thing we want to do is to is to edit the volume:
kubectl edit pv pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0
This will - depending on the system you are using - open a text editor. The file will look something like this:
# Please edit the object below. Lines beginning with a '#' will be ignored, # and an empty file will abort the edit. If an error occurs while saving this file will be # reopened with the relevant failures. # apiVersion: v1 kind: PersistentVolume metadata: annotations: pv.kubernetes.io/provisioned-by: ceph.rook.io/block creationTimestamp: 2019-04-29T09:43:01Z finalizers: - kubernetes.io/pv-protection name: pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0 resourceVersion: "11233174" selfLink: /api/v1/persistentvolumes/pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0 uid: 2e120442-6a63-11e9-bc92-0007cb040cb0 spec: accessModes: - ReadWriteMany capacity: storage: 10Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: mongo-persistent-storage-mongo-1 namespace: jitesoft resourceVersion: "11231795" uid: 2d9e264b-6a63-11e9-bc92-0007cb040cb0 flexVolume: driver: ceph.rook.io/rook-ceph-system options: clusterNamespace: rook-ceph dataBlockPool: "" image: pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0 pool: replicapool storageClass: rook-ceph-block persistentVolumeReclaimPolicy: Retain storageClassName: rook-ceph-block volumeMode: Filesystem status: phase: Released
The part that we care about is the
claimRef entry. To be sure that we can re-mount it on a new VPC, we aught to remove the claimRef. So delete the lines of the claimRef object and save the file and close it.
When that is done, the PV can be taken by any pod in the system, so now we go to our deployment yaml file.
specs object, we add a
volumeName property and set the NAME of the volume as value:
apiVersion: v1 kind: PersistentVolumeClaim metadata: namespace: jitesoft name: mongo-persistent-storage-mongo-1 spec: # HERE! volumeName: pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0 storageClassName: rook-ceph-block accessModes: - ReadWriteMany resources: requests: storage: 10Gi
We remove the currently running pod and the volume claim that we are now editing, and when that is done, we redeploy!
When the pod is back up, it will now use the correct volume and everything is awesome!
So, that was a short story about what I like to do during my sunday afternoons! What are your hobbies?!