That time my k8s master broke and I deleted my volumes…

OBSERVE! This post should not be seen as a tutorial or best practices, the fixes I do in this post is probably bad practice and seeing I should have had a better backup routine on this cluster, the issue would have been a whole lot smaller! Read it as a rant, as sort of a funny story, don’t follow it and hold me liable for any data loss!!

So, today I woke up to a k8s cluster which had a broken node. This is usually not a huge issue, you can restart the node and it often re-join the cluster pretty much automatically. But in this case it didnt…

So what is the problem then? Well, as it turns out, the master node had some issues, in this cluster, which is a very small cluster, the master node is untainted and runs pods on itself, it also runs the ETCD servers internally and the scheduler had broken and the server itself was behaving badly…

Well, smart as I am, instead of figuring this out from the start I tried to restart the containers… and when that didn’t work I actually deleted a couple of daemonsets and deployments… and well, you can guess that it didnt end up as I wanted…
After messing around with restarting and deleting I decided to restart the node (which I had not done yet, as I didnt know it was broken…), it rebooted and kubernetes didnt start.

When kubernetes don’t start and you dont know why, the best thing to do is to check some of the logs. Easiest way to check when it comes to the startup process is through journalctl, just print:

> sudo journalctl -u kubelet --since "1 hour ago"

and you will get all your kubelet logs in the terminal right away. In this case, it was quite an easy fix, the server had swap enabled (something that kubernetes DONT LIKE!), so turning that off and changing the entry in the /etc/fstab file fixed that and kubelet started fine.

In the fstab file, you can find a entry which has the type swap, just put a # (comment) in front of it and it will not load during boot.

> cat /etc/fstab

# Hah! I masked my UUIDs!
UUID=xxxxxxxx-xxxx-4d2c-xxxx-7bdxx94e39da /               ext4    noatime,errors=remount-ro 0       1
# /boot was on /dev/sda1 during installation
UUID=xxxxxxxx-a569-xxxx-xxxx-1405f7axx148 /boot           ext4    noatime         0       2
# swap was on /dev/sda3 during installation
UUID=xxxxxxxx-4008-xxxx-a9db-2ff3bxxxxxx7 none            swap    sw              0       0

To just turn it of right away without having to remount the system, run [sudo] swapoff -a, this will turn of the swap but when restarting it will be back again if not changing the fstab file.

So, now my little minion was up and running again, but did that fix the issues with the pods that where broken? No… It did not… not only that… Daemonsets would’nt spawn new pods, containers didnt have their Ceph mounts working… Fun fun I thought.

That is when I understood that the master must be miss-behaving, so I tried to reboot it… and it didnt work… Not at all… The command returned nothing and the server was still up…
Luckily my bare-metal provider have a control panel with ability to force-reboot computers. This is pretty much like pulling the plug and turning it on again, something I do NOT like to do, but this time I had to… Waited and waited… Finally, the server came back online (this was like half a minute, but felt like hours)!

So, did this fix my issue? Well yes, slightly! The containers went back up all daemonsets spawned new pods, all networking and ceph storage was just as it should be… yay!… err… Right, I forgot that I actually killed a couple of deployments with kubectl delete -f ...yml… And that - smart as I am sometimes - those deployments had their storage claims in the same yml file as the deployment… so the pods had created new persistent volumes for their claims!
Doh!

Now I had a couple of deployments with the wrong volumes, so they where like they where all new… The containers where my OAuth provider and my Issue Tracker, stuff that you really don’t want to loose!

This story ends quite well (I hope… for now it seems good at the least!), cause a lost PV is just a reference, not actually a lost volume as long as you use retain on them!

So, what to do when you loose a PV from your pod? When the PVC is pointing to a new volume instead?
Well, first off, you need to find the volume that you need.

To fetch all your PV’s from kubernetes just run:

> kubectl get pv

NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                         STORAGECLASS      REASON    AGE
pvc-00632722-2e2d-11e9-bc92-0007cb040cb0   20Gi       RWX            Retain           Bound      jitesoft/minio-storage-minio-2                rook-ceph-block             96d
pvc-127342a6-2e2d-11e9-bc92-0007cb040cb0   20Gi       RWX            Retain           Bound      jitesoft/minio-storage-minio-3                rook-ceph-block             96d
pvc-141f359f-552c-11e9-bc92-0007cb040cb0   10Gi       RWX            Retain           Bound      kube-system/consul-storage-consul-1           rook-ceph-block             47d
pvc-193f8ac5-552c-11e9-bc92-0007cb040cb0   10Gi       RWX            Retain           Bound      kube-system/consul-storage-consul-2           rook-ceph-block             47d
pvc-1defdb0c-5b73-11e9-bc92-0007cb040cb0   5Gi        RWX            Retain           Bound      monitoring/grafana-storage                    rook-ceph-block             39d
pvc-2649d5ae-6a65-11e9-bc92-0007cb040cb0   10Gi       RWX            Retain           Released   jitesoft/mongo-persistent-storage-mongo-0     rook-ceph-block             20d
pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0   10Gi       RWX            Retain           Released   jitesoft/mongo-persistent-storage-mongo-1     rook-ceph-block             20d
pvc-2fb2a7d4-2af7-11e9-a529-0007cb040cb0   1Gi        RWO            Retain           Released   default/jiteeu.isso.persistent-volume.claim   rook-ceph-block             101d
pvc-3df9fba6-552b-11e9-bc92-0007cb040cb0   10Gi       RWX            Retain           Bound      kube-system/consul-storage-consul-0           rook-ceph-block             47d

As you can see in the above, I use Rook-Ceph as my storage engine, it works great, I like it a lot!
What you need from the above view is the NAME of the PV, the names of the ones above are all auto generated, so lets for example just take the mongo-1 claim as it is released!.
pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0… a long and nice name, something you really don’t have to remember though, hehe…
The first thing we want to do is to is to edit the volume:

kubectl edit pv pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0

This will - depending on the system you are using - open a text editor. The file will look something like this:

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: ceph.rook.io/block
  creationTimestamp: 2019-04-29T09:43:01Z
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0
  resourceVersion: "11233174"
  selfLink: /api/v1/persistentvolumes/pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0
  uid: 2e120442-6a63-11e9-bc92-0007cb040cb0
spec:
  accessModes:
  - ReadWriteMany
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: mongo-persistent-storage-mongo-1
    namespace: jitesoft
    resourceVersion: "11231795"
    uid: 2d9e264b-6a63-11e9-bc92-0007cb040cb0
  flexVolume:
    driver: ceph.rook.io/rook-ceph-system
    options:
      clusterNamespace: rook-ceph
      dataBlockPool: ""
      image: pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0
      pool: replicapool
      storageClass: rook-ceph-block
  persistentVolumeReclaimPolicy: Retain
  storageClassName: rook-ceph-block
  volumeMode: Filesystem
status:
  phase: Released

The part that we care about is the claimRef entry. To be sure that we can re-mount it on a new VPC, we aught to remove the claimRef. So delete the lines of the claimRef object and save the file and close it. When that is done, the PV can be taken by any pod in the system, so now we go to our deployment yaml file.
On the specs object, we add a volumeName property and set the NAME of the volume as value:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  namespace: jitesoft
  name: mongo-persistent-storage-mongo-1
spec:
  # HERE!
  volumeName: pvc-2d9e264b-6a63-11e9-bc92-0007cb040cb0
  storageClassName: rook-ceph-block
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 10Gi

We remove the currently running pod and the volume claim that we are now editing, and when that is done, we redeploy!

When the pod is back up, it will now use the correct volume and everything is awesome!

So, that was a short story about what I like to do during my sunday afternoons! What are your hobbies?!

Johannes Tegnér