Certs and why to use stage.

So, I promised I would write a short post about what happened to my blog and why my certificates expired…

During a customer project, we decided to use my company servers (jitesoft) for staging environment.
I haven’t had the time to provision a kubernetes cluster for said servers, instead I use a docker swarm cluster, which is fine, it works quite alright especially due to the fact that all that runs on it is my blog, the company page and some CI runners (well, and monitoring tools like prometheus and graphana).
While doing this, I noticed that none of the certificates for the review applications was successfully applied, I use wildcard certificates on all my pages… Or at leas so I thought……

When checking on the server I noticed that the connection to the Consul keyvalue store was broken, so instead of creating a wildcard certificate for the whole domain including subdomains, the server was creating a new certificate for each new host rule created. This was not good, not at all!
The certificate for the domain that we used for the stage environment was not expired, but it had issues due to reaching the api request limit for the week… So I thought: well, I’ll quickly fix the issue with the consul storage and make sure it works as intended!

So…
When traefik (which I use for loadbalancing, certificate creation and routing) on my server dont have a consul connection, it uses a single file called acme.json to store all certificates and keys, the given file was probably fine, but I had forgotten that it was stored in a persistant storage, not shared with the host (my persistant storage is a nfs server, so it shares over the cluster). I started with testing the connection, fixing it after a few, all good there… But when it was fixed, it had no recent certificates (due to not being used for a while). I re-provisioned the traefik service and surrounding stuff and it started fine, uploading the acme.json content to the storage… The acme.json file that I had just shared from the host, instead of the one in the persistant storage smart as I am…
Now, the certificates in the local acme.json file was not recent, they where quite old, like… Really old, the persistant volume was gone (due to re-provision of all of the volumes and such) and one of the certificates where out of date, so I started trying to figure out some way to get the old certs back…
After messing with traefik and the certificates for a while I noticed another big mistake… The certificate directory (for let’s encrypt certificates) was not the stage one, but the real one… Which means that each time I restarted the server without the consul backend (which I had disconnected for now) got a new request for each cert.
I’m happy I noticed it at least this time, because some of the certificates had not yet reached the limit of requests, while two had.

Jite.eu was down, and I had no backup certificate file anywhere!

So what did I learn (and hope that someone might take a lesson from!)? Never mess with the server when tired, never use the production endpoint for certificates without knowing that it will work, and never forget to create a backup!

Feel free to leave a comment showing how fun it is when I do stupid things! ;)

Johannes Tegnér