Using ccache to optimize a node.js docker build!

14 minute read

In my company, I once decided that the best approach to docker and docker images is to have a (as much as possible) dependency free bundle of images. All (or at the least most) of the images in the Jitesoft organization are built from scratch and upwards through the GitLab registry (and also published to docker hub). This way, it’s a lot easier to control all the dependencies, make sure that no images are infected or compromised, it’s also a lot easier to make sure that all the software is of the version expected and especially, that the images are built with our own GitLab runners (making it fast!).

While this is great, one issue is that some of the software that we use in containers are quite annoying to build. As an example, our initial nodejs image (using alpine linux) required a very very long compilation step, something that could make the whole build take up to 4 hours if the runner was not using a whole lot of ram and cpu. This made the update process of the image quite long and if something went wrong at the end of the process, a re-build would take a whole lot of time!

Now, when building the old image, the script was basically only a docker build -t jitesoft/node-base . script. Something that is sufficient for smaller images that don’t really need anything extra. All fine if the image didnt take ages to build!
At this time, the runners used for GitLab CI was not really loaded with resources either, which made the jobs time-out (at the end even at a 5 hours limit!) and sometimes even crashed the kubernetes nodes it was running on (when it started trying to swap).
So, with this issue, a new build process was really needed. One that didnt require loads of resources nor required a 5 hour build time!

The first approach tried was to split up the job in multiple build stages, something that saved some of the burden when it comes to the timeout issue. But, of course, that did not solve the actual issue… So, the best approach was to move the actual compilation of the sourcecode outside of the image and directly into the build script (CI script that is).

The refactoring ended up with a whole lot of build steps:

  • Importing the current keys used to sign the source code tarball.
  • Fetch the stable and latest version numbers.
  • Download the actual tarballs and verify them with gnupg.
  • Build the source code and produce a binary from it.
  • Build the image as a tiny version without some of the dependencies wanted on a production image.
  • Build & deploy the final full image.
  • Scan with Clair dependency scanning to analyse possible vulnerabilities.

I’m so happy that I use GitLab and the CI that they provide. While using our own Runners, the possibility to easily throw a runner up and let it take care of the actual work, not having to do all the above manually, is very very nice.

Security, validation and PGP

I like security, I really enjoy being sure that an application that I’m about to install is actually made and released by the people who made it, not by some script kid wanting to inject my production environments with some malicious code!
Pretty good privacy is a good tool to make sure that the stuff I download is signed by the correct key and SHA checksum validation is something that I use for anything that it is possible to use it for.

I don’t do this always on my development machines, but when it comes to production, always.

Some of the PGP key-servers was recently attacked and infected with keys which could break keychains. The attack is described here and you might want to have a look at the following gist

What is PGP?

PGP is an acronym for “Pretty Good Privacy” and is using an open standard used to encrypt and sign data. The data can then be validated with the use of the signers public key to make sure that the data is actually signed by the person/company in question.

A key consists of a key-pair, one is public and one is secret, the secret key is required to sign and encrypt data, while the public is used to validate the data. The data in question could also be encrypted to only be able to open with specific public keys, but that is not really useful to go through in this post.

If you wish to read more about PGP, a decent description can be found at Wikipedia.

What is a SHA Checksum?

By using a hash algorithm as SHA, you are able to create a checksum which is a short string of data (often represented as a HEX encoded string) the checksum will always be the same for the same file, making it possible for someone - who have the checksum and the file - to verify that the file is what it was expected to be.
If the checksum does not match, the file has been altered since the checksum was created, and if so, it should probably be deleted right away!

Common algorithms used for checksums are MD5, SHA1, SHA256, SHA384, but basically any type of hash algorithm can be used.

Build step 1.1: PGP keyring cache

Importing pgp keys takes time, and in the case of the node.js image in question, there is 11 keys used for releases. Each of the keys could be the one signing the tarball downloaded, so all are required in the keychain when verifying it.
To make the big part of the job less of an issue, instead of re-adding the keys on each build, the keychain is added to the CI cache.
Now, the keys could change, and if they do the keys should of course be re-added from the start.

The way we approached this was to use a extra file which we added to the cache with the key-chain. It consists of a single string, the string is the MD5 checksum of the pgpkeys.txt file. If the checksum is not the same as in the file, the whole keychain will be re-added. Else the old chain will be used and no re-import will be needed!

At the end of the step, a export is used so that the imported keys are added to a file, to be added to the artifacts of the job and used in a later step.

The build step looks something like this:

download:pgp:
  image: registry.gitlab.com/jitesoft/dockerfiles/docker
  stage: pre
  before_script:
    - apk add --no-cache gnupg linux-headers
  script:
    - cat keysum.txt
    - md5sum gpgkeys.txt
    - |
      if [ ! "$(cat keysum.txt | md5sum -c)" ]
      then
        for key in `cat gpgkeys.txt`; do
          gpg --recv-keys "$key"
        done
        gpg --export > keys.out
        md5sum gpgkeys.txt | tee keysum.txt
      fi
  cache:
    key: nodejs.gpg.keyring
    paths:
      - keys.out
      - keysum.txt
  artifacts:
    paths:
      - keys.out
    expire_in: 1 day

Now, any step depending on the download:pgp step will have the keys.out file and could easily import it to its keyring!

Build step 1.2: Versions

While creating the keyring, another small build step is running. All this step is doing is to fetch the latest versions (full versions) of the nodejs source tarball from the node.js dist server.
A very simple wget command is used, piped to grep and then dumped into a file. The step is simple while the regex fetching the actual version is sort of complicated.

download:versions:
  stage: pre
  image: registry.gitlab.com/jitesoft/dockerfiles/alpine:latest
  variables:
    GIT_STRATEGY: none
    NODE_VERSIONS: "10 12"
  before_script:
    - apk add --no-cache grep
  script:
    - mkdir versions
    - |
      for version in $NODE_VERSIONS; do
        wget -qO- https://nodejs.org/dist/latest-v${version}.x/ | grep -oP "(?<=>node-v)(([0-9]{0,3}(\.?)){3})" | awk 'NR==1{print $1}' > versions/${version}.txt
      done;
  artifacts:
    paths:
      - versions/*.txt
    expire_in: 1 day

The regex in the grep command checks for a string in the HTML produced by the path wget requests.
The ?<= is a “look-behind” for >node-v which is a part of the string in the file, after that it extracts a set of digits following the semver system (major.minor.patch).

Build step 2: Download!

Downloading the files is quite simple, at this point we have the exact versions in a file (versions/<version>.txt) and we have the PGP keys (in keys.out). When downloading we will want to find three files, the SHASUM file (in this case a sha256 file in txt format) a .sig file and the actual tarball.

To make sure we get the exact version for each version in the VERSIONS variable, we loop through all the files in the versions directory and use the data in the file as version. Each file name is named after the current major version, while the content (as seen above) is the full version.

download:tars:
  stage: download
  dependencies:
    - download:versions
    - download:pgp
  image: registry.gitlab.com/jitesoft/dockerfiles/alpine:latest
  before_script:
    - apk add --no-cache curl grep gnupg
    - gpg --import keys.out
  script:
    - |
      for VERSION_FILE in versions/*.txt; do
        VERSION=$(cat ${VERSION_FILE})
        curl -OsS https://nodejs.org/dist/v${VERSION}/node-v${VERSION}.tar.xz
        curl -OsS https://nodejs.org/dist/v${VERSION}/SHASUMS256.txt
        curl -OsS https://nodejs.org/dist/v${VERSION}/SHASUMS256.txt.sig
        gpg --verify SHASUMS256.txt.sig SHASUMS256.txt
        grep " node-v${VERSION}.tar.xz\$" SHASUMS256.txt | sha256sum -c -
        mv node-v${VERSION}.tar.xz versions/node-v${VERSION}.tar.xz
      done;
  artifacts:
    paths:
      - versions/*.tar.xz
      - versions/*.txt
    expire_in: 1 day

As seen above, we verify the signature of the shasum text file with the use of gpg, we then check/verify the tar.xz file with the shasum256 tool. If all pass, the file is moved to versions/<major>.tar.xz and uploaded as artifacts with the txt files.

In the before-script installation of curl, grep and gnupg is done and the gpg keys are imported to the jobs local keychain.

Build step 3: Build the binary

Building node.js is not really a special process, the source contains a make file, and make is an easy tool to use.
This job does have a minor difference to the earlier version of the build script though, as it uses the ccache program!

Ccache is a compiler cache which wraps around the C and C++ compilers. For each new job, the cache will become more effective while the first will require a full, slow, compilation.
Due to the cache functionality in the GitLab runner, we have to change the cache dir to a path relative to the build directory and we need to add ccache to the PATH. Other than that, it’s pretty much straight forward.

.build:binary: &node_binary
  stage: build
  image: registry.gitlab.com/jitesoft/dockerfiles/alpine
  dependencies:
    - download:tars
  before_script:
    - apk add --no-cache build-base binutils-gold git python linux-headers ccache libstdc++ xz gnupg curl
    - if [ ! -d "result" ]; then mkdir result; fi
    - if [ ! -d "ccache" ]; then mkdir ccache; fi
    - mkdir -p node-src
  script:
    - PATH="/usr/lib/ccache/bin:$PATH"
    - CCACHE_RECACHE="yes"
    - CCACHE_DIR="$(pwd)/ccache"
    - NODE=$(cat versions/${NODE_VERSION}.txt)
    - tar -Jxf versions/node-v${NODE}.tar.xz -C node-src --strip-components=1
    - cd node-src
    - make -j2 binary V= DESTCPU="x64" ARCH="x64" VARIATION="musl" 2>&1 | tee ../result/build.log | awk 'NR%100==0 {print NR,$0}'
    - cd ..
    - mv node-src/node-*.tar.?z result/
  cache:
    paths:
      - ccache/
    key: node.build.ccache-${BUILD_TYPE}
  artifacts:
    paths:
      - result/*.log
      - result/*.tar.*
    expire_in: 30 days

This job is used as a template, it does not run by itself. Initially all dependencies required to build node is installed, a dir for the source, the ccache dir and the result dir (where we later place the binary files in) is created.
The script itself adds the ccache binary directory to the PATH variable and sets the ccache directory to the newly created cache directory. The versions/<version>.txt file is used again to fetch the actual node version and the source of the version is extracted into the source dir.

Make is in this case invoked with a -j2 flag, which means that it will run two jobs in parallel. Depending on your CI runners, this can be changed, but the runners used in Jitesoft org runs a lot of jobs in parallel, so I always try to keep all compilations to somewhat less resource heavy than I could.

All the variables defined in the make command are just sugar for the executable and not really needed for the compilation, while the tee ... | awk ... part makes only every 100’d line print to STDOUT. Something that makes the build log in GitLab smaller (as it consists of ~15000 lines else).
The make command is also invoked with the binary flag, telling make to produce a tar.xz and a tar.gz file possible to use on another “computer”.

Artifacts are uploaded (including the logfile for debugging in case that is needed!) and stored for 30 days in this case, not really needed, but if you want to be able to download a specific tar file later on this might be nice.
The ccache directory is uploaded to the gitlab ci cache (which in jitesoft case uses a Rook-ceph block storage, working like s3) to be used in future builds.

As you might notice, the BUILD_TYPE variable is never defined, which it should be, but as this is a template job, it will be defined in the actual jobs instead!

The jitesoft/node-base image keeps two versions up-to-date, the latest and the stable branches of node, that is. One is most resent the LTS version and one is the latest released version. As of right now, this is the v10 branch and the v12 branch.
These are built in two different jobs, they both uses the build template and they looks like this:

build:src:latest:
  variables:
    BUILD_TYPE: "latest"
    NODE_VERSION: "12"
    GIT_STRATEGY: "none"
  <<: *node_binary

build:src:stable:
  variables:
    BUILD_TYPE: "stable"
    NODE_VERSION: "10"
    GIT_STRATEGY: "none"
  <<: *node_binary

When we created the node_binary template job we added a reference/anchor name for it (&node_binary), by doing this, we can use Yaml specific features to pretty much copy the whole template into another job (using the <<: *node_binary part in the above example).
All the variables defined in the job will be used in the template when it is invoked and as you see above, we define the node version and BUILD_TYPE in the job.

when this job have finished, all we need to do is build the image!

Build step 4: Image

Pre-refactoring of the build script, all the above was done directly in the Dockerfile, this made it hard to cache anything and was a heavy load on the runners, now, the image just copies the tar.xz file, un-tar it and adds it to the /usr/local/bin directory.

FROM registry.gitlab.com/jitesoft/dockerfiles/alpine:latest
ARG NODE_VERSION
COPY ./result/node-v${NODE_VERSION}-linux-x64-musl.tar.xz /node.tar.xz
RUN addgroup -g 1000 node \
 && adduser -u 1000 -G node -s /bin/sh -D node \
 && apk add --no-cache --virtual .node-deps libstdc++ \
 && apk add --no-cache --virtual .install-deps tar xz \
 && tar -Jxf /node.tar.xz -C /usr/local/ --strip-components=1 \
 && apk del .install-deps \
 && rm -rf /node.tar.xz \
 && chmod +x /usr/local/bin/*
CMD [ "node" ]

Additionally a new user and group is created in alpine, it uses the 1000 u/g id and is named node. libstdc++ is required by node, so that is installed and the tar & xz programs are required to un-tar the binary.
When the image is built, we have a new image with nodejs built from source running in alpine linux. The image above is the jitesoft/node-base:<v>-slim image, missing some extra tools used in the actual full image, but works fine for a lot of stuff. Its total size is less than 40mb compressed and deployed to both gitlab registry and docker hub.

If you wish to check out the full source for all the above, go visit the repository at GitLab!


If you find any issues, things that could be further optimized or just have any questions about the post, please let me know in the comments below or in a MR/PR/Issue in the repository!