26 August 2014

The problem

I’m currently working on a small web application written in Java, using Travis CI as my build server. While the application itself is currently just a minimal “Hello World” Spring app, it already has quite a few dependencies. Downloading these was adding quite a lot of time to my CI builds. The problem was that since each build takes place in an isolated sandbox, Maven had to download all of my dependencies in each and every build.

To make matters worse, I’m also using Travis CI’s ability to deploy to Heroku. Heroku also builds the application from source and was also downloading all of the Maven dependencies every time. This meant that during each build, all of my dependencies were being downloaded twice and my build times were 5-8 minutes for a minimal “Hello World” app that didn’t even do anything yet.

Alternative solutions

For private repositories, Travis CI supports caching dependencies between builds. All you have to do to cache your Maven dependencies is add the following to the end of your .travis.yml file:

cache:
  directories:
  - $HOME/.m2

However, this doesn’t work for public repostories using Travis CI for free. Elsewhere in Travis CI’s documentation is an article on speeding up the build. This suggests rolling your own solution to cache dependencies in S3, or using WAD for Ruby projects managing dependencies via Bundler.

I considered trying to generalise WAD to allow it to cache dependencies other than Ruby gems, but concluded that it would be difficult to do this cleanly and that a custom build script would be more lightweight in the end.

The solution

I wrote a small bash script to compress my local Maven folder and upload it to S3 at the end of each build, and to download and uncompress it at the start. This reduced my build time down to 2-3 minutes. The rest of this section explains how to use this script and a bit about how it works.

Setting up an S3 bucket

The first step is to set up an S3 bucket and make it available to the build. Assuming you’re starting from scratch, you’ll need to do the following:

  • Sign up for AWS
  • Create a bucket (I believe Travis CI is in ‘US Standard’ region, so it makes sense to use the same)
  • Create an IAM user, generating an access key and a secret (keep note of these as AWS doesn’t let you to retrieve the secret again later on)
  • Attach a policy based on the S3 Full Access template, replacing the resource string * with arn:aws:s3:::bucketname/*

Add encrypted environment variables to your .travis.yml file for the credentials created above. Using the travis client, just run:

> travis encrypt AWS_ACCESS_KEY_ID=accesskey --add
> travis encrypt AWS_SECRET_ACCESS_KEY=secretkey --add

Writing the build script

The build script is available in full on Gist. You can skip to using the build script in .travis.yml if you just want to get it working. The rest of this section goes into a bit more detail on how the script works internally. The beginning of the script looks like this:

#!/bin/bash
_s3_caching_dependencyFolder=$HOME/.m2
_s3_caching_file="cached.tar.bz2"

function getCachedDependencies {
if [[ -z $(_s3_caching_diffPomFiles) ]]; then
echo "pom.xml files unchanged - using cached dependencies"
_s3_caching_downloadArchive
_s3_caching_extractDependencies
fi
}

function cacheDependencies {
if [[ -n $(_s3_caching_diffPomFiles) ]]; then
echo "pom.xml files have changed - updating cached dependencies"
_s3_caching_compressDependencies
_s3_caching_uploadArchive
fi
}

function _s3_caching_diffPomFiles {
git diff ${TRAVIS_COMMIT_RANGE} pom.xml **/pom.xml
}

Implementation details

The extract/compress methods just call tar. The download/upload methods use curl, but are rather more involved as they have to construct a valid authentication header for the request to S3. I got on the right path thanks to Tommy Montgomery’s blog post on uploading to S3. Unfortunately, Amazon have introduced a new authentication protocol since that post was written, and while I think the old one is still supported I couldn’t find it in the AWS documentation.

To support both uploads and downloads I ended up implementing the new authentication protocol in my script. While quite fiddly to get working, the resulting script just uses openssl and a few other simple tools, all of which are available in the Travis CI build environment.

Optimisation

Finally, the conditionals and the _s3_caching_diffPomFiles function seen above form an optimisation to address a couple of issues:

  • Compressing the dependencies and uploading them takes a while (~20s for me right now but this will grow over time)
  • When dependencies change, the old ones will continue to bloat our cache unless we recreate it from scratch

To address both of these, I wanted to download cached dependencies only if they hadn’t changed since the last push, and update the cache only if they had changed (recreating it from scratch). I could do this by checking for any changes to the project’s pom.xml files. Fortunately, Travis CI provides an environment variable containing the commit range of the last push, which can be passed straight to git diff for this purpose.

Using the build script in .travis.yml

The build script can be used in .travis.yml as follows:

before_install: source ./build/s3_caching.sh
install: getCachedDependencies
after_script: cacheDependencies
env:
  global:
  - AWS_BUCKET: bucketname
  ...

Travis CI’s default behaviour for Java projects is to call Maven twice: once in the install step to pull in dependencies and again in the script step to run the tests. The above config overrides the install step to use my script instead, and adds steps to load the script and cache dependencies after the build completes.

Caching dependencies between Heroku builds

The above solution addresses the problem of Travis CI downloading all of my dependencies on every build. However, I mentioned at the start that Heroku was also downloading all of the dependencies for every deployment. Heroku’s Java documentation claims that it automatically caches dependencies between builds. Looking at the source of the Heroku java buildpack it’s fairly clear that it caches Maven dependencies.

Some further digging around in the Travis CI documentation about Heroku revealed that there are two deployment methods available: The standard git-based deployment and the alternative Anvil, which Travis CI uses by default. I didn’t spend a lot of time getting my head around Anvil, so I don’t know whether it simply doesn’t use the same buildpack or if there was some other issue, or how fixable it might be. However, there didn’t seem to be any particular downside to using git-based deployment. Instructing Travis CI to deploy to Heroku via git resulted in Maven dependencies being cached between Heroku builds as expected.

In the next post I’ll cover another approach for deploying to Heroku and further optimising the build.