by Carl Furrow of Lumos Labs
Making sure your test suite runs quickly ensures that it will be run often. We at Lumos Labs (lumosity.com) have been working on an in-house Jenkins CI setup to run our ~2500 tests across ~360 files in under 10 minutes. Our Jenkins setup consists of about 24 executor VMs. For each build we allocate 12 executors, and each executor would get a subset of the total files to be run. For example, with 360 test files, each executor VM would be responsible for running 30 test files each.
Under this configuration a build would complete in anywhere from 12-20 minutes. Which is fine when the production releases are coming slow, but it’s an eternity when three or more people are queueing up changes that need to go into production. Running the suite locally can take 45 minutes to run just the rspec tests, so parallelization is necessity when needing to test the entire suite.
As our company grew, and more developers were creating feature branches, more CI builds were being queued in Jenkins. With the limited number of executors in Jenkins that we had, builds were queuing up. If you were behind 2-3 people in the build queue, that would mean you’d be waiting up to 30-40 minutes for your build to even start running! It was becoming a headache for all of us, so we looked at increasing the number of executor VMs, as well as beefing up the processing power of each one.
Adding more VMs to the cluster brought on additional headaches. With the increase in speed, we were noticing more segfaults occuring during the builds, marking it as a failure. But re-running the build would usually get it to pass. We spent many hours debugging the different environments, gems, etc, trying to determine where the segfaults were happening, and eventually coded up scripting solutions that could detect a segfault, and re-run the subset of tests where the segfault occured. Not a permanent solution, but it was one that could get our builds passing more often without these ‘flickering’ segfaults. Coupled with this was a constant hunt to determine whether or not a failed cucumber scenario was a legitimate one, or perhaps something related to Capybara Webkit. More developer time was spent re-working our selectors and specs that hit Capybara, which was time well spent, but it took a long time to re-code, and deal with version changes to the Capybara API. Obviously you cannot rid yourself of all responsibility, but running tests and managing our own servers was becoming tedious (wait for it).
Knowing that we wanted to stop managing our own CI environment, we went looking to CI service providers (hosted, and self-hosted) to see how they would perform. Unfortunately, we ended up investing days into configuration and setup, and still the test suite times were worse than what we were seeing in our own setup. It seemed we had the best CI around for us, and we’d have to give up on finding a hosted CI service that could be easy to setup, plus, and more importantly, faster than what we currently had. So we started building a beefier set of servers and VMs to run our Jenkins setup, and that was promising, but it was expensive.
Enter tddium, exit tedium
Flash-forward to a testing-related meetup this past August, hosted by Solano Labs in SF. They showed off their hosted CI product, tddium, along with a general discussion on testing strategies and horror stories. I had a chance to talk with co-founders Jay and William about our current CI setup, and they felt strongly they could improve the running time, if nothing else.
After setting up the trial account, creating a tddium.yml configuration file, and working with Solano’s support staff to setup an environment that more closely resembled our current Jenkins setup, I had a green build!
Today most of our builds run in about 5 minutes on 1.9.3-p327.
We even had our ruby2.0.0-p247 branch under 4 minutes!
TL;DR We’re glad we switched
Now that our tests are run via tddium, we’ve phased-out our Jenkins setup, and the testing queue has been all but eliminated. We ended up with a setup that allowed three builds going at once, and that seems like the sweet-spot for us with builds taking about five minutes apiece.
||Average Build Time
||Speed Improvement %
|tddium (ruby 1.9.3-p327)
|tddium (ruby 2.0.0-p247)