Solano Labs Blog

Back to Solano Labs

OpenSSL Heartbeat Vulnerability (aka: Heartbleed)

As most of our users are by now no doubt aware, on April 7th a serious vulnerability was announced in recent versions of OpenSSL.  Dubbed Heartbleed, CVE-2014-0160  allows a remote attacker to read potentially sensitive data on the server. This vulnerability has had a widespread impact on many providers.  We take security and the trust our customers place in us extremely seriously and so we wanted to take this opportunity to explain the steps we have taken over the last few days to address Heartbleed here at Solano.

Our incidence response began immediately with the release of CVE-2014-0160.  We do use SSL/TLS to secure communications between our customers and the service and between components of the service and we do use the OpenSSL implementation.  The response team began by upgrading all parts of our infrastructure, including the front end website, the core API and control plane, database servers, test environments, and ancillary services (issue trackers, workstations, and so on).  We then scheduled a downtime for the evening of April 8 to replace all of our certificates and revoke the previous certificate.  We are not aware of any compromise of the old certificate, but given the severity of CVE-2014-0160 we believe it is a best practice in this case to rekey all servers.

We continue to monitor the situation and strongly recommend that all users change their authentication tokens not only on Solano services but also with any other providers that they may use.

In summary:

  • All of our infrastructure was patched on 4/7 no later than 8pm PT
  • Fresh, re-keyed certificates were installed across our infrastructure by 4/8 11pm PT.
  • All logged in sessions were invalidated and reset after infrastructure updates

We do use Amazon Web Services to host much of our infrastructure but do not use AWS Elastic Load Balancers (ELB) to terminate SSL.  For AWS-specific information, we recommend reading Amazon’s detailed security advisories.

If you have any questions or concerns that are not addressed here, please contact us at support@solanolabs.com.

Leave a comment

Per-Repo GitHub Status Configuration

The Solano CI integration with GitHub uses OAuth for authentication.  Today we have rolled out the ability to set the credentials used to post GitHub status on a per-repository basis.  To configure an alternate set of credentials for a repository, go to the GitHub Status menu item on the repo configuration page  (click on the gear icon in the dashboard).  You can then select from the list of users that have linked their accounts with GitHub via OAuth or enter a personal OAuth token.

One of the useful features of GitHub OAuth tokens is that they can be used to authenticate command line tools.  In addition to using curl to access the full GitHub API you can use OAuth tokens for git over ssh or to download files from the command line.  The GitHub OAuth token is made available as an automatically managed config variable: GITHUB_OAUTH_TOKEN.  This config variable is exported to the build environment as an environment variable where it can be used by setup hooks and post build hooks.  For instance, you can use it to generate an authenticated URL to download custom esearch plugins from a private repository or to authenticate git pull or git push as part of setup and teardown.

 

oauth-screenshot

Leave a comment

3/14 Service Interruption

Around 7am PT, one of our app server nodes (and alas, also our primary redis server) started exhibiting average network (ping) latency of several tens of ms — spiking to >100ms — to our DB master and other nodes in the cluster.

Screenshot 2014-03-14 12.08.03

We have removed the app server from use, and failed over to replicas as of 10 am PT.

We are bringing on additional capacity to service the backlog as quickly as we can.

We’ll update this page as we make more progress.

Update: we are in communications with our infrastructure provider to get more information on the root cause of this situation.

Update 5:15pm: we have restored capacity and mostly drained the backlog of builds.  We continue communication with our infrastructure provider to understand root cause.

Leave a comment

Recent Service Interruptions

There have been several recent service interruptions that have delivered an experience of using TDDium that’s below our high standards. We here at Solano Labs sincerely apologize for these issues. We’d like to take a few minutes to explain the incidents and describe our short- and long-term mitigation strategies.

2/24 tddium.com Domains Unresolvable

Sometime before 8:30am PT on Monday 2/24, our domain registrar, name.com, deleted our DNS glue records. We had received no notice of this change, and discovered it only through DNS investigation.

  • Route53 nameservers up and running – Check.
  • SOA Records in place – Check.
  • dig @8.8.8.8 tddium.com – Failure. How could that have happened?

As DNS caches expired, tddium.com domain names began to become unresolvable, and our hosts therefore became effectively unreachable. Both our custom monitoring infrastructure and our off-site pings (from New Relic) still had cached DNS entries, and reported no errors. Many users and build workers were happily compiling and testing.
We replaced the registrar configuration at 9:55 am, returning service to operation.

After identifying the missing configuration, we emailed name.com support for help understanding what happened. Their response after 36 hours was that our domain was “disabled for using our name servers for URL forwarding DDos attack“. That’s funny – we have been using Route53’s DNS servers since early 2013. We were still using name.com’s URL forwarding hosts to route *.tddium.com to www, and we discovered later in the day on 2/24 that our forwarding entries had also been reset to point to a stock parking site hosting ads. We have since switched all redirection to use Route53 and S3 buckets. Our requests for further information from name.com were politely deflected.

Unfortunately, there’s no good way to have a “backup” domain registration, but we plan to switch registrars very soon to a company that will better respect our business interests.

3/10 Elevated Error Rates

Many of our users over the past few months have noticed “waiting for available socket” slowdowns in Chrome. These issues were due to Chrome’s single-host connection limits in the face of long-polling. We’ve been slowly migrating our live-updating UI from polling, to push over long-poll, and finally to WebSockets, which would resolve all of those available socket delays. The last component in the WebSockets conversion was rolling out a web front-end that natively supported WebSockets – specifically, switching from Apache to nginx. After a few weeks of soak testing that convinced us that the switch could be done seamlessly, we began the production nginx rollout on Wednesday 3/5, and it held up firmly under peak load on Thursday 3/6 and Friday 3/7. We declared success.

Monday 3/10, our alerts lit up.

It was immediately obvious from New Relic reporting that something was seriously wrong and that 3/10 traffic was seriously different from Thursday 3/6. Monday is not normally a peak traffic day for us, but for some reason, we were seeing huge front-end queue delays and big spikes of 503 errors. These unfortunately manifested to our users as error builds. We eventually traced it down to a combination of nginx and Passenger configurations. We prepared a new configuration, ran high-volume load tests of it over night, and we’ve seen service quality return to more acceptable levels. We are still seeing intermittent recurrences, and we continue to tune.

3/11 GitHub Gets DDoSed

Tuesday afternoon, GitHub experienced major connectivity issues, which they have announced were due to a DDoS attack. We are hugely thankful of and respectful for their quick response under fire. On our servers, we had a lingering shadow of git processes stuck on dead TCP connections, and webhooks and commit status updates that never made it to their rightful destinations. We were cleaning these up as they were happening over the course of the afternoon, and we finally had clean queues around 9pm PT.

Conclusions and Next Steps

Our next steps involve hardening our DNS infrastructure, completing our tuning of nginx, and productizing our admin UIs for displaying the external webhooks we’ve received for easier debugging. We continue to develop our internal monitoring systems and we’re scoping out production canary servers for updates to low-level infrastructure components like nginx.

We’d like to specially thank our partner New Relic for the invaluable insight their monitoring has provided us in debugging these interruptions.

We strive to provide a stable, trustworthy platform on which our customers can build and test great software, faster than ever, and we want to thank all of you for your patience and understanding while we weather these service interruptions. We will push to improve wherever we can, and we welcome your feedback at support@solanolabs.com.

The Solano Labs Team

Leave a comment

Solano Labs Sponsors New Community: AutoTestCentral.com

We are extremely happy to announce the launching of an online community blog based on the interest we have received from our first three Automated Testing Meetup Groups.  The blog is called AutoTestCentral.comWhere people who write and test software come to talk about automation”  We are very excited to grow and support this community!

Here we will post on all things about the Automation of Software Testing.  We decided to create this blog after trying to share content among our Automated Testing MeetUp Groups.  We currently have groups in San Francisco, New York City and Boston.  We have had some great talks in each meetup, and sharing presentation materials to only each city’s own meetup page was not going to cut it!  We had people in SF wanting to know about NYC and people in Boston trying to learn what the last month’s SF talk was about!  With the hopes of launching in more cities in the new year, we knew we needed to change something!  So we created this blog, so that we can share all the content from the MeetUp Groups in one place… here!

We are also going to be asking the community to contribute posts.  We already have some great ones posted from leaders in the space.  If you or someone you know would like to author a post, please reach out to Sarah at sfoster@solanolabs.com, and she will guide you through the process.

If you are in one of our covered cities please join! If you would like to have an Automated Testing MeetUp group come to your city, please say so in the comments section.

http://www.meetup.com/Automated-Testing-Boston/   

http://www.meetup.com/Automated-Testing-NYC/   

http://www.meetup.com/Automated-Testing-San-Francisco/   

We hope to see this group grow organically into a place where all testing professionals can learn, knowledge share, post content and talk with one another.

Thank you!  Lets get started!!!

– The Solano Labs Team

Leave a comment

SimpleCov and Ruby 2.1

Solano CI uses the exit status from commands to determine whether a test passes or fails.  The behavior follows in a venerable Unix tradition whereby the exit status of zero indicates success and a non-zero exit status indicates failure.

On occasion we’ve seen bugs in test frameworks that can cause false positives, or worse false negatives. Users with ruby test suites should check that they are not impacted by a recent defect when using SimpleCov 0.8.x, RSpect 2.14, Rails 4.0.x, and Ruby 2.1.0.  Details may be found in the Github issue: https://github.com/colszowka/simplecov/issues/281.

Leave a comment

Github Authentication Updates Released

We’re happy to announce that the changes we’ve been planning to our GitHub authentication integration are live in our production environment!

As we described in an earlier post, we’ve changed our OAuth model to allow users to select the privilege level they give Tddium to communicate with GitHub.  Now, when you link a GitHub account, you’ll see a menu of privilege levels that you can authorize: GitHub_Linking You can always change the level you’ve authorized by visiting your  User Settings page, where you’l see the same menu. For more information on Tddium’s use of GitHub permissions, see our documentation section.

Leave a comment

GitHub API Authentication Updates

At Solano Labs, we believe that a seamless integration between our service and our customers’ tools provides the best user experience. Many of our customers today use GitHub and have connected a GitHub account with their Tddium account using OAuth.

We take the security of our customers’ code very seriously, and we’re making some important changes to our GitHub OAuth integration that should give you much finer-grained control over the privileges you give Tddium to operate on your GitHub account.

What we do now
Our current GitHub OAuth functionality requests nearly complete permissions to your GitHub account (“user,repo” scope in GitHub’s API terminology). Tddium requests these privileges so that it can fully automate the setup of the CI workflow (commit hooks, deploy keys, and keys to install private dependencies). Our updated GitHub integration allows for multiple privilege levels so that you can make a tradeoff between permissions and automated setup.

In the next week or so
we’ll roll out changes that will:

  • Allow basic Single-Sign-On with no GitHub API access otherwise.
  • Let you choose between 3 privilege levels that allow Tddium to:
    1. post commit status to update pull requests (for public and private repos)
      (“repo:status” scope)
    2. automate CI webhooks and deploy keys for public repos.
      (“repo:status,public_repo” scope)
    3. automate CI webhooks and deploy keys for public and private repos.
      (“repo” scope)
  • Give instructions on creating bot Github users to allow your builds to pull dependencies installed from private GitHub repos.

If you have already linked your GitHub account, it will continue to be linked, and will give Tddium the current high level of permissions. After the rollout, you’ll be able to easily edit Tddium’s permissions on your GitHub account on your User Settings page.

We look forward to your feedback at support@solanolabs.com.

Cheers,

The Solano Labs Team

github-oauth

1 Comment

Fast Tests Remove Testing Barriers

by Carl Furrow of Lumos Labs

Making sure your test suite runs quickly ensures that it will be run often. We at Lumos Labs (lumosity.com) have been working on an in-house Jenkins CI setup to run our ~2500 tests across ~360 files in under 10 minutes. Our Jenkins setup consists of about 24 executor VMs. For each build we allocate 12 executors, and each executor would get a subset of the total files to be run. For example, with 360 test files, each executor VM would be responsible for running 30 test files each.

Under this configuration a build would complete in anywhere from 12-20 minutes. Which is fine when the production releases are coming slow, but it’s an eternity when three or more people are queueing up changes that need to go into production. Running the suite locally can take 45 minutes to run just the rspec tests, so parallelization is necessity when needing to test the entire suite.

As our company grew, and more developers were creating feature branches, more CI builds were being queued in Jenkins. With the limited number of executors in Jenkins that we had, builds were queuing up. If you were behind 2-3 people in the build queue, that would mean you’d be waiting up to 30-40 minutes for your build to even start running! It was becoming a headache for all of us, so we looked at increasing the number of executor VMs, as well as beefing up the processing power of each one.

Adding more VMs to the cluster brought on additional headaches. With the increase in speed, we were noticing more segfaults occuring during the builds, marking it as a failure. But re-running the build would usually get it to pass. We spent many hours debugging the different environments, gems, etc, trying to determine where the segfaults were happening, and eventually coded up scripting solutions that could detect a segfault, and re-run the subset of tests where the segfault occured. Not a permanent solution, but it was one that could get our builds passing more often without these ‘flickering’ segfaults. Coupled with this was a constant hunt to determine whether or not a failed cucumber scenario was a legitimate one, or perhaps something related to Capybara Webkit. More developer time was spent re-working our selectors and specs that hit Capybara, which was time well spent, but it took a long time to re-code, and deal with version changes to the Capybara API. Obviously you cannot rid yourself of all responsibility, but running tests and managing our own servers was becoming tedious (wait for it).

Knowing that we wanted to stop managing our own CI environment, we went looking to CI service providers (hosted, and self-hosted) to see how they would perform. Unfortunately, we ended up investing days into configuration and setup, and still the test suite times were worse than what we were seeing in our own setup. It seemed we had the best CI around for us, and we’d have to give up on finding a hosted CI service that could be easy to setup, plus, and more importantly, faster than what we currently had. So we started building a beefier set of servers and VMs to run our Jenkins setup, and that was promising, but it was expensive.

Enter tddium, exit tedium

Flash-forward to a testing-related meetup this past August, hosted by Solano Labs in SF. They showed off their hosted CI product, tddium, along with a general discussion on testing strategies and horror stories. I had a chance to talk with co-founders Jay and William about our current CI setup, and they felt strongly they could improve the running time, if nothing else.

After setting up the trial account, creating a tddium.yml configuration file, and working with Solano’s support staff to setup an environment that more closely resembled our current Jenkins setup, I had a green build!

All green!

Today most of our builds run in about 5 minutes on 1.9.3-p327.

All green again!

We even had our ruby2.0.0-p247 branch under 4 minutes!

All Green, and ruby 2.0!

TL;DR We’re glad we switched

Now that our tests are run via tddium, we’ve phased-out our Jenkins setup, and the testing queue has been all but eliminated. We ended up with a setup that allowed three builds going at once, and that seems like the sweet-spot for us with builds taking about five minutes apiece.

System Average Build Time Executors/Workers Speed Improvement %
Self-Hosted Jenkins 17 minutes 12 -
tddium (ruby 1.9.3-p327) 5 minutes 24 340% improvement!
tddium (ruby 2.0.0-p247) 4 minutes 24 425% improvement!
Leave a comment

Service Outage Oct 24, 2013 – Maintenance Today

At approximately 2:14pm PT on Oct 24, 2013, Tddium’s  DB master server experienced a CPU usage spike that cascaded into to a server stoppage.  No data was lost.

Examining data (thanks New Relic!) and logs, our conclusion is that though average usage hovers around 20-30%, our DB master has burst CPU usage close to 100%.  Once postgres crosses into “queue backup” territory, it never comes back.

Tonight, we will upgrade our DB cluster to use faster servers.  This upgrade should only take a few minutes, but it will require the app to be down.

We appreciate your patience as we address these infrastructure issues.

Thanks,

– The Solano Labs Team

Leave a comment