Recent Service Interruptions

There have been several recent service interruptions that have delivered an experience of using TDDium that’s below our high standards. We here at Solano Labs sincerely apologize for these issues. We’d like to take a few minutes to explain the incidents and describe our short- and long-term mitigation strategies.

2/24 tddium.com Domains Unresolvable

Sometime before 8:30am PT on Monday 2/24, our domain registrar, name.com, deleted our DNS glue records. We had received no notice of this change, and discovered it only through DNS investigation.

  • Route53 nameservers up and running – Check.
  • SOA Records in place – Check.
  • dig @8.8.8.8 tddium.com – Failure. How could that have happened?

As DNS caches expired, tddium.com domain names began to become unresolvable, and our hosts therefore became effectively unreachable. Both our custom monitoring infrastructure and our off-site pings (from New Relic) still had cached DNS entries, and reported no errors. Many users and build workers were happily compiling and testing.
We replaced the registrar configuration at 9:55 am, returning service to operation.

After identifying the missing configuration, we emailed name.com support for help understanding what happened. Their response after 36 hours was that our domain was “disabled for using our name servers for URL forwarding DDos attack“. That’s funny – we have been using Route53’s DNS servers since early 2013. We were still using name.com’s URL forwarding hosts to route *.tddium.com to www, and we discovered later in the day on 2/24 that our forwarding entries had also been reset to point to a stock parking site hosting ads. We have since switched all redirection to use Route53 and S3 buckets. Our requests for further information from name.com were politely deflected.

Unfortunately, there’s no good way to have a “backup” domain registration, but we plan to switch registrars very soon to a company that will better respect our business interests.

3/10 Elevated Error Rates

Many of our users over the past few months have noticed “waiting for available socket” slowdowns in Chrome. These issues were due to Chrome’s single-host connection limits in the face of long-polling. We’ve been slowly migrating our live-updating UI from polling, to push over long-poll, and finally to WebSockets, which would resolve all of those available socket delays. The last component in the WebSockets conversion was rolling out a web front-end that natively supported WebSockets – specifically, switching from Apache to nginx. After a few weeks of soak testing that convinced us that the switch could be done seamlessly, we began the production nginx rollout on Wednesday 3/5, and it held up firmly under peak load on Thursday 3/6 and Friday 3/7. We declared success.

Monday 3/10, our alerts lit up.

It was immediately obvious from New Relic reporting that something was seriously wrong and that 3/10 traffic was seriously different from Thursday 3/6. Monday is not normally a peak traffic day for us, but for some reason, we were seeing huge front-end queue delays and big spikes of 503 errors. These unfortunately manifested to our users as error builds. We eventually traced it down to a combination of nginx and Passenger configurations. We prepared a new configuration, ran high-volume load tests of it over night, and we’ve seen service quality return to more acceptable levels. We are still seeing intermittent recurrences, and we continue to tune.

3/11 GitHub Gets DDoSed

Tuesday afternoon, GitHub experienced major connectivity issues, which they have announced were due to a DDoS attack. We are hugely thankful of and respectful for their quick response under fire. On our servers, we had a lingering shadow of git processes stuck on dead TCP connections, and webhooks and commit status updates that never made it to their rightful destinations. We were cleaning these up as they were happening over the course of the afternoon, and we finally had clean queues around 9pm PT.

Conclusions and Next Steps

Our next steps involve hardening our DNS infrastructure, completing our tuning of nginx, and productizing our admin UIs for displaying the external webhooks we’ve received for easier debugging. We continue to develop our internal monitoring systems and we’re scoping out production canary servers for updates to low-level infrastructure components like nginx.

We’d like to specially thank our partner New Relic for the invaluable insight their monitoring has provided us in debugging these interruptions.

We strive to provide a stable, trustworthy platform on which our customers can build and test great software, faster than ever, and we want to thank all of you for your patience and understanding while we weather these service interruptions. We will push to improve wherever we can, and we welcome your feedback at support@solanolabs.com.

The Solano Labs Team

Post a Comment