The New Year is already off to a great start here at Solano Labs with new features and product upgrades getting ready to roll out. With the start of the New Year we also decided to take a look back at the year that was and ask as a company “What have we learned?” and “What should our New Years’ resolutions be?”.
2012 saw some high profile successes and failures in the world of software. Some mistakes went unnoticed but others were front-page news. Many cost time and money and a few even destroyed entire companies! However small or large the screw up, there was a common thread… in hindsight, these defects could have been identified earlier and prevented from reaching users with more automated validation!
What follows are a few of the bugs and outages that we found most interesting. For each story, one of our engineers shared his thoughts on the matter. Many of us got sidetracked in the process of researching the outages as they often offer a fascinating look inside the affected businesses. A good jumping off point for your own exploration of software screw-ups in 2012 is the ChannelBiz Top Ten List for 2012 Software Blunders — you can check it out here.
A bug in a newly rolled out load balancing software update caused an error with the interpretation of unavailable data centers. This caused an 18 min outage where 8% to 40% of Gmail users were affected by slow performance, timeouts or errors.
Nobody would even think of pushing new code without testing it first, and probably also doing a staged rollout to catch bugs that only show up in production. But configuration files and other small data aren’t usually given the same consideration, even though bugs in them can have just as devastating consequences. The small size of the data and ease of checking it by eye can give you a false sense of security.
Two practices can help mitigate risks. First, create a verifier for your complex configuration files that checks syntactic correctness and, more importantly, presents the user with a semantic delta from the previously deployed version. This may be difficult depending on the meaning of your configuration, but even a very rough attempt will make unintended changes easy to catch. Second, stage deployment of configuration changes just like code changes and carefully monitor instances that have the new configuration for unexpected behavior.
Errors in Nasdaq’s computer systems caused delays and mishandling of orders during the start of the Facebook IPO.
Web 2.0 companies usually treat past failures as bygones — the bugs of yesterday are replaced by today’s hot new successes. Facebook wasn’t so lucky, as software glitches in Nasdaq’s touted OMX trading platform “engulfed” its IPO and created confusion that investors haven’t yet forgiven. The series of errors in May Facebook’s underwriters an estimated $115 million. According to analysts, Nasdaq’s servers slowed down under heavy load from a 3ms response time to 5ms, and failed to establish an opening price. This caused a 2+ hour window where trades languished unconfirmed – “going into a black hole” – or were lost entirely because of a data rollback. You may think you understand your app’s hotspots, but when all 1 billion of Facebook’s enthusiastic users decide today is the day they want to use your trading platform, you may well be proven wrong. I guess that’s a success problem? The lesson – anticipate and test for extreme load conditions before the thundering herd arrives.
After Ruby on Rails bug report was ignored, software developer, Egor Homakov, used the bug to hack into Github! Although unlawful it brought necessary attention to the bug.
In March, a Github user exploited a mass assignment vulnerability in Github to add new authorized public keys to the Ruby on Rails account and push a change as a proof of concept. Github reacted swiftly to close the security vulnerability and publicized the details in this blog post. Github is to be commended both for closing the hole quickly and for providing a detailed description of the problem. Github is a high-profile website, particularly in the Ruby community, so the incident also brought a prevalent problem to the community’s attention.
The Ruby on Rails approach to using convention over configuration is a large part of what has made it a popular platform for building new web applications. Convention is a powerful way to promote collaboration and productivity within a software development organization – but it can also lead to severe bugs if programmers aren’t cognizant of the implications of the conventions. In the case of Rails, mass assignment makes it easy to map form input data sent by a user’s web browser into a convenient data representation for the application. When combined with an ORM such as ActiveRecord, however, it is all too easy for unsafe updates to slip into the database, for instance updating an access control list, and granting a user unwarranted privileges. Careful validation of user-supplied data is necessary for security and often requires more domain knowledge than simple convention provides.
An overload at the call centers created heavy confusion when changes went live this year.
United Airlines and Continental Airlines merged in 2010 to create one of the world’s largest airlines. Representatives of the combined entity extolled the virtues of the merger — greater ‘reach’ and convenience for customers, a more efficient business which would be positive for investors, and more. Now, merging two such large organizations takes time & effort, and sub-par planning can have a large negative impact. This has been very evident to the airline’s customers. Due to problems in the merged reservation system, many have suffered through flights delays and long call center wait times. The airline apparently did a thorough job in merging the data from their two separate reservation systems (which are called Apollo and SHARES) into a single one (SHARES, which was chosen as the sole successor). However, they seemed to have been less detailed-oriented in their load testing and UI testing efforts. On Mar 3, 2012, SHARES went live as the sole reservation system for United.
Unfortunately, roughly half of the airlines employees who used the system — at ticket counters, at airline gates, at call centers — were unfamiliar with the interface, which caused delays in boarding and flight departures & arrivals. This in in turn caused a 30% increase in calls to the call centers. The airline had planned for no more than a 10% increase. Queue times for customer calls jumped by 120%. Perhaps if the airline had planned and tested for a more extreme increase in traffic, SHARES could have more gracefully handled the increased load. The lesson is similar to that from NASDAQ’s problems with the Facebook IPO — one should certainly test for the thundering herd (apologies to Merrill Lynch for co-opting their tag line) and be extra paranoid when considering the possible size of the herd. Better to think too big than too small!!
A $440+ million dollar mistake occurred when a bug in a recently updated piece of trading software was let loose on the market for over 40 mins. This eventually caused the collapse of the institutional financial giant.
Knight Trading is one of a number of brokerage houses that acts as a market maker and provides trade execution to other brokers on Wall Street. Knight specializes in small cap equities and is responsible for roughly 10% of US equities volume. On August 1st, an error in Knight’s trading platform resulted in a $457MM dollar loss for the firm, threatening its viability as a going concern. Core infrastructure in Knight’s trading platform — not fancy quantitative trading algorithms — were responsible for wild swings in the prices of roughly 150 equities traded on the NYSE. Apparently disused portions of Knight’s infrastructure was still running old, incompatible versions of their proprietary software. Talk about integration testing pain! Although conceptually simple, keeping accurate, auditable track of deployed software versions and testing and tracking compatibility in a large deployment infrastructure is no simple task. Tracking and testing multiple versions of a software stack that may be deployed at the same time — intentionally during a rolling upgrade or inadvertently — can have serious real-world consequences for business. Reportedly the direct trading loss was in excess of $200MM and Knight was required to pay a massive additional 5% risk premium to Goldman Sachs to exit the position, costing them a further $230+MM.
We can all sympathize with and respect the efforts of engineering and ops teams worldwide to keep the computer systems we rely on running. We here at Solano Labs are no strangers to critical bugs and performance fire drills – in our current work (Tddium had an unexpected downtime in late December 2012) and over our years of experience. But we can all learn from these mistakes, and apply the lessons to our own practice of building, testing and releasing software. Some common themes emerge:
- Small-scale correctness is necessary, but not sufficient. Poor performance under load can turn quickly into incorrect behavior. Especially in a distributed system, retroactive fixes can be too late.
- Similarly, retrofitting security is risky. Nonetheless, plan for security as a war, not a battle. New vulnerabilities and attackers will arise, and it’s critical to be responsive (the Rails core team has done a great job of this!) and to be self-aware about the risks to your business and open source communities.
- Be bullish about traffic growth unless your system has a natural rate limit. Planning a large launch or an event with a large audience? What happens if they all show up? Even for a closed system, getting scaling right can be a multi-month proposition, so start now!
- Configuration and deployment processes are just as crucial as code – make sure they are tested and validated with the same rigor.
As 2013 innovation starts, lets raise a toast to learning our lessons and to our new year’s resolution: Keeping those bugs where they belong! We’re looking forward to another year of helping our customers build great software, and an open discussion of ways we can make that easier. We’d love to hear your thoughts.
Happy New Year to all from the Team at Solano Labs!