I recently read an interesting article describing how outages in Amazon’s cloud service, triggered by violent storms in the Eastern United States, were significantly aggravated and prolonged as a result of hidden bugs in their software that were only exposed when they tried to recover from the initial outages caused by the storms. I’m not going to take this as an opportunity to get up on a soapbox and preach about how ‘better testing could have prevented these problems’
First off, I have no detailed knowledge of Amazon Web Services’s current testing and QA environments. However, given the mission criticality of this arm of their business, their leadership position in cloud service provision and the reputation risk associated with service failures, I’m inclined to give them the benefit of the doubt and assume that they probably do some fairly rigorous testing of their software, and of their disaster recovery procedures and environments.g to take this as an opportunity to get up on a soapbox and preach about how ‘better testing could have prevented these problems’.
Given that, the take-away for me, the thought-provoking element of the article, is just how DIFFICULT it is to build highly robust, fault-tolerant software in today’s increasingly complex and interdependent technology environment. It’s apparent that the circumstances that exposed the bugs in Amazon’s failover and recovery management software were (at least apparently) the result of an extraordinary series of coincidental events that would have been extremely difficult to anticipate and test for. However, the end result of the outages was significant cost to Amazon and its customers, and likely a significant reduction in Amazon customers’ faith in the reliability and robustness of their cloud services.
So what are we to take away from this?
If we accept the hypothesis that Amazon is in fact diligent in stress-testing it’s cloud services, and that in spite of this, the likelihood of testing the actual series of events that eventually caused the recent problems was astronomically small, are we to conclude that the situation is hopeless? That no matter what we do, software will always be failure prone? And if that’s the case, is there any point to it all?
I think it’s fair, and completely realistic to expect that software will always have bugs and there will always be ways to make it fail. The increasing complexity of the software we use, and the diversity of the environments in which it’s required to operate all but guarantee that unforeseen circumstances will arise that expose flaws and bugs in even the most rigorously tested software. That being said, the enormous potential costs associated with these failures demonstrates that it is certainly worth our while to invest heavily in quality assurance and monitoring programs so as to ensure that such events are minimized as much as possible.
This includes standard items such as having a continuously monitored and robust testing program, and taking advantage of the many automation tools that provide analysis and detection of potential problems early in the development process, before they are exposed in production situations.
As our reliance on software and technology increases, and the interdependency of our technological infrastructure increases, the global, systemic importance of the quality of those components also increases. The more companies understand this, and invest in appropriate up-front risk-mitigation procedures to minimize software defects, the fewer painful object lessons we will have to learn in real-time to drive this point home.
How do your own testing and QA practices measure up? Are you confident in the quality of software your team/company develops? Are you sure it won’t be you who’s on the firing line, diagnosing critical issues and trying to come up with emergency solutions after the next big storm or power outage?