4th of July Fireworks — A Balanced Action Plan.
Shuttle to SRE —STS Lesson 2
This week, my friends and colleagues in the United States are celebrating their Independence Day and I am reminded of the time in 2006 when I watched the greatest 4th of July fireworks display of all time — the launch of Shuttle Discovery on the mission STS-121.
While the shuttle launched a few hours after dawn’s early light, the rocket’s red glare was visible for miles, and the thunder of the engines was louder than bombs bursting in the air. A most impressive sight which I consider myself fortunate to have experienced.
STS-121 was only the 2nd flight after the 2003 Columbia disaster, where the shuttle disintegrated upon reentry into the Earth’s atmosphere as a result of the heat shield being damaged by pieces of foam hitting it during launch.
After spending over 2 years researching, analyzing, and fixing the problems which destroyed Columbia, Space Shuttle Atlantis had been the first “return to flight” mission in July of 2005.
In order to guarantee that a disaster such as Columbia’s would not recur, a number of action items were defined as safety measures for all future flights. An abbreviated list includes:
- Since the damage to Columbia was not verified (though it was suspected) during the flight, no actions were performed to try to resolve the issue early. A new mandatory procedure was added to every shuttle flight — inspecting the underside of the shuttle with a camera.
This added additional capability to the shuttle’s Observability. - Even if the damage to Columbia had been detected early, there were no procedures in place to fix the broken heat shield. During the hiatus from shuttle flights, new equipment was developed and new procedures tested to perform in-flight repairs.
This added additional runbooks and operational procedures to improve the reliability of the shuttle. - To guarantee the crew’s survival — even if repairing the shuttle proved to be impossible — every shuttle launch after Columbia docked with the International Space Station (ISS) so that the crew would have a safe haven and could survive until a rescue shuttle was launched to retrieve them. By itself, the shuttle could carry enough supplies to keep the crew alive for about 2 weeks, while they could stay in the ISS indefinitely.
This added contingency plans in case all other options failed.
- The so-called “root cause” of the disaster was insulation foam breaking off the orange external fuel tank and striking Columbia’s wing. The foam had been re-formulated and new procedures were defined during the application of the foam to make sure it would not break off.
This reduced the technical debt (the use of insulation foam had been a way of reducing the weight of the external fuel tank) and, while the other items in this list were designed to solve problems as swiftly as possible, this item acted to prevent them in the first place.
The combination of all four types of solutions — added observability, additional runbooks, new contingency plans, and changes to reduce technical debt — demonstrate the type of balanced action items which SREs plan after post-incident analyses. Each type of solution improves the reliability of the system and the synergy between them means that if one solutions fails the others will be able to “pick up the slack” and make sure that the problem does not cause actual damage again.
And this kind of synergy was required, since the very first test flight which used these new safety measures, Atlantis (STS-114) in July 2005 demonstrated their complementary function.
But despite all the effort spent in preventing the problem, yet again pieces of foam detached from the external fuel tank!
Thanks to the added monitoring and observability capabilities added by the other action items, it was quickly determined that the damage to the shuttle heat shield was minimal, the astronauts were safe, and that there was no need to either repair the shuttle or use the space station as a safe harbor. The mission could continue as normal and the astronauts could perform all the safety tests planned for the “return to flight” after the Columbia disaster mission.
It was, however, highly frustrating for the engineers to discover that many of their theories regarding the cause of the foam breaking off the external fuel tank were incorrect — it was back to the drawing board for more investigation!
About a year later, Discovery launched on the 2nd return to flight test mission with an additional battery of fixes, tests and action items designed to maximize the safety of craft and crew.
On the 4th of July, Discovery’s second test of the lessons learned after Columbia passed with flying colours and the shuttle fleet went on successfully flying, constructing and supplying the International Space Station, repairing the Hubble Space Telescope, and more until it was finally retired just 5 years later in 2011.
Despite all the reliability additions and successful fixes, the Challenger and Columbia disasters had ultimately demonstrated the inherent safety and reliability flaws in the space shuttle.
Site Reliability Engineers, DevOps Engineers, Operations and Sysadmin engineers all perform post-mortems or post-incident analysis sessions after significant issues are found in the environments under their responsibility.
In IBM’s Garage Methodology, these tasks are divided into four classes:
- Detection: Improve the monitoring and instrumentation components to detect the issue faster.
- Investigation: Provide improvements to isolate and diagnose issues faster
- Correction: Provide improvements to correct malfunctions faster.
- Prevention: Improve the underlying application code, architecture, or both.
This article has demonstrated the importance of a balanced action plan, containing tasks which have different approaches to solving the issue. This ensures that even if some of the action items are flawed and do not achieve their goals, the resulting solution still improves the reliability of the system overall.
Articles in this series:
For future lessons and articles, follow me here as Robert Barron or as @flyingbarron on Twitter and Linkedin.
Learn more at www.ibm.com/garage