Lessons from the Lunar Landing — Resilience and redundancy on the way to the Moon — Part X

Robert Barron
6 min readApr 11, 2020

Between 1968 and 1972 NASA sent nine Apollo missions to the Moon. Reaching the Moon was an effort that required a decade of work by 400,000 people, billions of dollars and an incalculable amount of moving parts. Of the nine missions to the Moon, eight were spectacular successes and one was a spectacular near-disaster.

Apollo 13 launched on April 11th, 1970 and was meant to be the first of the Apollo missions to be dedicated to exploring the Moon, after Apollo 11 made the first landing and 12 had improved on it by making a pinpoint landing.

On April 13th, a little over 2 days after launch, one of the two large oxygen tanks in the Service Module component of the spacecraft exploded, crippling the spacecraft on the way to the Moon. Over the next days, NASA’s engineers worked feverishly together with the astronauts to overcome the seemingly insurmountable problems and brought them back to Earth — safe and sound.

When looking at Apollo 13 and its problems and solution, what stands out is not how much the astronauts, engineers, and managers improvised to solve unexpected problems but rather the reverse — how their existing procedure was ready to be adapted to the unexpected.

Aptly, Apollo 13’s command module was named Odyssey, meaning a long voyage usually marked by many changes of fortune.

Even before Odyssey began its odyssey, it had an unusual start when it became the first mission where the flight crew was disrupted just before launch.

Flying a spacecraft is a complicated activity; so many things happen simultaneously, there are more buttons to press and procedures to follow than a single person can deal with at any one time. While Mercury, the first spacecraft, had been simple enough for one astronaut to handle Apollo was a much larger and complex beast.

The three components of the Apollo spacecraft — the Command Module (CM) with the astronauts, the Service Module (SM) with the supplies and main engine for the flight to the Moon, the Lunar Module(LM) for the landing itself. During the launch, the LM was shielded within the Saturn 5 rocket. (NASA)

Instead of expecting a single astronaut so control the spacecraft from beginning to end, the work was divided between three astronauts. The Commander, the Command Module Pilot and the Lunar Module Pilot — note that no astronaut is a mere co-pilot ;)

Now, each astronaut was able to specialize in their specific part of the mission (while remaining competent in other parts too), but astronauts were also able to support each other. After the lightning strike which crippled Apollo 12 at launch, the astronaut who flipped the “SCE switch to Aux” in the Command Module was Alan Bean, despite being Lunar Module pilot, because he had the easiest access to the critical switch.

Taking the idea a step forward, in addition to having the three astronauts support each other during the flight, NASA also designated a “backup astronaut” for each one. The backup astronaut underwent nearly the same amount of training as the astronaut designated to fly and was sent to represent him at planning meetings (always fun!). Like being an understudy, the backup astronaut was available to replace the prime astronaut at a moment’s notice, but nothing short of a crippling injury would ever make an astronaut give up his flight. While a few astronauts had been forced to cancel their flights and allow their backups to fly, this had always been as the result of serious conditions (Deke Slayton had heart arrhythmia and Michael Collins had spinal surgery).

The crew of Apollo 13: Lovell, Swigert, Haise.

In the case of the ill-fated Apollo 13, Command Module pilot Ken Mattingly had been inadvertently exposed to Rubella (German Measles) just before the flight and was removed from the flight for medical reasons — despite the usually mild effects of the disease, no doctor was going to take a chance on some exotic and unexpected side effect while the astronauts were about to land on the Moon!

Just three days before the scheduled launch, flight commander James Lovell and Lunar Module pilot Fred Haise set out for a last-minute training regimen with backup Command Module pilot Jack Swigert. One of the few inaccuracies of the 1995 movie Apollo 13 was that the backup astronaut was less capable than the prime astronauts. The purpose of the last-minute training was not to check whether Swigert knew how to fly the spacecraft (which he unarguably did) but to see how the entire crew functioned together as a unit.

The last-minutes changes in the Apollo 13 crew and the way the crew functioned together are examples of the way the astronauts themselves were part of the Resilience and Reliability of the mission.

As explained in the IBM Garage build for reliability article, one must often trade cost for reliability. Training six astronauts instead of three takes more time, money and other resources, but having backups available means that you can recover when the unexpected occurs.

While the flesh-and-blood astronauts were the most critical component of the flight (the whole point of the flight was for a man, not a machine, to walk on the Moon), the entire 110 meter (363 foot) stack was built out of millions upon millions of highly reliable engines, pipes, connectors, switches, pumps, gauges, valves, computer chips and more.

The second stage of the Saturn V engine. Note the five J-2 engines which supply a total of 1,150,000 pounds of thrust (NASA)

During the first few minutes of flight, a failure in one of the second stage engines caused the rocket to gyrate wildly and the wayward engine was shutdown seconds before the flight would have been aborted.
As it happens, the engines had been designed with these types of failures in mind and could “pick up the slack”.

The remaining four healthy engines continued firing for longer than planned and made up for the defective engine.

Not even ten minutes into its flight, Apollo 13 had validated the engineering practices of building reliable components by having backups for everything and anything — both man and machine.

In the modern development of reliable software services, we use many patterns and techniques to achieve the reliability we require. While there are many similarities between the requirements of getting a man on the Moon and reaching your chosen website, software development leads to many abstractions that are not relevant for a flight in space. For example, if there’s a temporary failure between your local phone or laptop and the server you’re trying to reach then the local application can “invisibly” retry transient failures until it succeeds or decides that the failure is critical. With any luck, you won’t even notice this issue beyond a very temporary delay in bringing up the screen.

Now, having overcome Rubella before the flight even began and a failed engine during launch, the Apollo 13 astronauts and the NASA engineers in Houston could relax and enjoy a routine flight to the Moon, couldn’t they?

What else could have gone wrong?

April is the 50th anniversary of the Apollo 13 mission — follow me here or on twitter at @flyingbarron for more lessons from that mission.
In the meantime, thanks to @Ben Feist you can follow along with the mission in real-time here: https://apolloinrealtime.org/13/.

50 years after Apollo 13, designing your applications for reliability means it doesn’t have to be particularly complex, but it should be done ahead of time because adding reliability after the fact is much more complicated. If you are interested in learning more, check the IBM Garage Method for Cloud site for a new article on our practices.

Articles in this series:

For future lessons and articles, follow me here as Robert Barron, on Twitter @flyingbarron or Linkedin.

Bring your plan to the IBM Garage.
Are you ready to learn more about developing reliable applications?
We’re here to help. Contact us today to schedule a time to speak with a Garage expert about your next big idea. Learn about our IBM Garage Method, the design, development, and startup communities we work in, and the deep expertise and capabilities we bring to the table.

Schedule a no-charge session with the IBM Garage.

--

--

Robert Barron

Lessons from the Lunar Landing, Shuttle to SRE | AIOps, ChatOps, DevOps and other Ops | IBMer, opinions are my own