Lessons from the Lunar Landing — Day 2 Operations — Part XI

Robert Barron
7 min readJul 20, 2020

This week marks the 51st anniversary of the Apollo 11 moon landing and a full year since I started on this series of articles.

As with any event (or series of events) that takes place over a period of time, there are always periods that have more drama and generate more excitement than others. In every mission the Apollo astronauts spent over a week flying to the Moon and back, yet most people’s memories are composed of the few hours around launch, the Moon landing and splashing down in the ocean. Not only do they ignore the time during the mission between those highlights, but they are also unaware that leading up to these highlights were months and years of preparation.

In the same way, when we consider the operation and support of a software solution, there are often three “periods” or “types” of operational work —

  • Day 0 — The period of preparation leading up to the delivery of a new solution.
  • Day 1 — The launch of the new solution itself (this takes up the shortest period of time).
  • Day 2 — The period when the solution is running and needs to be supported.
Celebrating a Day-1 event — The successful splashdown of Apollo 13! Flight Directors are from left to right: Gerald D. Griffi
Celebrating a “Day-1” type of event — The successful splashdown of Apollo 13! Flight Directors are from left to right: Gerald D. Griffin, Eugene F. Kranz, and Glynn S. Lunney. (copyright NASA)

Like a Moon mission, Day 0 and Day 1 are more dramatic and attention-grabbing than the routine Day 2. While Day 2 is a period of “business as usual” and routine work, it is periodically interrupted by a Day 1 highlight event such as a Moon landing or a new software version.

The table above shows operations on different day types, both for Spaceflight and for Software Development.

Following the Apollo 11 mission, the most famous and noteworthy mission was that of Apollo 13 which suffered a catastrophic failure midway to the Moon, in the middle of a “Day 2” period which was expected to be routine and uneventful.

A few months ago, while marking the 50th anniversary of the Apollo 13 mission, Flight Director Gene Kranz spoke at the IBM Academies for Business Operations & Supply Chain online conference and had a few insights regarding the importance of preparation for Day 2 operations.¹

From the perspective of Flight Operations, while the time when the spacecraft coasted to the Moon (in our parlance, day-2 operations) was less eventful and perhaps less stressful than the launch or a critical checkpoint during the mission (day-0/day-1 operations) it was nonetheless no less important to make sure the spacecraft and astronauts were fully functional and prepared for any eventuality at any moment.

The engineers in Mission Operations Control Room (the MOCR or “Houston”) needed to know their systems in-and-out and be able to not only predict how they would function when things were going well, but how to handle malfunctions and unexpected failures.

On the 13th of April 1970, the explosion which rocked the Apollo 13 Odyssey service module took place in the middle of routine Day-2 operation of “stirring the oxygen tanks”. The spacecraft “died” halfway to the Moon. From that moment the routine became extraordinary as NASA engineers rushed to stabilize the spacecraft and save the astronauts. For the rest of the trip, the Aquarius lunar module had to keep three men alive for the entire trip back to Earth rather than the planned two men for a much shorter jaunt to the lunar surface.

Don’t be constrained by precedent — Innovate, Experiment, Find the Limits.
It is not what [something] is designed to do; it is what it can do that is important in a crisis.

— Gene Kranz, Apollo Flight Director

While the dramatic highlights of Day-0 and Day-1 (Launches, Landings, Walking in space, or on the Moon) are the ones that garner the limelight, NASA’s engineers knew full well that planning and preparing for the day-to-day activities were just as critical for the success of the mission. They needed to know how the spacecraft would function in times of normal behavior and in times of crisis and be able to switch from normal (nominal in NASA-speak) activity to crisis solving and heroism in the blink of an eye.

And except in very specific cases (such as the Apollo 12 lightning strike), the heroism was not that of independent individuals, but of individuals which made up a team more powerful than the sum of its parts. In both the Apollo 11 computer crisis before landing and the lengthier Apollo 13 crisis, it was only due to teamwork and collaboration that the issues were resolved successfully.

One of the most surprising aspects of the Apollo 13 emergency was not how unprepared NASA was but how well prepared it was!
NASA had already prepared a series of studies and plans which they used to resolve many of the individual issues and problems which arose during the flight. It was not any individual threat that proved to be such a danger to the astronauts but the fact that so many of them occurred at once.

The flight plan of Apollo 13, with the approximate location of the explosion (copyright NASA)

Between the explosion which crippled the spacecraft and the announcement that “Houston, we’ve had a problem”² about 56 hours after launch and the landing at about 143 hours after launch (roughly three and a half days) the astronauts and NASA needed to deal with:

  • Using the Lunar Module as a “lifeboat”, which had been practiced/planned for during the Apollo 10 mission.
  • Navigating without the computer, which had been practiced during the Gemini missions, the previous spacecraft program.
  • Maneuvering (course correction) without the computer, which had been practiced during the Apollo 8 mission.
  • And for the rapidly changing mission parameters, the IBM computers on the ground were well prepared and were constantly running “what if” scenarios before and during flights.

Of course, the scenarios did not play out exactly as the engineers had prepared for them, and many of their preliminary plans needed to be adjusted significantly when they deployed them in reality. One of the most memorable parts of the 1995 Apollo 13 movie details the creation of a carbon-dioxide purifier by fitting a “square filter into a round hole”; and while there was no pre-existing plan for this, the engineers were still starting their work with a deep understanding of how the air-purifying and filtering systems worked and how far they could jury-rig them before risking failure.

The knowledge instilled in the engineers was codified both by the training they performed before flights to prepare them for every eventuality and by the documentation which detailed every jot and tittle of the capabilities of the equipment.
Do you need to know how a battery will function when it’s been frozen and then thawed? You can check the documentation of that particular battery to see what kinds of stress tests it has been subjected to. Not enough detail? You have the contact information of the factory which made that component and access to the engineers who designed it in the first place.

Understanding the capabilities, the limits, and the possible behavior of the myriad systems which composed the spacecraft enabled the engineers to take them to their limits and beyond — both during “routine” flights and during emergencies. In truth, every flight had many anomalies and emergencies which threatened success.

The difference between a successful flight and a failed one was the ability of the astronauts and engineers to manage the anomalies.

In the same way as NASA’s engineers, flight controllers and astronauts needed to know the capabilities of the Apollo spacecraft and plan accordingly, modern operators and Site Reliability Engineers (SRE) need to have a good grasp of the Day 2 operations of their favored platform and plan accordingly.

The IBM Cloud Pak Playbook (https://cloudpak8s.io/) documents exactly this type of information for the Redhat OpenShift Container Platform and the IBM Cloud Paks. It is not merely a reference list of capabilities but includes practical information and scenarios.
While it has a lot of Day 0 and Day 1 content, I’d like to focus your attention on the Day 2 chapters which will help manage the many operational aspects of your OpenShift cluster.

  1. To access the full recording of Kranz’s speech, please access the IBM website and register for his session (registration is free and will allow you to watch the replay).
  2. The line “Houston, we have a problem” was made famous when Tom Hanks said it in 1995. However, Lovell used the past perfect in 1970.

Articles in this series:

For future lessons and articles, follow me here as Robert Barron, on Twitter @flyingbarron or Linkedin.

Bring your plan to the IBM Garage.
Are you ready to learn more about Day 2 Operations?
We’re here to help. Contact us today to schedule a time to speak with a Garage expert about your next big idea. Learn about our IBM Garage Method, the design, development, and startup communities we work in, and the deep expertise and capabilities we bring to the table.

Schedule a no-charge session with the IBM Garage.

--

--

Robert Barron

Lessons from the Lunar Landing, Shuttle to SRE | AIOps, ChatOps, DevOps and other Ops | IBMer, opinions are my own