Lightning strikes! — Lessons from the Apollo Lunar Landings — Part VI

Robert Barron
8 min readDec 5, 2019

--

Launch of Apollo 12, just before lightning strikes

While the spectacular achievement of landing a man on the Moon was the equivalent of “catching lightning in a bottle” on an unprecedented scale, no one expected Apollo 12, the very next mission, to be literally struck by lightning!

Apollo 12 launched on the 14th of November 1969 and had the goal of repeating Apollo 11’s success — and improving on it. The astronauts Charles Conrad and Alan Bean were planning on staying on the Moon for longer than Neil Armstrong and Buzz Aldrin and — even more importantly — have a less exciting landing.

The astronauts of Apollo 12 were going to improve on Apollo 11, whose goal had been to simply “land anywhere”, by making a pinpoint landing next to an old unmanned probe called Surveyor 3 that had reached the moon back in 1967. Only 2 years separated the robotic Surveyor from the manned Lunar Module lander, but this was generations in spaceflight technology.

However, before you reach the moon you need to take off from Earth. Like everything else in the Moon program, the launch was practiced again and again using ground-based simulations until the astronauts and flight controllers knew each step perfectly. Years before “Chaos Engineering” became an industry standard, simulation controllers challenged the engineers with simulated failures to test their procedures. Every time the Simulation Supervisor (SimSup) “won” it meant that the practice flight had failed. Every failure meant a lesson learned and one less potential surprise during an actual flight.

The collection of lessons were called “flight rules”, and they defined the actions and responses for any possible event during the flight.

Set of flight rules

Despite the years and endless hours of tests and simulations, real life has a way of surprising you. There were multiple mission rules defining when the mighty Saturn V rocket could or could not launch — for example, launch could not take place if the winds were too strong for a stable flight or if the cloud cover was too thick for the telescopes and cameras to track the rocket.
However, a certain amount of cloud cover was considered fine. And the flight launched as planned on November 12th, 1969 at 11:22 in the morning, Florida time… and as the 110 meter long metal tower, with an even larger exhaust plume of ionized gases passed through the clouds it became a flying lightning rod and precisely 36.5 seconds after launch… lightning struck!

Lightning striking the launch tower, just after the rocket lifts off

The Apollo 12 spacecraft, nestled at the top of the rocket stack, bore the brunt of the electrical discharge and the electrical fuel cell powering the spacecraft overloaded. They were disconnected automatically and the spacecraft continued flying “blind” with countless alarms flashing all across the astronauts’ control systems.

Back on Earth, flight controllers anxiously scanned their displays in an effort to figure out how to reinitialize the crippled spacecraft. Their training had equipped them with the skills to interpret the cryptic numbers transmitted by the spacecraft to their screens in Mission Control. If the numbers showing up were too high or too low, they knew what actions to take in order to rectify the situation. But after the lightning struck Apollo 12 the telemetry they saw was “not even wrong”. The numbers were gibberish and made no sense — as an illustration, a fuel tank might have been showing a negative amount of fuel or a compass an angle of over 360 degrees.

Without solving the problem of unreadable telemetry, there would be no way to resolve the underlying problems of the fuel cells and other components damaged by the lightning strike.

Of course, all this was not happening during a training simulation, rather during the launch of the 2nd attempt to land on the Moon! Fortunately, the Saturn V rocket, whose IBM brain had not been disabled by the launch, kept the rocket on course — but unless the astronauts and controllers could solve the problem, it was lifting a dead spacecraft into orbit. The decision to separate the duties of the rocket and the spacecraft, as described in a previous article, had paid off. As long as the rocket was flying according to the flight plan, the mission was safe. But the rocket’s role completes once the spacecraft reaches orbit and if the astronauts did not have full control by then, there would be no choice but to abort the mission.

Each controller in Mission Control had a distinct responsibility to a specific set of components — one was responsible for the on-board computer, another for the propulsion systems and so on. Responsible for the wellbeing of the electrical components was the On shift in the role of Electrical, environmental, and consumables manager (EECOM) was John Aaron.

Apollo 12 EECOM (Electrical, Environmental, and Consumables Manager) John Aaron.

All eyes in Mission Control were on John Aaron… he had seconds to resolve the underlying problem of bad telemetry before all the others could even attempt to solve their problems.

Despite having nearly countless mission rules and instructions, there was nothing “in the book” that covered this eventuality... or was there?

John scanned the erratic data in front of his eyes and was reminded of a similar chaotic pattern which he had seen about a year ago. During a standard mission simulation a power failure in the Mission Control building had knocked out the terminals and the practice run was cancelled for the day. But instead of being blank, John’s terminal had showed the same “ratty data” that he was seeing after the lightning strike.

Curious, John tried to backtrack and trace why his terminal was showing the aberrant information. Truth be told, he was poking his nose into something that didn’t really involve him — he was a Flight Controller, responsible for the electrical wellbeing of the Apollo spacecraft, not the computers and other pieces of hardware that composed the simulation machines and certainly not the power supply of the Mission Control building. Fortunately, NASA operated on a principle of transparency within the organizations at Mission Control. There was no such thing as “none of your business” when your business is going to the moon! John collected the necessary experts and they took out the blueprints of the simulation to work out just why a power failure could make the consoles display illogical numbers.

They traced it down to a small switch-box which converted the signals coming from the analog sensors monitoring the “real world” into a standard set of voltages so that the digital computer could consume them. This small box was called the Signal Conditioning Equipment or SCE for short (of course there’s an acronym). When the power failed, the SCE was knocked out and even when power returned and the sensors were doing their job of monitoring and measuring the simulation spacecraft the SCE did not transmit the correct electrical current into the computer and it showed the bad data. Resetting the SCE by shifting it into Auxiliary (or Aux) mode fixed the problem.

This obscure solution to an esoteric problem was documented twice — once deep in the thick binders of flight manuals and once in the brain of John Aaron.

A year later, while Apollo 12 was flying with a dead computer, John Aaron’s memory came to the rescue and he sent up the brief command to the astronauts: “Set SCE to Aux”. No-one else in Mission Control had any idea of what this meant. John did not have time to consult his support team of backroom experts who were searching the mission rules to try to solve the problem.

SCE switch in the Apollo spacecraft

Aboard Apollo 12 astronaut Al Bean reached out and flipped a tiny switch, one of hundreds, nestled at a hard to reach corner of the console. Immediately the negative and impossible numbers on the controller’s consoles changed to the actual metrics they needed to see. The spacecraft was still crippled, but it was no longer blind. In minutes all the controllers figured out how to solve their own respective problems and Apollo 12 went on to make a perfectly targeted landing on the Moon (and came safely back).

Apollo 12 astronauts on the Moon, landed just next to Surveyor

One of the key differences between a modern Site Reliability Engineer and a traditional Operator is the scope of responsibility. An operator is traditionally responsible for a specific domain of technology and is limited to that specific silo (or silos). If you’re a sysadmin you may be responsible for multiple types of operating systems, but you’re probably not going to own any part of the business applications running on top. If you’re a DBA, you’re probably not going to debug the lowest levels of the network stack.

John Aaron did not recognize any specific domain that defined the limits of his responsibility. He was responsible for understanding and managing anything and everything that affected the Electrical, Environmental, and Consumables on board Apollo. That included the simulation equipment and electrical wiring leading to his console as well as the spacecraft itself.

It’s not just that Aaron had a responsibility, the organization around him was also open (or transparent) enough to allow people from various departments to cooperate freely. The simulation controllers never thought that Aaron was trying to “blame them” for the irregular telemetry data. The engineers responsible for the wiring didn’t block Aaron from looking under the covers or hide the blueprints because he was from a different group.

The combination of individual responsibility and organizational transparency put John Aaron in the position to make the call that the mission was “go” in in the split-seconds after the lightning strike.

As stated in the IBM Method article on transparency :

“In a culture of blame-free transparency, you have more opportunities to detect errors before they occur, and it’s easier to track down causes when errors occur.”

John Aaron knew immediately what the problem was, not just because he was smart, but because he had trained for this and had gone the extra mile when necessary — even when (and perhaps especially when) he was the only person to decide what that extra mile was. He was what NASA’s engineers called a “Steely Eyed Missile Man” — but he was equally a prototype and inspiration for today’s Site Reliability Engineers.

Adopting SRE and changing your organization’s behaviour can be a complex task — IBM’s Garage Experts can help you take your first steps.

Read more about Site Reliability Engineering and Transparency in the IBM Method site.

To find out when the next article is published you can follow me on Medium at Robert Barron or on Twitter at @flyingbarron

Articles in this series:

Bring your plan to the IBM Garage.
Are you ready to learn more about working with the IBM Garage? We’re here to help. Contact us today to schedule time to speak with a Garage expert about your next big idea. Learn about our IBM Garage Method, the design, development and startup communities we work in, and the deep expertise and capabilities we bring to the table.

Schedule a no-charge visit with the IBM Garage.

--

--

Robert Barron
Robert Barron

Written by Robert Barron

Lessons from the Lunar Landing, Shuttle to SRE | AIOps, ChatOps, DevOps and other Ops | IBMer, opinions are my own

Responses (2)