Computer-aided responses and reflexes

Lesson XV from the Lunar Landings

8 min readNov 10, 2020

While it’s obvious that astronauts need lightning-fast reflexes and split-second decision-making while flying their missions, it’s important to remember that the engineers on the ground also need to make decisions quickly and decisively — often based on limited information.

Flight controllers monitored the spacecraft — both the technical parameters of how it was functioning and the mission parameters of whether the flight plan was succeeding. In other words, was the spacecraft in the right place at the right time, and could it perform the next step needed? If not, what needed to be done to fix that; when, how, and by whom?

Some responses were simple and well defined. For example, planners knew in advance that there was a chance that the spacecraft would drift off course slightly during the mission and that the main engine would be fired for correctional navigation. It was simply a matter of watching the flight and deciding when the best time to perform that action would be. This would be equivalent to simple operational orchestration in an IT environment — if we detect a higher load and/or a slower response time, we can spin up more nodes/containers as a reaction.

Other responses were more difficult.

Antenna for long-range communications in Goldstone, California (NASA)

Let’s say that communications between the spacecraft and Mission Control in Houston became garbled, then which of the possible responses and actions should Mission Control take? Perhaps a different, more powerful, transmitter on the spacecraft should be activated? If this solves the problem, do we stop there or do we investigate why we suddenly need a stronger transmitter?
If the solution ends up being to change the receiving antenna on Earth from the California antenna to the Madrid or Australian one, then when can they reactivate the California antenna? Is the problem in the antenna itself or in the network of wires between California and Texas?

In this scenario, the engineers can’t follow a simple predefined routine, they need to investigate the problem, analyze the different sources of information they have, come up with possible solutions, and implement the best one as quickly as possible. This would be equivalent to troubleshooting an unexplained performance problem in your application. Something you might expect to happen, but you need to investigate from end-to-end each time it occurs.

Let’s take another example, involving one of the most dramatic moments of a mission to the Moon — flying around the far side of the Moon, out of contact with the Earth.

Once the spacecraft reaches the Moon, it swings around the far side of the Moon and fires the main engine to slow the spacecraft enough for it to be captured by the Moon’s gravity and remain in orbit. During the time they are hidden by the shadow of the Moon, the astronauts cannot communicate with the ground and there is nothing for the ground controllers to do.

The first time it occurred was during the Apollo 8 mission, and Frank Borman [the mission commander] found the accuracy of Houston’s predictions awe-inspiring. At the precise time that he had been told communications would disappear, they did.
“Geeze!” he said to his crewmates, there being no one else to hear. “That was great, wasn’t it?” Then he mused: “I wonder if they’ve turned it off”
[Fellow astronaut] Bill Anders laughing replied: “Chris [Kraft, the boss in Houston] probably said, “No matter what happens, turn it off”.”
Bill’s humorous suggestion was that, in order not to worry the crew if the predictions had not been as accurate as they had hoped, Kraft would have ordered the people at the transmitting station to turn off the radio signal at just the right moment.
— from How Apollo Flew to the Moon. W. David Woods, pg 207

The predictions, performed by a series of IBM computers in the Real Time Computing Center, were perfect.

Left: The IBM Real-Time Computer Complex (RTCC) collected, processed, and sent to Mission Control information to direct every phase of an Apollo mission. The RTCC perfectly calculated the time of the Apollo 8 loss & acquisition of signal. (IBM & NASA)

For the flight controllers, the period of Loss of signal is usually an opportunity to relax, stretch their legs and take a break (not a cigarette break because the smokers all smoke at their consoles anyway!). However, what happens if there’s a glitch in the monitoring telemetry just before communications is cut? Perhaps there’s a sudden dip in the monitor reporting the oxygen pressure of the cabin.

If you have missing information, do you say “well, this was just a bit of bad telemetry because the radio signal got cutoff midway” or do you say “hmm, they might have developed a leak just as they went behind the Moon”?

If you’re confident that it’s just a bad signal, then you can take your break and continue with the flight plan once the astronauts re-establish communications.

But if you’re wrong, then the astronauts will re-establish communications after having dealt with the leak for half an hour without aid from the ground — while the team back on Earth had been relaxing for half an hour! That half-hour of preparation and planning for an oxygen leak could mean the difference between life and death for the astronauts.

So it was essential for the controllers to take every minor hiccup and change in telemetry into account. They would detect patterns in their data; they’d know that certain measurements change together and that specific settings would modify other metrics in well-defined ways. So if (as an example) you saw a spike in a measurement at the same time an antenna array moved, you could ignore one specific measurement, but if it continued for another few seconds then you’d realize that the antenna movement was not related to the measurement change.

Another way to be sure about how well the spacecraft functioned was by using two different mechanisms to measure the same thing and make sure they agreed. Before landing, the height of the lunar lander was determined both by its internal radar system and by the spacecraft’s internal guidance system.

Mission control with endless engineers, consoles and data sources that need to be correlated (NASA)

The only way the ground controllers could be confident that they were making the correct decisions was by endless training and practice sessions where they simulated space flight after space flight and emergency after emergency. The computer systems of those days could only display the information in a tabular fashion; only some specific metrics could be displayed as graphs.

Pattern matching and anomaly detection were up to the humans in charge. Another problem was in finding patterns that crossed engineering domains — during Apollo 13's return to Earth, it slowly drifted off course due to gasses venting off the side of the spacecraft, pushing it to the side. Now, the venting was being monitored by the expert responsible for the mechanical components of the spacecraft (Electrical, environmental, and consumables manager aka EECOM) while the flight of the spacecraft was monitored by the Flight dynamics officer (FIDO). While the two collaborated, it was not always easy for them to share raw data and to detect patterns which crossed domains.

While the Apollo flights were short periods of high tension, supported by hundreds of highly trained engineers, today’s IT systems run 24x7 throughout the year and, since they are being constantly changed and upgraded, it is between difficult and impossible for engineers today to be as familiar with systems as it was for the NASA engineers back then.

This is where newer and better solutions come into play.

In some cases, it’s enough to improve the visualization of displays and metrics so that patterns can immediately be seen by humans. But in many cases, we need something that can do more than find obvious patterns and can cross multiple informational domains.

Watson AIOps collects information from a wide variety of data sources:

Metrics — Measurements that can vary from the speed of a spacecraft to the number of calls made to an API.
Topology — the relationship between components, whether it be the electrical wiring and plumbing of a spacecraft or the physical and logical makeup of a Kubernetes service.
Events — information about a discrete occurrence in time. This could be an alarm stating that the spacecraft has lost power or that a disk has run out of space.
Logs and Traces — These are lines of data regarding specific applications. They display information about the workload being done and whether any errors have occurred.
Tickets — These are formal records of past incidents. During the Apollo missions, tickets would be documented in print and were not available for computer analysis. Today, Watson AIOps can automatically analyze tickets and use them to make suggestions about the best actions to take now.

More information, and more types of information, can be found in the article A Glossary for IT Operations Management.

In theory, all this information is available even without Watson AIOps to collect, aggregate, understand, and suggest solutions.

In theory, there’s no difference between theory and reality. But in reality — there is!

In reality, it’s extremely difficult to find patterns of issues and errors across multiple domains such as cloud infrastructure, middle-ware (database, buses, and event streams), and applications. It’s even more difficult to filter out the real issues from the red herrings.

Watson AIOps takes the information from all the various sources and aggregates them together, finding patterns that cross domains and suggesting solutions that might be missed by humans.

We can use Watson AIOps to support the Site Reliability Engineers who have to manage and understand all the myriad applications in their responsibility. Watson AIOps will reduce the amount of training they need on the specific applications, help them communicate and collaborate, and also keep up with the ever-changing environment.

A handy overview of Watson AIOps is available here.

I’m happy to report that I’ve collected some of the lessons I’ve shared here and some unpublished ones and I will be presenting at the upcoming SRE conference — SRECon Americas 2020!

Failure is Not an Option! SRE Lessons 50 Years after the Apollo 13 Flight to the Moon

SRECon Americas 2020

Join SREcon20 Americas on December 7–9, 2020. View the full program and register today!

Previous articles in this series:

For future lessons and articles, follow me here as Robert Barron, or as @flyingbarron on Twitter and Linkedin.

How Apollo Flew to the Moon, W. David Woods, pg 207

Bring your plan to the IBM Garage.
IBM Garage is built for moving faster, working smarter, and innovating in a way that lets you disrupt disruption.

Learn more at www.ibm.com/garage

Computer-aided responses and reflexes

Lesson XV from the Lunar Landings

If you have missing information, do you say “well, this was just a bit of bad telemetry because the radio signal got cutoff midway” or do you say “hmm, they might have developed a leak just as they went behind the Moon”?

Written by Robert Barron