ChatOps — Lessons from the Apollo Lunar Landings — Part V
In previous articles we saw the way flight controllers and their support teams swiftly solved problems as they occurred during the Moon landing. In this article we’ll discuss one of the key aspects of their success — communication & collaboration.
The “front line” or “public face” of Mission Control was called Mission Operations Control Room (MOCR) and that is where the on-duty flight controllers worked.
Each desk (or console) in the MOCR was dedicated to a specific role, such as FLIGHT (#4 managed the whole flight), CAPCOM (#10 spoke to the astronauts) or GUIDO (#15 managed the spacecraft computers).
The primary goal of the MOCR was to achieve mission success — this meant keeping the spacecraft going and the astronauts alive. Besides their regular duties of supporting the flight, when an issue arose (for example, the 1201 and 1202 computer errors discussed in part I and part IV of this series), the controllers in the MOCR were the first to seek solutions. In modern IT Service Management, we’d call them the “First Responders”. They were supported by other groups:
- The various “back rooms”, which directly supported controllers in their field of expertise. Today, we’d call these Subject Matter Experts.
- The SPacecraft ANalysis (SPAN) room held technical experts and managers from both NASA and contractors whose goal was to identify the nature of problems and bring the relevant expertise to the forefront.
- The Mission Evaluation Room (MER) was a place for experts to brainstorm solutions to problems.
So while the MOCR flight controllers and their backroom support were performing the initial triage and fixes to keep the astronauts alive and the flight working for the next few hours, the SPAN would be finding the engineers who had worked on the exact piece of machinery that was malfunctioning and the MER might be looking a longer term solution.
Since the Moon landing predates any of the modern conveniences of video conferences, email or messaging, flight controllers had limited options to work together and share information.
Controllers in the control room had consoles with various tools at their disposal — a display, status lights, communication equipment, and many, many, many, many, many buttons.
While the controller could control the displays and which metrics and telemetry from the spacecraft were displayed, it was always a highly condensed and cryptic set of information — no wonder that flights that took days to complete required years of training and practice.
There were status report lights (Red, Amber, Green) that had a special job; they let the rest of the team in the control room (and especially the Flight Director) know whether all was well within the scope of the controller’s responsibility. A lit-up Red light was a way for the controller to say “I need attention” quickly and wordlessly. Conversely, the Flight Director could quickly get a situation report for the entire flight by resetting all lights on the consoles to Amber and polling the controllers on their status: are they “go” (switch to Green) or “no-go” (switch to Red) for the next phase of the mission?
So the controller knew that there was a problem and requested help — what next?
Since shouting in a crowded room would be inefficient (especially since many of the people the controller was trying to reach were in other rooms!), the solution was a highly complex communications network where everyone could talk. Each console had multiple jacks where headsets could be plugged in and communication “loops” could be made.
And there were many such loops to listen in to!
“There are dozens of voice circuits on each panel, and each controller might listen to half a dozen or more at any one time. To the uninitiated, it’s got to sound like a cross between a hog auction and a Chinese fire drill” ¹
Let’s go through the 1201 alarm that we discussed previously:
- A unique communication loop between the Astronauts and CAPCOM — This is the only loop that the astronauts could hear, so that they wouldn’t be disturbed by the rest of the communication.
- A loop that the flight director controls and all the flight controllers listened in on — this was a very important loop because it’s where the controllers got their primary instructions!
- The controllers had a loop to their backroom (this is how Steve Bales and Jack Garman talked to each other).
- The controllers had another loop to their representative in the MER and/or SPAN rooms.
- Another special loop was that of the Public Affairs Office (PAO) which was broadcast to the public. This was usually the CAPCOM-Astronaut loop with additional explanations to clarify what was going on.
In addition to their own channels, astute controllers would listen in to other channels in order to find out about their neighbours’ problems too, either to render aid or to avoid being struck unawares.
Over time, the controllers came up with various techniques to manage their communications. For example, you would lower the volume on the less important channels so that you could concentrate on higher-value messages. Still, imagine having a conversation and listening in to 2–3 others at the same time, waiting to hear a key phrase which would mean that you need to get into action!
In addition to the difficulty in conversation, there were more collaboration barriers; is everyone seeing exactly the same information? While the NASA consoles were state of the art, they were still primitive, and it was often difficult to exactly share the same information between rooms. The MER and SPAN often had print-outs of telemetric information that might be slightly delayed and they did not have the same flexible consoles as the MOCR had. As in many cases, the solution came down to training and excellence.
Today, when we solve IT issues, we find ourselves in similar situations — we have multiple people trying to put out whatever fire has brought their webapp/mobile app/corporate project to its knees. Recovery efforts could always be better coordinated and information shared more efficiently.
One of the popular solutions to this problem is called ChatOps — the integration of development tools, operations tools, and processes into a collaboration platform so that teams can efficiently communicate and easily manage the flow of their work.
The point of ChatOps is to use a collaboration platform such as Slack, MSTeams, Mattermost (or others) to be a base of communications & collaboration between humans and also between humans and their applications.
So instead of bringing people together into physical war-rooms to work together, you bring them into virtual rooms or channels to work together.
Instead of each person working individually and invisibly (and perhaps at cross purposes with others), people work transparently and share each other’s efforts. This leads both to higher transparency and better training as junior people see what seniors are doing. Since we try to not spend months preparing for new deployments and application versions, working together in a collaborative fashion makes training easier and faster.
In the above screenshot, Rick from Operations has been alerted about a critical issue in the ATM service he is supporting. In addition to the initial alert, Rick also has easy access to links to other supporting systems he can access and the ChatOps automation (casebot) has brought him a performance graph which helps with the initial diagnosis and triage (is this a sudden performance spike or have things been gradually degrading)?
Rick can also get help from the ChatOps automation bots by running commands and asking it questions.
All these activities allow Rick to continue working in the incident channel without having to context switch between applications.
Rick can arrange all the issues he’s trying to solve in individual channels if he likes. In this way, each problem has its own conversation and there’s been no overlap like with the MOCR conversations.
Another advantage of ChatOps is, of course, the capability to chat!
Here, Rick (from Operations), Todd (the Site Reliability Engineer) and Olivia (the Application Owner) are discussing their problem and brainstorming solutions, in parallel to using the bot to pull out information from multiple sources at once. They could have done this over the phone and could have queried their databases themselves, but by using ChatOps they are saving a significant amount of time because the bot is bringing them the information they want, the way they want to see it, and as soon as they ask. In addition, since everything they’re discussing is stored in the conversation history, it’s easy for them to go back and see who said what, who did what, and what the system looked like back when they were having the conversation.
While the MOCR controllers had channels for their backrooms, the modern Operator or Site Reliability Engineer might have a dedicated channel for a given service they support or a specific technology they use. But while a Mission Control channel was a physical wire, a new channel in Slack can be created effortlessly and on the fly. Thus, a channel can be a temporary thing which is opened when a problem arises, people and bots are invited in to work together to solve it, and the channel is then closed once the problem has been solved.
If this reminds you of Steve Bales and Jack Garman discussing the Apollo 1201 and 1202 errors — good!
The goal of ChatOps is the same as that of the communication channels in NASA’s mission control — but utilizing the more advanced technological solutions we have available today.
There are many ways of utilizing ChatOps. At the very simplest level, opening a Slack channel to discuss an operational problem without any added automations is a good first step, since you’ll be improving the transparency between various teams and team members. After that, adding simple automations which will inject relevant information into the channels and even alert on the initial problems is easy.
Next time, we’ll see how the flight controllers kept their cool under pressure and solved problems fast. What happens when lighting strikes a Saturn V rocket? Who makes the “go/no-go” call and finds the solution? Today we’d call them Site Reliability Engineers.
I will be pleased if you would join me. You can follow me here Robert Barron or on Twitter at @flyingbarron
In my experience, many clients combine applications that are “ChatOps ready” (such as IBM’s own Cloud Event Management) with customizable solutions (such as the open source Hubot and Botkit) to get a combination of speed and flexibility.
If you’re interested in learning more about ChatOps or how IBM’s Garage can help you, please reach out to me or schedule a no-charge visit with the IBM Garage.
[1] Below Tranquility Base, R. Stachurski, pg 28.
Articles in this series: