Operational Scorecard — Lessons from the Apollo Lunar Landings — Part VII

Robert Barron
8 min readDec 23, 2019

--

If you say “Christmas” to a space aficionado, one of the first things to come to mind will always be the flight of Apollo 8 in December of 1968. The very first flight where mankind left the cradle of the Earth and traveled to the Moon (although they only circled the Moon and did not land).

Between 1961 and 1969, every flight NASA undertook was a stepping stone towards the Moon landing. Every flight was a test to make sure that the next, more advanced flight would be successful. By taking a series of stepping stones NASA minimized the risks and maximized the success of each flight.

By far the largest “step” between flights was the gap between Apollo 7 (which despite the number in its name was the first test of the American 3rd generation spacecraft — Apollo) in October of 1968 and the flight to the Moon on Apollo 8. The Apollo 8 flight would also be the first manned flight of the mighty rocket — the 110 meter tall Saturn V.

Prior to Apollo 8, the furthest any human had traveled from the Earth had been 1,369 km (739.2 nautical miles) in Gemini 11 in 1966. Apollo 8 planned to reach lunar orbit, or over 377,000km (203,560 nautical miles) away from Earth.

Flight patch of Apollo 8, with a symbolic diagram of the flight plan. (NASA)

The flight of Apollo 8 was planned to

  1. Test the Saturn V launch rocket
  2. The ability to navigate and fly to the Moon
  3. The ability to change the direction of flight
  4. Reach a stable orbit around the Moon
  5. The ability to perform scientific work (aka photography) as preparation for future missions,
  6. The ability to break out of lunar orbit and return to Earth
  7. the ability of the Apollo heat shields to protect the astronauts as Apollo 8 returned to Earth from the Moon, achieving previously unheard of speeds and temperatures during re-entry.

While the above were the functional mission goals, there were other unstated mission goals which entailed the capability of managing and operating the complex flight. Down at mission control in Houston, the flight controllers worked in shifts to make sure that the mission was a success. As discussed in earlier articles, there were many flight controllers watching over the various systems and making sure that the spacecraft & the astronauts were healthy and nothing was interfering with the flight plan.

Orchestrating the entire flight was the Flight Director (call sign FLIGHT), who periodically needed to decide whether the flight was proceeding correctly and could continue to the next step. For example, before leaving Earth orbit for the Moon, the entire spacecraft was checked out again, to make sure that nothing untoward had occurred during the launch (perhaps a lightning strike had damaged some equipment?). This meant that the flight plan was full of tests and pauses, time to repair or realign damaged equipment or just re-test and make sure everything was as planned.

These decision points were called “go/no-go polls” and entailed each and every controller essentially saying “I have checked everything within my responsibility and I have made sure, to the best of my ability, that we can continue with the next stage of mission and nothing will go wrong” Of course, since this was NASA-speak, that sentence was shortened to “go”.

In Apollo 8, the most important go/no-go poll came a few hours after launch. The flight controllers tested and verified that everything was ready for the rocket engine to ignite and propel humans away from the bonds of Mother Earth for the very first time ever. In NASA-speak, this action was called the “Trans-Lunar Injection” because it meant that the spacecraft was injected into a flight path that would transfer it to the Moon. In true NASA-speak, this was abbreviated as TLI.For many astronauts and engineers, this moment was second only to the actual Moon landing in historical impact.

After FIDO, GUIDO, EECOM, NETWORK, SURGEON, CAPCOM, and others all gave their “go”, the Flight Director Cliff Charlesworth gave the signal to astronaut Michael Collins who passed on the news to the Apollo 8 astronauts — “you are GO for TLI”.

Mankind was on the way to the Moon, in time to reach it by Christmas.

Just as the NASA flight controllers needed to make sure that the flight would continue as planned with no problems, so in the modern world of DevOps and Cloud Service Management & Operations we need to make sure that new deployments of services and changes to the environment (whether cloud, traditional or a hybrid of them) does not endanger the performance and availability of the services we support.

But with changes becoming more and more frequent, how do we make sure a new version of an application or a new Cloud service will not impact and damage the existing environment? The solution — a version of the “go/no-go poll” which we call an “operational readiness score card”.

In essence, this is a checklist that reproduces the spirit of the “go/no-go poll”, but for services and applications instead of spacecraft.
The checklist is a series of tests which every application needs to go through in order to be considered Production Ready. Examples of these tests might include:

  • Have all the test cases (unit, system, regression) been executed successfully?
  • Do we have updated release notes and runbooks/operational guides?
  • Have we updated the monitoring thresholds?
  • Is the underlying infrastructure at the necessary level (firmware for hardware, software version for middleware/containers)
  • Have we done performance/load testing? If so, have we allocated the necessary resources in advance?
  • Do we have backup/restore and back-out/reversion procedures?
  • Are the application logs in the agreed upon standard?
  • Are the service APIs part of the organization service mesh?
  • Is deployment of new versions gradual (blue/green, canary) or is it a risky all-or-nothing deployment?

If development follows concepts such as 12-factor apps, then we would test to see that the applications follow each of the factors too.

As you can see, the operational readiness checklist is a combination of development and operational concerns — a true DevOps/Site Reliability Engineering synergy.

While the NASA “go/no-go poll” was a pass/fail test — a single “no-go” meant that the mission would be put on hold and could not proceed until the underlying issue was resolved — the operational readiness scorecard is usually more flexible. The reason that we aggregate these tests in a scorecard and not in a simple yes/no checklist is because we want to allow a level of flexibility that is relevant in most application development scenarios.

  • Full test coverage might be too costly/time consuming for certain cases.
  • Keeping documentation up to date often has a lower priority.
  • Legacy logs might remain in an older format because no-one is tasked with updating them.
  • The underlying middleware might be a “good enough” version and not always need to be updated to the latest version.

The following sample scorecard shows how 4 different microservices are scored against 12-factor app development. In a real-world scenario, we would add many more operational factors to the scorecard and give different weights to different factors.

Sample scorecard, only taking 12-factor app development into account.

The operational readiness scorecard allows us to give different weights and values to different tests and prioritize different components. Perhaps we don’t mind if the logs are in multiple formats as long as the documentation on how to read the logs is crystal clear. Perhaps we don’t mind that the application has a complex restart procedure if we have enough automation in place to make the process simple from a human perspective.

A scorecard allows us to objectively grade services before they deploy a new version and allows us to stop (“no-go” in NASA speak) deployments that seem risky.

If we define an aggregated score of B to be the minimum required for deployment, we might end up with the following procedure for a given service:

Grading options for Operations Scorecard

Additional ideas might include gamification of the scorecards, so that the team with the highest grade in the organization will get a prize of some kind.

Earthrise over the Moon (NASA)

The Apollo 8 team (astronauts in space and controllers on the ground) passed all their go/no-go readiness tests with A+ marks and reached lunar orbit on December 24th 1968.
The astronauts had a hugely successful mission, seeing the far side of the Moon with human eyes for the first time, taking the iconic Earthrise photograph, reading from Genesis as a season’s greeting to all humans on planet Earth, and (most important from their children’s perspective) affirming that they could see Santa Claus!

Improving your operational readiness and creating an operational scorecard is not a trivial exercise. You need to balance creating as many tests as possible while not micro-managing and over-burdening the teams responsible. As many tests as possible need to be automated tests which can be reproduced on a schedule.

If you want to see what a scorecard for your organization might look like, IBM’s Garage Experts can help you take your first steps.

To be informed when the next article is published follow me on Medium at Robert Barron or on Twitter at @flyingbarron

Articles in this series:

Bring your plan to the IBM Garage.
Are you ready to learn more about working with the IBM Garage? We’re here to help. Contact us today to schedule time to speak with a Garage expert about your next big idea. Learn about our IBM Garage Method, the design, development and startup communities we work in, and the deep expertise and capabilities we bring to the table.

Schedule a no-charge visit with the IBM Garage.

As I write this in December 2019, I’d like to close with a message from 51 years ago:

And from the crew of Apollo 8, we close with good night, good luck, a Merry Christmas — and God bless all of you, all of you on the good Earth.
— Frank Borman, James Lovell & Bill Anders

Happy Hanukkah and Happy Holidays to all IBMers and everyone else on the good Earth (and around it)!

--

--

Robert Barron
Robert Barron

Written by Robert Barron

Lessons from the Lunar Landing, Shuttle to SRE | AIOps, ChatOps, DevOps and other Ops | IBMer, opinions are my own

No responses yet