A look at managing an incident without AIOps and then with AIOps.

Incident Management is the practice of restoring a damaged service as quickly as possible. In this post, I’ll describe Incident Management process twice — at first without AIOps and then with AIOps — in order to show the differences and advantages of AIOps.

While many of the concepts are global and generic, the techniques and use cases displayed were developed at IBM as part of the IBM Cloud Pak® for Watson AIOps product.

The process will flow along a timeline from left to right.

Without AIOps: Before the incident occurs

Michael Collins, Apollo 11 astronaut. 1930–2021.

In July of 1969 two astronauts looked up at the sky from Tranquility Base on the Moon. Over 3.5 billion people lived on the Earth and when the astronauts gazed at the bright blue sphere, they saw the whole of humanity suspended in velvet blackness.

All of humanity, except for one man — Michael Collins — pilot of the Apollo 11 spacecraft Columbia which was orbiting the Moon, waiting for Neil Armstrong and Buzz Aldrin to complete the first exploration of the Moon and to carry them back to the safety of the home planet. But half the time Collins…

Shuttle To SRE — STS Lesson 1

April 12th is a landmark date in space exploration — Yuri Gagarin became the first human to explore outer space in 1961 and Space Shuttle Columbia, the first re-usable space craft, was launched on this day in 1981.
While the technology which launched Gagarin into space has a legacy which continues to this day, the Space Shuttle was a departure from space systems which came before it, and its DNA does not seem to have been passed on to further systems.

In this series of lessons we will discuss why this was and what modern developers, DevOps practitioners and Site…

Glynn Lunney “Black Flight”, 1936–2021

Glynn Lunney, one of the most senior engineers of the Apollo project passed away on Friday, 20th of March 2021. Lunney was a leader and an inspiration to reliability engineers for over 50 years.

Glynn Lunney (NASA)

Glynn Lunney, call sign “Black Flight”, was one of the first of NASA’s flight directors, the engineers who orchestrated the space flights. While the astronauts flew, the flight directors had the responsibility for the overall success of the mission — overseeing everything from the moment the rocket lifted off until the astronauts were recovered by the navy. In modern IT parlance…

Lesson XVIII from the lunar landings

The most famous words spoken on the Moon begin, “That’s one small step…”, but close behind are “Houston, Tranquility Base here. The Eagle has landed.”

For Neil Armstrong, astronaut and engineer, the actual landing of the spacecraft was personally of more importance and historical relevance than his first steps. After all, he was first and foremost a pilot…

Pilots take no special joy in walking: pilots like flying. Pilots generally take pride in a good landing, not in getting out of the vehicle.
— Neil Armstrong

While “one small step” was directed at all…

2020 was certainly a year to remember!

I started writing these articles just before the 50th anniversary of the Apollo 11 Moon Landing and I’ve found myself continuing to share more and more stories which show how modern day Site Reliability Engineers, Sysadmins, Operators and DevOps Engineers can learn valuable lessons from the actions and practices of NASA’s astronauts and flight controllers as they took humanity to the Moon.

Disregarding global events and concentrating on my series of Lunar Landing Lessons, in 2020 I managed to:

  • Publish nine lesson articles. This is less than the rate of one a month…

Lesson XVII from the lunar landings

The word Monolithic would seem to be a definition for the Saturn V rocket. 110 meters (330 feet) tall, filled to the brim with liquid oxygen, liquid hydrogen, and kerosene.

Especially in DevOps, the word monolithic has negative connotations — unwieldy, cumbersome, and difficult to change. But the Saturn V rocket, and the Apollo program of which it was part, was actually a more flexible solution than one might expect. Indeed, the Apollo project was the opposite; it was frankly Agile to an extent that would make modern-day space entrepreneurs such as Elon Musk and Richard Branson envious!

Operational agility…

Lesson XVI from the Lunar Landings

For Moon landing aficionados, December is the month of both the first and the last Apollo Moon missions. Exactly 49 years ago, Apollo 17 closed the first chapter of human exploration of the Moon when they launched on December 7th, 1972. This was a short chapter, which opened 4 years earlier when the first humans left the immediate area of the Earth and ventured forth into the cosmos — the voyage of Apollo 8 to the Moon. Apollo 8 launched on the 21st of December, 1968, and achieved lunar orbit on the 24th.

Later that day, on Christmas Eve, the…

Lesson XV from the Lunar Landings

While it’s obvious that astronauts need lightning-fast reflexes and split-second decision-making while flying their missions, it’s important to remember that the engineers on the ground also need to make decisions quickly and decisively — often based on limited information.

Flight controllers monitored the spacecraft — both the technical parameters of how it was functioning and the mission parameters of whether the flight plan was succeeding. In other words, was the spacecraft in the right place at the right time, and could it perform the next step needed? …

Using chaos engineering to validate your resilient infrastructure and applications.

This guest article was written by Rajesh Jaluka, David Nguyen & Haytham Elkhoja

While we’ve discussed the need for chaos engineering and many of the concepts and principles behind it, it’s also important to understand by performing chaos engineering experiments yourself.

This tutorial shows you how to get started to incorporate chaos engineering using Gremlin, a chaos engineering platform, to validate the resiliency and reliability of your application and infrastructure on IBM Cloud Kubernetes Service or Red Hat OpenShift on IBM Cloud. …

Robert Barron

Lessons from the Lunar Landing, Shuttle to SRE | AIOps, ChatOps, DevOps and other Ops | IBMer, opinions are my own

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store