Chaos Engineering vs Voodoo

5 min readOct 5, 2020

Years ago, I started out as a junior software developer in a project that would probably be described today as a semi-failed monolith, a white elephant drinking at the muddy pool of the waterfall development process which birthed it.

It was a multi-million lines-of-code application that implemented the design spec exactly, never mind what the end-user really wanted by the time it was delivered. However, this monolith was an ornery and fragile beast, prone to failures just before the end of your shift on-call.
Before I was allowed to wrangle the problems myself, I often watched my elders deal with them. Since these more senior developers had had a hand in the creation of the beast, they had a much better understanding of its components and their dependencies.

This was back in the late 90s, so we’re talking about hundreds of processes scattered over a few dozen servers, not the endless numbers of containers in modern Kubernetes applications. Still, standardization was more limited and I watched in wonder as my seniors debugged and drilled deep into the innards of the wayward beast, navigating without a map.
But even for these seasoned veterans of our application, the application was simply too large to be fully comprehended and its behaviour was often bafflingly mysterious.

Who wants to debug this muddy monolith? (Photo by Ansie Potgieter on Unsplash)

Why did 18 parallel processes function perfectly, but as soon as you added another you started getting failures? And the more you added, the more often failures occurred.
Why did the application choke on an incoming file, but if you copied the identical file to an identical server running an identical transaction process, the application succeeded?
Why did stopping one server actually make the processes on another server run faster?

We used to call this “Voodoo” — the successful solution to a problem that returned the application to a proper functional state — but without quite understanding why it worked!

One of the major gaps in our understanding of how the application worked was the lack of application Observability — we knew about issues after the fact, by seeing log errors or watching a queue get too large. We did not use modern techniques of Build to Manage such as traces, consistent error messages and so on which tell us what’s going on in the application, before and after problems arise — so we needed to depend on more limited external monitoring capabilities.

The other major gap was that despite the fact that we knew how fragile the application was, we barely performed reliability and availability tests in development. After all, we had enough reliability troubles in production, why should we spend time looking for more reliability problems in development?
The fact of the matter is that we were flying blind and depending on our experience, our collection of runbooks, Voodoo, and luck to solve problems in production.

Over the last few years, Chaos Engineering has become a rising discipline in Site Reliability Engineering and IT Operations which aims to resolve this issue.

Simply put, Chaos Engineering means defining an experiment (or a series of experiments) that will challenge the successful operation of a system. For example, killing a process and making sure that the service has enough reliability to continue functioning until the process is automatically restarted. Unlike traditional testing, one of the concepts in Chaos Engineering is that the experiment is automated with an element of randomness and that we aspire to perform these tests in the production environment too. After all, if we don’t have confidence that a test will succeed, how can we have confidence that the application won’t fail at a moment’s notice by a random failure?

Cat doing what it does best — experimenting on an application (IBM)

While the name implies a chaotic (or random) process, the practice actually follows strict engineering procedures to attack a system in a contained and observable manner. The goal is to break a system to correct its architecture, understand its weak points, and anticipate failures and how the system and the people will behave.
So Chaos Engineering not only helps us understand the system but also how people react to the failures. This allows us to fix both the technical issues in the system and the human factor issues in the troubleshooting process.

Within IBM, we have been adopting Chaos Engineering both internally — to support our Cloud services — and externally — to support our clients. Adopting Chaos Engineering is not a trivial task, since it theoretically goes against the engineering instinct of not breaking what’s working:

But the fact of the matter is that Chaos Engineering, properly practiced, will improve the reliability of your systems and reduce the amount of guesswork and “Voodoo” in solving problems — because you know exactly how your systems respond to problems in advance.

As part of IBM Cloud Architectures, we’ve developed an entire series of principles of Chaos Engineering which can function as guidance for organizations adopting Chaos Engineering. And because we know that it can be difficult to implement these due to organization or cultural issues, we have also developed an entire methodology around these principles which can take you from random Voodoo to rigorous Chaos Engineering in a few easy steps.

For further information about IBM’s Chaos Engineering, Principles and Methodology, please follow the following links on our Architecture site: Chaos engineering principles & Use chaos engineering to assess application reliability.

If you’d prefer to listen, please attend (or listen to the recording of) IBM’s Always-On architect Haytham Elkhoja’s session at ChaosConf 2020

Bring your plan to the IBM Garage.
IBM Garage is built for moving faster, working smarter, and innovating in a way that lets you disrupt disruption.

Learn more at www.ibm.com/garage

Chaos Engineering vs Voodoo

Written by Robert Barron

Responses (1)