Is there such a thing as a system that’s too reliable?
The year 1977 was a landmark for the computer and hi-tech industry. Revolutionary systems appeared, ranging from the first affordable and practical home computers like the Commodore PET, the Apple II, and the TRS-80 to the first VMS system — the VAX-11/780 “super-minicomputer”. IBM introduced two new lines of systems — the high-end IBM 3033 and the mid-range IBM/System 34.
But of all the computer systems which went online during 1977, only two are still reliably operating, collecting data, analyzing signals, and reporting back to their human masters. The Voyager space probes were launched in August and September of 1977 and have each traveled around 20,000,000,000 km (13,000,000,000 miles) from Earth.
Dating from an era without cellular phone networks, or wifi, and when long distance landline based connections were fragile, the Voyagers were designed so that they could communicate over a distance that takes light nearly a day to travel.
And recently, after nearly half a century of operations, decades after encountering the planets Jupiter, Saturn, Uranus, and Neptune, years after leaving the solar system to traverse interstellar space, the NASA engineers who support Voyager 2 have decided that it is too reliable and have removed some of the onboard safety mechanisms!
While this is a counter intuitive step, it is a result of the never ending balancing act every system performs.
The long term success of a system is the balance between the expense of doing something now and making sure we can do it again tomorrow.
When we define a system’s requirements — whether it be a mobile game, a financial app, a booking system, an HR or sales application, an elevator control system, a self driving car, or an interplanetary probe — we need to know that it is reliable. In other words, if something goes wrong in one component, the overall system has a solution in place to allow us to continue working.
The simplest example of adding reliability is redundancy. If something breaks, we’ve got a spare (or redundant) component ready to replace it. It’s obviously much easier to create a system without redundancy — less cost, less complexity, less operational overhead.
A system with built-in reliability is inherently more complex than one without. Often the easiest way to make sure you have enough capacity to handle unexpected failures is by simply doubling the infrastructure you’re running your system on. In this way you have spare capacity to manage failures. The problem is that you have doubled the cost of infrastructure and you also have to spend time and effort to coordinate your original infrastructure with the additional infrastructure which cost you money in the first place.
A single server is simpler to manage than a host running multiple virtual machines which is simpler than a Kubernetes cluster running many nodes. Of course, balancing that complexity is the benefit that a well managed Kubernetes cluster is much more reliable than a single server because it will survive failures which would cripple the non-redundant single server.
In the case of Voyager 2, the reliability in question came in the form of a reserve of backup power. Voyager’s power comes from radioisotope thermoelectric generators (RTGs). Simply put, these are lumps of plutonium which decay, emitting heat. This heat is converted to power which powers the various sensors, engines, and computers on Voyager. When Voyager 2 was launched in 1977, the RTGs were capable of generating about 470 Watts. Today, they have decayed to the point where they can generate just over a third of that. Since each instrument requires power to work, the capability of the system to use the scientific instruments lowers as power drop. Most of the instruments on both of the Voyagers have been permanently shut down.
NASA engineers planned to shut down one of the remaining instruments on Voyager 2 later this year, to allow four others to continue operating for longer. However, they also found a hidden trove of reserve energy — a small reservoir of power designed to be a safety mechanism in case of a sudden power fluctuation. Voyager 2 has not survived decades of interplanetary travel by skimping on safety mechanisms, but each safety mechanism comes with a cost. As an example of the cost of reliability, many people over-pack their suitcases when they travel with spare clothes, in case they unexpectedly need clean clothes. A more technical example might be a computer service which is over-provisioned with extra servers, containers, memory, storage, or CPU in case of an unexpected burst of usage. All these “extras” come with a cost, both financial (you need to have bought extra clothes ahead of time) and opportunity (you may end up with overweight baggage due to the shopping you did while on vacation and never having used the spare clothes).
In the case of Voyager, the cost of this reservoir is that less power is available for the scientific instruments.
So NASA engineers decided, after looking at the data and how critical this specific safety mechanism is projected to be over the next few years, that they could lower their safety criteria and release the spare power to operate all five instruments for a few more years.
In essence, this is a rebalance — or recalculation — of the Error Budget on Voyager. In modern development and Site Reliability Engineering parlance, the Error Budget is the balance between how fast we can produce and deploy new features and how careful we must be in not harming the system. In other words, how many issues and problems can we allow before we say “we need to slow down and be more careful”. In this case, NASA decided that the error budget on Voyager was too tight and loosened it.
In comparing the risk of a power surge damaging instruments versus the certainty that lack of power would reduce the operating instruments, NASA engineers, scientists, and project managers came to the joint conclusion to prioritise the gathering of new scientific data.
This is actually quite a counter-intuitive conclusion. Usually, SREs want to prioritize long term reliability and add safety mechanisms — we want to have as many options as possible when something goes wrong.
But in this case, since NASA had plenty of historical data to analyze (i.e. just how unstable is the power generation and how has the safety reservoir been used) they could make the judgement call of performance over safety.
Voyager 2 is going to be a little less reliable, but much more productive.
“Variable voltages pose a risk to the instruments, but we’ve determined that it’s a small risk, and the alternative offers a big reward of being able to keep the science instruments turned on longer,”
— Suzanne Dodd, Voyager’s project manager at JPL.
The confidence in making this decision, based on data, is similar to how parents gradually reduce the amount of spare clothes they take for their children as the children grow up!
Back on Earth, balancing the reliability and capacity requirements of IT systems is a difficult process and one which troubles SREs everywhere. Unlike Voyager, which can only use the capacity it was built and launched with in the 1970s, our computing environments are in a constant state of change — and growth. Here in the CIO office, we’ve recently started using a solution called Turbonomic which balances and rebalances and reconfigures our systems to meet the best balance of performance today and confidence in capacity tomorrow — while trying to be as economical as possible. Even better, one doesn’t need to be a rocket scientist to use it!
NASA's Voyager Will Do More Science With New Power Strategy
Launched in 1977, the Voyager 2 spacecraft is more than 12 billion miles (20 billion kilometers) from Earth, using five…