The James Webb Space Telescope — Success through Redundancy
Success through Reliability — Webb Lesson 2
Success is overcoming obstacles, whether in deploying sun-shields and mirrors in the James Webb Space Telescope (Webb from here on) or in keeping an application your company depends on running smoothly in the cloud.
In the first article in this series I discussed the three ways in which NASA engineers¹ could ensure the success of the Webb telescope:
- Redundancy — building multiple telescopes, so that if one failed, the others could complete the mission.
- Repairability — building the telescope in a way which could be fixed if anything went wrong.
- Reliability/Resiliency— building the telescope so that it cope with any failures which might occur.
NASA invested heavily in the third option, designing, building and perfecting mechanisms which would guarantee the success of the mission. So far, as I write this in January of 2022, everything is going well.
Before investigating the Reliability & Resiliency techniques used by Webb, this article will describe the use of Redundancy in previous space exploration missions while the next article will cover Repairability. Of course, we’ll also use these as opportunities to extract some lessons for today’s Site Reliability Engineers (SRE) and DevOps engineers.
Now, to make sense of what we mean by “success through redundancy” we’ll need to define a few terms. First, by success we mean “Mission Success”. In the case of Webb success is not “launch” or even “arriving on station”. Success is “performing the scientific observations required and sending the information back to Earth”. Other NASA spacecraft might have a mission of “orbit Mars and return images” or “land on the Moon” and so on.
In the case of SREs and DevOps engineers, the mission of the services they support is to complement whatever their company requires — in whatever industry they may be. There may be individual failures in parts of the underlying infrastructure, but as long as the infrastructure is redundant enough, these failures will not affect the overall success of the mission or the experience of people using your applications.
Next, for the purposes of this article we’ll limit the discussion of redundancy to the redundancy of the entire system (either a spacecraft or a computer application). In other words, the system has redundancy if it can be replaced and the mission will still succeed. The most basic example of redundancy in the IT world is the humble backup, which enables one to recover lost data if the database or disk drive fails.
One of the major concepts (and successes) of modern Cloud architectures is the inherent redundancy of components:
- If a server fails, instantly replace it with an identical one and continue working. (aka Cattle vs Pets)
- Functions-as-a-Service and Event Driven Architectures use message (or event) brokers to guarantee delivery of messages by re-sending (or re-processing) failed messages.
The simplest way to achieve redundancy is to add more hardware —If we suspect that a computer/node/container/whatever might fail then we purchase two and make sure that the second is ready to fulfill the first’s role as soon as it is necessary.
By this logic, NASA should have built two copies of the Webb telescope — just in case the first failed. To understand why NASA did not depend on redundancy for the success of Webb we must first understand how NASA has successfully used redundancy in previous missions.
Let’s rewind to the 1960s and look at some of the first spacecraft to explore Mars.
Sending a probe to Mars is not as simple as pointing the rocket and launching it in the right direction. Due to the movement of both Earth and Mars around the Sun, it takes more energy (and time) when Mars is further away. There is a window of time approximately every two years where it is easiest and cheapest to launch a probe. To get the most out of this opportunity, since neither the rockets launching the probes or the probes themselves were particularly dependable in the 60s — and there wasn’t time to invest in further improvements — the simplest way to maximize the chances of success was to build two copies of each, launch them both, and hope that at least one would succeed. As the following table shows, many of the early failures were not even failures of the spacecraft themselves but of the rockets launching them.
As the technology matured, both rockets and spacecraft became more reliable. By the 1990s, missions to Mars stopped depending on redundant spacecraft and NASA’s missions were each unique. Mars Observer, Mars Global Surveyor, Mars Pathfinder, Mars Climate Orbiter, and Mars Polar Lander were all launched between 1992 and 1998 without a “twin” to keep them company along the long journey between Earth and Mars.
The complexity of each mission, and the amount of scientific measurements expected from each one, were orders of magnitude higher than that of the earlier missions. Much of this was thanks to the progression of technological capabilities — more powerful rockets which could launch heavier spaceprobes, newer and more sensitive instruments which could perform more measurements, miniaturization of equipment which could pack more scientific instruments into the same space and, above all, more advanced & powerful computers which made the spacecraft itself more intelligent, resilient, and reliable.
As a general rule, this kind of redundancy was less common for satellites which orbited the Earth. Because they were not traveling to deep-space, there was less pressure to launch at a specific moment in time. It was easier to launch a single mission, learn from any mistakes and launch again without needing to wait for the planets to align. Even when identical scientific satellites were launched, it was usually to collect more information and not for them to survive as redundant backups for each other.
A series of early space telescopes (the Orbiting Astronomical Observatories) each used different instruments to perform different astronomical observations. They did not all succeed, but lessons learned from their design & operations were used for future missions.
By the 1990s, the Great Observatories, which are the Hubble Space Telescope, the Compton Gamma Ray Observatory, the Chandra X-ray Observatory, and the Spitzer Space Telescope added another complication to the idea of redundant observatories. These flagships of astronomy were simply too expensive to justify building another one “just in case”. Instead of investing in redundancy, NASA invested in making the spacecraft as reliable as possible. Famously, the Hubble Space Telescope had a colossal failure in its reliability, as the main mirror was built out of focus. Fortunately, another option available to ensure the success of the mission — repairability — was available. The saga of how Hubble was repaired (and why Webb does not have this same capability) will be the subject of the next article.
We can see that the use of redundancy in spacecraft was driven by three factors:
- Venturing into the unknown. The spacecraft using redundancy were usually the first of their kind to approach planets, orbit planets and land on planets. There was very little previous experience to draw on and so simply doubling the number of missions lowered the chance of failure.
- Low reliability of the spacecraft themselves. Space exploration was in its infancy and many of the engineering processes and technical solutions which we are familiar with today were just being developed. The confidence in any single component — and the capability of the spacecraft to handle failures — was much lower than today.
- Lower cost & complexity. Early missions had limited goals and lifespans. The first Mariners sent back a handful of images. The Lunar Orbiters, which surveyed the Moon before the astronauts landed, spent mere weeks in orbit. In addition, the number of scientific instruments they had on board was more limited. Today’s missions are larger and have many more potential points of failure which much be accounted for. If the probe isn’t reliable then simply doubling the number of probes would just mean that the 2nd probe would fail in a different way.
So, as SREs, when should we look at Redundancy as our solution?
Essentially, redundancy is a form of horizontal scaling — adding more “of the same” — where each component is not aware or dependent on the others. So adding a webserver behind a loadbalancer is adding redundancy, as is adding another node to a Kubernetes cluster so that a daemon-set can scale out. But adding a shard to a database is not, since the shards need to coordinate.
We do need external coordination of some kind — whether it’s a NASA flight controller coordinating missions, a loadbalancer or the Kubernetes cluster management system itself — so that the redundant resource can come into play at the right time.
So we can solve some performance problems or availability problems by “throwing more assets” at the problem. And this hardware can be small and simple and cheap or it can be large and complex and costly, depending on the problem we want to solve. A rule of thumb is that you’d want to spread your application across three environments, preferably in a hybrid cloud configuration, so that even if there’s a major failure in one environment the application will still function and you’ll have resources to spare in case of a sudden spike of usage. But this might not always be a cost-effective way of working. You need to know where and how to allocate your new resources in the most cost efficient way.
We can maximize the value we get out of Redundancy by utilizing Application Resource Management solutions such as Turbonomic ARM, which will orchestrate your solutions.
As long as the problem is conceptually simple and the financial cost of redundancy is not too high, adding redundant components is a quick and simple way of ensuring success. This also means that we might start adopting a new technology stack by utilizing redundancy, and gradually add more forms of reliability as we become more comfortable with the new technology.
In the next few articles we’ll examine cases where simple redundancy will not suffice.
Articles in this series:
1) I use the shorthand “NASA Engineers” to represent everyone involved in the design, development and operations of the James Webb Space Telescope, whether engineers, architects, technicians, managers or having any other role. Of course, this also includes all the contractors and partners involved.