The James Webb Space Telescope — making 300 points of failure reliable
Success through Reliability : Webb Lesson 1
The James Webb Space Telescope (JWST) is the largest and most sensitive space telescope ever devised. It will replace the Hubble Space Telescope as the flagship space based observatory.
After years of delay, the JWST was launched on Christmas Day 2021 and is now on its way to the destination in space where it will remain while peering deep into the universe.
There, in the second Langranian point (L-2), the JWST will orbit the Sun while staying behind the Earth. This will allow it to keep its delicate instruments at just the right temperature below 50 K (−223 °C, −370 °F) in order to make precise observations in the infra-red spectrum.
Between launch and arrival, the JWST must perform a bewildering array of tasks — unfurling solar panels, main mirrors, secondary mirrors, and sun shields, calibrating instruments, setting temperatures, aligning optics and more. Each and every step will be tested before, during, and after it is performed to make sure that it is successful and does not adversely affect future steps. Despite the barrage of tests which the JWST undertook before launch, NASA engineers are extremely cautious with every task and take every opportunity to pause and double check before pushing forward. While the systems were thoroughly checked on the ground…
“Nothing we can learn from simulations on the ground is as good as analyzing the observatory when it’s up and running. Now is the time to take the opportunity to learn everything we can about its baseline operations.”
— Bill Ochs, Webb project manager.
And no wonder they’re so careful — conservatively, out of the thousands of tasks which must be completed before the telescope can be considered deployed successfully, over 300 are “single points of failure”. Meaning that if any one of them fails then the telescope as a whole will fail.
If you ask any Site Reliability or DevOps engineer how they feel about a deployment plan with over 300 single points of failure, you’d see a lot of nauseous faces and an outbreak of nervous tics! However, NASA has decided that the best way to design, deploy and operate the JWST depends on their 300 and more steps succeeding, one after the other, with no recourse or second chance. How did we get into this situation and (more importantly) why do we have the confidence that this will succeed?
The answer to the first question is simply a matter of dependencies and constraints. For example, the main mirror of the JWST has a diameter of 6.5 meters which is wider than any existing rocket can carry into space. This meant that NASA could either design and build a larger rocket or build a foldable mirror. NASA was in a no-win situation; they needed either a brand-new class of rocket or a brand-new kind of mirror — both of which introduce their own risks and challenges. In the end, the decision was made that building a mirror which could unfurl in space would be better, even if that still left the unique challenge of folding and unfolding the telescope like the largest metal origami ever conceived.
The second question can be answered in three words — testing, testing, and testing. The JWST cost so much (over 9,000,000,000 dollars) and took so long to create (originally planned for 2013, only launched in the final week of 2021) because so many unplanned challenges arose and had to be solved before the launch.
But the JWST was not designed or tested in a vacuum (pun only partially intended). The concepts used to create the JWST are built on the same concepts used on the first satellites and space probes in the 1950s. These have been continually advanced ever since. The JWST, quite literally, learnt from the mistakes (and successes) of everything that came before it.
In the earliest days of space exploration, it was taken for granted that not all missions would succeed. Overall, the Ranger program to take close-up photographs of the Moon succeeded, but the first photos were only taken by Ranger 7 in 1964 after missions 1 through 6 all failed.
In 1966 Surveyors 1 & 2 were publicly defined as “engineering tests” and not scientific missions. No-one was more surprised than the NASA engineers at the Jet Propulsion Laboratory when Surveyor 1 landed successfully on the Moon on the very first try!
Such was the regularity of failures in those embryonic explorers that the Soviet Union had a protocol where missions were only officially named after a successful launch. Missions which failed in their first moments were given generic nomenclature instead of the name they were designed for and which their engineers aspired towards. So Mars-1 (1962) and Mars-2 (1971) were actually the 4th and 10th probes launched to explore Mars and not the first and second as might be assumed by their names.
Many space probes were launched in pairs, so that at least one of them would succeed — such mission include the American Mariner 8 & 9 (8 failed, 9 succeeded), Viking 1 & 2 (both succeeded), and Voyager 1 & 2 (1 succeeded so well that Voyager 2 could be reconfigured to go on to a new and improved mission) and the Soviet Mars 2 & 3 (both failed) and Vega 1 &2 (both succeeded).
Success was assured via redundancy.
Another improvement was in the reliability of the individual probes themselves. Mariner 9 reached Mars in 1971 during a severe dust storm which completely hid the planet from its prying cameras. The mission succeeded because Mariner 9 was able to “pause” and wait months for the storm to pass before performing its mission. The additional longevity of the probe meant the difference between the high quality, scientifically fascinating, photographs of Mars and useless photographs of a dust storm.
Success was assured via reliability.
Fast forward to the 1980s and a new concept arose — space probes and satellites could be repaired in orbit by astronauts visiting in the Space Shuttle. While this was by no means common, the Space Shuttle astronauts visited a number of satellites which orbited Earth and fixed, refueled and upgraded them. The first of these was the repair of the Solar Maximum Mission in 1984, but by far the most famous was the Hubble Space telescope which was visited by astronauts five times — at first to repair the faulty optics and later to upgrade the scientific instruments and repair or replace components which had failed.
Success was assured via repairability
The high cost of the JWST meant that only one would be built — therefore redundancy was not possible to assure success.
Some of the specific requirements of the JWST mean that it cannot be repaired once launched — therefore repairability was not possible to assure success.
The only solution left for NASA managers & engineers was making sure that the JWST would be as reliable as possible.
Endless tests pushed back the launch date. So far, two weeks into the mission, it appears that all systems are go!
If you want to know where the JWST is now, and how successful it is in deploying, the following site has all the information :
I plan for the next few articles to be split between further details of the James Webb Space Telescope and flashbacks to reliability lessons from past missions.
We’ll see how success is assured via redundancy, reliability, and repairability while contrasting NASA’s lessons to those relevant to today’s DevOps engineers and SREs.
In addition to those lessons, we’ll also discuss the Observability capabilities of the Webb telescope — both internally (how do engineers understand how well it functions) and externally (how do scientists use the telescope to observe the universe)
The next article — Success through Redundancy is available.