Known Unknowns —Webb Struck by Meteoroid!
Success through Reliability — Webb Lesson 4
If it bleeds, it leads — and nothing bleeds like a potential crisis in a 10 billion dollar space telescope which has not yet started working!
Last week NASA announced that the James Webb Space Telescope had been struck by a meteoroid which (if you only skimmed media headlines) “struck a hole” in the telescope and “put it out of alignment”. Slightly less dramatically, NASA announced that, as expected, the Webb telescope had encountered tiny pieces of dust floating in space and that of the five which struck the telescope in the initial months of its life, one was larger than expected. A statistical anomaly, a speck of space dust which was larger than predicted.
Yes, it struck the telescope.
Yes, it caused some concern to engineers as they (slightly) made adjustments to the mirror to correct for the dust-strike.
No, there’s no cause for concern.
It’s like a commercial website hiccuping when there’s an unexpected load on the system and quickly readjusting and returning to normal operations.
The usual questions after a headline are predictable — both for NASA missions and IT operations — How did this happen? Why weren’t we ready for this? What will happen to the system if this problem recurs?
There’s a handy tool Site Reliability Engineers often use to assess risks in a project. It’s commonly called the Rumsfeld Quadrant, after the US Secretary of Defense who brought the concept to widespread attention in 2002.
“…there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know.”
– Donald Rumsfeld
In short, we can divide risks into four types:
The case of known-knowns is the simplest — for NASA, it’s a risk which has a known solution such as “How do I cool Webb down?”. For SREs, it’s a technical issue which has good monitoring and automation coverage.
Unknown-knowns are more complicated — “How can I have any confidence that I can solve a problem I don’t know about?” Well, this is where experience and planning comes in. For example, NASA didn’t know ahead of time how to save Apollo 13, but all the training and preparation for other issues enabled them to take control of the situation and rescue the astronauts. In the same way operators and SREs have processes in place to deal with unexpected errors and practice “game days”.
Known-unknowns are the compromises — when you know that there’s a potential problem out there but you don’t have enough time, money or people available to ferret out the issue and the potential solution ahead of time. NASA knows that there are specks of dust and meteoroids floating around in space, they even have an approximate idea of how many there are and how large they are. Back in 2017, NASA published a series of Questions and Answers about the hardiness of Webb, which included this exact issue:
“…In the inner Solar System where Webb will orbit, we have a good understanding of what the population of meteoroids is like from years of observations and research. It’s mostly dust and very small particles, with the majority being sparsely distributed and tinier than grains of sand. There are some pebbles, rocks, and boulders, but they are very sparse and very rare…We know Webb will get struck by micrometeoroids during its lifetime, and we have taken that into account in its design and construction… We even did tests on the ground that emulated micrometeoroid impacts to demonstrate what will happen to the mirrors in space.”
— Paul Geithner. Deputy Project Manager — technical for the James Webb Space Telescope
Unfortunately, Webb has been struck by a rare particle which is larger than was planned for. Could Webb have planned for larger particles? Perhaps launched with a heavier shield against such impacts? Maybe something that would generate a Star-Trek like “force field” around the space telescope to protect it?
The thing is that anything added to Webb to improve its resilience would have come at the expense of the scientific payload. A heaver shield means a smaller telescope or less helium to keep the telescope cool or… some other compromise which would reduce the scientific capability of the telescope.
A similar question was raised just hours after Webb’s launch — why doesn’t Webb have a “selfie” camera as we’ve seen on other rockets and space probes? The answer is again related to the compromises such camera would cause — if there’s enough light for the camera to see Webb properly then there’s something wrong with the sunshade which protects Webb from the heat of the sun. So a camera capable of seeing Webb would need to generate a source of light which would hurt Webb’s scientific capabilities. Despite the potential advantages of such a camera to keep track of the health of Webb and investigate problems, Webb’s engineers came to the conclusion that the cost would outweigh the benefits.
In the same way, NASA scientists and engineers didn’t want to over-engineer and wrap Webb in heavier protective armour than necessary.
One of the reasons the Webb sun shield is layered is to ensure that even if particles impact the shield, they won’t make one big hole.
For SREs the question often becomes “Just how much reliability do I need?” If today the system you’re responsible for has 4 nines of reliability (i.e. can be down for one hour out of a year), how much money are you willing to invest get to five nines (six minutes of downtime)?
Basically you’re saying “I know that there are problems I’m not ready for, but avoiding them ahead of time will cost much more money than solving them when they occur”. How many of us pre-book a taxi/uber/lyft, just in case we’ll have a flat tire in our car?
Unknown-Unknowns are the wild card problems. The ones we don’t have enough information to even judge what the impact of them will be when we encounter them.
Here we have two options —the first is to prepare as well as we can ahead of time, hoping that when an unknown problem occurs it will fall in the “unknown-known” zone and our processes and training will enable us to resolve the issue fast enough.
The second option is to investigate as many problems ahead of time as possible and even if we don’t find the solution, at least we’ll be better prepared. In essence, push this into the “known-unknown” zone.
Continuing that, further engineering work can always be done to shift “known-unknowns” and “unknown-knowns” into “known-knowns” (whew — try saying that three times fast!)
Back in the 1960s, the issue of micrometeoroids striking satellites was in its “unknown-unknown” infancy — NASA had no idea of how many little specks of dust were whizzing around the Earth, from which angles they were coming, what kind of effect they’d have on satellites and how the satellites would be affected. So research was done by launching a series of satellites whose raison d’etre was to be struck by micrometeoroids and report back.
These satellites had gigantic “wings” with sensors, which would be struck by space dust and results would be analyzed by hundreds of embedded sensors.
The results of the Pegasus project turned the unknown-unknown risk of space impact into a known-unknown risk. Further engineering work on the design and manufacture of spacecraft found solutions to the problem (e.g. building components in layers so that impacts wouldn’t penetrate all the way through) made them safe from micrometeoroid impact, thus turning it into a known-known risk.
In summary: Yes,there’s some area for concern because the risk of large pieces of debris is apparently more significant than NASA originally considered. No, there’s no reason to be seriously concerned because Webb was designed to survive (and thrive) when struck by pieces of debris — even much larger ones than it was explicitly tested against. NASA will take this new knowledge and plan ways of turning the Unknown into Known.
SREs supporting IT environments down here on Earth use many Application Performance Management and Observably tools to keep track of what’s going on in the environments they’re responsible for — to make sure that the known-knowns are behaving and the unknowns become known.
I’m very happy to be able to say that Instana, the tool I’ve used the most in the last few years, has just been announced as a leader in the field by Gartner! This is an incredible achievement by the team and I’d like to take the opportunity to congratulate them.
Last month I presented a session about “Lessons from Webb” at the fantastic #WTFisSRE 2022 conference. If you attended, I hope you enjoyed it — I know I certainly did!
If you didn’t have a chance then you can watch myreplay (and everyone else’s) at their video repository:
I know that I’ve learned a lot by attending conferences and even more by collecting my thoughts into a coherent session to share with others. It’s something I’d recommend anyone to try at least once.
If you’ve got something you’ve done in the areas of reliability, performance, DevOps and so on which you’d like to share with others, here are a couple of upcoming conferences I’m involved with:
- If you’re local to Israel (or want an excuse to visit!), then may I suggest submitting a session to DevOpsDays Tel Aviv — Call for papers open till the end of August (you’ve got plenty of time!).
- If you’re local to planet Earth then may I suggest submitting a session to a virtual conference — PREVAIL; IBM’s premier conference for presentations and panels about resilience, performance, security, and testing. By sheer co-incidence, the call for papers is also open till the end of August.