What kind of Site Reliability Engineer was Sherlock Holmes?
Holmes as a role model for SREs
As humans, we learn from others — what are the good things to and what are the bad things to avoid. As professionals, one of the ways we improve ourselves is by finding good role models to emulate. In this article I will explain why Sherlock Holmes is an excellent role model for Site Reliability Engineers (SREs) and in future I will dive deeper into further lessons we can learn from him.
SREs are, after all, responsible for the reliability, availability and performance of the systems or services they support. What does that have to do with the fictional character of Sherlock Holmes?
I have presented some of these concepts in brief in conferences both public and within IBM and I plan to use this platform to expand on them.
For the sake of the argument, there are obvious parallels between “solving crimes” and “resolving incidents”.
In both of them we started in a normal environment when an anomaly of some kind occurred which is causing harm to our clients.
In both of them the first task is to negate the immediate harm (if possible) and them investigate the underlying causes of issue, resolve them, and make sure they don’t happen again. By investigating the methods of the preeminent criminal investigator in fiction, I believe we can be better engineers in reality.
Working in the late 19th and early 20th century, Holmes was conscious of the scientific and technological advances all around him and took advantage of them. In the very first time Holmes story, A Study in Scarlet, he is introduced to Watson while performing an experiment:
“I’ve found it! I’ve found it,” he shouted to my companion, running towards us with a test-tube in his hand. “I have found a re-agent which is precipitated by hemoglobin, and by nothing else.”
— Sherlock Holmes, A Study in Scarlet
In short, Holmes can now detect bloodstains in a crime scene using a special chemical he has found. Holmes plans on publishing the test, making it available to others and thus resolving hundreds of crimes. As he claims:
“Criminal cases are continually hinging upon that one point. A man is
suspected of a crime months perhaps after it has been committed. His
linen or clothes are examined, and brownish stains discovered upon
them. Are they blood stains, or mud stains, or rust stains, or fruit
stains, or what are they? That is a question which has puzzled many
an expert, and why? Because there was no reliable test. Now we have
the Sherlock Holmes’ test, and there will no longer be any
difficulty.”
— Sherlock Holmes, A Study in Scarlet
Today’s SREs similarly wrestle with the investigation of an incident. The service is running slowly, with a latency much higher than the design calls for.
Is it because of slowness in the network or in the storage?
Is the delay caused by a message queue building up or because of a database buffer filling up?
Is there a reliable test? Each environment is different and each one requires testing and examination — and understanding of the normal behaviour — before we can properly investigate an anomaly.
Holmes also shows us that we should never be satisfied with the current state of equipment or Observability systems we have — we should always push them to the limit and maximize the insights we can gather from them. And if the tools we have are limited — then we must improve them or develop our own. A good SRE is never satisfied with what exists, but always strains to improve.
Holmes’ expertise is not limited to identifying blood, but covers many domains:
I have made a special study of cigar ashes — in fact, I have written a monograph upon the subject. I flatter myself that I can distinguish at a glance the ash of any
known brand, either of cigar or of tobacco.
— Sherlock Holmes, A Study in Scarlet
Note that Holmes is not satisfied with just creating a new tool to solve his problems — he is eager to share his knowledge and publish it for others to benefit from them too.
Within a few paragraphs of the first story we meet him in, Sherlock Holmes has already demonstrated three things a good SRE should always do:
- Practice using the tools and skills available to you before incidents/crimes occur, so that you will be well prepared to investigate efficiently when necessary.
- You will probably need a variety of tools working in tandem, each responsible for a different aspect of your work. If the tools you are using are not efficient enough or do not deliver the results you need, use your engineering skills to create new ones!
- Whenever you create a new tool, or leverage a new skill, be generous in sharing your new knowledge with others.
These are concepts all SREs should be familiar with. In future articles I will dive deeper into some more lessons SREs can learn from Mr. Sherlock Holmes.
Solving crimes and resolving incidents makes the world safer and more reliable — it’s everyone’s business!
I cannot share the sessions I have delivered within IBM, but here is the five minute lightning talk I gave at SRECon Americas 2023:
Please let me know what you think of this idea and whether you have a favourite Sherlock Holmes story.
If you liked this article please clap here and share elsewhere.
For future lessons and articles, follow me here as Robert Barron or as @flyingbarron on Twitter and Linkedin.