Who to blame for all your problems
Conducting Blameless Postmortems
This post is based off of my talk at PyCascades 2019
To start off with, what is a postmortem?
There are two common uses of the term:
- A document detailing what happened during an incident
- A meeting to review an incident, usually resulting in the creation of the postmortem document
This post is focused on the meeting, but I'll also have some recommendations for the document.
Why Run Postmortems?
Why do we conduct postmortems, anyway? Production broke, we fixed it, call it a day, right?
Holding postmortems helps us understand better how our systems work -- and how they don't.
If your system is complex (and it probably is), the people who work on it have an incomplete and inaccurate view of how it works. Incidents highlight where these gaps and inaccuracies lie. Reviewing incidents after the fact will improve your understanding of your systems. By doing this as a group and sharing what you found, you can improve your whole organization's understanding.
You've probably already had failures -- or at least imagined what types of failures you might have -- and you've tried to defend against those. The defenses you've built are improving reliability, which means it takes several failures to cause a real problem. In fact, if you've built a moderately complex and robust system, it probably runs in a partly-degraded mode most of the time. Services sometimes hang, but the load balancer responds by trying a different server. A virtual machine crashes, but it's not the only one hosting this service, so you run at partial capacity until it's restarted or replaced.
In order for there to be a real problem, you might need to have taken some of the servers out of rotation for a deployment first, then have one or two instances hang, and the increased load on the others causes a cascade effect. Or maybe it takes several VMs crashing while the hypervisor is unavailable due to maintenance.
Understanding these edge cases give you insight into the limitations and false assumptions about your systems. More importantly, they let you see what people are doing behind the scenes on a daily basis to protect you from these sorts of failures.
Failures are your system telling you where it doesn't work the way you think it does. Postmortems let you listen to it.
What is Blamelessness?
Alright. you're sold on postmortems. Glad we got that covered. Now what was that bit about being blameless?
Being blameless boils down to this: you assume every person made the correct decision at every point along the way, given the information that was available to them at the time.
This means that if your causes -- root or proximate -- include 'human error', you have not really identified your causes. By approaching your postmortems blamelessly, you start with the assumption that 'human error' is not a valid conclusion.
Why Blameless Postmortems?
Why would you do that? How will we get people to stop breaking things if we just say "you did your best, I'm sure you'll get 'em next time"?
Well, first off, this is not about participation trophies, or positive affirmation, or anything like that (not that there's anything wrong with those ideas).
Honesty and transparency
The primary goal of blamelessness encouraging honesty and transparency. When you include blame in your postmortems, you encourage people to hide information and distance themselves from problems.
Say Richard reboots a server to try to fix a problem, and that makes it worse. Then, we hold a postmortem where we find the root cause was 'operator error' on Richard's part. That leads to action items like take away Richard's production access, or make Richard get approval whenever he wants to reboot a server for a while. Will he tell us what he did during the next outage? Or if we fire him, will Remy tell us what happened when during a postmortem for their failed deployment?
By removing blame from postmortems, we encourage open, honest, complete feedback from the people who know best how the system operates, and how it failed to do so during this incident.
We're all on the same side
Reviewing incidents blamelessly also shifts your perspective by preventing "Monday-morning quarterbacking". In The Field Guide To Understanding Human Error, Sidney Dekker tells us this:
Human error is an attribution, a judgement that we make after the fact
In other words, 'human error' is not a class of behavior - it's how we classify actions after we know outcomes. People don't shift between "error mode" and "normal mode" - they take input from their surroundings, apply their knowledge and expertise, and take actions based on that.
This helps us remember that people don't come to work to fail: They come to work to succeed. Your coworkers are not the ones causing the systems to break. Most of the time, they're doing "messy work": little nudges and corrections and cleanups throughout the day. They're doing the little things needed to keep your systems online. The people closest to the problems - the ones you'll inevitably blame in a blameful postmortem - they are experts in maintaining these systems.
If we look at how people behaved and recall that they want to succeed, we realize that they only do things that make sense at the time. When we realize that, we have to stop saying "this was their fault because they pulled the lever" and start saying "why did they pull that lever?" Surely, if it made sense this time, it'll make sense in the future. Was this an observability problem? Was this an interface problem? Could the system have somehow prevented this person from pulling the lever?
Aside: Blamelessness != Anonymity
You might note that I used names in all the previous examples. Blamelessness is not about anonymizing actions, or about discussing the system without mentioning the humans in it.
Here's an anonymized version, so we're clear what we're talking about:
Trigger: An engineer powered down the main fooserver
Anonymizing says "we want to blame you, but we don't know who you are." Or maybe, "we want to blame you, but Ben said we couldn't."
If you anonymize, what will probably happen is someone will read the anonymized statement during the meeting, pause, and look conspicuously at James, who was indeed the engineer who powered down the system.
Removing all mention of the people involved says nobody works here, the system just did these things on its own, and makes it harder to understand what happened:
Trigger: The main fooserver was powered down
How? Why? Did we lose power to the rack? Did it have a kernel panic? Was this sabotage by our competitors?
Trigger: James powered off the main fooserver
Stating what someone did is not the same as assigning blame to them.
Accountability without Blame
About this point you may be wondering how to ensure accountability within your team when you eliminate blame from incident reviews. At first glance, blamelessness seems directly at odds with ensuring accountability.
In actuality, removing blame from incidents turns your culture from backward-looking accountability to forward-looking accountability.
If you have a blameful culture, after an incident you might take away an engineer's access to certain systems, require them to get permission to make certain changes, or maybe even fire them. The next time there is a similar incident, that engineer's skills will not be available as readily, or maybe at all, and the incident probably won't go any better. Your incident investigation probably ended in a diagnosis of "operator error" last time, so other engineers have to learn the hard way, during the next incident, how best to handle it.
Compare that to a blameless environment. After the first incident, the engineer(s) involved likely learned more about the incident during the postmortem, as did the other engineers on the team. Changes were probably made to the system to make this type of incident less likely (real changes, not "Tell them not to do that anymore"). Even if the system itself is not more resilient to this failure mode, the engineers are now better equipped to handle it quickly and effectively.
By soliciting the open, honest feedback from the people who were there when everything went wrong -- the people who were there when it all got fixed -- we turn them into teachers. The information gathered in a postmortem helps make your experts more knowledgeable in their fields.
Blamelessness in High-Stakes Organizations
Sure, you say, this works for unicorns, or startups, or someone else, but my work is far too important, or my company is too highly regulated, or our processes are too sensitive.
Here are some real organizations that really do this:
The US Forest Service.
In response to a "serious" incident (generally an on-duty fatality) the US Forest Service follows their Coordinated Response Protocol and Learning Review Process. In the introduction to their guide on this they say they developed this process "as a result of the agency's transition from focusing on finding cause to striving to understand conditions and influences that had an effect on decisions and actions" The guide goes on to say:
The Learning Review process begins with the understanding that if the action(s) had not made sense to those involved, they would have done something different. Conditions shape decisions and actions, and revealing these conditions will help the agency design more robust and resilient regulations, policies, and procedures (i.e., a more robust and resilient system).
That's pretty big talk, but it doesn't end with the talk. This was actually codified by an Executive Order and the Code of Federal Regulations: Information gathered by Learning Review (LR) personnel will not be used as the basis for disciplinary action or to place blame on employees. No attempts to gather information resulting from other [Coordinated Response Protocol] functions or activities such as [Critical Incident Peer Support] will be made. This is in accordance with Executive Order 12196 paragraph 1-201[f] and CFR 1904.36 and CFR 1960.
Not only will the USFS not blame you, they will not allow others to use the results of their reviews to blame you.
National Transportation Safety Board
The NTSB investigates accidents in the aviation, highway, marine, pipeline, and railroad modes, as well as accidents related to the transportation of hazardous materials.
They say this in their instructions for conducting public hearings on accident investigations:
There are no adverse parties or interests. There are no formal pleadings. The Board does not determine liability, nor does it attempt to do so. For this reason, questions directed to issues of liability will not be permitted. I must emphasize the fact-finding nature of the hearing. Our sole purpose is to determine how and why this accident occurred and what can be done to prevent similar occurrences in the future.
If your work is more pressing than the NTSB or Forest Service, maybe you can tell me later how you run postmortems.
Getting Started with Blamelessness
So how do you get started?
First, make sure everyone understands what blamelessness means. For my team, we have the definition I shared earlier -- actually, it's important enough, here it is again:
assume every person made the correct decision at every point along the way, given the information that was available to them at the time.
We have that definition in the template we use to construct our postmortem documents, and typically the person running the postmortem meeting will remind everyone of this at the beginning of the meeting.
Next, it's ideal to draw up your specific procedures before the first incident. You'll probably refine them a bit to fit your team, but it helps to have something in place before your first time through.
Then, set some parameters on when you hold postmortems. Here are some suggestions:
- Any time an outage is declared
- Any time service is degraded for a certain amount of time
- Any time a human needs to intervene to correct a process that should be automatic
- Any time a stakeholder requests it
It might be a good idea to consider running a few postmortems on "near misses" if you don't frequently have incidents that meet these criteria. This gives you practice before "the big one" hits and, anyway, you'll probably learn something about your system.
The Postmortem Document
I mentioned our template document - here's what's in it:
Summary: All instances of the web server stopped responding to requests for approximately 1 hour
Impact: Users were unable to view most pages on the Visual Guide for 1 hour
- Incompatible versions of docker and docker-compose caused stdout to stop accepting input after a certain number of bytes were received
- The java process writes logs synchronously
- We never test this service in docker compose except in production
- Alerts were misconfigured, so we did not know when instances failed their healthchecks
- Turned on instance termination for failed healthchecks
- Pinned docker-compose and docker versions in ansible deployment scripts
|Support docker-compose in staging||Open||JIRA-1234|
|Fix and test health check alarms||Closed||JIRA-1235|
- What went well:
- Changing healtcheck failure behavior to instance termination quickly made system stable
- What went wrong:
- We don't have a staging environment that matches production for this service
- We never tested alerting, so we relied on users to report failures
- Where we got lucky:
- Internal user discovered this before public launch
- 20180210-1301: last instance becomes unavailable Outage Begins
- 20180210-1520: Graham messages in #software that he cannot access Visual Guide
- 20180210-1523: Gabe tests and confirms he cannot either Incident declared
- 20180210-1614: Ben changes healthcheck failure policy to Terminate Instance Outage Mitigated
- 20180210-1750: Ben deploys new AMI with docker and docker-compose versions pinned Issue Resolved
I'll call out some of the more interesting things here:
- Root causes: This is a plural. There are definitely multiple root causes to any issue.
- Lessons Learned: Here you list the key takeaways. What went well and what went wrong are pretty self-explanatory. Where we got lucky is where you call out ways that this could've been a more severe issue on a different day.
- Timeline: Shown is a very abbreviated version. This should usually be very verbose, in chronological order, with timestamps wherever possible. We include callouts to when an issue started, when it was detected, when it was mitigated and when we considered the incident resolved.
- Supporting information: This is anything that might help understand the issue or document the timeline better. Links to log or metrics queries, thread dumps, stack traces, etc.
Running a Postmortem
The big day comes. You have an outage, you fix it. Now it's time to run your postmortem.
First, the person who led the incident response should fill out a draft of the postmortem document. It should not be complete at this time: only enter the date, the known parts of the timeline, trigger, resolution and detection. Leave any of these blank that you do not know the answer to. This should be done as soon as practical, definitely within about a day of the end of the incident.
They should also schedule the meeting. This, again, should be as soon as practical, probably within a few business days. Everyone who directly worked on the incident should be there, and anyone affected by it or curious about it should be welcome.
Someone other than the person who led the response should lead most of the meeting. People too close to the incident may tend to gloss over details and make assumptions about others' knowledge. Of course, the person/people who were involved in the incident will probably end up doing a lot of the talking, either way.
The meeting should follow a rough structure, but it doesn't need to be strict or prescriptive.
Start by giving a background on the systems or processes involved, especially if some people in the room are not deeply familiar with them. Then move on to the summary, detection, and impact, if known. Basically here you're telling folks (briefly) what this incident looked like from the outside.
Next, work through the timeline, referencing supporting documents along the way.
This timeline is the most time-consuming and also most important part of this procedure. Work through what happened from start to finish. Make sure everyone is comfortable asking questions whenever they need clarification. These questions should not be limited to technical details, but should also probe into how people recognized issues, made decisions, and sought help. This is the interesting part where you will really start to learn new things about your system.
Next, discuss the root causes. Note that causes is plural here - there are almost definitely many causes to any given incident.
Some teams find it helpful to use a formalized technique for root causes analysis, such as the 5 Whys, but it's not always necessary. It's very possible to follow a formal process such as 5 Whys and end up at wrong answers.
Discussing root causes is where blame usually starts to enter this picture. Sometimes it's obvious, sometimes it's not. Here are some things to look out for:
By this, I mean comparing what actions people took (or didn't take) with what actions they might have taken - this includes things like "you rolled back instead of trying to push out a hotfix" or "You did x, but this Confluence page says to do Y"
Cherry-picking is constructing a story that only exists in hindsight by selecting data that support an existing conclusion. For instance, if we start with the conclusion that Ben doesn't know his way around the Kubernetes cluster, then look for data that points to that conclusion (such as consulting documentation or mistyping commands), we can construct a convincing story that the outage was caused by Ben's lack of knowledge around Kubernetes. Cherry-picking says "I know the cause and I can sift through the data to prove it"
The shopping bag is the opposite - it says "I know the outcome, and there's all this data here - you should've seen it coming!" The problem is we ignore the fact that we're collecting this data in hindsight. This data may have been available, but was it observable? Or, was there conflicting data at the time? Is it even possible to predict the outcome that you now know based on the data you're pointing to?
These cause-forming traps all exist because we now know how the incident ended. If you find yourself using them when looking for causes, it's a warning sign that you're looking to blame and explain, not to learn and understand.
Finally, discuss what was learned along the way. This is where you start looking for remediations, but this discussion should not be limited to action items. Broader topics should also be welcome here. Everyone should feel free to throw in ideas about what they learned and what could be improved. This can also lead into more questions for clarification or to add context.
Note that we've waited until now to discuss recommendations. It's likely that throughout the whole postmortem process, people will probably start to think up action items and recommendations. It's best to have folks write these ideas down, and discuss them all together at the same time. This keeps the meeting on track, reduces overlap, and helps people consider the whole picture before making suggestions.
This discussion should turn out a list of recommendations, not necessarily well-formed action items. These recommendations should be thought over and discussed before considering them tickets to be worked on. This can happen as the final stage of the postmortem, but it's best to give people some time to ponder what they've learned. Ideally, hold a separate meeting just for this, two or more days after the postmortem itself.
When you work through your recommendations, be careful about knee-jerk reactions and over-engineering. Getting a bunch of engineers together in a room and talking about a problem will probably lead to a lot of engineering solutions. Make sure the solutions you come up with are actually helpful and reasonable. Adding more steps to a process might mean people are less likely to follow the prescribed process. Adding more logic to your tooling means it's harder to reason about and adds places for bugs to hide. Also, make sure these action items are SMART: Specific, Measureable, Actionable, Reasonable, and Time-bound.
"Improve performance" is not an action item - "Reduce 99th percentile response times to 50 ms" is.
Remember, though: It's possible to have a very good postmortem and not leave with any action items. The primary goal is of a postmortem learning - action items are ancillary.
After The Postmortem
Well, obviously work through your action items. Once they're done, update the postmortem document to reflect that.
More importantly, though, share your postmortems! Send out an email or a slack inviting people to read the postmortem. Put them somewhere where everyone can read them whenever they want. Remember, we did all this to gain knowledge - sharing it means more people benefit from your knowledge.
The Field Guide To Understanding 'Human Error' by Sidney Dekker
Site Reliability Engineering
Etsy's debriefing guide
How Complex Systems Fail
Slides from my PyCascades presentation
PyCascades conference information