Incident Analysis
We analyze incidents for two reasons. First, to identify ways we can improve our systems and services. Second, our clients are required by examiners to show that their core vendors are providing an incident analysis for service incidents.
Ideally, every incident should be examined in a blameless postmortem. Some incidents may be so self-evident as to not require additional anaylsis.
Postmortem
A postmortem is a team effort, conducted in a meeting with a facilitator, with the intent of working through the sequence of events to identify, as thoroughly as possible, what happened and why. The output of the meeting will be a document that includes this timelime, and identifies the root cause. The document may contain additional tasks to be performed, with due dates for these actions.
Useful links for conducting the postmortem:
- The Howie Guide to Post-Incident Investigations
- Post-Mortem Guide - Incident.io
- The Blameless Postmortem - PagerDuty
- How to run a blameless post-mortem - Atlassian
- Five Whys - Wikipedia
- Fishbone Diagram - Wikipedia
- Root Cause Analysis - Wikipedia
Post-Incident Flow
Incident.io helps us perform all the necessary post-incident steps by using the following flow. The “Post-incident” tab on each incident should present a nice checklist to work through, with buttons to automate things as much as possible.
Review the incident timeline
Before creating a post-mortem, it’s worth spending some time making sure that the incident timeline includes all the important events and hides those events which detract from the story.
If there are events that you think should be added to the timeline, you can either:
- Pin messages in the Slack channel, and they will be automatically added.
- Add a note directly onto the timeline.
Building a timeline of an incident helps us focus on cause and effect. “First, we tried X. That had result Y, which made things worse. So we tried Z, which resulted in a fix to the problem”. For short-lived incidents, a timeline might not reveal novel information; for longer or complex incidents a timeline can really help us separate signal from noise. A lot can happen during an incident, and important details can get lost in the excitement. A calm, dispassionate review of events after the fact can surface important information.
Export the post-incident review
Once the incident details are up-to-date, create your post-mortem. Incident.io uses a templating system to stub out the post-mortem document, allowing you to prepare most of your post-mortem document directly with its UI. Then you export this to a Google Doc.
The Google Doc can (and should) be edited to add additional information. The Google Doc will be shared with debrief participants.
Schedule a debrief
Schedule the debrief shortly after the incident, whilst ensuring that individuals still have a day or two to process what happened. Check that you have invited the relevant people, e.g. incident participants or stakeholders. Incident.io can schedule this for you and will automatically invite everyone who participated.
A good post-mortem is a team effort. Different people will have different recollections of events, and different ideas about what could have been done. One person (usually the Incident Coordinator) should prepare most of the post-mortem document and then review it with the group during the debrief. The Google Doc should be updated collaboratively to reflect the collective wisdom and experience of the group.
Review, Learn, Improve
The Jack Henry Enterprise Resilience Office uses a “Review, Learn, Improve” model for performing post-incident analyses. They have a template to follow. In order to streamline reporting efforts, we will use their template format when performing post-incident analyses in Incident.io.
The RLI template may feel a little brittle, or ask questions that don’t feel relevant. Do your best to follow the template, but focus on the intent: look for ways that our processes may have failed us, identify opportunities for improvement, and keep a blameless attitude.
Set timestamps
Timestamps are automatically set when an incident moves between statuses. However, timestamps don’t always reflect what happened in reality; updating them will ensure that the post-mortem will be more representative of what happened.
This is usually optional, but in the event that symptoms started long before an incident was declared, it can be helpful to note that in the timeline. These details can be added during the “Review the incident timeline” step, during the debrief, or after as necessary.
Mark the post-incident review as complete
Ensure that all sections of the post-mortem have been filled in before marking it as complete.
Share the post-incident review
Once the post-mortem has been completed, share the document alongside any key learnings from the incident. This will post a message to the incident channel with a link to the Google Doc. You may email the Google Doc (or a link) to anyone else who might be interested: support, management, or teammates.
NOTE: ServiceNow is our record-of-truth for all incidents. You should also export the Google Doc to PDF and attach it to the ServiceNow ticket.
Review follow-ups
During the debrief, there may have been discussions around what work should be done in the future to help respond to incidents better, or to stop them from happening again. Make sure the follow-ups here are representative.
Time
IN BUSINESS DAYS
48 hours from incident a postmortem is to be provided to the operations team to help them communicate relevant details about the incident to customers.
Within one week from an incident the team that owns the incident will complete the post-incident flow.
- Customer communication should be transparent, but high level. Our goal is to be informative and transparent, but not over share; revealing insider information or details that put our organization at risk.
- Operations teams will publish incident analysis details to Jack Henry’s Client Portal within 5 days of an incident. 10 days is what our client’s examiners require.