← Team Dreamwork

Dreamwork Firefighter

On-call

The members of Dreamwork take turns rotating through an on-call schedule. Everyone is technically on the calendar, as indicated by roles including primary, secondary, tertiary, and so on. The on-call shifts are one week and change out on Tuesdays. Primary and secondary firefighters are generally not included in any mission-critical or time-sensitive feature work, focusing on customer issues and maintenance tasks instead.

Incidents

Incidents are downtime of a service, degraded service, or anything that negatively impacts our services up-time or customer experience. That can be for a single customer or many. If you notice a problem, post a message in #org-reliability-ff and start an incident. It’s better to be overly cautious. Here is the page that describes how to start an incident.

Starting a rotation

On the day before your rotation (currently Monday), find a clean stopping point for you current work and/or transition the rest to another team member. Tuesday the current primary on-call person and the next in line will meet to handoff customer issues in order to preserve context. This will typically happen after the daily standup.

Responsibilities

Primary

  • Provide primary on call support through PagerDuty
  • Triage customer issues
  • Primarily work on active customer issues.
  • When not doing those things above, pick a ticket from our project which is not related to scheduled project work. Instead focus on issues that will help your on-call situation or things that are easy to drop if a customer issue or a page comes up. It is also fine to pick up work that you look forward to doing to balance out the on-call stress.

Secondary

  • Provide backup on call support through PagerDuty
  • Provide backup if primary needs help
  • Primarily work on maintenance and documentation issues.
  • Respond to slack questions and requests for help (pings for @dreamwork or messages in #org-dreamwork{:target="_blank"}) so other team members can stay heads down
    • This also helps us spread knowlege around the team by taking turns answering questions and learning about areas of our domain that you may not be familiar with.
  • Moderate team meetings (standup, issue sweep, retro) during the week.
  • Try to stay current on the larger issues as well as help review PRs that the primary creates.
  • May still work on project work, but must switch if the primary needs help

Tertiary and beyond

  • Primarily work on project/feature work or technical debt, as decided at the weekly issue sweep.
  • Provide backup on call support through PagerDuty

Incident Analysis

  • General incident analysis can be found here: https://github.com/Banno/incident-analysis
  • We will list/link to our incident analysis for our team as we get those started.
  • Analysis for
    • anything that needs a roll back/forward (hotfix/quick fix)
      • Exception: Check and Documentation services need to be deployed to production for testing. If they encounter issue and are rolled back during that testing phase, they will not require a incident analysis.
    • incidents we start or get pulled into (that we do work for)
    • incidents where Support updates our external status page