← Team Pupper

Pupper Firefighter

Pupper escalation policy can be found here

Incidents

Incidents are downtime of a service, degraded service, or anything that negatively impacts our services up-time or customer experience. That can be for a single customer or many. If you notice a problem, post a message in #org-reliability-ff and start an incident. It’s better to be overly cautious. Here is the page that describes how to start an incident.

Issue flow

  • The open customer issues with our pupper component can be found here
  • Quick, consistent, and clear communication with support is required
  • In many instances we will need to work well with other teams and analysts to debug and/or resolve an issue
  • Team ownership can be found here
  • Analysts can be reached in slack #org-analyst room.
  • Don’t be shy to pull in subject matter experts or leadership to an issue as needed

Runbook

All symxchange requests failing for specific institution

  1. *Communicate with support-ff about issue.
  2. Verify behavior in kibana, looking at institution’s symxchange interaction logs.
  3. Capture root request error (ie: connection timeout, connection refused, other)
  4. Determine if issue is on Banno’s side or if jsource case needs to be open.

*Page support if no response. via slack command /pd-support *message*

Restarting Symxhcange servers

Restarting the Symxchange servers is done throught Jenkins. Currently the services are running in Marathon.

  1. Just this link to display Jenkins
  2. Select “Build with Parameters” on the left column
  3. Enter “symxchange-http-server”
  4. Press the “Build” button
  5. Repeat for “symxchange-rpc-server”

Once the services move to Kubernetes, you will use this link for Jenkins


Tooling

Most tooling links can be found here: https://docs.banno.com/infrastructure/urls/

Responsibilities

Primary - only doing FF work (no feature/project work)

  • Acknowledge and resolve pagerduty alerts
  • Page other teams when necessary. (via slack /pd-* or pagerduty app)
  • Check for incoming customer issues at least once a day
  • Triage issues jira issues https://banno-jha.atlassian.net/issues/?filter=10843
  • Work on any customer issue related development/fixes, as needed
  • Work on any incident related development/fixes, as needed
  • Address slack messages to @pupper-ff group
  • Address jira messages to @pupper & @pupper-ff
  • Monitor #org-pupper room for triage
  • Monitor #war-room-go-live for triage
  • Monitor #auto-pupper-alerts for alerts
  • Monitor #prod-people-reports for reports and mobile-admin triage
  • Monitor graphs periodically and after deploys
  • Work on non-feature related needs (such as logging, alerts, metrics, outside requests - infra/security, etc), as time allows

Secondary

  • Acts as a safety net
  • Primary to reach out when help is needed (whether underwater or primary will be unavailable for a period of time)
  • Update Pagerduty as needed when swaps occur

Requirements

  • Pagerduty account
  • VPN access
  • Be in Banno organization and Pupper team in Github
  • General knowledge of how to use Kibana, Marathon, Grafana, Mesos, Prometheus, Data Services Reporting

Incident Analysis

  • General incident analysis can be found here: https://github.com/Banno/incident-analysis
  • We will list/link to our incident analysis for our team as we get those started.
  • Incident Analysis for
    • anything that needs a roll back/forward (hotfix/quick fix)
    • incidents we start or get pulled into (that we do work for)
    • incidents where Support updates our external status page