SLO Philosophy
Expectations:
- Each service should have an SLO in Honeycomb with an alert.
- Each public facing endpoint should have an SLO in Honeycomb with an alert that is used to notify us when we’re going to burn down our error budget.
Why aren’t metrics per url over our ingress enough?
This metric is looking to see if each url is over >= a single number for number of 500s. We have set the trigger quite high in order to accomodate the noisy parts of our system. This doesn’t actually allow us to get granular with each endpoint which may have different thresholds of normal. We could have a low consistent failure, i.e. 10 per second which wouldn’t hit a 1000 errors per minute, just under the rate that would trip the trigger, even if it’s constant. These numbers also depend on the rate of requests coming in. In addition, all this gives us is a trigger, the equivalent of an alert in prometheus, rather than an SLO which is an understanding of how often we’re not hitting what we see as “up”.
A spike in 500s may make the trigger alert, but may not burn down the error rate for the SLO as it doesn’t hit the trigger threshold where we need to be aware. When the error rate is burning down fast enough it will then alert, because the SLO alerting is around the budget burndown to keep above our SLO.
Ingress SLOs
- SLOs are endpoint centric from the ingress. The ingress may be node-api-gatway or some other public facing api.
- SLOs should be focused on the span in the ingress and defining what is success for that endpoint.
- For example, requests should be less than 100ms and return with a 200 to be successful.
Service-Centric SLOs
- SLOs usually more endpoint centric. I.E. singular endpoints or groups of them like GETS etc
- For external External Bulkheads (i.e. Jxchange, Symxchange, BSL, Zelle) SLOs are usually around the entire health of the service, maybe excluding some endpoints. This is because of our bulkheads being multi-tennant and sometimes a single institution may be much slower or down. SLOs currently do not allow us to easily make a per institution SLO.
- Example:
- Zelle-Service (Bulkhead to PayCenter)
- Segregate internal paths from external that we don’t control.
- 2 SLOS
- Paycenter Bulkhead SLO (All endpoints excluding /accounts)
- Checks for 200s only. Duration is an unknown.
- Accounts (internal only) SLO
- Non 500s and less than N ms
- Paycenter Bulkhead SLO (All endpoints excluding /accounts)
- Banno-Accounts (Access to the Accounts layer)
- SLO per endpoint / method which allows for granular changes that take into account that things like inserts into the database are much slower than a singular read.
- Zelle-Service (Bulkhead to PayCenter)
- Example:
Wise Wisdom
- When making SLOs think about where your endpoint flows in the system, what services it hits, and what SLOs they may have. You are only as good as what you depend on.
- For example, an endpoint that only talks to a database will be fast compared to an endpoint that calls down to another service that then calls to a bulkhead.
- Remember to turn alerting on in the SLO.