Building and maintaining infrastructure services requires to strive for quality and ownership. But it’s not always easy to know what we are missing, and what assumption we are making that we don’t know of. To help myself and my colleagues reason about whether we are addressing the important topics, I came up with something I call the Service Ownership Checklist. It’s still in a draft format, but I’ve already refined it thanks to the help and feedback of many of my peers around me, and I’m now releasing it on my blog hoping that it can help other infrastructure engineers as well.
The way to use this document is to share it with your colleagues and teams, and have them ask each other some of the questions to see how they’re doing on all these topics and challenge their assumptions. You will hopefully uncover unknown issues, and create enough urgency to go and fix them.
This blog post is organized in two parts. The first part is the Service Ownership Checklist, a set of loose questions that can be used in brainstorming and sharing sessions, and the second part is a condensed version in the form of a questionnaire, the Service Ownership Questionnaire, which shows the different levels of quality for each reliability topic.
The SRE organization at Google is running Launch Reviews when they release new services, and for this they use a Launch Review Checklist. I recommend that you read Chapter 27 of the Google SRE book, which covers the subject.
Finally, if you think of any other topics, or of a better way to group the questions into categories, please post a comment below! And if you enjoyed this article, subscribe to the mailing list at the top of this page, and you will receive an update every time a new article is posted.
Service Ownership Checklist
1. Prioritization, Toil & Automation
- Do you have a backlog of tasks for your service?
- How do you decide which tasks are high priority and make sure they are getting done? How do you decide which tasks are low priority, and should be delayed or not be done?
- What ratio of toil versus development does every team member do per week or month?
- What toil (repetitive) tasks need to be automatized?
2. Users, Interfaces & Autonomy
- Who are your users: which teams or services are using your service?
- What do your users need? How frequently do you talk to your users?
- How are your users accessing your service?
- How much autonomy do your users have in using your service? How much operational work do you need to do for a new user to start and continue using your service?
- Do you have SLIs/SLOs that define what quality means for the users of your service, and what minimum level of quality you guarantee?
3. Logging, Monitoring & Alerting
- Logging, monitoring, and alerting are related, but strictly different things.
- How do you keep logs of what is going on in your service?
- How do you monitor the health of your service? What metrics are you looking at?
- How do you know when something is broken and needs immediate human intervention?
- How is alerting duty (holding the pager) handled in your service/team?
4. Deployment, Configuration & Dependencies
- What other services, packages or libraries does your service need? Are any of these dependencies unstable and could impact you?
- How do you do OS upgrades and external dependency upgrades?
- Which repositories hold code for your service?
- Do you have development and staging environments for your service?
- How do you deploy new versions of your service? How do you roll back?
- Where do you store your configs? How do you do configuration changes?
5. Capacity Planning & Provisioning
- What is the current load on your service?
- How much extra capacity do you think you’ll need 6 months from now?
- What are the bottlenecks of your infrastructure under different workloads? Is it CPU, memory, disk space, network, etc.?
- How do you provision new servers to scale your service up?
6. Disaster Recovery & Traffic Management
- What happens when one of your servers crashes?
- What happens when one of the data centers becomes unreachable?
- What is the load balancing strategy for your service? How do you redirect traffic from one data center to another?
- Do you have sufficient redundancy?
- Can you service fail due to Thundering Herd and Cascading Failure?
- Any other ways that your service could fail?
- Does your system require backups, and if so, do you have a backup strategy?
- Are you regularly testing your backups and keeping track of how long it takes to recover your data?
- What is the current bus factor on the service? i.e. how many people have knowledge of the different parts of the service?
- Do you have entry-level tasks on the backlog that new team members can pick to get onboarded?
Service Ownership Questionnaire
1. Owners (team or person)
2. Email address of owner(s)
3. Filled at (date of review)
4. Filled by (email address of person who reviewed)
5. Monitoring & Alerting
1. This service has insufficient monitoring or alerting.
2. Graphs and dashboards are set to monitor the internal components.
3. Alerts are set up and going off when something breaks.
4. Alerts are set up and going off when something breaks; they also go to the primary oncall when needed.
If you don’t know what SLIs/SLOs are, read this.
1. SLIs for this service have not been defined yet.
2. SLIs have been defined for this service.
3. SLIs have been implemented for this service.
4. SLIs and SLOs have been implemented, and can be viewed in a dashboard.
5. SLOs are acted upon when breached.
7. Capacity planning
1. Last capacity tests were done 6 months ago or more (or never).
2. Last capacity tests were done between 3 to 6 months ago.
3. Basic capacity tests were done in the last 3 months (only in number of requests; no deep dive).
4. Advanced capacity tests were run in the last 3 months and bottlenecks were identified.
1. Runs only in 1 server, no redundancy is in place.
2. Has enough redundant servers/processes, but failovers to other servers were never tested.
3. Has enough redundant servers/processes, and failovers are done by hand.
4. Has enough redundant servers/processes, and failovers are automated.
[list of data centers where the service runs]
10. Data centers (DC) and failovers
1. Only needs one DC.
2. Runs in only one DC but needs to be in more DCs.
3. Runs in multiple DCs and failover tests were done 6 months ago or more (or never).
4. Runs in multiple DCs and failover tests were done 3 to 6 months ago.
5. Runs in multiple DCs and failover tests were done in the last 3 months.
11. Data backups
1. Doesn’t require data backups.
2. Requires data backups but nothing is in place yet.
3. Has data backups and restoring tests were done 3 months ago or more (or never).
4. Has data backups and restoring tests were done in the last 3 months.
12. Bus factor (How many people know about this service/system?)
1. 1 person
2. 2 persons
3. 3+ persons
13. Failure criticality (how bad is it if we lose this system?)
1. Minimal (ex: impacts internal tooling or slows down employees).
2. Medium (ex: impacts partners or customer service).
3. High (ex: money transactions are going down).
4. Critical (ex: the entire website/business is down).
14. Additional comment about the potential business impact of failure