Skip to content

Tag: infrastructure

Service Ownership Checklist

Building and maintaining infrastructure services requires to strive for quality and ownership. But it’s not always easy to know what we are missing, and what assumption we are making that we don’t know of. To help myself and my colleagues reason about whether we are addressing the important topics, I came up with something I call the Service Ownership Checklist. It’s still in a draft format, but I’ve already refined it thanks to the help and feedback of many of my peers around me, and I’m now releasing it on my blog hoping that it can help other infrastructure engineers as well.

The way to use this document is to share it with your colleagues and teams, and have them ask each other some of the questions to see how they’re doing on all these topics and challenge their assumptions. You will hopefully uncover unknown issues, and create enough urgency to go and fix them.

This blog post is organized in two parts. The first part is the Service Ownership Checklist, a set of loose questions that can be used in brainstorming and sharing sessions, and the second part is a condensed version in the form of a questionnaire, the Service Ownership Questionnaire, which shows the different levels of quality for each reliability topic.

The SRE organization at Google is running Launch Reviews when they release new services, and for this they use a Launch Review Checklist. I recommend that you read Chapter 27 of the Google SRE book, which covers the subject.

Finally, if you think of any other topics, or of a better way to group the questions into categories, please post a comment below! And if you enjoyed this article, subscribe to the mailing list at the top of this page, and you will receive an update every time a new article is posted.

How to get started with infrastructure and distributed systems

Most of us developers have had experience with web or native applications that run on a single computer, but things are a lot different when you need to build a distributed system to synchronize dozens, sometimes hundreds of computers to work together. I recently received an email from someone asking…