Optimize your monitoring for decision-making
Working in infrastructure is about building and maintaining complex systems made of many moving parts. To operating such systems you need to make sure they are running and healthy, that is to say that they are performing to the level of quality you have defined, and for this you need monitoring.
For most engineers, “monitoring” means having web dashboards with graphs and numbers. I had to build monitoring for half a dozen large systems, and I’ll be honest, the first dashboards I made were really bad. They were packed with too much data, were using too many colors, and required too much internal knowledge to be useful outside of my team. Through my experience, I have come to the point where I can build decent monitoring dashboards — or at least I’d like to think so — and I want to share one simple yet powerful tool I’ve learned: the health box.
Building effective dashboards is hard
Everybody’s first dashboard looks close to this:
This dashboard is shiny and polished, it has many colors, and a black background to make it look cool. Now I want you to take a couple of minutes before you read further, and take a careful look at the graphs and labels in the dashboard above. Can you tell if this system is healthy or not?
Have you really looked or did you cheat? If you haven’t looked, please do it. There is something that counts the number of errors at the top, and the count is two. But does it mean the system has an outage? Honestly, I don’t know if this system is healthy or not, and that’s the answer I was expecting from you.
We have no idea what’s going on with this system and it’s not our fault. That dashboard was poorly designed, and whoever made it was probably building a dashboard for the first time, so it’s not their fault either.
Assume the viewer knows nothing
The makers of the dashboard above assumed that the viewer was like them, and had as much knowledge of the system as they did. Instead, they should have assumed that the viewer:
- Does not know what every graph means.
- Does not know what the graphs are supposed to look like when the system becomes unhealthy.
- Does not know the internal components and how they fit together.
- Has never read the source code.
Before you read further, take a couple of minutes to look carefully at this other dashboard below. Can you tell if this system is healthy?
What I expected you to think was:
log_statistics_minutelyservice is definitely broken, it looks like it stopped running for the last 90 minutes at least.
- There is something wrong going on with the
puppetservice on the host
storage-31. Not as serious as
log_statistics_minutely, but worth looking into.
- Except for the two issues mentioned above, everything else seems healthy.
This dashboard is better than the previous one because it does not make you think, it tells you right away what is wrong and how critical each problem is. Now that you are aware of the limitations of graph dashboards, it’s time to formalize the solution. And here I give you: the health box.
Health box: a decision-making shortcut
The second dashboard did not have any graphs, it only had boxes of different colors with text in them. Those boxes are health boxes, and each of them has only one job: to tell you if the service it represents is healthy or not.
A health box has four bits of information:
- Service name: what service is represented.
- Status: OK, WARNING, or CRITICAL.
- Message: a hint as to what is causing the status.
- Opdocs code: this is a unique identifier that represents the service, and also a link to the Opdocs for that service. I will talk more about Opdocs later in this article.
A health box can only have three states, based on the three statuses it can represent: green, orange or red. Below is an example of a health box for a service called
query_monitor, and what the health box would look like in different states.
There will be times when you have to look at your monitoring at 3 o’clock in the morning, and you’ll have come up with an answer to the question: is my system on fire and should I wake up my colleague?
By showing the viewer exactly what he needs so he can answer that question within seconds, you remove cognitive load. The viewer no longer needs to interpret and combine data from multiple sources, the decision has already been made for him and he can move onto troubleshooting and remediation.
Control your health box with thresholds
To control the state of your health boxes, you need to read the metrics that you have collected with your monitoring infrastructure, and compare those metrics against thresholds. For example if your system uses a queue to distribute work among workers, you want to monitor the number of items on that queue, which we’ll call N. Let’s assume that under normal load, N is about 50 per second, i.e. an average of 50 items per second on your queue.
You want to check that the size of your queue is not growing too much, which could be a sign that the workers are dead or are not processing items fast enough. You also want to ensure that there are at least some items on the queue, as an empty queue could mean that there is a problem somewhere upstream in your pipeline. Here is what those conditions could look like:
if N > 200 for the last 5 minutes: set to CRITICAL else if N == 0 for the last 30 minutes: set to CRITICAL else if N > 100 for the last 5 minutes: set to WARNING else if N < 10 for the last 10 minutes: set to WARNING else: set to OK
All your monitoring and alerting systems must tell the same story
You’ll notice that the status of a health box is the same thing as the alerting or paging, that is to say, the set of conditions you have configured for a service in order to send yourself a message when that service becomes unhealthy.
When one of your systems goes down and your on-call engineer gets paged, the first thing this engineer will do is open the health web dashboard that has all the relevant health boxes, so he can get an instant view as to which services are unhealthy. Therefore it is very important for your health boxes to be in sync with your pager alerts at all time, so that there is only one version of the truth that can be trusted as the true state of your system.
Opdocs: Operational Documentation
The Opdocs are the Operational Documentation for a service, which describes the remediation steps that one can use to fix outages. Some engineers also call that a Playbook. Another useful concept is the “Opdocs code”, which is a unique identifier for every service in a system. For example if your system is named Bulldog, or BLG for short, then all the services that form that system and which you are monitoring should have a different Opdocs code: BLG-001, BLG-002, BLG-003, etc. This is just one possible convention, and you could also have a different Opdocs code for each failure mode of the same service, that's totally up to you.
When a service is unhealthy and sends an alert by text message or email, that alert should include the Opdocs code. The on-call engineer can then use the Opdocs code to search in the internal documentation of your company, and find guidance to fix the outage.
Putting it all together
Don’t get me wrong, I'm not saying that graph dashboards are bad. You definitely need them to monitor the internals of your systems. They should not, however, be your first level of monitoring.
Your top-level monitoring should be optimized for decision-making, so you can quickly figure out if you have an outage that needs a human to act immediately. One way to reach that goal is to build a health status dashboard using health boxes, and keep your graph dashboards as a second level, for troubleshooting.
This setup has proven very effective for me, but it's only a tool and you need to think for yourself whether it would work in the context of your own infrastructure.
Do you have monitoring or dashboard best practices that you want to share? Post a comment below! And if you enjoyed this article, subscribe to the mailing list at the top of this page, and you will receive an update every time a new article is posted.
Looking for a job?
Do you have experience in infrastructure, and are you interested in building and scaling large distributed systems? My employer, Booking.com, is recruiting Software Engineers and Site Reliability Engineers (SREs) in Amsterdam, Netherlands. If you think you have what it takes, send me your CV at emmanuel [at] codecapsule [dot] com.