A classic scenario is that you start out by checking whether your hosts are replying to ICMP ping requests. You slowly become more curious and start checking hardware temperatures, disk health, network latency, bandwidth, etc. You gather all manner of interesting statistics and number about your environment.
Triggers must be set! If a host doesn’t reply to an ICMP ping request, something is obviously wrong. If a web check fails, there might be underlying back-end issues with your web service or database back-end. Maybe a certain process should always be up and running.
A year passes. You look at your Triggers page and count: 23 “High”, 8 “Average” 38 “Warning” and 58 “Information”. Phew, no “Disaster” triggers! Everything is running smoothly! Except you’ve botched your monitoring solution and made it next to impossible to distinguish between what warrants your attention and what doesn’t.
Alerts must be actionable
ac·tion·a·ble – Relating to or being information that allows a decision to be made or action to be taken.
Above I’ve defined the word “actionable” because first of all, it’s important to understand precisely what it means. In short: If you can’t do anything about it, why are you being alerted?
You can do something about:
- A hard drive throwing read errors
- A host not replying to ICMP ping requests
- The temperature of a CPU
- Disk space running low
- Excessive swap usage
You can not do anything about:
- A switch port having changed status
- Googles public DNS service not replying
- An access point having lower than usual throughput
- A single data packet being lost
- A service replying slower than usual once or twice
A trigger list should be a to-do list
A hard drive is about to die; replace it as soon as possible. Free space is running low on a server; remove irrelevant files or upgrade your storage. A service is consistently replying slower than usual; start looking into why it’s happening.
The trigger list should be your to-do list and not a list of interesting information. Every triggered event must be actionable and at some point correctable. They must also be important enough to warrant your attention and your valuable time.
When creating a trigger, always consider if and when that trigger pops up with “PROBLEM” status, if you can turn that status back into an “OK”!