Event Suppression Doesn't Work

If there’s any one gripe most users have with their monitoring toolset, it’s got to be a lack of effective event suppression. Virtually every organization I visit struggles with notification fatigue. The modern datacenter has a lot going on, and a lot of potential signs of failure occurring; high CPU utilization, services not responding to requests on occasion, drives filling up. One switch goes down and Administrators suddenly get hundreds of notifications. Most of this can be ignored, but how do we suppress effectively?

The problem with most approaches to Event Suppression is its focus on the components of a service, rather than measuring the service itself.

We must start evaluating the effects of IT, rather than the mechanics of service delivery.

So, what do we mean when we say the effects of IT? In short, the impact of downstream events on user experiences, transaction volume, revenue or any other outcome metric of the service. If we can measure these effects, and include this in our event model, we gain a new way of looking at service issues. We can start asking the question, does this matter? For example, if a load balanced web server fails, but user experiences see no impact and transaction volume remains stable, does this matter? Not so much. I may want to log the event, but it definitely doesn’t make a Priority 1 Issue. Now, let’s say we have a deadlocked table, and at the same time we see user experiences take a nose dive; now that’s priority 1! In many suppression models, such a situation as a deadlocked table may not merit high prioritization by itself. It’s only when we view this from the perspective of how it impacts service outcomes that we suddenly see what matters and what doesn’t.