Can Monitoring Everything a Bad Idea?

The average network switch can expose upwards of 10,000 metrics describing availability, performance and security.  I’ve seen filers with over 30,000.  So, the question I’m posing is, for effective monitoring should I be collecting and analyzing every one of these?

Here’s the complication.  Somewhere in that pile of data is an early indicator that a critical service is about to fail, and if I’m not monitoring it, I miss that early warning.  Users start calling the helpdesk, my boss starts screaming.  Elsewhere in that pile are distractions, abnormal behavior that effects virtually no one.  But how can you easily filter out these distractions without also filtering out those golden nuggets that can help me prevent outages?

They key is starting at the right place.  Most vendors start with discovery at the infrastructure level, identifying everything in existence and then trying to align these assets to critical services.  In this approach, everything is important and the distractions cause so much noise that I’m stuck in perpetual fire-fighting.  But what if we flipped it?

What if we started at the intersection between technology and people, the user experience?  As we work down the technology stack, from the web server to the application tier to compute and storage and network, we can ask the critical question – how does this help deliver the user experience?    At the switch, we can focus on the ports connecting the application servers to the web farm.  At the filer, we can focus on the volumes mounted by the application servers.  At the application server, we can focus on  garbage collection metrics and ignore the ftp service.

Suddenly, we’ve gone from 10,000+ metrics to a dozen or so, and now I can focus my attention on those golden nuggets of intelligence that can help me leave work before the sun goes down.  This is the heart of FireScope’s Top-Down approach, helping organizations take a more meaningful approach to monitoring that eliminates the distractions that have been consuming far too much time and resources.