86 Expensive Minutes – Strategies to Leverage Automation to Improve Incident Recovery

According to a recent survey by the Ponemon Institute, average reported incidents lasted 86 minutes in 2013.  While this is an improvement over results from the same study in 2010 (97 minutes), the impact to the business was actually greater ($690,200 versus $505,500 in 2010).  This makes it quite a painful hour-and-a-half, and an area that a lot of organizations are going to be struggling to drive down.

Unfortunately, there is no silver bullet that’s going to magically fix every incident.  A lot of organizations have explored approaches to automated (or zero-touch) resolution, but these only work for specific use cases.  However, other forms of automation could potentially help reduce incident resolution times.  Let’s start by breaking a typical incident into discreet segments and exploring possibilities.

Identification that an incident has occurred.  I’ve seen statistics ranging from 37% to 76% of incidents are brought to IT’s attention by users and not through alerting or instrumentation.  In other words, the problem has existed for anywhere between 10-30 minutes before IT is even aware of the problem.  This also results in low organizational credibility for IT as they are seen as behind the ball.  An easy test of how your organization stands is to look at the last 5 major incidents that occurred, and trace whether it was originally reported by users or your monitoring tools.  If users are your most consistent source of incident identification, I’d strongly recommend re-evaluating your current monitoring strategy.

Generating an incident ticket.  When looking at optimization of incident life cycles, the incident ticket itself can have a big impact on resolution times.  If the incident is reported by users, information is scarce and possibly inaccurate.  This will slow down all subsequent work as IT operators must first interpret and verify all the information within the ticket.  However, if the ticket could be automatically generated through automation with monitoring tools, such as what we’ve accomplished with FireScope+Cherwell integration, we can equip operators with much more accurate information regarding the issue and related metrics.  Additionally, since this ticket is generated by monitoring at the moment the first signs of issue are identified, we can slash as much as 45 minutes from incident resolution times.  Take those 5 major incidents we discussed in the last step and look at the initial reporting information and evaluate how well armed your support team is at the beginning of an incident.

Initiating Response.  Now that we have a ticket for the issue, who do we route it to?  How do we get the right person touching the right interface to get the incident resolved?  In the average incident, we’re looking at 20-25 minutes being lost while the ticket is assigned to the appropriate person or team, notifications are sent out, they review the ticket and finally begin work.  And if they identify that their responsible domain is not the source of the problem, the process starts all over again as the ticket gets bounced between the Storage, Server, Application and Network teams.  This is also highly common in user-reported incidents because the user is only aware of the symptoms of the problem.  Once again, an automated integration between monitoring and ticketing can radically reduce this phase of the incident by providing insight into the source of the problem, and in the case of the FireScope+Cherwell integration, automatically assigning the ticket to the right team.  Once again, take a look at those 5 incidents and see how often the ticket was re-assigned before being resolved.  For each re-assignment, we can factor at least 20 minutes lost that could be recovered.

Isolating the source of the problem.  For each interface that has to be consulted to trace an issue, factor 15 minutes added to the time it takes to resolve the incident.  This is where a single pane view of the entire infrastructure really shows its value, as this can often be the longest period of any incident.  In your own environment, how many interfaces does your organization have to consult to troubleshoot a typical incident?

Verifying resolution and closing the incident.  Even though the issue may be resolved, the timer is still ticking.  In the manual world, operators must go through every step of a typical use case of the service that was impacted to verify resolution, then enter final documentation and close the ticket.  This could take anywhere from 5-35 minutes, time that we still want to shave off it at all possible.  FireScope’s approach is to use it’s automated user experience testing, as well as its numerous other monitoring capabilities, to identify that the service is functioning within normal parameters and then automatically close the incident in Cherwell.  And because the monitoring platform is continually testing multiple use cases and various other aspects of the service, we have greater assurance that the issue is effectively resolved, avoiding the potential for follow on tickets based on forgotten use cases or checks.

As mentioned before, there is no silver bullet to automatically solve every issue.  However, thoughtful implementation of automation, as we’ve outlined in the above, can have a huge impact on incident resolution times.  Many FireScope and Cherwell customers have seen their incident resolution times cut by 60% in their first months of using the integrated and highly automated approach we just discussed.

Consider the self-analysis questions we posed in each of the preceding sections.  How does your organization fair? How could leveraging any of the proposed automation paths ease your day?