Monday, April 19, 2010

INCIDENT MANAGEMENT

In a previous entry I was fascinated by the idea of running IT as a business. A key concept stemming from this is the ITSM, which is IT Service Management. This is the main topic discussed in ITIL V3, and many companies are trying to adopt the ITIL infrastructure in their own environment to fulfill IT services and manage it at a enterprise level.

One of my upcoming projects is to review the Incident Management process currently in place in our company. There are a lot of dicussion in ITIL, as well as in all kinds of white papers, about event management, incident management and problem management. There are certainly overlapping areas among those 3, but the differences are quite clear and should definitely be treated separately.

So I've been doing some homework for Incident Management and I thought I might just put some notes here.

The incident management that I'm talking about here is strictly IT Service related incident management, so if you see papers talking about "security incident management", or even "disaster incident management", that's a whole different story. (And maybe Business Continuity Planning and Disaster Recovery are good topics to talk about next ^^)

First of all, it's always to have nice diagrams associated with any boring notes (click on pic for original size):



This is a flowchart from CA, and it's designed in a subway map style. So far this is my personal favorite flowchart illustrating all different pieces of the ITIL service support processes. And for now I'm only focusing on the light blue line - the Incident Management route.

Incident is defined in ITIL as an interruption that stops, or reduces the quality of an IT service, and incident management is to resolve the issue and restore the service in a timely fashion, and the processes to achieve this include: identification, logging, categorization, prioritization, investigation and diagnosis, escalation, resolution and recovery, and closure.

When an incident occurs, it could be manually detected and reported by users, or could be automatically generated by the alert systems and a ticket would be cut to help desk. Based on incident type, SLA requirement and other possible factors, these records will be categorized and prioritized for further processing. Before trying to figure out what went wrong from a root cause analysis perspective (which is what Problem Management is for), help desk will search for their knowledge management database to restore the service to its normal status, or escalate it as necessary to other Tiers of SMEs to find workarounds to re-enable the service in minimal time, and transferring the ticket to problem management for root cause.

A critical element in incident management process, (and in many processes too), is obviously the data. Whether your data is accurately logged, which you keep record of every workaround, whether you do metrics/trending for your SLAs... All these activities are very important to the final decision-making process.

To sum up, the objective of incident Management is to rapidly restore services within agreed SLAs. Unlike problem management, which focuses on finding the root cause of problems, Incident Management is essentially about getting things back up and running quickly, even if this means performing workarounds and quick fixes. Technology can play a critical role in optimizing the incident management process by automating the actual process activities themselves (such as incident recording and classification), and by accessing the outputs from other related processes. Integration with other processes (especially Problem Management, Change Management, Configuration Management and Service Level Management) is vitally important to ensure that incidents are kept to a minimum and that the highest levels of availability and service are maintained.

0 comments:

Post a Comment