Monday, May 24, 2010

Need For RCA

By definition - Root Cause Analysis(RCA) is the fundamental breakdown or failure of a process which, when resolved, helps us understand why the problem occurred in the first place and prevents a recurrence of the problem time and again.

The whole purpose of Root Cause Analysis or identification is to identify the origin of a problem. It uses a specific set of steps, with associated tools, to find the primary cause of the problem, so that you can determine:

1. What actually happened.
2. How it happened.
3. Why it happened.

RCA assumes that systems and events are interrelated. An action in one area triggers an action in another, and another, and so on. By tracing back these actions, you can discover where the problem started and how it grew into the symptom you're now facing.

Initially RCA happens to be a reactive method of problem detection and solving.- a post analysis This means that the analysis is done after an event has occurred. By gaining expertise in RCA it becomes a pro-active method. This means that RCA is able to forecast the possibility of an event even before it could occur.

Being in Data center operations we too come across repetitive and irritating problems quite often. It is very important to get into the roots. RCA comes in as an ally in such situations. I can understand its 'annoying' when your manager asks for it. But believe me, its equally enjoyable to reveal the root cause, as someone rightly said, Only the inquiring mind solves problems".

In order to deliver high levels of IT infrastructure availability, organizations need tools that help them isolate repetitive problems. If you belong to a team, where alert provisioning is very tight, you might also land up in a situation where multiple alarms fired at the same time. This is the point of root cause analysis -- to dig below the symptoms and find the fundamental, underlying decisions and contradictions that led to the undesired consequences. If you want your problems to go away, your best option is to kill them at the root.YTo identify the root cause, we have to ask “Why?” over and over, until we reach there. ou need to trace back the events in a systematic way by looking at the effects and the causes that created or contributed to those effects. 'Fishbone' diagram may be quite handy at this kind of situation and isolate the issue.

At the end of your analysis, the finding must be willing to probe the data first to determine what happened during the occurrence,second to describe how it happened, and third to understand why.

Once identified the root cause, need to determine to 'Resolve(actionable)' or 'Not To Resolve(Non-actionable)' This is even more crucial if the cost of resolving the same is higher which forces us to consider it as a symptom. Its a very difficult scenario as the cost of the symptom is generally wrapped up in some number of customers satisfaction in addition to the resource costs associated with it. BUT, If cost involved is very minimal, with appropriate failover/backup or downtime this needs to be addressed immediately. And, if its identified as a deeply rooted cause, with higher cost of resolution better to tag it as a known symptom.

Many organizations document a set of procedures to follow on how to tackle this problem if it reoccurs further. Of course, this is now being tagged as a 'known issue', and a considerable amount of time will be saved while addressing the same. What we achieved here is at least a quicker resolution, even though the root of the cause was NOT being removed at all.

As Someone rightly said "Customers don’t expect you to be perfect. They do expect you to fix things when they go wrong.” So be equipped with your tool set, for a quicker resolution, by engaging in a continuous RCA hunt -may be "tomorrow" becomes predictable!, as someone rightly said "It's what you learn after you know it all that counts".

So, dig it big !! AND Don't Skip! :-)


No comments:

Post a Comment

Why Database CI/CD?

Making the Database Part of Your Continuous Delivery Pipeline The database, unlike other software components and code or compiled co...