Monday 19 January 2015

Causality and Root Cause Analysis

Software programs crash, people make mistakes, accidents or disasters happen and ... We can't always be ready for all of them or prevent all of them but we do can perform analysis to find out why they have happened and prevent them from happening again. One of the approaches to identify the reasons of a problem is Root Cause Analysis or RCA for short.

RCA helps us to find out the underlying cause or causes of the problems, usually those we don't want to happen, but sometimes when we always get bad results and suddenly we get a good result!!! We can use RCA to find the reasons of having such a good result too.

It is one of the best ways of problem solving because it doesn't try to repair or fix the problem, it looks for the root causes, and when you remove them, the problem will never play with you again.



Causality
E2 and E4 are main causes of E0
Causality or causation is the relation between an event e1 happened at time t1 and the other event e2 happened at t2 ( t> t) which we consider it as the result of already happened e1. It is not a correlation between two events because you may have a correlation between two different events but none of them be the cause of happening the other.

You can talk about causality both from scientific or philosophical points of view. Science may say causality exists only when time does exist. So for example, you may believe that since before the (first) big bang there were no space and time, then we had no causality and then no one needed to trigger the big bang. While in a philosophical point of view they may not see the causality in the box of time, and believe it can happen even when there is no time. (It is weird I know) Regardless of how we look at the causality, it helps us to trace the origin of an event.

Look at the picture, here E0 happens at the current time, and we want to find out the root cause of this event. If we go back in time we will find that the reason of happening E0 is E1 and the reasons for E1 are E2 and E3, E2 has no reason itself but E3 has a cause which is E4, so if we just prevent happening of E2 and E4 then we will not see E0 happens again.

Note that we can't tell E2 and E4 have happened by themselves with no reason, they both surely have had some causes but their causes were not in the scope of our investigation. So we need to know how deep we should dig for the root causes. In other words, at some point, if we dig too much for the root causes, we face some events we don't have any control or interest on them. For example, in a power failure of a data center, we may finally find out that the reason was the other night's thunder. We can't prevent a thunder but we can prevent one step later after it hits the building, so we don't try to find the cause of thunders.

Root Cause Analysis
RCA is a simple systematic methodology to do what we just talked about. Each of the events in the picture has its own story of what they are and why they happen. RCA suggest you to follow these three simple steps to prevent something bad happens again.
  1. Understand what happened (problem)
  2. Find out why the problem happened (reasons/causes)
  3. Prevent the reasons of happening
You may need to repeat these 3 steps, more and more as you go for the step 3 (or 2) to prevent the causes you found in step 2. 

There are many forms and methods to loop through these 3 steps and all of them just try to define a discipline of doing RCA. As we talked before, just try to understand what RCA says then you can do it in any way you like or feel comfortable. Just don't forget to use brainstorming and drawings especially in step 2 which is somehow an investigation or forensic process. In my point of the view most difficult problems can be solved with just these approaches. 

No comments:

Post a Comment