Saturday, June 26, 2010

When Human becomes a SPOF...

SPOF(Single Point Of Failure) is an element or indivisible entity in our system, which when fails leads to a downtime or even an unwanted blackout. Now this essential element can be a firewall, Power Supply Unit(PSU) in a Data Center , UPS systems, a backup system, Enterprise Single-Sign-On (SSO) OR sometimes a human being too! While the rest is technical , and mostly seen in any IT /ITES or some other organization the last one i.e. human being as a single point of failure, which even though does not lead to failure but surely leads to a slower recovery, post -any outage. Now a days in many organization ,an exercise in the form of a Single Point of Failure (SPOF) analysis is seem to be conducted too. A SPOF analysis is a systematic analysis of what can go wrong in our environment, and what impact each failure can cause. It details the inter-dependencies and relationships among the major components in our environment. This analysis is really helpful to figure out the crucial failure points quite easily and also can be eradicated at the root, with some serious collaborative effort by various BU, partners or so.

For others who do not practice this; it may be due to their time constrain. But this definitely limits the scope of a detail analysis to understand where are we actually attracting SPOF points which might result in a undesirable outcome. Today I am going to emphasis on the point - how and why an employee has to take the role of a savior in the organization in spite of having a vey capable, ready to jump-in team. This is because he is the person who would normally have all the detail from minor to sensitive information. But is it really healthy ? And , can this be removed with a much broader expanded planing right from the planing phase? For the rest of my post I will call this person who is definitely a hero as Mr. Rocky! He is usually company's "Go-To" guy. He knows each minor detail of how a device works, Why this device here and not other, was considered ,why this is configured this way and not the other normal way. He is the guy, whom your manger will always have to call when something goes wrong. Yes! The business needs experienced and serious people who can take ownership when something goes wrong. It is also completely acceptable that you become more important as you gain experience and skills. But when you start feeling too important and the business hinges on you then please be aware! you should know that you are setting up yourself to be the SPOF here.

Failure and Post Failure analysis:

Before any analysis it is important to understand: What constitutes a failure? This may seem a trivial question at first, but the term 'failure' is meaningless in the abstract. In whose opinion has a failure occurred? Your customers? Or your company's? It is important to understand and define the perspective from which a lack of functionality will be considered a failure. Now that, we know we have a failure, first is of-course to get out of it. Post which we can do a RCA, which can help figure out - was there a any human dependency or a technical malfunctioning of some equipment, which led to the outage. How much time the recovery took. Was there too a human dependency. Did Mr. Rocky got busy this time also since he had to deliver everything single-handedly, in-spite of having a quite big team, who were just a mere on lookers when crisis was on.

Lets look at some of the negative aspects if we have a guy like this in our organization:

  1. What if he/she might take a unhealthy control of the work flow, like change management / system configuration which he only knows, and our organization is not so good in terms of a Wiki documentation.
  2. He suddenly changed his mind to switch job in a short notice or going rogue joined one of my competitor.
  3. He is on a Caribbean vacation and there is a crisis. But by then, even though my sweet manager understands what he missed by NOT giving enough training to new comers, we already have an outage.

Do we have a solution:

Yes! Human SPOF can be reduced if not completely eradicable. First point is to identify the important/crucial person team wise in the organization. Of course this may be a huge exercise depending on the situation, size and complexity of your organization. Assigning documentation to various members in the team, regular follow up, and getting report in a continuous fashion. Also needed is - documenting all the peoples job descriptions, inside the team, their roles & responsibilities, implementing backup-roles, cross-training your IT personnel, and most importantly - Not having a team which is a just bare minimum.

This post would be incomplete if I don't try to understand the view point of Mr. Rocky! what led people around him -to reach a point NOW, where they had to discuss his role only when crisis came in.

'Rocky' has a team. But probably they are not contributors. They are NOT self starters, what they wrote in their resume(quite oppositely while applying for the job!). They don't love the product as much as he loves or cares. 'Non-Rocky's ' can work or works on the product until COB (sharp) only, but not beyond- depicting sheer sense of professionalism .They have a personal life too, and they need a balance. 'Rocky' has too, but you know,- he is 'different' .

Mr. Rocky does it again!

Crisis, downtime, upgrade, maintenance, target, deadline and so many- and yet- on top of that , he has a new team to train. He might be OK with it, but to start with -emphasizes more on reading company wiki pages etc. and all those junk he has in the name of documentation , following the emails, watch out for some new upcoming events, attending team meetings and probably after then only he would prefer to sit with them for some training. I have no guess here, but he must have some thought process behind that too! Crisis is the best time to teach new people in your team. But this time also 'Rocky' misses it- because he thinks he can just get it done quicker if he does it himself. Because that is a moment when he is the only one driving this out, and definitely living on peoples expectations! so his first priority is to get out of it and decides to do it single handedly and he rocks again! But in the end, nobody else learns the job. Important thing to watch is- was the incident a good opportunity or a case study to document so many things which in fact could have helped the new comers? or is it that - Rocky again became so tired or laziness showed up to train, teach, and document. Did the manager understood the importance of doing this exercise or he just relaxed shooting some "Kudos" mail? Or, did Rocky took advantage of the point that - "He can’t be fired and he knows it. And sadly, he probably need to be fired, but can’t be." I reiterate on this point of a serious emphasis for business mangers to consider it by giving a closer look on a top-down approach. Some people really never builds a redundancy which a company requires. This is NOT an achievement that is being payed off as a point that Rocky made during his annual appraisal. But a reminder to the management, that SPOF himself is declaring it so bold and loud Who is he!

Rocky has a point too. He is a hard-working, he loves what he does and does it with utmost care and deep involvement . His hardwork also often gets paid off. He and his people around too is happy along with management. He is always in a helping spree. That's why probably he is always the My-Dear 'n' Go-To-Guy. On the other hand we have a new capable team too, thirsty for knowledge transfer and ready to jump in. Then why the hell this SPOF still exists? Is he also suffering for this. Probably Yes.

Whats is actually Wrong?

What is seen here is a capable team, which is so rare to get, who is always wiling to jump in and ready to give their best when there is a call for it. But there is a serious disconnect here in the form of team-work and collaborative effort. To some extent this is mostly situation-driven, but to some extent contributed by our relaxed business leaders who has tremendous faith on his current rock stars who sometime forgets to even lay the foundation of a much needed bridge for knowledge drain to other team members.

Moreover the senior persons, whom I have named Rock Starts in this post should always be open to share and mentor his fellow team members in their team who can fill the gap in his absence, and if at all there is that so called 'command-and-conquer' mentality, better to get rid of it.

I think if we Don't 'Skip' these points, we can move 'Fast' towards removing this Human SPOF too! Watsay? :-)


No comments:

Post a Comment

Why Database CI/CD?

Making the Database Part of Your Continuous Delivery Pipeline The database, unlike other software components and code or compiled co...