Friday, April 25, 2014

Ops : Opposite of Luck?



Running effective and efficient Operations is definitely NOT an easy task!! And I do keep on saying that -  “Ops is against Luck!”

Lets consider the below scenarios: 

  • No one is  complaining and  you also don’t have a readily available report on how healthy your systems are , you actually might be in  trouble..
  • Or customers may be too angry , busy or confused to bring up issues - you should not count these days as your lucky ones!!
  • 'No  news' is NOT a good news !! 
  • Assuming  unresolved issues won’t reoccur again..  Also the strategy of duck and cover in the hope that the problem will disappear is a recipe for catastrophe

Someone rightly said - luck is just a thin wire between survival and the worse!!

You may consider them as Luck , while may be so - diligence probably  is what we need along with our willingness to act for consistent efficiency.

But How..?  below may be couple of real life scenarios:

If you own production - negotiate your own production readiness and excellence with ‘Why’, ‘When’ and ‘What’.  It might be overly cautious, BUT we should apply a lens to understand whether the said change ‘now’ will bring real benefit or it can wait for a favorable time. 

Does your PD consider operations as their customers? If NOT speak up.  Set expectations correctly so that you don’t have to struggle to setup monitors and metrics overnight to support production from early next morning.

Also - we should know when to pull the brake on projects , and prepare for actual traffic and earn $

Relying on your firefighting superhero and showering praise on their heroics should be handled with a diff mindset of permanent resolution by getting into the root.

Yep!   Sin kills….!! and Luck is always to blame :-)

Wednesday, April 9, 2014

Why KILL -9 httpd is NOT always a good idea to invoke SSC




Sharing from my personal experience:

One fine day I myself had fallen in to this trap.

Being ‘n-4’  (LTM, BIG IP F5 iRule) is the condition to trigger SSC (static site contingency) on the LTM layer, I could kill apache on most of the web servers but last 7 or 8 ( I use windows laptop so mPuTTY is my guy to trigger them in parallel)  and guess what! all load had eventually had fallen into those few.

I lost control , lost time as well  I was having a superman attitude and never invited  Network (NOC) to do so by logging in to BIGIP and disable the pool members, so whole time was a waste, and finally I had to invite them as a rescue.

LTM was on and off Intermittent SSC and sometimes good but mostly bad Ux.


We can debate - that high load on servers will automatically trigger a SSC, but that will be an worse effect (not an action!) and uncertain scenario where we won’t be  playing any role and we are talking about a human being putting a certain and timely contingency..

Hence I recommend this to go at the LTM layer by someone deliberately …  ALWAYS!

Thanks/-

RCA - Root Cause Analysis

An important step in finding the root causes of issues or occurrences that happen within a system or organization is root cause analysis (RC...