Inner Peace During Oncall
A team member just started solo oncall. âYou looked stressed, how are you holding up?â I asked. âI was paged last night. Took me 3 hours to find the root cause and I still have 5 severity-2 tickets pending.â He was nervous, I could tell. We took a quick a look at the ticket queue. âIs there any customer impacting ticket that needs your immediate attention?â âNo, none of them have customer impact or operational risk at this point. I just need to root cause them and follow the runbook if they need actions.â âAh, but why are you still stressed? Since there are no customer impact, you can take your time to learn new skills, maybe even write a tool to automate some troubleshooting procedures. Isn't it good?â âBut I thought I need to work on all 5 high severity tickets immediately. I have been switching from one to another ...â âAh, I see!â When an oncall engineer or manager gets paged, the natural instinct is to feel stressful. But we don't have to be. In fact, being oncall can be a great opportunity to practice inner peace: stay calm, triage, think critically and solve problem effectively - the type of skills that separate good engineers from mediocre ones. When we get paged by a high severity ticket, 1. The first step is triaging: understand the situation, evaluate the customer impact and decide priority. At one large service event I got paged 30+ times in a roll, but they were all for one problem. There was nothing to panic about. 2. The next step is mitigating the impact. Engineers are trained to ask âwhy did it happen?â Don't! Focus on âwhat can I do to stop or reduce the customer impact?â 3. In parallel with mitigation, you need hyper communication: AWSâ large service events will trigger automatic conference calls with dedicated call leaders to coordinate the mitigation actions. It is important one of the oncall crew is assigned to communicate timely, concisely, precisely and frequently to the rest of the stakeholders. 4. Long term fix and retrospect. We will do root cause analysis later, maybe even write a formal document called âCorrection of Error (COE)â, where we ask 5 whys. We will apply the long term fixes. But for now, take a breath. See, not too bad if you can stay calm, right? If you panic, get stressed out, you are reacting to the event without your mind. You are letting your emotion take control of the output. When your kitchen sink gets blocked - water is overflowing, what do you do? Well, instead of panicking or cursing, you stop the faucet so no more water will go into the sink. That is mitigation first approach. All these events are nothing but signals to our mind. We can process these signals logically without losing our inner peace. We can stay calm in spite of the crisis. Urgency does not have to mean stress. âEach one has to find his peace from within. And peace to be real must be unaffected by outside circumstances.â Mahatma Gandhi
Last updated
Was this helpful?