99% Automated - deleted an on-call rotation!
I had a career record recently: I As a SDM I've created quite a few on-call rotations. But deleting a on-call rotation, oh man, it took me a while to figure out how to. Now engineers have more time to work on quality project tasks, and more uninterrupted family life. So how did we remove the need for the secondary on-call rotation when AWS continues to support more regions? In Amazon’ leadership principles we would say: 1. Insist the highest standards - challenge the status quo, don’t accept repetitive manual work as the Standard Operating Procedure (SOP) 2. Invent and Simplify - use innovations to raise security and operational excellence bar, instead of relying on-call engineers’ blood and sweat. We looked hard at the use cases that required a secondary on-call engineer after hours, and launched multiple projects to make them either fully automated, or only required a single primary on-call engineer. We mercilessly reduced the ticket noise - the tickets that page our on-call engineers but require no actions. We automated the time consuming manual log diving processes, so that the information needed by the on-call engineers to take actions are augmented into the tickets automatically. We scheduled biweekly ticket bashes that helped us keep the ticket count down. Now just technical innovations themselves are not enough to remove an on-call rotation. We need mindset change. We had SOPs that were designed to require two engineers to execute certain commands, for better security, for better safety ... it was a mindset that 2 persons were better than one person no matter what. When we dived deeper into why we needed two persons, it turned out 99% of the time we didn’t. We can leverage technologies such as hardware attestation and codified guardrail rules to make our automated workflows securer and safer than manual processes. After launching the automations and monitoring the secondary on-call shifts for 4-5 months we found there was zero event that needed another on-call engineer. Our on-call ticket queue size is low enough that a single on-call engineer can easily handle the load. Then we pulled the plug on the secondary on-call rotation. A careful reader might ask: wait a minute, you said you automated 99% of the manual work, what about the last 1%. Well that last 1% are the cases that do require expertise from Subject Matter Experts (SME), and real human judgment, but they almost never need to be done in the middle of the night. We can use work hours to cover them when the whole team is online. We wrote a SOP so on-call engineers can rely on escalation managers to find the SME when they absolutely need them immediately. The moral of the story: For automation projects, don't shoot for 100%, shoot for 99%, automate the mechanism, leave the 1% of the advanced decision making to human experts.
Last updated
Was this helpful?