Why being oncall is a good thing
I am oncall today so I have more time to contemplate work, and life. Many people prefer jobs that have no oncall duty, or have as little oncall as possible. Oncall is considered at its worst, the destroyer of work-life balance; at its best, a necessary evil. But I found I learned most of my stuff in AWS through its picular oncall and operational excellence culture. AWS practices an extreme version of DevOps, before the term âDevOpsâ was even defined: software engineering teams own their services end to end; they not only design, develop the services, they also have to operate the services directly, 24/7. Being oncall is part of our DNA. It is a high bar for our engineers. They have to know infrastructure, networking, load balancers, DNS - everything that is required to operate our services. The beauty of this extreme ownership model is our engineers have to design Observability, Maintainerbility and other properties of operational excellence from the very beginning of designing a feature. In KMS, we actually require an Operation Readiness Plan (ORP) as part of the design review phase for nontrivial feature launch. We work backwards from Operational Excellence just like how we work backwards from customer requirements using the PR/FAQ (Press Release/Frequen Asked Questions) process. If we donât design our services to be operationally friendly, we are the first ones to pay the price - we get paged in the middle of the night! Being oncall allows us intimately understand customersâ demand and pain points. To handle a customer escalation or large service event directly is very different from being in a supporting role during such situations, when there are separate operation teams or SRE teams manage the services and infrastructure. How do we really learn about scalability if we have never seen how p99.9 latency responds to customersâs traffic increase using our own eyes. Being oncall also allows us to learn about the features we did not get to involve during the development phase. Oncall is a chance to lift our head beyond the trees and get perspectives of the forest. It is the breeding ground for innovation because we learn how customers really experience our services from first hand. How does oncall impact our work-life balance. Well, there are definitely pains. But some pains are necessary in life. What we need to avoid is suffering. We have rotations to make sure everyone takes turns to be on-call. When the team size is large enough, being oncall 7 days every 8-10 weeks is not that big deal. We also heavily invest into automation and noise reduction. In KMS, we have many bots (we call one of them Woodpecker affectionately) to help us collect metrics, logs, auto diagnose root causes and recommend mitigations. So if you really want to know your service and your customers; if you want to be good at problem solving and troubleshooting, get into oncall!
Last updated
Was this helpful?