How AWS SDM Dive Deep (6) - OPS Review
Last updated
Was this helpful?
Last updated
Was this helpful?
When I joined AWS, the first thing that impressed me was the weekly operational (OPS) review meeting, where AWS services can be randomly picked to present their dashboards in front of several hundred colleagues. SDM of the picked service are required to explain how their customer experiences are measured; if there is a availability drop on their dashboard, they need to know what happened; if there is a latency spike on their dashboard, they need to know how their customers were impacted… What shocked me was the number of executives and senior engineering leaders participated in the weekly review meeting, and how deep they would go to ask questions to the dashboard presenters. When your senior leadership deeply care and know about operational quality, that creates a culture of operational excellence top down. To avoid being humiliated in the AWS weekly ops meeting, services commonly do their own weekly operational review before the AWS wide review. This trickles down to each component owner of the services. In KMS, we usually have on-call engineers drive our own operational review meeting. We would go over the events that paged the primary on-call first, discuss their customer impacts and our mitigations. Then we have the secondary on-call present the operation metrics’ trend analysis of their shift: the anomalies that didn't page on-call but are worth calling out; the ongoing operational pains we want to address etc. But SDMs are not just sitting there day dreaming. They ask questions, sometimes inconvenient questions: 1. For a ticket that paged on-call but required no actions, why is the pain needed? Is the ticket at the right severity? 2. For the ones that did trigger on-call actions, was the response fast enough? was the runbook clear, easy to follow? 3. But if we have a clear runbook, can we not automate the actions? 4. Are we monitoring the right metrics? Are they measuring the right customer experience? 5. Do we have reliable baselines of our key performance indicators? 6. Do we have too much noise in our auto-cut tickets? Is it time to adjust alarm thresholds? ... Besides the weekly ops reviews, KMS also have monthly performance engineering review, capacity engineering review and quarterly static stability reviews. These reviews have dedicated tech leads as owners. They will prepare the content but SDMs are expected to actively involve in the discussions and take actions decided from the operation reviews. They need to walk the walks, instead of just talk the talks. Ops review is the heart beat to measure the health of the service and the health of the team. To survive in AWS, SDMs need to have operational excellence in their blood! For new SDMs: if you don't have weekly ops review yet for your service, start one.