How AWS SDM Dive Deep (1) - Dashboard
âHow should a new SDM dive deep into their service?â I was taking a break with a few other SDMs and one manager asked. âJust read the source code and figure outâ, one manager said, he was a SDE before transitioning to SDM role. âWell, reading source code is not necessarily the best way to learn a service as a beginner. You might be lost in weeds and miss the big picture. SDMâs diving deep should start from how customers use the service.â another manager disagreed. He came from product management background. âThere we go. Start from customer and work backwards. It is called Customer Obsession, Dude!â I laughed. âWe can learn a service from many perspectives: user experience, data flow, infrastructure, security posture, you name it! But let's talk about user experience first.â The question is how do we quantify user experience? Most of us in AWS work on API services one way or another. To measure the user experience of API services, we build service dashboards. You may ask: âHow is knowing a few graphs on a dashboard requires diving deep?â It depends on what questions we ask about the graphs. An API serviceâs dashboard should at least include four pillar graphs: availability, Request Per Second (RPS), latency and resource utilization. Sounds simple, right? Now let's dig in. For availability, 1. What is the SLA of your service availability? Is it 99.9%, 99.99%, 99.999%? How is the SLA decided? 2. How is availability measured? Is it per second, minute or hour? 3. Where is the availability measured? Load Balancer? API Gateway? Your frontend fleet? Backend fleet? How are these layersâ availabilities influence your overall service availability? 4. Can you explain the availability drops on your graph? What were the root causes? How were they impacting your customers? For latency: 1. What percentile of the latency are you monitoring? Is it p99, p99.9 or p99.99? Have you considered trimmed mean percentile metrics? 2. Latency is not a singular number but a shape, what is your latency distribution shape? How are they trending over time? 3. What are your alarm thresholds for latency? 4. For latency spikes on your graph, do you know which components or dependencies triggered the spikes? ⌠See, from the service dashboard, we can go deep infinitely with the SDM, asking questions about customer experience, system performance, system resilience, and scalability etc. Dashboard is just the tip of the iceberg on the surface. You can find the whole iceberg if you are willing to go down. In fact, AWS has a tradition to randomly pick a service in weekly operation meetings and ask the SDM of the service to explain their dashboard on the spot. Depending on how well you know your dashboard, this can be your moment of glory or moment of shame. So here is my first advice to new SDMs who desire to dive deep, start from your service dashboard and ask many questions.
Last updated
Was this helpful?