On dashboard
When I joined AWS, I was blown away by the amount of metrics we collect, how easy we can graph them, set alarm, and cut tickets on them for anomaly detections. But with thousands of metrics/graphs to monitor, the data volume are beyond human’s mental capacity. The question is: where do we start to understand customer’s real experience? One of my first contributions to KMS five years ago was to develop a framework to standardize the dashboard best practices. We built a Single Page Application (SPA) using ReactJS to define templates that group meaningful graphs together with annotations. We followed a data driven approach so that engineers who want to make changes on the dashboard can just submit JSON files in Code Review (CR), without knowledge on the JavaScript framework. But the technologies are not the concern of today’s topic. The import things we had to decide for a service dashboard are: 1. What are the metrics that measure customers’ most important experience? 2. What are the metrics that measure the most critical resources’ health and utilization that might impact customers? After many iterations we ended up with a pattern of four metrics that are the most critical for KMS APIs: 1. Total Requests Per Second (RPS) 2. Availability defined by the number of successful requests over total requests. KMS is a 99.999% service. All our availability baselines start from 5 9s. 3. High percentile latency. Based on the volume of the API, we might choose to monitor P90, P99, P99.9 latency or above. 4. Resource utilization. Choosing how to measure resource utilization is an art in large services because there are so many factors you can measure: CPU, Memory, thread pool, Disk I/O, Network etc. Choose the ones that are most likely to be the bottlenecks of your scalability. We repeat these four-metric pattern for all our dashboard, from the most used APIs on top. We want to make sure when engineers and stakeholders look at the KMS dashboard for 3 seconds, they can get out of that first glance: is KMS ok? We find there are many metrics/graphs that are more useful for troubleshooting specific problems, so we move them into runbooks. We built bots that monitor tickets with certain nams and augmented the tickets with the relevant graphs. In some cases, we even provided auto-diagnosis on what might be the root cause and mitigation. (In case you are curious, we used sheer statistical math, not machine learning. The auto reasoning needs to be real-time) The bots turn out to be super useful for on-call engineers who got paged in the middle of the night: when they opened the tickets that paged them, the metic/graph and suggestions were right in front of them to take actions. The moral of the story: Dashboard is about what stakeholders need to know, not what you can put there. If you care about the use experience of your service, start with a good dashboard and work backwards.
Last updated
Was this helpful?