Monitoring Your Service Health Like Human

I find when it comes to monitoring the health of a complex service with multiple components, think of it as monitoring a human body is a useful mental model: some checks are like monitoring heart rate and blood pressure, essential for immediate well-being. Others are like regular screenings for cholesterol or bone density—important but not necessarily critical in the immediate term. * Vital Signs: Direct Customer-Impacting In any service, components that directly affect customers are your vital signs— traffic volume, error rate, latency and system utilization, are like a human's heart rate or respiratory rate. When they fail, your customers feel the impact instantly. These components' health tells us whether the service is alive and kicking—healthy enough to serve customer traffic. Like how we monitor human bodies, there are two types of checks: -- Active Checks: Like taking a pulse, send out regular pings to these essential components. AWS Route53's health check (https://lnkd.in/ggTbTsBs) is a good example of active check. -- Passive Checks: Keep an eye on error rates and latencies, similar to how irregular breathing can be a passive indicator of a health issue. AWS CloudWatch metric/alarm is a good example of passive check. If these "vital signs" of a service instance are not stable, immediate action is required. The service instances should be flagged as 'unhealthy,', taking out of your load balancer for example, to prevent them from serving more customer traffic. This is similar to rushing a patient to emergency care. * Long-Term Health: Indirect Customer-Impacting Components On the flip side, some components's health monitors are your long-term health indicators, similar to cholesterol levels or blood sugar in a human body. A failure here won't cause immediate symptoms but may lead to health issues if not addressed. For these components, use different monitoring setups, alarms, and mitigation strategies. The response here could be a “doctor's appointment”—an engineer checking the system—rather than an “emergency room visit” to take the service instance completely offline. The objective here is to monitor and treat before the condition worsens and potentially affects other systems, like how people manage their lifestyle to avoid long-term health problems. So when setting up health checks in a multi-component service, think of it as a human body, ask yourself, "Is this a vital sign or a long-term health marker?" Prioritize accordingly, and set up your monitoring and mitigation strategies to align with the impact level on customer experience. This dual focus allows for a robust, resilient system that keeps both immediate and future customer needs in sight. Just like in healthcare, the right monitoring and timely intervention can make all the difference between a system that merely survives and one that thrives.

https://aws.amazon.com/builders-library/implementing-health-checks/

PreviousInner Peace During Oncall NextSoftware Architecture

Last updated 2 years ago