How we got rid of log diving

One of the essential skills to survive in KMS on-call is to be good at log diving. We use the word “diving” quite literally here because as a tier-0 AWS service we are dealing with an ocean of logs coming from different layers. Seasoned engineers tend to be wizards of getting the logs they need quickly and new engineers may have nightmares of drowning in logs.

But this is pretty much history. We have been building big data solutions to stream KMS operational logs into data lakes so we can query them at real-time over the years. Last week we launched the cream of the cake: for tickets that meet the conditions, a bot will query the relevant logs and paste them into the tickets automatically. When on-call engineers open the tickets, the logs are right in front of them. All they have to do is to analyze the data, make a decision, then take appropriate actions. Basically we removed the time consuming, laborious “log diving” step from ticket handling!

Machines are built to do repetitive work. They never feel tired, or bored. We humans are evolved differently. We suck at repetitive work but we are good at spotting patterns and making decisions, either based on rational reasoning, or based on our intuitions. Machines cannot have intuitions, at least in the foreseeable future that we know.

This feature represents where we are going in pursing operational excellence: automate the mechanisms as much as we can, but leave complicated decision making to human operators.

But what about the simple use cases where the decisions are always the same? Well, in these cases, we codify the decision making and turn on the full self-driving mode.

We didn’t get the fully automated log diving in one shot. In fact, we delivered an intermediate milestone so that on-call engineers can type one command to get the relevant logs.

The formula looks like this:

Data streaming and formatting into data lake - fully automated
Data query step - semi-automated with one command
Data analysis and decision making - manual or with semi-automated tools
Action to take: semi-automated with one command

In the automated log diving launch, we made the (2) step fully automated. A small step forward.

This incremental style of achieving automation is also good to be a fallback when the fully automated process goes astray. Under emergencies human operators can alway stop the full self-driving mode, use semi-automatic data collection command, make the decision and invoke appropriate actions.

The moral of the story: automation is an incremental improvement process. Develop a semi-automated mode, before turn on the full self-driving mode. But always have a way to take back the wheel, if needed.

Previous99% Automated - deleted an on-call rotation!NextNo perfect runbook - map is not territory

Last updated 1 year ago

Was this helpful?