How AWS SDM Dive Deep (5) - Correction of Error (COE)
My experience six years ago in AWS started with a COE presentation. I was in front of AWS weekly OPS meeting, which included a big chunk of the AWS engineering senior leadership and 200-300 SDM/SDE from different services. I was two months into AWS, to say I was terrified was an understatement. Since then I have written and participated a lot of COEs. I got into a group called COE Bar Raiser, which basically means I would help other teams improve their COE writing qualities. AWS is serious about COE, as a SDM, you should be too. Here are some basics about COE: 1. We learn from our failures. To be more precise, we learn by reflecting on our failures. COE is the process in Amazon to institutionalize the learning from failures and share them across the company. 2. COE is about process failure, not human mistakes. No matter how severe is the impact, it is not allowed to call people's names like âxxx did thatâ. For a COE that needs to be presented to large audiences, the best practice is the SDM of the team should be the presenter, not the engineers. SDM needs to be accountable for their team's failures. 3. COE is a data driven process. It starts by quantifying the customer impact, then use time line data to establish the context. Here is the intriguing part, the COE author needs to ask at least five âwhyâ to get to the bottom of the root cause. For new COE writers, this is probably the hardest part: to ask the whys logically, from surface layer to the deep end one step a time. For example, âWe had a bug in the codeâ is not a good root cause. Having bugs in the code basically is saying âwe are human, all too humanâ. Why was the bug not caught by code review? Should we add a checklist for the future code reviewers? Can our automated tests catch the bug? Can we reduce the blast radius of the bug during development and deployment? Can we detect the problem earlier using our operation metrics and alarms? We can keep asking like this. A SDM should be the champion to ask the whys. 4. COE actions need to be prioritized. They need to be specific, measurable, relevant, time bounded, just like SMART goals. SDMs are usually the owners of the actions. If you have worked in AWS but you have never written a COE, or presented a COE in front of AWS weekly OPS meetings, man, you have not lived! You might be one of the lucky doctors who claimed they have cured every singe patient they met, their hands are clean ... or, maybe these hands are just lacking real life experience. As a SDM, COE can be your best learning opportunity in AWS, enjoy it when you can.
Last updated
Was this helpful?