When it is too hard ... don't do it!

The topic might sound controversial, but it is a common scenario we face in software design. It is especially so for tier-0 distributed services like KMS that needs to operate at extreme scale, tight security constraints and very few dependencies. A recurring debate in our design review is: “Should we reuse a solution from another AWS service, or should we rebuild a solution from scratch in KMS?” The former reduces the development time but it also introduces dependencies that might negatively impact KMS’s security, durability, availability and static stability properties. The later could involve heavy development and operational cost, and it could become a snowflake solution that is a nightmare to support in the future. Engineering is an art of balanced trade-offs, isn't it? When we run into a hard problem in a design, we need to ask 1. Can we make the problem go away? Is there a simpler solution that does not involve solving this problem? 2. Can we make it other people’s problem? If other people have solved similar problems, can we reuse their solutions like calling their service? 3. Can we delay the problem solving to the future? Time is often our best friend in problem solving. The future “we” have more data to learn from, have gained more experience and are in a better position to solve the problem than the “we” now. But there are problems we should solve in our design, even they are hard. For example, KMS is a general purpose cryptography primitive provider in AWS Cloud. We have to solve the cryptography problem head on. Delegating our main value to another AWS service will lose the whole point of being the root of trust of AWS data security. But there are also hard problems in KMS that are extremely important, but we leverage other services in AWS for the solution. For example, “To help ensure that your keys and your data is highly available, KMS stores multiple copies of encrypted versions of your keys in systems that are designed for 99.999999999% durability.” Should KMS build a storage engine with 99.999999999% durability directly? No. If we decided to do that we would still be building the storage engine now after 8 years, with no sign of completing the first cryptography API call that serves customers. Similarly in distributed computing we often run into tricky problems like: 1. Distributed transaction 2. Leader selection and consensus 3. Load balancing 4. Data replication 5. Public Key Infrastructure 6. Transportation Layer Security (TLS) 7. ... For most engineering teams these problems should NOT be the type of problems they invest time and resources to solve directly - find an existing solution that is the best fit for their context and reuse it is a better choice. The bottom line is: If you feel like you are solving a computer science hard problem in a project - think twice, you might be solving the wrong problem!

PreviousThat last 1%: from good to great NextWhat Engineering can learn from Sports: Time it!

Last updated 1 year ago

Was this helpful?