Quality starts with local stack

The irony of the cloud computing movement is on the one side it makes procuring and provisioning hardware, infrastructure and network so much easier; but on the other side development and testing on such distributed systems become a lot more difficult than before. I remember when I started in telecommunication industry decades ago, each engineer would have a mini voice digital switch in their cubicle. That was pretty much all we needed to develop software in a embedded system. Nowadays a complex software system typically consists of components that are distributed on many machines: “whose behavior is intrinsically difficult to model due to the dependencies, competitions, relationships, or other types of interactions between their parts or between a given system and its environment.” Engineers often rely on preproduction stacks to test their non trivial changes. But testing in preproduction stacks requires long feedback loops: from hours to days. Preproduction stacks are shared resources. If they are broken by one fat finger, the whole organization suffers. Efficient engineering requires much faster feedback loop. We want to know our changes work or not in second or minute. Unit test is a good feedback, but unit test mocks dependencies and assumptions, it loses certain fidelity towards production systems. So where can we establish the fast feedback loop for developing complex distributed systems? We use local stack - where the whole system can be ran inside a single ec2 machine. In the last a few months we dedicated a team charter to improve KMS’ local stack. We set up three goals for an efficient local stack: 1. Fast: for new team members, setting up a local stack for KMS system should be less than one hour. We are not 100% there yet, but we are getting there. 2. High fidelity: we want the local stack to support the majority of KMS features, exactly the way they function in production systems. 3. Reliable: if the local stack constantly breaks, engineers will lose faith on it, and they fail back to preproduction or production systems for testing. We learned keeping local stack reliable is really hard. It runs in individual engineer’s desktop, we have no Observability on how it operates. We often had merges that broke the local stack that was only discovered by the next poor soul who rebased and needed to work on it. Before we dedicated a team to local stack, whoever discovered the problem needed to fix it. The process relied on individual heroism, we all know it won’t last… We recently introduced local stack metrics and dashboard, just like how we monitor preproduction and production systems. They give us insights and measurements on our local stack quality, and where to invest to get the best returns. So in summary, developing distributed systems is hard. If you want to have a high quality product, start with your local stack!

PreviousLearn system design from Git - Evolutionary Architecture NextIntention-Revealing Interfaces with Examples

Last updated 1 year ago

Was this helpful?