How AWS SDM Dive Deep (3) - Infrastructure

AWS follows an unique end to end service ownership model: the software development team that implements the service also owns the infrastructure that runs the service. We don’t have dedicated operation team or SRE team. So it is essential for a SDM in AWS to have deep understanding of the infrastructure their service runs on. KMS has a tech talk called “A request journey in KMS” that we rerun every year for new team members. The talk covers the infrastructure a request from customer needs to go through in different layers of KMS stacks. A new SDM of a service can follow the same thought experiment: imagine you can ride a request to your service, and ask the following questions in your journey: 1. where is your DNS record? For AWS services, DNS record will be in Route53. But you can literally use “dig” command to find out the information. DNS setting is important if your service does cross AZ (Availability Zone) or cross region failover. It allows you to quickly shift customers’ traffic off the data centers you have outages. 2. what is your load balancer solution? Load balancer is the first hop a request will enter your service infrastructure. Is it per AZ or per Region? If it is using AWS Elastic Load Balancer, is it NLB based which distributes load at TCP layer, or is it ALB based which distributes load at HTTP layer? What algorithm do you use for load balancing, is it Round Robin or weighted by connection or other factors, why? Do you have DDoS attack protection? If you have sudden traffic spike, how fast can your load balancer scale up? 3. where are your service hosts? What are the EC2 instance types your service choose? Do they run inside its VPC or other network fabrics? Do you have host health monitoring? Do you have heart beat mechanisms to take bad hosts out of you load balancer? Do you use EC2 auto scaling? What are the threshold to trigger scale out? How do you keep host patching up to date? 4. where do you terminate your TLS connections? Do you use reverse proxy on host or you rely on solutions like AWS API gateway to handle TLS? Where do you get your TLS X.509 certificate? Who is your certificate authority? When will your certificate expire? Do you have automated certificate rotation? 5. how many layers do you have in your service? How do they communicate between each others? Do you have retry and circuit breaker for inter-layer RPC calls? Do you have asynchronous workflows? How are they orchestrated? 6. how does your service authenticate and authorize requests? Do you have throttling mechanisms to prevent noisy neighbors from overwhelming your service? What are the external dependencies of your service? If they go down, can your service gracefully degrade? … Services in AWS are by nature large scale distributed systems. To understand a service, a SDM must know the infrastructure it runs on.

PreviousHow AWS SDM Dive Deep (2) - Datastore NextHow AWS SDM Dive Deep (4) - Deployment

Last updated 1 year ago

Was this helpful?