How we turn operation problems into big data innovation

Like most large services, there was a period KMS was struggling with all kind of operation problems: mysterious availability drops, latency spikes, dead hosts, dependency failures and customer escalations, you name it. This is the story I am very proud of because we turned KMS’ operation problems into big data innovation! We started small. The most time consuming work in troubleshooting operation problems is to find the relevant data from logs. KMS produces massive amount of logs from thousands of machines. Using traditional log diving tools can take hours, days or sometime weeks. Finally a SDE of KMS had enough of these nonsense. He put together a data lake prototype for logs using AWS EMR and S3. Now searching log data for many cases became a matter of minutes using AWS Athena. Our product management quickly realized they could get a lot of business questions answered by the data lake too so they became regular users… But that was just the beginning of the story. As KMS continued to grow exponentially we found the EMR process became slower and slower. We needed to wait for hours for the EMR process to complete indexing the logs. We also needed other metadata from production datastore to make sense of the logs. The makeshift EMR clusters started to have operation problems themselves. Instead of solving our operation problem, we now had two problems: operation and the data lake. We were at a cross road: to double down on using big data to solve our operation problems, or to fall back to the more traditional approach: hire more engineers for operations, maybe some dedicated support engineers and system engineers. That was when I hired KMS’ first Sr. Data Engineer. It took me a while to convince KMS leadership why we needed a real expert on data engineering, and Jin should stop being an imposter part-time data engineer. We built a two-pizza team around the Sr. data engineer. We rewrote our ETL pipeline to make it much faster and stable. We built real time log data streaming pipeline to compensate the batch mode ETL process. We built Online Analytics Processing Datastore (OLAP) to mirror metadata from our Online Transaction Processing Datastore (OLTP) at real-time. We built a Event Sourcing Datastore to track our data mutation history. Finally we built an universal query interface to allow users to get results from data sources across all regions, without knowing the internal complexity. As KMS continued to grow rapidly, I can attest that our operation load actually went down dramatically. My favorite feature from our data engineering team is their recent intern project. It automated the anomaly detection of availability drops. It even correlated graphs together using statistical models to recommend potential root causes. So if your service is overwhelmed by operation problems, instead of throwing in more humans, big data technologies might help!

PreviousDashboard Quality Vs. Product Quality NextDon’t automate 100%

Last updated 1 year ago

Was this helpful?