Distributed Tracing, get a grasp on your production

Nakul MishraThe Tracing Guy

Large scale distributed systems are complex; they consist of hundreds of services, developed by various teams using polyglot stack. When a single request travels in such a system, it might end up touching hundreds of services deployed across many machines. What happens when suddenly your system starts to get slower? How can you reason about performance issues? How can you troubleshoot such problems? We can go to our best engineers but due to the distributed responsibility spread over different teams, they might not be able to guess or pinpoint the exact cause of performance issue. Processing the sheer volume of log files containing the overwhelming amount of information requires a lot of time and deducing anything meaningful that could quickly help us to diagnose the latency issue is rather hard. Can't metrics help? They can show us that we have some latency issues but can't tell what is the cause and depending on how we aggregate might even mislead us. What we need is a distributed tracing system. In this talk, we will take a look on OpenZipkin, which is based on Dapper, see how it can help us to pin point latency problems in our production. Discuss why 99th percentile matters for latency and build a demo application using polyglot stack (Java, Spring cloud Sleuth & GO) to see OpenZipkin in action.

Level:
BEGINNER

Bio:
Consultant around JVM and related technologies. Prefers automation over manual configurations. Keen on continuous delivery, unit testing and code simplicity. Interested in developing applications that requires creativity, imagination, fast-learning and zest for putting theory into code.