« all talks

Break It till You Make It: Ideas on Site Reliability Engineering

This talk was held on Wednesday, August 21, 2019 at 5:30pm

We’re updating the accompanying videos. Please check back later.

Outages happen. Products break. Every time a failure occurs, it’s an opportunity to learn and improve. Web-based products are incredibly complex. By understanding and managing their complexity, carefully investigating incidents, and improving responses, we can build more reliable products and more resilient systems.

Three Indeed engineering leaders talk about successfully handling complexity.

How I Broke This: The One Where Ketan Takes Down Indeed.com

During his ten years in various roles at Indeed, Ketan Gangatirkar, VP of engineering, has wreaked havoc on the company’s site. In his talk, he showcases creative problem-solving that resulted in unintended and negative consequences. He shares the specific lessons learned from each of these adventures. Ketan’s experiences all support the abiding principle that incidents occurring in very complex systems seldom share a single root cause.

How the Incident Retrospective Helps Indeed Deliver Constant Change Safely

Site reliability engineering manager, Alex Elman, uses a recent incident at Indeed to demonstrate the benefits of the incident retrospective. Many high-profile events are associated with a seemingly innocuous change. A single change, however, rarely causes an incident alone. By conducting thorough reviews, organizations can learn a lot about how their systems respond to failure. Applying these lessons helps organizations increase the capacity of their systems to adapt and absorb change.

Why SLOs Are Useful: Scaling an Organization from First Principles

Tristan Slominski, site reliability engineering manager, constantly strives to offer an answer to the question “Why are we doing this?” His talk is a distillation of models that seem to explain why certain known practices work. We understand that “two pizza” teams are about the right size. We know APIs are “good.” We adopt SLOs as “good.” But did you know that we can explain the effectiveness of these three standards through a single equation commonly referred to as the Universal Scalability Law? They’re all solutions to the problem of managing complexity at different organizational scales.