Ramón Medrano Llamas

Oops. I broke the Google, now what? by Ramón Medrano Llamas

My team and I are SREs and run one of the most important service stacks at Google: authentication. Our role is to make our services reliable and secure through engineering projects. But we sometimes fail, even if our SLO can be as high as 99.9999% uptime (31.56 seconds downtime per year) in some cases.

Whenever our services are down, mostly every Google product is down, our billions of customers can't access their own data they have entrusted us with and you make it quickly through the press. Business across the globe can be affected.

When you are oncall and get one of the pages that makes you think "oh gosh", what really happens behind the scenes? How to go from a potentially cryptic alert message to a full blown incident response team coordinating over tens of engineers?

After mitigation, the complete repair starts and the forensics style root cause analysis needs to indicate what happened and how to prevent that failure class forever. We also need to travel back in time: outages do not randomly happen, but have a trigger in a broken process, a system interaction, a small code piece.

In this talk, we'll go through the beautiful process of failure and recovery, examining real outages that have affected hundreds of millions of customers and seeing what happened, how we approached it and what we learned. We'll deep dive on some of the responses and how can the be exported to other organisations. We'll learn how our organisation has evolved to be resilient as well, over the last 15 years of operating systems at hyper-scale.

Talk Questions

  • Question 184
    So it's your fault that we now have to use 2fa? ;-)
  • Question 189
    How do you make your runbooks easily discoverable and searchable?
  • Question 186
    How do you train your Developers to mitigate outages?
  • Question 188
    How do you ensure the test to prod, doesn’t break another thing or has a bigger impact? In case there’s a corner case effect
  • Question 185
    What is the difference between RCA ( root cause analysis) and your post-mortem? Both are the same right?
  • Question 187
    In 'where do we focus our effort' and later slides, i think a major point - Time to Mobilization was missing. Without mobilizing the right people/ technology; the mitigation doesn't happen, right?
  • Question 191
    Rollout releases or package mirroring + canary ?