Brian Landers’ Post

Brian Landers reposted this

View profile for David Masover, graphic

SRE at Rubrik

I’m the Google SRE who made sure to hand off the pager in the minutes after I got laid off on 2023-01-20. If you’ve worked at Google (or maybe even if you haven’t), you may have heard some version of this story. Here’s what actually happened: Many SRE teams are split across at least two sites in two very different time zones. This ensures 24/7 coverage while avoiding waking people up in the middle of the night to handle a page. Even in the worst outages, people need to hand off the incident and get some sleep. My oncall shift was a full week, from Monday to Sunday, but only from 10 AM to 10 PM each day. So I wasn’t working when the layoff email arrived (a little after 2 AM), and there was no ongoing incident. I just happened to check something on my work phone before I went to bed. And one of my first thoughts was: I’m oncall again in less than 8 hours. I’m not a single point of failure, we have secondaries, but it takes time to find someone to cover a shift and to update the relevant systems. I’d start pinging Sunnyvale people, but they’re all asleep. I’ll just look up a phone number… right, no access to the corp directory anymore. The longer I wait, the harder it’s going to be to contact anyone. So I sent this page: > speckle-svl (at least) needs a new oncaller right now > I've just been laid off, and I've lost all access to the ███████████████ account. I actually don't have a better way to contact anyone before handoff time. It worked. Dublin got the page, and I could finally start to deal with what just happened. Meanwhile, I was going a bit viral inside Google. “When I saw it was you, I thought ‘yep, that’s the most Masover thing ever.’” – Robert Banz “That person has SRE running through their veins!” – anonymous So as long as I’ve got your attention, I’d like to say a bit about how a good SRE organization enables this mindset. I was at Google for eight and a half years. Probably 3-4 years in, I caused at least one major outage – I broke Youtube for half an hour or so. I was the one to write up the postmortem, and the playbook on how to handle that situation next time. Google’s blameless postmortem culture is real, and it matters. I’d built a habit of setting aside any big-picture anxiety and focusing on the problem in front of me. And, your team has your back. If you’re facing a problem bigger than you can handle, you can escalate, deputize, and otherwise get the help you need. Every night, you can hand off the pager and anything that’s still on fire, and sleep well knowing it’s handled. That’s what I did: Work the problem, do the best you can, and trust your team to have your back. It’s 10 PM on Sunday night, 2023-01-22. My shift is over. #opentowork

To view or add a comment, sign in

Explore topics