Site Reliability Engineering: How Google Runs Production Systems
My quick review of “Site Reliability Engineering: How Google Runs Production Systems”.
In a way, the Site Reliability Engineer is the new DevOps. A development team’s job is for the application they work on to, basically, work – and the SRE is part of this development team. This means thinking ahead while developping (using reliable and scalable technologies), launching properly, monitoring, but also being on-call, receiving alerts when things go wrong, and acting upon them.
There is one idea I really liked in this book: the notion of budget. First, you define a SLO for your service – say 99.99% of availability. You then measure it. If the availability of the service is greater than this SLO, all good, the team can develop and deploy new features: it can take risks, considering deploying new features is more likely to break things than doing nothing. On the other hand, if the availability of the greater is less than the SLO, nothing (except maybe urgent fixes) should be deployed. I wish I’d see this logic applied more often!
I’ve found interesting sections in this book, about managements, about post-mortems and being on-call, about launching new projects, about scalability and reliability and distributed systems… Really, most of the book contains interesting informations. But I found it a bit too long1 and many ideas it presents are not directly applicable for most of us – not everyone is google-scale with hundreds of engineers and dozens of dedicated SRE embedded in development teams. Basically, “Site Reliability Engineering” is full of long-term ideas, but is not the book I would recommend when looking for what one engineer could do today to help her current project.
→ 3⁄5 Several interesting notions, but the book is a bit too long and I think several chapters don’t apply directly-enough to most not-google-scale projects.
- This book is over 500 pages long. I actually read it in two halves, several months appart – first before the summer when I went abroad and had to wait several hours at airports, and I only got back to it then while commuting near the end of the year. [return]