From book-skills
Knowledge base from "Site Reliability Engineering: How Google Runs Production Systems" by Betsy Beyer, Chris Jones, Jennifer Petoff & Niall Murphy (Google). Use when applying SRE philosophy, setting SLOs/error budgets, managing on-call and toil, designing incident response, conducting postmortems, handling cascading failures, or referencing Google's operational frameworks. Lane: SRE philosophy, org/process, postmortems, toil, on-call, error budgets, monitoring, cascading failures, load balancing, data integrity.
How this skill is triggered — by the user, by Claude, or both
Slash command
/book-skills:google-sre [topic, framework name, or chapter number — e.g. 'error budget', 'four golden signals', 'ch06'][topic, framework name, or chapter number — e.g. 'error budget', 'four golden signals', 'ch06']This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Editors**: Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy (Google) | **Pages**: ~550 | **Chapters**: 34 | **Generated**: 2026-06-03
chapters/ch01-introduction.mdchapters/ch02-production-environment.mdchapters/ch03-embracing-risk.mdchapters/ch04-slo.mdchapters/ch05-eliminating-toil.mdchapters/ch06-monitoring.mdchapters/ch07-automation.mdchapters/ch08-release-engineering.mdchapters/ch09-simplicity.mdchapters/ch10-practical-alerting.mdchapters/ch11-being-on-call.mdchapters/ch12-troubleshooting.mdchapters/ch13-emergency-response.mdchapters/ch14-managing-incidents.mdchapters/ch15-postmortem-culture.mdchapters/ch16-tracking-outages.mdchapters/ch17-testing-reliability.mdchapters/ch18-sw-engineering-sre.mdchapters/ch19-load-balancing-frontend.mdchapters/ch20-load-balancing-datacenter.mdEditors: Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy (Google) | Pages: ~550 | Chapters: 34 | Generated: 2026-06-03
error budget, postmortem, cascading failures, on-call; I find and read the relevant chapterch06 or ch22; I load that chapter fileFor details beyond the core frameworks, I will Read the relevant chapters/chNN-*.md, glossary.md, patterns.md, or cheatsheet.md.
error_budget = 1 − SLO_target
The error budget is the permitted unreliability per quarter. As long as budget remains, releases ship. When exhausted, releases freeze until budget replenishes.
successful_requests / total_requests (Google's preferred over uptime for distributed systems).If you can only measure four things on a user-facing system, measure these:
| Signal | What it measures | Leading indicator of |
|---|---|---|
| Latency | Time to service a request | User experience, saturation |
| Traffic | Demand (req/s, I/O rate) | Scaling needs |
| Errors | Rate of failed requests | Service health |
| Saturation | How "full" the system is | Impending failure |
Page on symptoms (what users see), not causes. Latency increases are the leading indicator of saturation.
SRE ops work (on-call, tickets, manual tasks) ≤ 50% of total time. Excess overflows to the dev team.
Toil = operational work that is: manual + repetitive + automatable + tactical + no enduring value + O(n) with service growth. Any three attributes = strong signal. All six = definitely toil.
Toil is not overhead (meetings, HR) and not the first or second time you do something. It is the N-th repetition that could be automated.
A postmortem is not punishment — it is a learning artifact. Blameless = focusing on systemic causes, not individual fault.
Write one when: user-visible downtime, data loss, on-call intervention, long MTTR, or monitoring failure.
Structure: Timeline → Impact → Root cause(s) → Contributing factors (systemic) → Action items with owners. Review with senior engineers. Share broadly.
"You can't fix people, but you can fix systems and processes."
Four roles, recursive separation:
Only Ops Lead touches production. Freelancing by well-meaning engineers makes incidents worse. Handoffs require explicit verbal acknowledgment.
Cascade = positive feedback loop. When 10% of servers survive, dropping to 90% of normal load won't stabilize — drop to 10%.
Prevention priority:
Retry rules: Always use randomized exponential backoff + jitter. Budget retries globally. 3 layers × 3 retries = 27× amplification at the database.
Automation is a force multiplier, not a panacea. Better than automation is a system designed to not need it.
Three valid monitoring outputs: Alert (immediate action), Ticket (days), Log (no action needed).
Alert quality checklist: urgent + actionable + user-visible + novel + requires intelligence. Any violation → redesign or eliminate.
Monitoring systems that page must be simple, predictable, and reliable. Complex thresholds are fragile.
~70% of outages are change-induced. Hermetic builds + progressive rollout + fast rollback are the answer.
| # | Title | Key Frameworks |
|---|---|---|
| ch01 | Introduction | Error budget, 50% cap, hope-is-not-a-strategy |
| ch02 | Production Environment at Google | Borg, BNS, N+2 redundancy, Google stack |
| ch03 | Embracing Risk | Error budget formula, availability metrics, risk tiering |
| ch04 | Service Level Objectives | SLI/SLO/SLA, percentiles, safety margin, SLO rules |
| ch05 | Eliminating Toil | Toil definition, 50% rule, toil taxonomy |
| ch06 | Monitoring Distributed Systems | Four Golden Signals, symptoms vs. causes, alert quality |
| ch07 | Evolution of Automation | Automation hierarchy (5 levels), MoB case study |
| ch08 | Release Engineering | Hermetic builds, progressive rollout, Rapid, config mgmt |
| ch09 | Simplicity | Essential vs. accidental complexity, negative LOC, minimal APIs |
| ch10 | Practical Alerting (Borgmon) | Time-series monitoring, varz, hierarchical topology |
| ch11 | Being On-Call | 25% rule, 2 incidents/shift, follow-the-sun, cognitive load |
| ch12 | Effective Troubleshooting | Hypothetico-deductive, triage first, negative results |
| ch13 | Emergency Response | 3 emergency types, rollback testing, out-of-band comms |
| ch14 | Managing Incidents | ICS (IC/Ops/Comms/Planning), live incident doc, handoff |
| ch15 | Postmortem Culture | Blameless postmortem, triggers, Wheel of Misfortune |
| ch16 | Tracking Outages | Outalator, alerts vs. incidents, tagging taxonomy |
| ch17 | Testing for Reliability | Zero MTTR, test hierarchy, canary analysis, config tests |
| ch18 | Software Engineering in SRE | Intent-based capacity planning, Auxon case study |
| ch19 | Load Balancing at the Frontend | DNS LB, VIP, EDNS0, geographic routing |
| ch20 | Load Balancing in the Datacenter | Lame duck state, subsetting, WRR, CPU as capacity |
| ch21 | Handling Overload | Adaptive throttling formula, criticality levels, deadline propagation |
| ch22 | Addressing Cascading Failures | Cascade triggers, defense-in-depth, retry amplification, GC spiral |
| ch23 | Distributed Consensus | CAP, split-brain, Paxos/Raft, ACID vs. BASE |
| ch24 | Distributed Periodic Scheduling | Idempotency classification, fail-closed, leader election |
| ch25 | Data Processing Pipelines | Periodic pipeline failures, Google Workflow, checkpointing |
| ch26 | Data Integrity | Proactive detection + rapid repair, soft deletion, backup vs. replication |
| ch27 | Reliable Product Launches | LCE, launch checklist, gradual rollout, NORAD Santa example |
| ch28 | Accelerating SREs to On-Call | Wheel of Misfortune, education blueprint, anti-patterns |
| ch29 | Dealing with Interrupts | Pages/tickets/ops categories, flow state, bystander effect |
| ch30 | Embedding an SRE | 3-phase intervention, ops mode diagnosis, kindling |
| ch31 | Communication and Collaboration | Production meetings, two-masters model, API-as-contract |
| ch32 | The Evolving SRE Engagement Model | PRR, early engagement, SRE platform, engagement spectrum |
| ch33 | Lessons from Other Industries | 4 universal themes, aviation/nuclear/telecom parallels |
| ch34 | Conclusion | 747 analogy, two-pilot model, enduring principles |
This skill covers the 2016 first edition of the SRE book. It does not include:
For hands-on implementation, combine with project-specific tools and current cloud provider documentation. The core principles (error budgets, blameless culture, the 50% cap, four golden signals) remain authoritative.
npx claudepluginhub andersonfpcorrea/andersonfpcorrea-skills --plugin book-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.