From book-skills
Knowledge base from "Implementing Service Level Objectives" by Alex Hidalgo. Use when applying Hidalgo's frameworks for meaningful SLIs, reliability targets, error budget math and policy, SLO alerting, statistics behind SLOs, data reliability, or building SLO culture. This is the deepest practical SLI/SLO/error-budget treatment available — it covers the full Reliability Stack from measurement design through probability/statistics, alerting architecture, and organizational change.
How this skill is triggered — by the user, by Claude, or both
Slash command
/book-skills:hidalgo-slo [topic, framework name, chapter number, or 'index'][topic, framework name, chapter number, or 'index']This skill is limited to the following tools:
The summary Claude sees in its skill listing — used to decide when to auto-load this skill
**Author**: Alex Hidalgo | **Pages**: ~646 | **Chapters**: 17 + 2 appendices | **Generated**: 2026-06-03
chapters/ch01-the-reliability-stack.mdchapters/ch02-how-to-think-about-reliability.mdchapters/ch03-developing-meaningful-slis.mdchapters/ch04-choosing-good-slos.mdchapters/ch05-how-to-use-error-budgets.mdchapters/ch06-getting-buy-in.mdchapters/ch07-measuring-slis-and-slos.mdchapters/ch08-slo-monitoring-and-alerting.mdchapters/ch09-probability-and-statistics.mdchapters/ch10-architecting-for-reliability.mdchapters/ch11-data-reliability.mdchapters/ch12-a-worked-example.mdchapters/ch13-building-an-slo-culture.mdchapters/ch14-slo-evolution.mdchapters/ch15-discoverable-and-understandable-slos.mdchapters/ch16-slo-advocacy.mdchapters/ch17-reliability-reporting.mdcheatsheet.mdglossary.mdpatterns.mdAuthor: Alex Hidalgo | Pages: ~646 | Chapters: 17 + 2 appendices | Generated: 2026-06-03
burn rate, error budget policy, percentiles, data reliability, etc.; I find and read the relevant chapterch08 or ch09; I load that specific chapterFor details beyond the core frameworks below, I will Read the relevant chapters/chNN-*.md, glossary.md, patterns.md, or cheatsheet.md files.
Three-layer hierarchy for user-centric reliability:
SLI (Service Level Indicator) — A measurement of service behavior from the user's perspective, expressed as a ratio: good_events / total_events. Not CPU usage or queue depth — user-observable outcomes.
SLO (Service Level Objective) — The target percentage for the SLI ratio. Good SLO: exceeding it means users are happy; missing it means users are unhappy. Not a contract — it can and should change.
Error Budget — (1 − SLO_target) × window = permitted unreliability. If SLO = 99.9%, the error budget = 0.1% of all events (or ~43 minutes per 30-day month).
SLA is a contractual version of SLO with financial consequences. This book focuses on SLOs, not SLAs.
The three service truths:
An SLI is fundamentally a user journey. Product teams call it a KPI; QA calls it an interface test. They're the same thing in different language.
Measure many things by measuring only a few: Verify the user gets correct data at the edge, and you've implicitly verified the service is up, available, responsive, format-correct, and returning good results. Use the highest-level user-observable outcome.
Six questions for any request/response service SLI:
Write every SLI as a plain English sentence: "When users search for a product, 99.8% of searches return a response within 4 seconds." If a non-engineer can't understand it, rewrite it.
Don't constrain to "nines": 99.7%, 98.3%, 97.2% are as valid as 99.9% if data-driven. The "number nine" convention is an artifact, not a requirement.
Dependency math: system_reliability = component_reliability^N. Forty components at 99.9% compose to 96%. You cannot exceed your hard dependencies' reliability.
The "too reliable" problem: If you routinely exceed your SLO by a wide margin, users build higher expectations, engineers stop learning from failures, and you lose the freedom to experiment.
Start from past performance: Use the P95 or P99 of historical data as a starting target; adjust based on user feedback and metric quality (resolution, quantity, noise).
Bad time formula: bad_seconds_per_day = (1 − SLO_target) × 86400
| Target | Bad time/month | Practical on-call? |
|---|---|---|
| 99.99% | 4 m 23 s | Barely; needs auto-remediation |
| 99.95% | 21 m 54 s | Yes, with fast response |
| 99.9% | 43 m 50 s | Yes, standard |
| 99.7% | 2 h 9 m | Yes, comfortable |
| 99% | 7 h 18 m | Yes, for lower-criticality |
Events-based:
budget = (1 - SLO) × total_requests
remaining_pct = (budget - failures) / budget
Time-based (30-day, 1-second resolution):
total = 2,592,000 seconds
budget = (1 - SLO) × 2,592,000
remaining = budget - bad_seconds
Which to use: Events-based for burn rate alerting; time-based for human communication. Use both.
Rolling windows (recommended): Bad events expire as they age out. Calendar windows: Budget resets on a date — risk of release flood at reset.
Error budget policy: Write it before you need it. Define what actions to take at 20%, 50%, and 100% budget consumption. Without a pre-agreed policy, error budget data goes unused. This is the cultural acid test.
Budget uses beyond feature freeze: Chaos engineering, load tests, blackhole exercises (intentional location failures), library/config experiments. Use surplus budget to learn about your systems.
Simple threshold alerting (CPU, queue depth) fails because: thresholds decay as the system grows, they're poor proxies for user experience, they lose context, and they generate alert fatigue.
Two-alert architecture (the gold standard):
Burn rate formula:
burn_rate = observed_errors / allowable_errors
# > 1 → violating SLO at this rate
The SLO alert condition:
SUM(errors) / SUM(all_requests) > (1.0 - SLO_target)
over a window substantially shorter than the SLO evaluation period.
Human response requires ≥5 minutes. SLO targets above 99.95% on monthly windows leave <5 minutes for a complete outage response — auto-remediation or architecture change needed.
For new services or low-QPS services: Raw percentage calculations are unreliable with sparse data. Use Bayesian methods with a prior, or widen the measurement window.
Multi-datacenter independence math:
P(both DCs fail) = (1 − p)²
# Two DCs at 99% → system failure = 0.0001 → 99.99% composite
Only valid if failures are truly independent (cross-region, separate power/network).
Latency and queueing: At high utilization, queue wait → ∞. Latency SLOs must account for service utilization approaching capacity.
For durability:
P(all N replicas fail) = p_fail^N
# Three replicas at 0.1% annual failure rate → 10^-9 annual data loss probability
Same-DC replicas are NOT independent — correlated failure modes exist.
Seven data properties to consider for SLOs: Freshness (not the same as age), Completeness, Consistency, Accuracy, Validity, Integrity, Durability.
Critical distinction: Durability failures are permanent. All other SLO violations are recoverable. Invest disproportionately in preventing durability failures.
Design tensions: Completeness vs. Freshness, Consistency vs. Availability, Durability vs. Performance. SLOs make these trade-offs explicit.
Culture change is 80% of the work. Technical implementation takes days; organizational alignment takes months.
Six-step path: get buy-in → prioritize → implement → use → iterate → advocate.
The first error budget policy is the cultural acid test: If the team ignores it when budget is exhausted, the SLO program is decorative.
SLO advocacy phases:
First people, then processes, then technology.
| # | Title | Key Frameworks |
|---|---|---|
| ch01 | The Reliability Stack | SLI/SLO/Error Budget, Service Truths, SLA vs SLO |
| ch02 | How to Think About Reliability | Implied agreements, Hyrum's Law, reliability cost curve |
| ch03 | Developing Meaningful SLIs | User journey as SLI, measure many by measuring few, six questions |
| ch04 | Choosing Good SLOs | Five Ms, percentiles, dependency math, not-just-nines |
| ch05 | How to Use Error Budgets | Error budget math, rolling vs calendar windows, budget policy |
| ch06 | Getting Buy-In | Stakeholder matrix, objection responses, cultural test |
| ch07 | Measuring SLIs and SLOs | Metrics vs logs, design goals, measurement patterns |
| ch08 | SLO Monitoring and Alerting | Burn rate, fast/slow burn, why thresholds fail |
| ch09 | Probability and Statistics | Bernoulli, Binomial, Poisson, multi-DC math, Bayesian |
| ch10 | Architecting for Reliability | SLO-first design, MTTR, hardware trade-offs |
| ch11 | Data Reliability | 13 data properties, durability permanence, data lineage |
| ch12 | A Worked Example | SLIs as user journeys, Wiener Shirt-zel multi-service example |
| ch13 | Building an SLO Culture | Six-step path, error budget policy as cultural test |
| ch14 | SLO Evolution | Seven change drivers, aspirational SLOs, revisit schedules |
| ch15 | Discoverable and Understandable SLOs | SLO definition documents, phraseology, discoverability tooling |
| ch16 | SLO Advocacy | Crawl/Walk/Run, artifacts, stakeholder pitch templates |
| ch17 | Reliability Reporting | Why MTTX fails, error budget as reporting unit, audience matrix |
This skill covers the book content only (O'Reilly, 2020). It does not include:
For hands-on implementation, combine with your monitoring stack's documentation. The book's conceptual frameworks (Reliability Stack, user-journey SLIs, error budget policy, burn rate alerting) remain authoritative. This is the deepest single-source treatment of practical SLI/SLO implementation.
npx claudepluginhub andersonfpcorrea/andersonfpcorrea-skills --plugin book-skillsProvides CDSS development patterns for drug interaction checking, dose validation, clinical scoring (NEWS2, qSOFA), and alert classification integrated into EMR workflows.