Op werkdagen voor 23:00 besteld, morgen in huis Gratis verzending vanaf €20

Implementing Service Level Objectives

A Practical Guide to SLIs, SLOs, and Error Budgets

Paperback Engels 2020 9781492076810
Op voorraad | Vandaag voor 21:00 uur besteld, morgen in huis


Although service-level objectives (SLOs) continue to grow in importance, there’s a distinct lack of information about how to implement them. Practical advice that does exist usually assumes that your team already has the infrastructure, tooling, and culture in place. In this book, recognized SLO expert Alex Hidalgo explains how to build an SLO culture from the ground up.

Ideal as a primer and daily reference for anyone creating both the culture and tooling necessary for SLO-based approaches to reliability, this guide provides detailed analysis of advanced SLO and service-level indicator (SLI) techniques. Armed with mathematical models and statistical knowledge to help you get the most out of an SLO-based approach, you’ll learn how to build systems capable of measuring meaningful SLIs with buy-in across all departments of your organization.

- Define SLIs that meaningfully measure the reliability of a service from a user’s perspective
- Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis
- Use error budgets to help your team have better discussions and make better data-driven decisions
- Build supportive tooling and resources required for an SLO-based approach
- Use SLO data to present meaningful reports to leadership and your users


Trefwoorden:Data analyse
Aantal pagina's:378
Hoofdrubriek:IT-management / ICT


You Don’t Have to Be Perfect
How to Read This Book
Conventions Used in This Book
O’Reilly Online Learning
How to Contact Us

I: SLO Development
1. The Reliability Stack
Service Truths
The Reliability Stack
Service Level Indicators
Service Level Objectives
Error Budgets
What Is a Service?
Example Services
Things to Keep in Mind
SLOs Are Just Data
SLOs Are a Process, Not a Project
Iterate Over Everything
The World Will Change
It’s All About Humans

2. How to Think About Reliability
Reliability Engineering
Past Performance and Your Users
Implied Agreements
Making Agreements
A Worked Example of Reliability
How Reliable Should You Be?
100% Isn’t Necessary
Reliability Is Expensive
How to Think About Reliability

3. Developing Meaningful Service Level Indicators
What Meaningful SLIs Provide
Happier Users
Happier Engineers
A Happier Business
Caring About Many Things
A Request and Response Service
Measuring Many Things by Measuring Only a Few
A Written Example
Something More Complex
Measuring Complex Service User Reliability
Another Written Example
Business Alignment and SLIs

4. Choosing Good Service Level Objectives
Reliability Targets
User Happiness
The Problem of Being Too Reliable
The Problem with the Number Nine
The Problem with Too Many SLOs
Service Dependencies and Components
Service Dependencies
Service Components
Reliability for Things You Don’t Own
Open Source or Hosted Services
Measuring Hardware
Choosing Targets
Past Performance
Basic Statistics
Metric Attributes
Percentile Thresholds
What to Do Without a History

5. How to Use Error Budgets
Error Budgets in Practice
To Release New Features or Not?
Project Focus
Examining Risk Factors
Experimentation and Chaos Engineering
Load and Stress Tests
Blackhole Exercises
Purposely Burning Budget
Error Budgets for Humans
Error Budget Measurement
Establishing Error Budgets
Decision Making
Error Budget Policies

II: SLO Implementation
6. Getting Buy-In
Engineering Is More than Code
Key Stakeholders
Executive Leadership
Making It So
Order of Operation
Common Objections and How to Overcome Them
Your First Error Budget Policy (and Your First Critical Test)
Lessons Learned the Hard Way

7. Measuring SLIs and SLOs
Design Goals
Flexible Targets
Testable Targets
Organizational Constraints
Common Machinery
Centralized Time Series Statistics (Metrics)
Structured Event Databases (Logging)
Common Cases
Latency-Sensitive Request Processing
Low-Lag, High-Throughput Batch Processing
Mobile and Web Clients
The General Case
Other Considerations
Integration with Distributed Tracing
SLI and SLO Discoverability

8. SLO Monitoring and Alerting
Motivation: What Is SLO Alerting, and Why Should You Do It?
The Shortcomings of Simple Threshold Alerting
A Better Way
How to Do SLO Alerting
Choosing a Target
Error Budgets and Response Time
Error Budget Burn Rate
Rolling Windows
Putting It Together
Troubleshooting with SLO Alerting
Corner Cases
SLO Alerting in a Brownfield Setup
Parting Recommendations

9. Probability and Statistics for SLIs and SLOs
On Probability
SLI Example: Availability
SLI Example: Low QPS
On Statistics
Maximum Likelihood Estimation
Maximum a Posteriori
Bayesian Inference
SLI Example: Queueing Latency
Batch Latency
SLI Example: Durability
Further Reading

10. Architecting for Reliability
Example System: Image-Serving Service
Architectural Considerations: Hardware
Architectural Considerations: Monolith or Microservices
Architectural Considerations: Anticipating Failure Modes
Architectural Considerations: Three Types of Requests
Systems and Building Blocks
Quantitative Analysis of Systems
Instrumentation! The System Also Needs Instrumentation!
Architectural Considerations: Hardware, Revisited
SLOs as a Result of System SLIs
The Importance of Identifying and Understanding Dependencies

11. Data Reliability
Data Services
Designing Data Applications
Users of Data Services
Setting Measurable Data Objectives
Data and Data Application Reliability
Data Properties
Data Application Properties
System Design Concerns
Data Application Failures
Other Qualities
Data Lineage

12. A Worked Example
Dogs Deserve Clothes
How a Service Grows
The Design of a Service
SLIs and SLOs as User Journeys
Customers: Finding and Browsing Products
Other Services as Users: Buying Products
Internal Users
Platforms as Services

III: SLO Culture
13. Building an SLO Culture
A Culture of No SLOs
Strategies for Shifting Culture
Path to a Culture of SLOs
Getting Buy-in
Prioritizing SLO Work
Implementing Your SLO
What Will Your SLIs Be?
What Will Your SLOs Be?
Using Your SLO
Iterating on Your SLO
Determining When Your SLOs Are Good Enough
Advocating for Others to Use SLOs

14. SLO Evolution
SLO Genesis
The First Pass
Listening to Users
Periodic Revisits
Usage Changes
Increased Utilization Changes
Decreased Utilization Changes
Functional Utilization Changes
Dependency Changes
Service Dependency Changes
Platform Changes
Dependency Introduction or Retirement
Failure-Induced Changes
User Expectation and Requirement Changes
User Expectation Changes
User Requirement Changes
Tooling Changes
Measurement Changes
Calculation Changes
Intuition-Based Changes
Setting Aspirational SLOs
Identifying Incorrect SLOs
Listening to Users (Redux)
Paying Attention to Failures
How to Change SLOs
Revisit Schedules

15. Discoverable and Understandable SLOs
SLO Definition Documents
Document Repositories
Discoverability Tooling
SLO Reports

16. SLO Advocacy
Do Your Research
Prepare Your Sales Pitch
Create Your Supporting Artifacts
Run Your First Training and Workshop
Implement an SLO Pilot with a Single Service
Spread Your Message
Learn How to Handle Challenges
Work with Early Adopters to Implement SLOs for More Services
Celebrate Achievements and Build Confidence
Create a Library of Case Studies
Scale Your Training Program by Adding More Trainers
Scale Your Communications
Share Your Library of SLO Case Studies
Create a Community of SLO Experts
Continuously Improve

17. Reliability Reporting
Basic Reporting
Counting Incidents
Severity Levels
The Problem with Mean Time to X
SLOs for Basic Reporting
Advanced Reporting
SLO Status
Error Budget Status
A. SLO Definition Template
SLO Definition: Service Name
Service Overview
SLIs and SLOs
Revisit Schedule
Error Budget Policy
External Links
B. Proofs for Chapter 9
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Theorem 6
Theorem 7


Alle 100 bestsellers


Populaire producten



        Implementing Service Level Objectives