The Site Reliability Workbook
Practical Ways to Implement SRE
Paperback Engels 2018 1e druk 9781492029502Samenvatting
In 2016, Google’s Site Reliability Engineering book ignited an industry discussion on what it means to run production services today—and why reliability considerations are fundamental to service design. Now, Google engineers who worked on that bestseller introduce The Site Reliability Workbook, a hands-on companion that uses concrete examples to show you how to put SRE principles and practices to work in your environment.
This new workbook not only combines practical examples from Google’s experiences, but also provides case studies from Google’s Cloud Platform customers who underwent this journey. Evernote, The Home Depot, The New York Times, and other companies outline hard-won experiences of what worked for them and what didn’t.
Dive into this workbook and learn how to flesh out your own SRE practice, no matter what size your company is.
You’ll learn:
- How to run reliable services in environments you don’t completely control—like cloud
- Practical applications of how to create, monitor, and run your services via Service Level Objectives
- How to convert existing ops teams to SRE—including how to dig out of operational overload
- Methods for starting SRE from either greenfield or brownfield
Specificaties
Lezersrecensies
Inhoudsopgave
Foreword II
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
1. How SRE Relates to DevOps
Background on DevOps
No More Silos
Accidents Are Normal
Change Should Be Gradual
Tooling and Culture Are Interrelated
Measurement Is Crucial
Background on SRE
Operations Is a Software Problem
Manage by Service Level Objectives (SLOs)
Work to Minimize Toil
Automate This Year’s Job Away
Move Fast by Reducing the Cost of Failure
Share Ownership with Developers
Use the Same Tooling, Regardless of Function or Job Title
Compare and Contrast
Organizational Context and Fostering Successful Adoption
Narrow, Rigid Incentives Narrow Your Success
It’s Better to Fix It Yourself; Don’t Blame Someone Else
Consider Reliability Work as a Specialized Role
When Can Substitute for Whether
Strive for Parity of Esteem: Career and Financial
Conclusion
I. Foundations
2. Implementing SLOs
Why SREs Need SLOs
Getting Started
Reliability Targets and Error Budgets
What to Measure: Using SLIs
A Worked Example
Moving from SLI Specification to SLI Implementation
Measuring the SLIs
Using the SLIs to Calculate Starter SLOs
Choosing an Appropriate Time Window
Getting Stakeholder Agreement
Establishing an Error Budget Policy
Documenting the SLO and Error Budget Policy
Dashboards and Reports
Continuous Improvement of SLO Targets
Improving the Quality of Your SLO
Decision Making Using SLOs and Error Budgets
Advanced Topics
Modeling User Journeys
Grading Interaction Importance
Modeling Dependencies
Experimenting with Relaxing Your SLOs
Conclusion
3. SLO Engineering Case Studies
Evernote’s SLO Story
Why Did Evernote Adopt the SRE Model?
Introduction of SLOs: A Journey in Progress
Breaking Down the SLO Wall Between Customer and Cloud Provider
Current State
The Home Depot’s SLO Story
The SLO Culture Project
Our First Set of SLOs
Evangelizing SLOs
Automating VALET Data Collection
The Proliferation of SLOs
Applying VALET to Batch Applications
Using VALET in Testing
Future Aspirations
Summary
Conclusion
4. Monitoring
Desirable Features of a Monitoring Strategy
Speed
Calculations
Interfaces
Alerts
Sources of Monitoring Data
Examples
Managing Your Monitoring System
Treat Your Configuration as Code
Encourage Consistency
Prefer Loose Coupling
Metrics with Purpose
Intended Changes
Dependencies
Saturation
Status of Served Traffic
Implementing Purposeful Metrics
Testing Alerting Logic
Conclusion
5. Alerting on SLOs
Alerting Considerations
Ways to Alert on Significant Events
1: Target Error Rate ≥ SLO Threshold
2: Increased Alert Window
3: Incrementing Alert Duration
4: Alert on Burn Rate
5: Multiple Burn Rate Alerts
6: Multiwindow, Multi-Burn-Rate Alerts
Low-Traffic Services and Error Budget Alerting
Generating Artificial Traffic
Combining Services
Making Service and Infrastructure Changes
Lowering the SLO or Increasing the Window
Extreme Availability Goals
Alerting at Scale
Conclusion
6. Eliminating Toil
What Is Toil?
Measuring Toil
Toil Taxonomy
Business Processes
Production Interrupts
Release Shepherding
Migrations
Cost Engineering and Capacity Planning
Troubleshooting for Opaque Architectures
Toil Management Strategies
Identify and Measure Toil
Engineer Toil Out of the System
Reject the Toil
Use SLOs to Reduce Toil
Start with Human-Backed Interfaces
Provide Self-Service Methods
Get Support from Management and Colleagues
Promote Toil Reduction as a Feature
Start Small and Then Improve
Increase Uniformity
Assess Risk Within Automation
Automate Toil Response
Use Open Source and Third-Party Tools
Use Feedback to Improve
Case Studies
Case Study 1: Reducing Toil in the Datacenter with Automation
Background
Problem Statement
What We Decided to Do
Design First Effort: Saturn Line-Card Repair
Implementation
Design Second Effort: Saturn Line-Card Repair Versus Jupiter Line-Card Repair
Implementation
Lessons Learned
Case Study 2: Decommissioning Filer-Backed Home Directories
Background
Problem Statement
What We Decided to Do
Design and Implementation
Key Components
Lessons Learned
Conclusion
7. Simplicity
Measuring Complexity
Simplicity Is End-to-End, and SREs Are Good for That
Case Study 1: End-to-End API Simplicity
Case Study 2: Project Lifecycle Complexity
Regaining Simplicity
Case Study 3: Simplification of the Display Ads Spiderweb
Case Study 4: Running Hundreds of Microservices on a Shared Platform
Case Study 5: pDNS No Longer Depends on Itself
Conclusion
II. Practices
8. On-Call
Recap of “Being On-Call” Chapter of First SRE Book
Example On-Call Setups Within Google and Outside Google
Google: Forming a New Team
Evernote: Finding Our Feet in the Cloud
Practical Implementation Details
Anatomy of Pager Load
On-Call Flexibility
On-Call Team Dynamics
Conclusion
9. Incident Response
Incident Management at Google
Incident Command System
Main Roles in Incident Response
Case Studies
Case Study 1: Software Bug—The Lights Are On but No One’s (Google) Home
Case Study 2: Service Fault—Cache Me If You Can
Case Study 3: Power Outage—Lightning Never Strikes Twice…Until It Does
Case Study 4: Incident Response at PagerDuty
Putting Best Practices into Practice
Incident Response Training
Prepare Beforehand
Drills
Conclusion
10. Postmortem Culture: Learning from Failure
Case Study
Bad Postmortem
Why Is This Postmortem Bad?
Good Postmortem
Why Is This Postmortem Better?
Organizational Incentives
Model and Enforce Blameless Behavior
Reward Postmortem Outcomes
Share Postmortems Openly
Respond to Postmortem Culture Failures
Tools and Templates
Postmortem Templates
Postmortem Tooling
Conclusion
11. Managing Load
Google Cloud Load Balancing
Anycast
Maglev
Global Software Load Balancer
Google Front End
GCLB: Low Latency
GCLB: High Availability
Case Study 1: Pokémon GO on GCLB
Autoscaling
Handling Unhealthy Machines
Working with Stateful Systems
Configuring Conservatively
Setting Constraints
Including Kill Switches and Manual Overrides
Avoiding Overloading Backends
Avoiding Traffic Imbalance
Combining Strategies to Manage Load
Case Study 2: When Load Shedding Attacks
Conclusion
12. Introducing Non-Abstract Large System Design
What Is NALSD?
Why “Non-Abstract”?
AdWords Example
Design Process
Initial Requirements
One Machine
Distributed System
Conclusion
13. Data Processing Pipelines
Pipeline Applications
Event Processing/Data Transformation to Order or Structure Data
Data Analytics
Machine Learning
Pipeline Best Practices
Define and Measure Service Level Objectives
Plan for Dependency Failure
Create and Maintain Pipeline Documentation
Map Your Development Lifecycle
Reduce Hotspotting and Workload Patterns
Implement Autoscaling and Resource Planning
Adhere to Access Control and Security Policies
Plan Escalation Paths
Pipeline Requirements and Design
What Features Do You Need?
Idempotent and Two-Phase Mutations
Checkpointing
Code Patterns
Pipeline Production Readiness
Pipeline Failures: Prevention and Response
Potential Failure Modes
Potential Causes
Case Study: Spotify
Event Delivery
Event Delivery System Design and Architecture
Event Delivery System Operation
Customer Integration and Support
Summary
Conclusion
14. Configuration Design and Best Practices
What Is Configuration?
Configuration and Reliability
Separating Philosophy and Mechanics
Configuration Philosophy
Configuration Asks Users Questions
Questions Should Be Close to User Goals
Mandatory and Optional Questions
Escaping Simplicity
Mechanics of Configuration
Separate Configuration and Resulting Data
Importance of Tooling
Ownership and Change Tracking
Safe Configuration Change Application
Conclusion
15. Configuration Specifics
Configuration-Induced Toil
Reducing Configuration-Induced Toil
Critical Properties and Pitfalls of Configuration Systems
Pitfall 1: Failing to Recognize Configuration as a Programming Language Problem
Pitfall 2: Designing Accidental or Ad Hoc Language Features
Pitfall 3: Building Too Much Domain-Specific Optimization
Pitfall 4: Interleaving “Configuration Evaluation” with “Side Effects”
Pitfall 5: Using an Existing General-Purpose Scripting Language Like Python, Ruby, or Lua
Integrating a Configuration Language
Generating Config in Specific Formats
Driving Multiple Applications
Integrating an Existing Application: Kubernetes
What Kubernetes Provides
Example Kubernetes Config
Integrating the Configuration Language
Integrating Custom Applications (In-House Software)
Effectively Operating a Configuration System
Versioning
Source Control
Tooling
Testing
When to Evaluate Configuration
Very Early: Checking in the JSON
Middle of the Road: Evaluate at Build Time
Late: Evaluate at Runtime
Guarding Against Abusive Configuration
Conclusion
16. Canarying Releases
Release Engineering Principles
Balancing Release Velocity and Reliability
What Is Canarying?
Release Engineering and Canarying
Requirements of a Canary Process
Our Example Setup
A Roll Forward Deployment Versus a Simple Canary Deployment
Canary Implementation
Minimizing Risk to SLOs and the Error Budget
Choosing a Canary Population and Duration
Selecting and Evaluating Metrics
Metrics Should Indicate Problems
Metrics Should Be Representative and Attributable
Before/After Evaluation Is Risky
Use a Gradual Canary for Better Metric Selection
Dependencies and Isolation
Canarying in Noninteractive Systems
Requirements on Monitoring Data
Related Concepts
Blue/Green Deployment
Artificial Load Generation
Traffic Teeing
Conclusion
III. Processes
17. Identifying and Recovering from Overload
From Load to Overload
Case Study 1: Work Overload When Half a Team Leaves
Background
Problem Statement
What We Decided to Do
Implementation
Lessons Learned
Case Study 2: Perceived Overload After Organizational and Workload Changes
Background
Problem Statement
What We Decided to Do
Implementation
Effects
Lessons Learned
Strategies for Mitigating Overload
Recognizing the Symptoms of Overload
Reducing Overload and Restoring Team Health
Conclusion
18. SRE Engagement Model
The Service Lifecycle
Phase 1: Architecture and Design
Phase 2: Active Development
Phase 3: Limited Availability
Phase 4: General Availability
Phase 5: Deprecation
Phase 6: Abandoned
Phase 7: Unsupported
Setting Up the Relationship
Communicating Business and Production Priorities
Identifying Risks
Aligning Goals
Setting Ground Rules
Planning and Executing
Sustaining an Effective Ongoing Relationship
Investing Time in Working Better Together
Maintaining an Open Line of Communication
Performing Regular Service Reviews
Reassessing When Ground Rules Start to Slip
Adjusting Priorities According to Your SLOs and Error Budget
Handling Mistakes Appropriately
Scaling SRE to Larger Environments
Supporting Multiple Services with a Single SRE Team
Structuring a Multiple SRE Team Environment
Adapting SRE Team Structures to Changing Circumstances
Running Cohesive Distributed SRE Teams
Ending the Relationship
Case Study 1: Ares
Case Study 2: Data Analysis Pipeline
Conclusion
19. SRE: Reaching Beyond Your Walls
Truths We Hold to Be Self-Evident
Reliability Is the Most Important Feature
Your Users, Not Your Monitoring, Decide Your Reliability
If You Run a Platform, Then Reliability Is a Partnership
Everything Important Eventually Becomes a Platform
When Your Customers Have a Hard Time, You Have to Slow Down
You Will Need to Practice SRE with Your Customers
How to: SRE with Your Customers
Step 1: SLOs and SLIs Are How You Speak
Step 2: Audit the Monitoring and Build Shared Dashboards
Step 3: Measure and Renegotiate
Step 4: Design Reviews and Risk Analysis
Step 5: Practice, Practice, Practice
Be Thoughtful and Disciplined
Conclusion
20. SRE Team Lifecycles
SRE Practices Without SREs
Starting an SRE Role
Finding Your First SRE
Placing Your First SRE
Bootstrapping Your First SRE
Distributed SREs
Your First SRE Team
Forming
Storming
Norming
Performing
Making More SRE Teams
Service Complexity
SRE Rollout
Geographical Splits
Suggested Practices for Running Many Teams
Mission Control
SRE Exchange
Training
Horizontal Projects
SRE Mobility
Travel
Launch Coordination Engineering Teams
Production Excellence
SRE Funding and Hiring
Conclusion
21. Organizational Change Management in SRE
SRE Embraces Change
Introduction to Change Management
Lewin’s Three-Stage Model
McKinsey’s 7-S Model
Kotter’s Eight-Step Process for Leading Change
The Prosci ADKAR Model
Emotion-Based Models
The Deming Cycle
How These Theories Apply to SRE
Case Study 1: Scaling Waze—From Ad Hoc to Planned Change
Background
The Messaging Queue: Replacing a System While Maintaining Reliability
The Next Cycle of Change: Improving the Deployment Process
Lessons Learned
Case Study 2: Common Tooling Adoption in SRE
Background
Problem Statement
What We Decided to Do
Design
Implementation: Monitoring
Lessons Learned
Conclusion
Conclusion
Onward…
The Future Belongs to the Past
SRE + <Insert Other Discipline>
Trickles, Streams, and Floods
SRE Belongs to All of Us
On Gratitude
A. Example SLO Document
B. Example Error Budget Policy
C. Results of Postmortem Analysis
Index
Rubrieken
- advisering
- algemeen management
- coaching en trainen
- communicatie en media
- economie
- financieel management
- inkoop en logistiek
- internet en social media
- it-management / ict
- juridisch
- leiderschap
- marketing
- mens en maatschappij
- non-profit
- ondernemen
- organisatiekunde
- personal finance
- personeelsmanagement
- persoonlijke effectiviteit
- projectmanagement
- psychologie
- reclame en verkoop
- strategisch management
- verandermanagement
- werk en loopbaan