Op werkdagen voor 23:00 besteld, morgen in huis Gratis verzending vanaf €20

Cost–Effective Data Pipelines

Balancing Trade-Offs When Developing Pipelines in the Cloud

Paperback Engels 2023 1e druk 9781492098645
Verkooppositie 5482Hoogste positie: 5482
Verwachte levertijd ongeveer 16 werkdagen

Samenvatting

The low cost of getting started with cloud services can easily evolve into a significant expense down the road. That's challenging for teams developing data pipelines, particularly when rapid changes in technology and workload require a constant cycle of redesign. How do you deliver scalable, highly available products while keeping costs in check?

With this practical guide, author Sev Leonard provides a holistic approach to designing scalable data pipelines in the cloud. Intermediate data engineers, software developers, and architects will learn how to navigate cost/performance trade-offs and how to choose and configure compute and storage. You'll also pick up best practices for code development, testing, and monitoring.

By focusing on the entire design process, you'll be able to deliver cost-effective, high-quality products.

This book helps you:
- Reduce cloud spend with lower cost cloud service offerings and smart design strategies
- Minimize waste without sacrificing performance by rightsizing compute resources
- Drive pipeline evolution, head off performance issues, and quickly debug with effective monitoring
- Set up development and test environments that minimize cloud service dependencies
- Create data pipeline code bases that are testable and extensible, fostering rapid development and evolution
- Improve data quality and pipeline operation through validation and testing

Specificaties

ISBN13:9781492098645
Trefwoorden:Data Warehousing
Taal:Engels
Bindwijze:paperback
Aantal pagina's:275
Uitgever:O'Reilly
Druk:1
Verschijningsdatum:28-7-2023
Hoofdrubriek:IT-management / ICT
ISSN:

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Inhoudsopgave

Preface
Who This Book Is For
What You Will Learn
What This Book Is Not
Running Example
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments

1. Designing Compute for Data Pipelines
Understanding Availability of Cloud Compute
Outages
Capacity Limits
Account Limits
Infrastructure
Leveraging Different Purchasing Options in Pipeline Design
On Demand
Spot/Interruptible
Contractual Discounts
Contractual Discounts in the Real World: A Cautionary Tale
Requirements Gathering for Compute Design
Business Requirements
Architectural Requirements
Requirements-Gathering Example: HoD Batch Ingest
Benchmarking
Instance Family Identification
Cluster Sizing
Monitoring
Benchmarking Example
Undersized
Oversized
Right-Sized
Summary
Recommended Readings

2. Responding to Changes in Demand by Scaling Compute
Identifying Scaling Opportunities
Variation in Data Pipelines
Scaling Metrics
Pipeline Scaling Example
Designing for Scaling
Implementing Scaling Plans
Scaling Mechanics
Common Autoscaling Pitfalls
Autoscaling Example
Summary
Recommended Readings

3. Data Organization in the Cloud
Cloud Storage Costs
Storage at Rest
Egress
Data Access
Cloud Storage Organization
Storage Bucket Strategies
Lifecycle Configurations
File Structure Design
File Formats
Partitioning
Compaction
Summary
Recommended Readings

4. Economical Pipeline Fundamentals
Idempotency
Preventing Data Duplication
Tolerating Data Duplication
Checkpointing
Automatic Retries
Retry Considerations
Retry Levels in Data Pipelines
Data Validation
Validating Data Characteristics
Schemas
Summary

5. Setting Up Effective Development Environments
Environments
Software Environments
Data Environments
Data Pipeline Environments
Environment Planning
Local Development
Containers
Resource Dependency Reduction
Resource Cleanup
Summary

6. Software Development Strategies
Managing Different Coding Environments
Example: A Multimodal Pipeline
Example: How Code Becomes Difficult to Change
Modular Design
Single Responsibility
Dependency Inversion
Modular Design with DataFrames
Configurable Design
Summary
Recommended Readings

7. Unit Testing
The Role of Unit Testing in Data Pipelines
Unit Testing Overview
Example: Identifying Unit Testing Needs
Pipeline Areas to Unit-Test
Data Logic
Connections
Observability
Data Modification Processes
Cloud Components
Working with Dependencies
Interfaces
Data
Example: Unit Testing Plan
Identifying Components to Test
Identifying Dependencies
Summary

8. Mocks
Considerations for Replacing Dependencies
Placement
Dependency Stability
Complexity Versus Criticality
Mocking Generic Interfaces
Responses
Requests
Connectivity
Mocking Cloud Services
Building Your Own Mocks
Mocking with Moto
Testing with Databases
Test Database Example
Working with Test Databases
Summary
Further Exploration
More Moto Mocks
Mock Placement

9. Data for Testing
Working with Live Data
Benefits
Challenges
Working with Synthetic Data
Benefits
Challenges
Is Synthetic Data the Right Approach?
Manual Data Generation
Automated Data Generation
Synthetic Data Libraries
Schema-Driven Generation
Property-Based Testing
Summary

10. Logging
Logging Costs
Impact of Scale
Impact of Cloud Storage Elasticity
Reducing Logging Costs
Effective Logging
Summary

11. Finding Your Way with Monitoring
Costs of Inadequate Monitoring
Getting Lost in the Woods
Navigation to the Rescue
System Monitoring
Data Volume
Throughput
Consumer Lag
Worker Utilization
Resource Monitoring
Understanding the Bounds
Understanding Reliability Impacts
Pipeline Performance
Pipeline Stage Duration
Profiling
Errors to Watch Out For
Query Monitoring
Minimizing Monitoring Costs
Summary
Recommended Readings

12. Essential Takeaways
An Ounce of Prevention Is Worth a Pound of Cure
Reign In Compute Spend
Organize Your Resources
Design for Interruption
Build In Data Quality
Change Is the Only Constant
Design for Change
Monitor for Change
Parting Thoughts
Appendix. Preparing a Cloud Budget
It’s All About the Details
Historical Data
Estimating for New Projects
Changes That Impact Costs
Creating a Budget
Budget Summary
Changes Between Previous and Next Budget Periods
Cost Breakdown
Communicating the Budget
Summary

Index
About the Author

Managementboek Top 100

Rubrieken

Populaire producten

    Personen

      Trefwoorden

        Cost–Effective Data Pipelines