Op werkdagen voor 23:00 besteld, morgen in huis Gratis verzending vanaf €20
,

Fundamentals of Data Engineering

Paperback Engels 2022 9781098108304
Verkooppositie 2353Hoogste positie: 993
Verwachte levertijd ongeveer 8 werkdagen

Samenvatting

Data engineering has grown rapidly in the past decade, leaving many software engineers, data scientists, and analysts looking for a comprehensive view of this practice. With this practical book, you'll learn how to plan and build systems to serve the needs of your organization and customers by evaluating the best technologies available through the framework of the data engineering lifecycle.

Authors Joe Reis and Matt Housley walk you through the data engineering lifecycle and show you how to stitch together a variety of cloud technologies to serve the needs of downstream data consumers. You'll understand how to apply the concepts of data generation, ingestion, orchestration, transformation, storage, and governance that are critical in any data environment regardless of the underlying technology.

This book will help you:
- Get a concise overview of the entire data engineering landscape
- Assess data engineering problems using an end-to-end framework of best practices
- Cut through marketing hype when choosing data technologies, architecture, and processes
- Use the data engineering lifecycle to design and build a robust architecture
- Incorporate data governance and security across the data engineering lifecycle

Specificaties

ISBN13:9781098108304
Trefwoorden:data engineering
Taal:Engels
Bindwijze:paperback
Aantal pagina's:400
Uitgever:O'Reilly
Druk:1
Verschijningsdatum:5-7-2022
Hoofdrubriek:IT-management / ICT
ISSN:

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Geef uw waardering

Zeer goed Goed Voldoende Matig Slecht

Inhoudsopgave

Preface
What This Book Isn’t
What This Book Is About
Who Should Read This Book
Prerequisites
What You’ll Learn and How It Will Improve Your Abilities
Navigating This Book
Conventions Used in This Book
How to Contact Us
Acknowledgments

I. Foundation and Building Blocks
1. Data Engineering Described
What Is Data Engineering?
Data Engineering Defined
The Data Engineering Lifecycle
Evolution of the Data Engineer
Data Engineering and Data Science
Data Engineering Skills and Activities
Data Maturity and the Data Engineer
The Background and Skills of a Data Engineer
Business Responsibilities
Technical Responsibilities
The Continuum of Data Engineering Roles, from A to B
Data Engineers Inside an Organization
Internal-Facing Versus External-Facing Data Engineers
Data Engineers and Other Technical Roles
Data Engineers and Business Leadership
Conclusion
Additional Resources

2. The Data Engineering Lifecycle
What Is the Data Engineering Lifecycle?
The Data Lifecycle Versus the Data Engineering Lifecycle
Generation: Source Systems
Storage
Ingestion
Transformation
Serving Data
Major Undercurrents Across the Data Engineering Lifecycle
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

3. Designing Good Data Architecture
What Is Data Architecture?
Enterprise Architecture Defined
Data Architecture Defined
“Good” Data Architecture
Principles of Good Data Architecture
Principle 1: Choose Common Components Wisely
Principle 2: Plan for Failure
Principle 3: Architect for Scalability
Principle 4: Architecture Is Leadership
Principle 5: Always Be Architecting
Principle 6: Build Loosely Coupled Systems
Principle 7: Make Reversible Decisions
Principle 8: Prioritize Security
Principle 9: Embrace FinOps
Major Architecture Concepts
Domains and Services
Distributed Systems, Scalability, and Designing for Failure
Tight Versus Loose Coupling: Tiers, Monoliths, and Microservices
User Access: Single Versus Multitenant
Event-Driven Architecture
Brownfield Versus Greenfield Projects
Examples and Types of Data Architecture
Data Warehouse
Data Lake
Convergence, Next-Generation Data Lakes, and the Data Platform
Modern Data Stack
Lambda Architecture
Kappa Architecture
The Dataflow Model and Unified Batch and Streaming
Architecture for IoT
Data Mesh
Other Data Architecture Examples
Who’s Involved with Designing a Data Architecture?
Conclusion
Additional Resources

4. Choosing Technologies Across the Data Engineering Lifecycle
Team Size and Capabilities
Speed to Market
Interoperability
Cost Optimization and Business Value
Total Cost of Ownership
Total Opportunity Cost of Ownership
FinOps
Today Versus the Future: Immutable Versus Transitory Technologies
Our Advice
Location
On Premises
Cloud
Hybrid Cloud
Multicloud
Decentralized: Blockchain and the Edge
Our Advice
Cloud Repatriation Arguments
Build Versus Buy
Open Source Software
Proprietary Walled Gardens
Our Advice
Monolith Versus Modular
Monolith
Modularity
The Distributed Monolith Pattern
Our Advice
Serverless Versus Servers
Serverless
Containers
How to Evaluate Server Versus Serverless
Our Advice
Optimization, Performance, and the Benchmark Wars
Big Data...for the 1990s
Nonsensical Cost Comparisons
Asymmetric Optimization
Caveat Emptor
Undercurrents and Their Impacts on Choosing Technologies
Data Management
DataOps
Data Architecture
Orchestration Example: Airflow
Software Engineering
Conclusion
Additional Resources

II. The Data Engineering Lifecycle in Depth
5. Data Generation in Source Systems
Sources of Data: How Is Data Created?
Source Systems: Main Ideas
Files and Unstructured Data
APIs
Application Databases (OLTP Systems)
Online Analytical Processing System
Change Data Capture
Logs
Database Logs
CRUD
Insert-Only
Messages and Streams
Types of Time
Source System Practical Details
Databases
APIs
Data Sharing
Third-Party Data Sources
Message Queues and Event-Streaming Platforms
Whom You’ll Work With
Undercurrents and Their Impact on Source Systems
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

6. Storage
Raw Ingredients of Data Storage
Magnetic Disk Drive
Solid-State Drive
Random Access Memory
Networking and CPU
Serialization
Compression
Caching
Data Storage Systems
Single Machine Versus Distributed Storage
Eventual Versus Strong Consistency
File Storage
Block Storage
Object Storage
Cache and Memory-Based Storage Systems
The Hadoop Distributed File System
Streaming Storage
Indexes, Partitioning, and Clustering
Data Engineering Storage Abstractions
The Data Warehouse
The Data Lake
The Data Lakehouse
Data Platforms
Stream-to-Batch Storage Architecture
Big Ideas and Trends in Storage
Data Catalog
Data Sharing
Schema
Separation of Compute from Storage
Data Storage Lifecycle and Data Retention
Single-Tenant Versus Multitenant Storage
Whom You’ll Work With
Undercurrents
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

7. Ingestion
What Is Data Ingestion?
Key Engineering Considerations for the Ingestion Phase
Bounded Versus Unbounded Data
Frequency
Synchronous Versus Asynchronous Ingestion
Serialization and Deserialization
Throughput and Scalability
Reliability and Durability
Payload
Push Versus Pull Versus Poll Patterns
Batch Ingestion Considerations
Snapshot or Differential Extraction
File-Based Export and Ingestion
ETL Versus ELT
Inserts, Updates, and Batch Size
Data Migration
Message and Stream Ingestion Considerations
Schema Evolution
Late-Arriving Data
Ordering and Multiple Delivery
Replay
Time to Live
Message Size
Error Handling and Dead-Letter Queues
Consumer Pull and Push
Location
Ways to Ingest Data
Direct Database Connection
Change Data Capture
APIs
Message Queues and Event-Streaming Platforms
Managed Data Connectors
Moving Data with Object Storage
EDI
Databases and File Export
Practical Issues with Common File Formats
Shell
SSH
SFTP and SCP
Webhooks
Web Interface
Web Scraping
Transfer Appliances for Data Migration
Data Sharing
Whom You’ll Work With
Upstream Stakeholders
Downstream Stakeholders
Undercurrents
Security
Data Management
DataOps
Orchestration
Software Engineering
Conclusion
Additional Resources

8. Queries, Modeling, and Transformation
Queries
What Is a Query?
The Life of a Query
The Query Optimizer
Improving Query Performance
Queries on Streaming Data
Data Modeling
What Is a Data Model?
Conceptual, Logical, and Physical Data Models
Normalization
Techniques for Modeling Batch Analytical Data
Modeling Streaming Data
Transformations
Batch Transformations
Materialized Views, Federation, and Query Virtualization
Streaming Transformations and Processing
Whom You’ll Work With
Upstream Stakeholders
Downstream Stakeholders
Undercurrents
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

9. Serving Data for Analytics, Machine Learning, and Reverse ETL
General Considerations for Serving Data
Trust
What’s the Use Case, and Who’s the User?
Data Products
Self-Service or Not?
Data Definitions and Logic
Data Mesh
Analytics
Business Analytics
Operational Analytics
Embedded Analytics
Machine Learning
What a Data Engineer Should Know About ML
Ways to Serve Data for Analytics and ML
File Exchange
Databases
Streaming Systems
Query Federation
Data Sharing
Semantic and Metrics Layers
Serving Data in Notebooks
Reverse ETL
Whom You’ll Work With
Undercurrents
Security
Data Management
DataOps
Data Architecture
Orchestration
Software Engineering
Conclusion
Additional Resources

III. Security, Privacy, and the Future of Data Engineering
10. Security and Privacy
People
The Power of Negative Thinking
Always Be Paranoid
Processes
Security Theater Versus Security Habit
Active Security
The Principle of Least Privilege
Shared Responsibility in the Cloud
Always Back Up Your Data
An Example Security Policy
Technology
Patch and Update Systems
Encryption
Logging, Monitoring, and Alerting
Network Access
Security for Low-Level Data Engineering
Conclusion
Additional Resources

11. The Future of Data Engineering
The Data Engineering Lifecycle Isn’t Going Away
The Decline of Complexity and the Rise of Easy-to-Use Data Tools
The Cloud-Scale Data OS and Improved Interoperability
“Enterprisey” Data Engineering
Titles and Responsibilities Will Morph...
Moving Beyond the Modern Data Stack, Toward the Live Data Stack
The Live Data Stack
Streaming Pipelines and Real-Time Analytical Databases
The Fusion of Data with Applications
The Tight Feedback Between Applications and ML
Dark Matter Data and the Rise of...Spreadsheets?!
Conclusion

A. Serialization and Compression Technical Details
Serialization Formats
Row-Based Serialization
Columnar Serialization
Hybrid Serialization
Database Storage Engines
Compression: gzip, bzip2, Snappy, Etc.

B. Cloud Networking
Cloud Network Topology
Data Egress Charges
Availability Zones
Regions
GCP-Specific Networking and Multiregional Redundancy
Direct Network Connections to the Clouds
CDNs
The Future of Data Egress Fees

Index
About the Authors

Managementboek Top 100

Rubrieken

Populaire producten

    Personen

      Trefwoorden

        Fundamentals of Data Engineering