Data Pipelines Pocket Reference

Name: Data Pipelines Pocket Reference
Author: James Densmore

Paperback Engels 2021 1e druk 9781492087830

34,27

In winkelwagen

Verwachte levertijd ongeveer 16 werkdagen

Samenvatting

Data pipelines are the foundation for success in data analytics. Moving data from numerous diverse sources and transforming it to provide context is the difference between having data and actually gaining value from it. This pocket reference defines data pipelines and explains how they work in today's modern data stack.

You'll learn common considerations and key decision points when implementing pipelines, such as batch versus streaming data ingestion and build versus buy. This book addresses the most common decisions made by data professionals and discusses foundational concepts that apply to open source frameworks, commercial products, and homegrown solutions.

You'll learn:
- What a data pipeline is and how it works
- How data is moved and processed on modern data infrastructure, including cloud platforms
- Common tools and products used by data engineers to build pipelines
- How pipelines support analytics and reporting needs
- Considerations for pipeline maintenance, testing, and alerting

Specificaties

ISBN13:9781492087830

Trefwoorden:Data Warehousing, Data Pipelines

Taal:Engels

Bindwijze:paperback

Aantal pagina's:200

Uitgever:O'Reilly

Druk:1

Verschijningsdatum:28-5-2021

Hoofdrubriek:IT-management / ICT

Serie:Pocket reference (O'Reilly)

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Schrijf een recensie

Uw cijfer

?

Log in om te stemmen

Inhoudsopgave

Preface
Who This Book Is For
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments

1. Introduction to Data Pipelines
What Are Data Pipelines?
Who Builds Data Pipelines?
SQL and Data Warehousing Fundamentals
Python and/or Java
Distributed Computing
Basic System Administration
A Goal-Oriented Mentality
Why Build Data Pipelines?
How Are Pipelines Built?

2. A Modern Data Infrastructure
Diversity of Data Sources
Source System Ownership
Ingestion Interface and Data Structure
Data Volume
Data Cleanliness and Validity
Latency and Bandwidth of the Source System
Cloud Data Warehouses and Data Lakes
Data Ingestion Tools
Data Transformation and Modeling Tools
Workflow Orchestration Platforms
Directed Acyclic Graphs
Customizing Your Data Infrastructure

3. Common Data Pipeline Patterns
ETL and ELT
The Emergence of ELT over ETL
EtLT Subpattern
ELT for Data Analysis
ELT for Data Science
ELT for Data Products and Machine Learning
Steps in a Machine Learning Pipeline
Incorporate Feedback in the Pipeline
Further Reading on ML Pipelines

4. Data Ingestion: Extracting Data
Setting Up Your Python Environment
Setting Up Cloud File Storage
Extracting Data from a MySQL Database
Full or Incremental MySQL Table Extraction
Binary Log Replication of MySQL Data
Extracting Data from a PostgreSQL Database
Full or Incremental Postgres Table Extraction
Replicating Data Using the Write-Ahead Log
Extracting Data from MongoDB
Extracting Data from a REST API
Streaming Data Ingestions with Kafka and Debezium

5. Data Ingestion: Loading Data
Configuring an Amazon Redshift Warehouse as a Destination
Loading Data into a Redshift Warehouse
Incremental Versus Full Loads
Loading Data Extracted from a CDC Log
Configuring a Snowflake Warehouse as a Destination
Loading Data into a Snowflake Data Warehouse
Using Your File Storage as a Data Lake
Open Source Frameworks
Commercial Alternatives

6. Transforming Data
Noncontextual Transformations
Deduplicating Records in a Table
Parsing URLs
When to Transform? During or After Ingestion?
Data Modeling Foundations
Key Data Modeling Terms
Modeling Fully Refreshed Data
Slowly Changing Dimensions for Fully Refreshed Data
Modeling Incrementally Ingested Data
Modeling Append-Only Data
Modeling Change Capture Data

7. Orchestrating Pipelines
Directed Acyclic Graphs
Apache Airflow Setup and Overview
Installing and Configuring
Airflow Database
Web Server and UI
Scheduler
Executors
Operators
Building Airflow DAGs
A Simple DAG
An ELT Pipeline DAG
Additional Pipeline Tasks
Alerts and Notifications
Data Validation Checks
Advanced Orchestration Configurations
Coupled Versus Uncoupled Pipeline Tasks
When to Split Up DAGs
Coordinating Multiple DAGs with Sensors
Managed Airflow Options
Other Orchestration Frameworks

8. Data Validation in Pipelines
Validate Early, Validate Often
Source System Data Quality
Data Ingestion Risks
Enabling Data Analyst Validation
A Simple Validation Framework
Validator Framework Code
Structure of a Validation Test
Running a Validation Test
Usage in an Airflow DAG
When to Halt a Pipeline, When to Warn and Continue
Extending the Framework
Validation Test Examples
Duplicate Records After Ingestion
Unexpected Change in Row Count After Ingestion
Metric Value Fluctuations
Commercial and Open Source Data Validation Frameworks

9. Best Practices for Maintaining Pipelines
Handling Changes in Source Systems
Introduce Abstraction
Maintain Data Contracts
Limits of Schema-on-Read
Scaling Complexity
Standardizing Data Ingestion
Reuse of Data Model Logic
Ensuring Dependency Integrity

10. Measuring and Monitoring Pipeline Performance
Key Pipeline Metrics
Prepping the Data Warehouse
A Data Infrastructure Schema
Logging and Ingesting Performance Data
Ingesting DAG Run History from Airflow
Adding Logging to the Data Validator
Transforming Performance Data
DAG Success Rate
DAG Runtime Change Over Time
Validation Test Volume and Success Rate
Orchestrating a Performance Pipeline
The Performance DAG
Performance Transparency

Index