The Cloud Data Lake
A Guide to Building Robust Cloud Data ArchitecturePaperback Engels 2022 9781098116583
More organizations than ever understand the importance of data lake architectures for deriving value from their data. Building a robust, scalable, and performant data lake remains a complex proposition, however, with a buffet of tools and options that need to work together to provide a seamless end-to-end pipeline from data to insights.
This book provides a concise yet comprehensive overview on the setup, management, and governance of a cloud data lake. Author Rukmani Gopalan, a product management leader and data enthusiast, guides data architects and engineers through the major aspects of working with a cloud data lake, from design considerations and best practices to data format optimizations, performance optimization, cost management, and governance.
- Learn the benefits of a cloud-based big data strategy for your organization
- Get guidance and best practices for designing performant and scalable data lakes
- Examine architecture and design choices, and data governance principles and strategies
- Build a data strategy that scales as your organizational and business needs increase
- Implement a scalable data lake in the cloud
- Use cloud-based advanced analytics to gain more value from your data
Geef uw waardering
Why I Wrote This Book
Who Should Read This Book?
Introducing Klodars Corporation
Navigating the Book
Conventions Used in This Book
O'Reilly Online Learning
How to Contact Us
1. Big Data Beyond the Buzz
What Is Big Data?
Elastic Data Infrastructure The Challenge
Cloud Computing Fundamentals
Cloud Computing Terminology
Value Proposition of the Cloud
Cloud Data Lake Architecture
Limitations of On-Premises Data Warehouse Solutions
What Is a Cloud Data Lake Architecture?
Benefits of a Cloud Data Lake Architecture
Defining Your Cloud Data Lake Journey
2. Big Data Architectures on the Cloud
Why Klodars Corporation Moves to the Cloud
Fundamentals of Cloud Data Lake Architectures
A Word on Variety of Data
Cloud Data Lake Storage
Big Data Analytics Engines
Cloud Data Warehouses
Modern Data Warehouse Architecture
Sample Use Case for a Modern Data Warehouse Architecture
Benefits and Challenges of Modern Data Warehouse Architecture
Data Lakehouse Architecture
Reference Architecture for the Data Lakehouse
Sample Use Case for Data Lakehouse Architecture
Benefits and Challenges of the Data Lakehouse Architecture
Data Warehouses and Unstructured Data
Sample Use Case for a Data Mesh Architecture
Challenges and Benefits of a Data Mesh Architecture
What Is the Right Architecture for Me?
Know Your Customers
Know Your Business Drivers
Consider Your Growth and Future Scenarios
3. Design Considerations for Your Data Lake
Setting Up the Cloud Data Lake Infrastructure
Identify Your Goals
Plan Your Architecture and Deliverables
Implement the Cloud Data Lake
Release and Operationalize
Organizing Data in Your Data Lake
A Day in the Life of Data
Data Lake Zones
Introduction to Data Governance
Actors Involved in Data Governance
Metadata Management, Data Catalog, and Data Sharing
Data Access Management
Data Quality and Observability
Data Governance at Klodars Corporation
Data Governance Wrap-Up
Manage Data Lake Costs
Demystifying Data Lake Costs on the Cloud
Data Lake Cost Strategy
4. Scalable Data Lakes
A Sneak Peek into Scalability
What Is Scalability?
Scale in Our Day-to-Day Life
Scalability in Data Lake Architectures
Internals of Data Lake Processing Systems
Data Copy Internals
ELT/ETL Processing Internals
A Note on Other Interactive Queries
Considerations for Scalable Data Lake Solutions
Pick the Right Cloud Offerings
Plan for Peak Capacity
Data Formats and Job Profile
5. Optimizing Cloud Data Lake Architectures for Performance
Basics of Measuring Performance
Goals and Metrics for Performance
Optimizing for Faster Performance
Cloud Data Lake Performance
SLAs, SLOs, and SLIs
Example: How Klodars Corporation Managed Its SLAs, SLOs, and SLIs
Drivers of Performance
Performance Drivers for a Copy Job
Performance Drivers for a Spark Job
Optimization Principles and Techniques for Performance Tuning
Data Organization and Partitioning
Choosing the Right Configurations on Apache Spark
Minimize Overheads with Data Transfer
Premium Offerings and Performance
The Case of Bigger Virtual Machines
The Case of Flash Storage
6. Deep Dive on Data Formats
Why Do We Need These Open Data Formats?
Why Do We Need to Store Tabular Data?
Why Is It a Problem to Store Tabular Data in a Cloud Data Lake Storage?
Why Was Delta Lake Founded?
How Does Delta Lake Work?
When Do You Use Delta Lake?
Why Was Apache Iceberg Founded?
How Does Apache Iceberg Work?
When Do You Use Apache Iceberg?
Why Was Apache Hudi Founded?
How Does Apache Hudi Work?
When Do You Use Apache Hudi?
7. Decision Framework for Your Architecture
Cloud Data Lake Assessment
Cloud Data Lake Assessment Questionnaire
Analysis for Your Cloud Data Lake Assessment
Starting from Scratch
Migrating an Existing Data Lake or Data Warehouse to the Cloud
Improving an Existing Cloud Data Lake
Phase 1 of Decision Framework: Assess
Understand Customer Requirements
Understand Opportunities for Improvement
Know Your Business Drivers
Complete the Assess Phase by Prioritizing the Requirements
Phase 2 of Decision Framework: Define
Finalize the Design Choices for the Cloud Data Lake
Plan Your Cloud Data Lake Project Deliverables
Phase 3 of Decision Framework: Implement
Phase 4 of Decision Framework: Operationalize
8. Six Lessons for a Data Informed Future
Lesson 1: Focus on the How and When, Not the If and Why, When It Comes to Cloud Data Lakes
Lesson 2: With Great Power Comes Great ResponsibilityâData Is No Exception
Lesson 3: Customers Lead Technology, Not the Other Way Around
Lesson 4: Change Is Inevitable, so Be Prepared
Lesson 5: Build Empathy and Prioritize Ruthlessly
Lesson 6: Big Impact Does Not Happen Overnight
A. Cloud Data Lake Decision Framework Template
Phase 1: Assess Framework
Phase 2: Define Framework
Planning the Cloud Data Lake Deliverables
Phase 3: Implement Framework
About the Author
Managementboek Top 100
- Algemeen management
- Coaching en trainen
- Communicatie en media
- Financieel management
- Inkoop en logistiek
- Internet en social media
- IT-management / ICT
- Mens en maatschappij
- Personal finance
- Persoonlijke effectiviteit
- Reclame en verkoop
- Strategisch management
- Werk en loopbaan