Architecting Modern Data Platforms
A Guide to Enterprise Hadoop at Scale
Paperback Engels 2019 1e druk 9781491969274Samenvatting
There's a lot of information about big data technologies, but splicing these technologies into an end-to-end enterprise data platform is a daunting task not widely covered. With this practical book, you'll learn how to build big data infrastructure both on-premises and in the cloud and successfully architect a modern data platform.
Ideal for enterprise architects, IT managers, application architects, and data engineers, this book shows you how to overcome the many challenges that emerge during Hadoop projects.
You'll explore the vast landscape of tools available in the Hadoop and big data realm in a thorough technical primer before diving into:
- Infrastructure: Look at all component layers in a modern data platform, from the server to the data center, to establish a solid foundation for data in your enterprise
- Platform: Understand aspects of deployment, operation, security, high availability, and disaster recovery, along with everything you need to know to integrate your platform with the rest of your enterprise IT
- Taking Hadoop to the cloud: Learn the important architectural aspects of running a big data platform in the cloud while maintaining enterprise security and high availability
Specificaties
Lezersrecensies
Inhoudsopgave
Preface
Some Misconceptions
Some General Trends
Horizontal Scaling
Adoption of Open Source
Embracing Cloud Compute
Decoupled Compute and Storage
What Is This Book About?
Who Should Read This Book?
The Road Ahead
Conventions Used in This Book
O’Reilly Safari
How to Contact Us
Acknowledgments
1. Big Data Technology Primer
A Tour of the Landscape
Core Components
Computational Frameworks
Analytical SQL Engines
Storage Engines
Ingestion
Orchestration
Summary
I: Infrastructure
2. Clusters
Reasons for Multiple Clusters
Multiple Clusters for Resiliency
Multiple Clusters for Software Development
Multiple Clusters for Workload Isolation
Multiple Clusters for Legal Separation
Multiple Clusters and Independent Storage and Compute
Multitenancy
Requirements for Multitenancy
Sizing Clusters
Sizing by Storage
Sizing by Ingest Rate
Sizing by Workload
Cluster Growth
The Drivers of Cluster Growth
Implementing Cluster Growth
Data Replication
Replication for Software Development
Replication and Workload Isolation
Summary
3. Compute and Storage
Computer Architecture for Hadoop
Commodity Servers
Server CPUs and RAM
Nonuniform Memory Access
CPU Specifications
RAM
Commoditized Storage Meets the Enterprise
Modularity of Compute and Storage
Everything Is Java
Replication or Erasure Coding?
Alternatives
Hadoop and the Linux Storage Stack
User Space
Important System Calls
The Linux Page Cache
Short-Circuit and Zero-Copy Reads
Filesystems
Erasure Coding Versus Replication
Discussion
Guidance
Low-Level Storage
Storage Controllers
Disk Layer
Server Form Factors
Form Factor Comparison
Guidance
Workload Profiles
Cluster Configurations and Node Types
Master Nodes
Worker Nodes
Utility Nodes
Edge Nodes
Small Cluster Configurations
Medium Cluster Configurations
Large Cluster Configurations
Summary
4. Networking
How Services Use a Network
Remote Procedure Calls (RPCs)
Data Transfers
Monitoring
Backup
Consensus
Network Architectures
Small Cluster Architectures
Medium Cluster Architectures
Large Cluster Architectures
Network Integration
Reusing an Existing Network
Creating an Additional Network
Network Design Considerations
Layer 1 Recommendations
Layer 2 Recommendations
Layer 3 Recommendations
Summary
5. Organizational Challenges
Who Runs It?
Is It Infrastructure, Middleware, or an Application?
Case Study: A Typical Business Intelligence Project
The Traditional Approach
Typical Team Setup
Compartmentalization of IT
Revised Team Setup for Hadoop in the Enterprise
Solution Overview with Hadoop
New Team Setup
Split Responsibilities
Do I Need DevOps?
Do I Need a Center of Excellence/Competence?
Summary
6. Datacenter Considerations
Why Does It Matter ?
Basic Datacenter Concepts
Cooling
Power
Network
Rack Awareness and Rack Failures
Failure Domain Alignment
Space and Racking Constraints
Ingest and Intercluster Connectivity
Software
Hardware
Replacements and Repair
Operational Procedures
Typical Pitfalls
Networking
Cluster Spanning
Summary
II: Platform
7. Provisioning Clusters
Operating Systems
OS Choices
OS Configuration for Hadoop
Automated Configuration Example
Service Databases
Required Databases
Database Integration Options
Database Considerations
Hadoop Deployment
Hadoop Distributions
Installation Choices
Distribution Architecture
Installation Process
Summary
8. Platform Validation
Testing Methodology
Useful Tools
Hardware Validation
CPU
Disks
Network
Hadoop Validation
HDFS Validation
General Validation
Validating Other Components
Operations Validation
Summary
9. Security
In-Flight Encryption
TLS Encryption
SASL Quality of Protection
Enabling in-Flight Encryption
Authentication
Kerberos
LDAP Authentication
Delegation Tokens
Impersonation
Authorization
Group Resolution
Superusers and Supergroups
Hadoop Service Level Authorization
Centralized Security Management
HDFS
YARN
ZooKeeper
Hive
Impala
HBase
Solr
Kudu
Oozie
Hue
Kafka
Sentry
At-Rest Encryption
Volume Encryption with Cloudera Navigator Encrypt and Key Trustee Server
HDFS Transparent Data Encryption
Encrypting Temporary Files
Summary
10. Integration with Identity Management Providers
Integration Areas
Integration Scenarios
Scenario 1: Writing a File to HDFS
Scenario 2: Submitting a Hive Query
Scenario 3: Running a Spark Job
Integration Providers
LDAP Integration
Background
LDAP Security
Load Balancing
Application Integration
Linux Integration
Kerberos Integration
Kerberos Clients
KDC Integration
Certificate Management
Signing Certificates
Converting Certificates
Wildcard Certificates
Automation
Summary
11. Accessing and Interacting with Clusters
Access Mechanisms
Programmatic Access
Command-Line Access
Web UIs
Access Topologies
Interaction Patterns
Proxy Access
Load Balancing
Edge Node Interactions
Access Security
Administration Gateways
Workbenches
Hue
Notebooks
Landing Zones
Summary
12. High Availability
High Availability Defined
Lateral/Service HA
Vertical/Systemic HA
Measuring Availability
Percentages
Percentiles
Operating for HA
Monitoring
Playbooks and Postmortems
HA Building Blocks
Quorums
Load Balancing
Database HA
Ancillary Services
General Considerations
Separation of Master and Worker Processes
Separation of Identical Service Roles
Master Servers in Separate Failure Domains
Balanced Master Configurations
Optimized Server Configurations
High Availability of Cluster Services
ZooKeeper
HDFS
YARN
HBase
KMS
Hive
Impala
Solr
Kafka
Oozie
Hue
Other Services
Autoconfiguration
Summary
13. Backup and Disaster Recovery
Context
Many Distributed Systems
Policies and Objectives
Failure Scenarios
Suitable Data Sources
Strategies
Data Types
Consistency
Validation
Summary
Data Replication
HBase
Cluster Management Tools
Kafka
Summary
Hadoop Cluster Backups
Subsystems
Case Study: Automating Backups with Oozie
Restore
Summary
III: Taking Hadoop to the Cloud
14. Basics of Virtualization for Hadoop
Compute Virtualization
Virtual Machine Distribution
Anti-Affinity Groups
Storage Virtualization
Virtualizing Local Storage
SANs
Object Storage and Network-Attached Storage
Network Virtualization
Cluster Life Cycle Models
Summary
15. Solutions for Private Clouds
OpenStack
Automation and Integration
Life Cycle and Storage
Isolation
Summary
OpenShift
Automation
Life Cycle and Storage
Isolation
Summary
VMware and Pivotal Cloud Foundry
Do It Yourself?
Automation
Isolation
Life Cycle Model
Summary
Object Storage for Private Clouds
EMC Isilon
Ceph
Summary
16. Solutions in the Public Cloud
Key Things to Know
Cloud Providers
AWS
Microsoft Azure
Google Cloud Platform
Implementing Clusters
Instances
Storage and Life Cycle Models
Network Architecture
High Availability
Summary
17. Automated Provisioning
Long-Lived Clusters
Configuration and Templating
Deployment Phases
Vendor Solutions
One-Click Deployments
Homegrown Automation
Hooking Into a Provisioning Life Cycle
Scaling Up and Down
Deploying with Security
Transient Clusters
Sharing Metadata Services
Summary
18. Security in the Cloud
Assessing the Risk
Risk Model
Environmental Risks
Deployment Risks
Identity Provider Options for Hadoop
Option A: Cloud-Only Self-Contained ID Services
Option B: Cloud-Only Shared ID Services
Option C: On-Premises ID Services
Object Storage Security and Hadoop
Identity and Access Management
Amazon Simple Storage Service
GCP Cloud Storage
Microsoft Azure
Auditing
Encryption for Data at Rest
Requirements for Key Material
Options for Encryption in the Cloud
On-Premises Key Persistence
Encryption via the Cloud Provider
Encryption Feature and Interoperability Summary
Recommendations and Summary for Cloud Encryption
Encrypting Data in Flight in the Cloud
Perimeter Controls and Firewalling
GCP
AWS
Azure
Summary
A. Backup Onboarding Checklist
Backup Onboarding Checklist
Backup
Services
Cloudera Manager
HDFS
HBase
Hive/Impala
Sqoop
Oozie
Hue
Sentry
Index
Rubrieken
- advisering
- algemeen management
- coaching en trainen
- communicatie en media
- economie
- financieel management
- inkoop en logistiek
- internet en social media
- it-management / ict
- juridisch
- leiderschap
- marketing
- mens en maatschappij
- non-profit
- ondernemen
- organisatiekunde
- personal finance
- personeelsmanagement
- persoonlijke effectiviteit
- projectmanagement
- psychologie
- reclame en verkoop
- strategisch management
- verandermanagement
- werk en loopbaan