Foundations for Architecting Data Solutions

Name: Foundations for Architecting Data Solutions
Author: Ted Malaska

Managing Successful Data Projects

Paperback Engels 2018 1e druk 9781492038740

Niet leverbaar.

Samenvatting

While many companies ponder implementation details such as distributed processing engines and algorithms for data analysis, this practical book takes a much wider view of big data development, starting with initial planning and moving diligently toward execution. Authors Ted Malaska and Jonathan Seidman guide you through the major components necessary to start, architect, and develop successful big data projects.

Everyone from CIOs and COOs to lead architects and developers will explore a variety of big data architectures and applications, from massive data pipelines to web-scale applications. Each chapter addresses a piece of the software development life cycle and identifies patterns to maximize long-term success throughout the life of your project.

- Start the planning process by considering the key data project types
- Use guidelines to evaluate and select data management solutions
- Reduce risk related to technology, your team, and vague requirements
- Explore system interface design using APIs, REST, and pub/sub systems
- Choose the right distributed storage system for your big data system
- Plan and implement metadata collections for your data architecture
- Use data pipelines to ensure data integrity from source to final storage
- Evaluate the attributes of various engines for processing the data you collect

Specificaties

ISBN13:9781492038740

Trefwoorden:Data analyse, big data, Data Solutions

Taal:Engels

Bindwijze:paperback

Aantal pagina's:225

Uitgever:O'Reilly

Druk:1

Verschijningsdatum:12-9-2018

Hoofdrubriek:IT-management / ICT

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Schrijf een recensie

Uw waardering

?

Log in om uw waardering te geven

Klik om uw waardering te geven

Inhoudsopgave

Preface
Who This Book Is For
Navigating This Book
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments

1. Key Data Project Types and Considerations
Major Data Project Types
Data Pipelines and Data Staging
Primary Considerations and Risk Management
Pipeline and Staging Team Makeup
Data Processing and Analysis
Primary Considerations and Risk Management
Data Processing and Analytics Team Makeup
Application Development
Primary Considerations and Risk Management
Application Development Team Makeup
Summary

2. Evaluating and Selecting Data Management Solutions
Stages of Open Source Projects
Private Incubation Stage
Release Stage
“Curing Cancer” Stage
Broken Promises Stage
Hardening Stage
Enterprise Stage
Decline and Slow Death Stage
Common Life Cycles for Open Source Projects
Open Sourcing a Dead Product
The Follower
Evaluating Benchmarks
Considerations for Technology Selection
Understanding the Building Blocks
Looking to a Guide for Advice
Using Analysts
Looking to Market Trends
Summary

3. Managing Risk in Data Projects
Categories of Risk
Technology Risk
Team Risk
Requirements Risk
Managing Risk
Categorizing Risk in Your Architecture
Technology Risk
Strength of the Team
Other Teams
Requirements Risk
Tying This All Together
Using Prototypes and Proofs of Concept
Build Two to Three Ways
Build PoCs and Then Throw Them Away
Deployment Considerations
Using Interfaces
Start Building Early
Test Often and Keep Records
Monitoring and Alerting
Communicating Risk
Collaborate and Gain Buy-In
Share the Risk
Using Risk as a Negotiation Tool
Summary

4. Interface Design
The Human Body
The Human Body Versus a Data Architecture
Decoupling
Decoupling Considerations
Specialization
What Makes a Good Interface Design
The Contract
The Abstraction
Versioning
Being Defensive
Documentation and Naming for Interfaces
Nonfunctional Considerations
Availability
Response-Time Guarantees
Load Capacity
Using Testing to Determine SLAs
Common Interface Examples
Publish–Subscribe
Request–Response Asynchronous Example
Request–Response Synchronous Example
Summary

5. Distributed Storage Systems
Attributes of Distributed Storage Systems
Storage System Genealogy
Partitioning
Mutation Options
Read Paths
Availability Versus Consistency
Primary Use Cases
Storage System Breakdown
HDFS
S3 and Object Stores
Apache HBase
Apache Cassandra
Elasticsearch and Apache Solr
Newcomers: Apache Kudu and CockroachDB
In-Memory Storage Systems
Summary

6. The Meta of Enterprise Data
Reasons to Care About Metadata
Visibility
Relationships
Regulation
Types of Metadata in a Data Architecture
Data at Rest
Data in Motion
Metadata for Source Data
Metadata About Data Processing
Reports and Dashboards
Metadata Collection
Declarative Metadata Collection
Discovery of Metadata
Metadata Management in Practice
Summary

7. Ensuring Data Integrity
Examples of Building Data Pipelines to Ensure Data Integrity
Predefined Data Pipelines
Validation of Data Pipelines
Row Counts
Distinct Count
Full-Byte Comparison
Checksum Comparison
Summary

8. Data Processing
Attributes of Processing Engines
DAG Management
Compute Isolation
Performance
Fault Tolerance
Interaction Model
Batch and/or Streaming
Data Processing over Time
Summary

Index