The Self–Service Data Roadmap
Democratize Data and Reduce Time to Insight
Paperback Engels 2020 1e druk 9781492075257Samenvatting
Data-driven insights are a key competitive advantage for any industry today, but deriving insights from raw data can still take days or weeks. Most organizations can’t scale data science teams fast enough to keep up with the growing amounts of data to transform. What’s the answer? Self-service data.
With this practical book, data engineers, data scientists, and team managers will learn how to build a self-service data science platform that helps anyone in your organization extract insights from data. Sandeep Uttamchandani provides a scorecard to track and address bottlenecks that slow down time to insight across data discovery, transformation, processing, and production. This book bridges the gap between data scientists bottlenecked by engineering realities and data engineers unclear about ways to make self-service work.
- Build a self-service portal to support data discovery, quality, lineage, and governance
- Select the best approach for each self-service capability using open source cloud technologies
- Tailor self-service for the people, processes, and technology maturity of your data platform
- Implement capabilities to democratize data and reduce time to insight
-Scale your self-service portal to support a large number of users within your organization
Specificaties
Lezersrecensies
Inhoudsopgave
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
1. Introduction
Journey Map from Raw Data to Insights
Discover
Prep
Build
Operationalize
Defining Your Time-to-Insight Scorecard
Build Your Self-Service Data Roadmap
I: Self-Service Data Discovery
2. Metadata Catalog Service
Journey Map
Understanding Datasets
Analyzing Datasets
Knowledge Scaling
Minimizing Time to Interpret
Extracting Technical Metadata
Extracting Operational Metadata
Gathering Team Knowledge
Defining Requirements
Technical Metadata Extractor Requirements
Operational Metadata Requirements
Team Knowledge Aggregator Requirements
Implementation Patterns
Source-Specific Connectors Pattern
Lineage Correlation Pattern
Team Knowledge Pattern
Summary
3. Search Service
Journey Map
Determining Feasibility of the Business Problem
Selecting Relevant Datasets for Data Prep
Reusing Existing Artifacts for Prototyping
Minimizing Time to Find
Indexing Datasets and Artifacts
Ranking Results
Access Control
Defining Requirements
Indexer Requirements
Ranking Requirements
Access Control Requirements
Nonfunctional Requirements
Implementation Patterns
Push-Pull Indexer Pattern
Hybrid Search Ranking Pattern
Catalog Access Control Pattern
Summary
4. Feature Store Service
Journey Map
Finding Available Features
Training Set Generation
Feature Pipeline for Online Inference
Minimize Time to Featurize
Feature Computation
Feature Serving
Defining Requirements
Feature Computation
Feature Serving
Nonfunctional Requirements
Implementation Patterns
Hybrid Feature Computation Pattern
Feature Registry Pattern
Summary
5. Data Movement Service
Journey Map
Aggregating Data Across Sources
Moving Raw Data to Specialized Query Engines
Moving Processed Data to Serving Stores
Exploratory Analysis Across Sources
Minimizing Time to Data Availability
Data Ingestion Configuration and Change Management
Compliance
Data Quality Verification
Defining Requirements
Ingestion Requirements
Transformation Requirements
Compliance Requirements
Verification Requirements
Nonfunctional Requirements
Implementation Patterns
Batch Ingestion Pattern
Change Data Capture Ingestion Pattern
Event Aggregation Pattern
Summary
6. Clickstream Tracking Service
Journey Map
Minimizing Time to Click Metrics
Managing Instrumentation
Event Enrichment
Building Insights
Defining Requirements
Instrumentation Requirements Checklist
Enrichment Requirements Checklist
Implementation Patterns
Instrumentation Pattern
Rule-Based Enrichment Patterns
Consumption Patterns
Summary
II. Self-Service Data Prep
7. Data Lake Management Service
Journey Map
Primitive Life Cycle Management
Managing Data Updates
Managing Batching and Streaming Data Flows
Minimizing Time to Data Lake Management
Requirements
Implementation Patterns
Data Life Cycle Primitives Pattern
Transactional Pattern
Advanced Data Management Pattern
Summary
8. Data Wrangling Service
Journey Map
Minimizing Time to Wrangle
Defining Requirements
Curating Data
Operational Monitoring
Defining Requirements
Implementation Patterns
Exploratory Data Analysis Patterns
Analytical Transformation Patterns
Summary
9. Data Rights Governance Service
Journey Map
Executing Data Rights Requests
Discovery of Datasets
Model Retraining
Minimizing Time to Comply
Tracking the Customer Data Life Cycle
Executing Customer Data Rights Requests
Limiting Data Access
Defining Requirements
Current Pain Point Questionnaire
Interop Checklist
Functional Requirements
Nonfunctional Requirements
Implementation Patterns
Sensitive Data Discovery and Classification Pattern
Data Lake Deletion Pattern
Use Case–Dependent Access Control
Summary
III. Self-Service Build
10. Data Virtualization Service
Journey Map
Exploring Data Sources
Picking a Processing Cluster
Minimizing Time to Query
Picking the Execution Environment
Formulating Polyglot Queries
Joining Data Across Silos
Defining Requirements
Current Pain Point Analysis
Operational Requirements
Functional Requirements
Nonfunctional Requirements
Implementation Patterns
Automatic Query Routing Pattern
Unified Query Pattern
Federated Query Pattern
Summary
11. Data Transformation Service
Journey Map
Production Dashboard and ML Pipelines
Data-Driven Storytelling
Minimizing Time to Transform
Transformation Implementation
Transformation Execution
Transformation Operations
Defining Requirements
Current State Questionnaire
Functional Requirements
Nonfunctional Requirements
Implementation Patterns
Implementation Pattern
Execution Patterns
Summary
12. Model Training Service
Journey Map
Model Prototyping
Continuous Training
Model Debugging
Minimizing Time to Train
Training Orchestration
Tuning
Continuous Training
Defining Requirements
Training Orchestration
Tuning
Continuous Training
Nonfunctional Requirements
Implementation Patterns
Distributed Training Orchestrator Pattern
Automated Tuning Pattern
Data-Aware Continuous Training
Summary
13. Continuous Integration Service
Journey Map
Collaborating on an ML Pipeline
Integrating ETL Changes
Validating Schema Changes
Minimizing Time to Integrate
Experiment Tracking
Reproducible Deployment
Testing Validation
Defining Requirements
Experiment Tracking Module
Pipeline Packaging Module
Testing Automation Module
Implementation Patterns
Programmable Tracking Pattern
Reproducible Project Pattern
Summary
14. A/B Testing Service
Journey Map
Minimizing Time to A/B Test
Experiment Design
Execution at Scale
Experiment Optimization
Implementation Patterns
Experiment Specification Pattern
Metrics Definition Pattern
Automated Experiment Optimization
Summary
IV. Self-Service Operationalize
15. Query Optimization Service
Journey Map
Avoiding Cluster Clogs
Resolving Runtime Query Issues
Speeding Up Applications
Minimizing Time to Optimize
Aggregating Statistics
Analyzing Statistics
Optimizing Jobs
Defining Requirements
Current Pain Points Questionnaire
Interop Requirements
Functionality Requirements
Nonfunctional Requirements
Implementation Patterns
Avoidance Pattern
Operational Insights Pattern
Automated Tuning Pattern
Summary
16. Pipeline Orchestration Service
Journey Map
Invoke Exploratory Pipelines
Run SLA-Bound Pipelines
Minimizing Time to Orchestrate
Defining Job Dependencies
Distributed Execution
Production Monitoring
Defining Requirements
Current Pain Points Questionnaire
Operational Requirements
Functional Requirements
Nonfunctional Requirements
Implementation Patterns
Dependency Authoring Patterns
Orchestration Observability Patterns
Distributed Execution Pattern
Summary
17. Model Deploy Service
Journey Map
Model Deployment in Production
Model Maintenance and Upgrade
Minimizing Time to Deploy
Deployment Orchestration
Performance Scaling
Drift Monitoring
Defining Requirements
Orchestration
Model Scaling and Performance
Drift Verification
Nonfunctional Requirements
Implementation Patterns
Universal Deployment Pattern
Autoscaling Deployment Pattern
Model Drift Tracking Pattern
Summary
18. Quality Observability Service
Journey Map
Daily Data Quality Monitoring Reports
Debugging Quality Issues
Handling Low-Quality Data Records
Minimizing Time to Insight Quality
Verify the Accuracy of the Data
Detect Quality Anomalies
Prevent Data Quality Issues
Defining Requirements
Detection and Handling Data Quality Issues
Functional Requirements
Nonfunctional Requirements
Implementation Patterns
Accuracy Models Pattern
Profiling-Based Anomaly Detection Pattern
Avoidance Pattern
Summary
19. Cost Management Service
Journey Map
Monitoring Cost Usage
Continuous Cost Optimization
Minimizing Time to Optimize Cost
Expenditure Observability
Matching Supply and Demand
Continuous Cost Optimization
Defining Requirements
Pain Points Questionnaire
Functional Requirements
Nonfunctional Requirements
Implementation Patterns
Continuous Cost Monitoring Pattern
Automated Scaling Pattern
Cost Advisor Pattern
Summary
Index
Rubrieken
- advisering
- algemeen management
- coaching en trainen
- communicatie en media
- economie
- financieel management
- inkoop en logistiek
- internet en social media
- it-management / ict
- juridisch
- leiderschap
- marketing
- mens en maatschappij
- non-profit
- ondernemen
- organisatiekunde
- personal finance
- personeelsmanagement
- persoonlijke effectiviteit
- projectmanagement
- psychologie
- reclame en verkoop
- strategisch management
- verandermanagement
- werk en loopbaan