Op werkdagen voor 23:00 besteld, morgen in huis Gratis verzending vanaf €20
,

Stream Processing with Apache Spark

Mastering Structured Streaming and Spark Streaming

Paperback Engels 2019 1e druk 9781491944240
Verwachte levertijd ongeveer 16 werkdagen

Samenvatting

Before you can build analytics tools to gain quick insights, you first need to know how to process data in real time. With this practical guide, developers familiar with Apache Spark will learn how to put this in-memory framework to use for streaming data. You’ll discover how Spark enables you to write streaming jobs in almost the same way you write batch jobs.

Authors Gerard Maas and François Garillot help you explore the theoretical underpinnings of Apache Spark. This comprehensive guide features two sections that compare and contrast the streaming APIs Spark now supports: the original Spark Streaming library and the newer Structured Streaming API.

- Learn fundamental stream processing concepts and examine different streaming architectures
- Explore Structured Streaming through practical examples; learn different aspects of stream processing in detail
- Create and operate streaming jobs and applications with Spark Streaming; integrate Spark -Streaming with other Spark APIs
- Learn advanced Spark Streaming techniques, including approximation algorithms and machine learning algorithms
- Compare Apache Spark to other stream processing projects, including Apache Storm, Apache Flink, and Apache Kafka Streams

Specificaties

ISBN13:9781491944240
Taal:Engels
Bindwijze:paperback
Aantal pagina's:452
Uitgever:O'Reilly
Druk:1
Verschijningsdatum:18-6-2019
Hoofdrubriek:IT-management / ICT

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Over Francois Garillot

François Garillot worked on Scala's type system in 2006, earned his PhD from the French École Polytechnique in 2011, and worked at Typesafe, after a brief stint in Internet advertising. He's worked on interactive interfaces to the Scala compiler, while nourishing a strong enthusiasm for data analytics in his spare time, until Apache Spark let him fullfill this passion as his main job. He received the first Spark Certification in November 2014, and worked in London and Philadelphia, among other places. In his spare time, he can be found practicing one of a half-dozen ways of making coffee, climbing up or skiing down a not-necessarily-Alpine mountain, or sailing a not-necessarily coastal course.

Andere boeken door Francois Garillot

Over Gerard Maas

Gerard Maas is the lead engineer at Kensu.io, an early stage startup where he works on context management for big-data environments. Previous to that, he led the design and development of the data processing pipeline of Virdata.com, a startup building a cloud-native IoT platform, where Scala, Apache Spark and Spark Streaming were crucial building blocks. He enjoys contributing to open source projects, small and large. Through his career in technology companies like Alcatel-Lucent, Bell Labs, Sony and Technicolor, he has been mostly involved in the interaction of services and devices, from early days service adaptation when mobile screens only had few text lines, passing through multi-device interactions to IoT device management. He has a degree in Computer Engineering from the Simón Bolívar University, Venezuela.

Andere boeken door Gerard Maas

Inhoudsopgave

Foreword
Preface
Who Should Read This Book?
Installing Spark
Learning Scala
The Way Ahead
Bibliography
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
From Gerard
From François

I. Fundamentals of Stream Processing with Apache Spark
1. Introducing Stream Processing
What Is Stream Processing?
Batch Versus Stream Processing
The Notion of Time in Stream Processing
The Factor of Uncertainty
Some Examples of Stream Processing
Scaling Up Data Processing
MapReduce
The Lesson Learned: Scalability and Fault Tolerance
Distributed Stream Processing
Stateful Stream Processing in a Distributed System
Introducing Apache Spark
The First Wave: Functional APIs
The Second Wave: SQL
A Unified Engine
Spark Components
Spark Streaming
Structured Streaming
Where Next?

2. Stream-Processing Model
Sources and Sinks
Immutable Streams Defined from One Another
Transformations and Aggregations
Window Aggregations
Tumbling Windows
Sliding Windows
Stateless and Stateful Processing
Stateful Streams
An Example: Local Stateful Computation in Scala
A Stateless Definition of the Fibonacci Sequence as a Stream Transformation
Stateless or Stateful Streaming
The Effect of Time
Computing on Timestamped Events
Timestamps as the Provider of the Notion of Time
Event Time Versus Processing Time
Computing with a Watermark
Summary

3. Streaming Architectures
Components of a Data Platform
Architectural Models
The Use of a Batch-Processing Component in a Streaming Application
Referential Streaming Architectures
The Lambda Architecture
The Kappa Architecture
Streaming Versus Batch Algorithms
Streaming Algorithms Are Sometimes Completely Different in Nature
Streaming Algorithms Can’t Be Guaranteed to Measure Well Against Batch Algorithms
Summary

4. Apache Spark as a Stream-Processing Engine
The Tale of Two APIs
Spark’s Memory Usage
Failure Recovery
Lazy Evaluation
Cache Hints
Understanding Latency
Throughput-Oriented Processing
Spark’s Polyglot API
Fast Implementation of Data Analysis
To Learn More About Spark
Summary

5. Spark’s Distributed Processing Model
Running Apache Spark with a Cluster Manager
Examples of Cluster Managers
Spark’s Own Cluster Manager
Understanding Resilience and Fault Tolerance in a Distributed System
Fault Recovery
Cluster Manager Support for Fault Tolerance
Data Delivery Semantics
Microbatching and One-Element-at-a-Time
Microbatching: An Application of Bulk-Synchronous Processing
One-Record-at-a-Time Processing
Microbatching Versus One-at-a-Time: The Trade-Offs
Bringing Microbatch and One-Record-at-a-Time Closer Together
Dynamic Batch Interval
Structured Streaming Processing Model
The Disappearance of the Batch Interval

6. Spark’s Resilience Model
Resilient Distributed Datasets in Spark
Spark Components
Spark’s Fault-Tolerance Guarantees
Task Failure Recovery
Stage Failure Recovery
Driver Failure Recovery
Summary
A. References for Part I

II. Structured Streaming
7. Introducing Structured Streaming
First Steps with Structured Streaming
Batch Analytics
Streaming Analytics
Connecting to a Stream
Preparing the Data in the Stream
Operations on Streaming Dataset
Creating a Query
Start the Stream Processing
Exploring the Data
Summary

8. The Structured Streaming Programming Model
Initializing Spark
Sources: Acquiring Streaming Data
Available Sources
Transforming Streaming Data
Streaming API Restrictions on the DataFrame API
Sinks: Output the Resulting Data
format
outputMode
queryName
option
options
trigger
start()
Summary

9. Structured Streaming in Action
Consuming a Streaming Source
Application Logic
Writing to a Streaming Sink
Summary

10. Structured Streaming Sources
Understanding Sources
Reliable Sources Must Be Replayable
Sources Must Provide a Schema
Available Sources
The File Source
Specifying a File Format
Common Options
Common Text Parsing Options (CSV, JSON)
JSON File Source Format
CSV File Source Format
Parquet File Source Format
Text File Source Format
The Kafka Source
Setting Up a Kafka Source
Selecting a Topic Subscription Method
Configuring Kafka Source Options
Kafka Consumer Options
The Socket Source
Configuration
Operations
The Rate Source
Options

11. Structured Streaming Sinks
Understanding Sinks
Available Sinks
Reliable Sinks
Sinks for Experimentation
The Sink API
Exploring Sinks in Detail
The File Sink
Using Triggers with the File Sink
Common Configuration Options Across All Supported File Formats
Common Time and Date Formatting (CSV, JSON)
The CSV Format of the File Sink
The JSON File Sink Format
The Parquet File Sink Format
The Text File Sink Format
The Kafka Sink
Understanding the Kafka Publish Model
Using the Kafka Sink
The Memory Sink
Output Modes
The Console Sink
Options
Output Modes
The Foreach Sink
The ForeachWriter Interface
TCP Writer Sink: A Practical ForeachWriter Example
The Moral of this Example
Troubleshooting ForeachWriter Serialization Issues

12. Event Time–Based Stream Processing
Understanding Event Time in Structured Streaming
Using Event Time
Processing Time
Watermarks
Time-Based Window Aggregations
Defining Time-Based Windows
Understanding How Intervals Are Computed
Using Composite Aggregation Keys
Tumbling and Sliding Windows
Record Deduplication
Summary

13. Advanced Stateful Operations
Example: Car Fleet Management
Understanding Group with State Operations
Internal State Flow
Using MapGroupsWithState
Using FlatMapGroupsWithState
Output Modes
Managing State Over Time
Summary

14. Monitoring Structured Streaming Applications
The Spark Metrics Subsystem
Structured Streaming Metrics
The StreamingQuery Instance
Getting Metrics with StreamingQueryProgress
The StreamingQueryListener Interface
Implementing a StreamingQueryListener

15. Experimental Areas: Continuous Processing and Machine Learning
Continuous Processing
Understanding Continuous Processing
Using Continuous Processing
Limitations
Machine Learning
Learning Versus Exploiting
Applying a Machine Learning Model to a Stream
Example: Estimating Room Occupancy by Using Ambient Sensors
Online Training
B. References for Part II

III. Spark Streaming
16. Introducing Spark Streaming
The DStream Abstraction
DStreams as a Programming Model
DStreams as an Execution Model
The Structure of a Spark Streaming Application
Creating the Spark Streaming Context
Defining a DStream
Defining Output Operations
Starting the Spark Streaming Context
Stopping the Streaming Process
Summary

17. The Spark Streaming Programming Model
RDDs as the Underlying Abstraction for DStreams
Understanding DStream Transformations
Element-Centric DStream Transformations
RDD-Centric DStream Transformations
Counting
Structure-Changing Transformations
Summary

18. The Spark Streaming Execution Model
The Bulk-Synchronous Architecture
The Receiver Model
The Receiver API
How Receivers Work
The Receiver’s Data Flow
The Internal Data Resilience
Receiver Parallelism
Balancing Resources: Receivers Versus Processing Cores
Achieving Zero Data Loss with the Write-Ahead Log
The Receiverless or Direct Model
Summary

19. Spark Streaming Sources
Types of Sources
Basic Sources
Receiver-Based Sources
Direct Sources
Commonly Used Sources
The File Source
How It Works
The Queue Source
How It Works
Using a Queue Source for Unit Testing
A Simpler Alternative to the Queue Source: The ConstantInputDStream
The Socket Source
How It Works
The Kafka Source
Using the Kafka Source
How It Works
Where to Find More Sources

20. Spark Streaming Sinks
Output Operations
Built-In Output Operations
print
saveAsxyz
foreachRDD
Using foreachRDD as a Programmable Sink
Third-Party Output Operations

21. Time-Based Stream Processing
Window Aggregations
Tumbling Windows
Window Length Versus Batch Interval
Sliding Windows
Sliding Windows Versus Batch Interval
Sliding Windows Versus Tumbling Windows
Using Windows Versus Longer Batch Intervals
Window Reductions
reduceByWindow
reduceByKeyAndWindow
countByWindow
countByValueAndWindow
Invertible Window Aggregations
Slicing Streams
Summary

22. Arbitrary Stateful Streaming Computation
Statefulness at the Scale of a Stream
updateStateByKey
Limitation of updateStateByKey
Performance
Memory Usage
Introducing Stateful Computation with mapwithState
Using mapWithState
Event-Time Stream Computation Using mapWithState

23. Working with Spark SQL
Spark SQL
Accessing Spark SQL Functions from Spark Streaming
Example: Writing Streaming Data to Parquet
Dealing with Data at Rest
Using Join to Enrich the Input Stream
Join Optimizations
Updating Reference Datasets in a Streaming Application
Enhancing Our Example with a Reference Dataset
Summary

24. Checkpointing
Understanding the Use of Checkpoints
Checkpointing DStreams
Recovery from a Checkpoint
Limitations
The Cost of Checkpointing
Checkpoint Tuning

25. Monitoring Spark Streaming
The Streaming UI
Understanding Job Performance Using the Streaming UI
Input Rate Chart
Scheduling Delay Chart
Processing Time Chart
Total Delay Chart
Batch Details
The Monitoring REST API
Using the Monitoring REST API
Information Exposed by the Monitoring REST API
The Metrics Subsystem
The Internal Event Bus
Interacting with the Event Bus
Summary

26. Performance Tuning
The Performance Balance of Spark Streaming
The Relationship Between Batch Interval and Processing Delay
The Last Moments of a Failing Job
Going Deeper: Scheduling Delay and Processing Delay
Checkpoint Influence in Processing Time
External Factors that Influence the Job’s Performance
How to Improve Performance?
Tweaking the Batch Interval
Limiting the Data Ingress with Fixed-Rate Throttling
Backpressure
Dynamic Throttling
Tuning the Backpressure PID
Custom Rate Estimator
A Note on Alternative Dynamic Handling Strategies
Caching
Speculative Execution
C. References for Part III

IV. Advanced Spark Streaming Techniques
27. Streaming Approximation and Sampling Algorithms
Exactness, Real Time, and Big Data
Exactness
Real-Time Processing
Big Data
The Exactness, Real-Time, and Big Data triangle
Big Data and Real Time
Approximation Algorithms
Hashing and Sketching: An Introduction
Counting Distinct Elements: HyperLogLog
Role-Playing Exercise: If We Were a System Administrator
Practical HyperLogLog in Spark
Counting Element Frequency: Count Min Sketches
Introducing Bloom Filters
Bloom Filters with Spark
Computing Frequencies with a Count-Min Sketch
Ranks and Quantiles: T-Digest
T-Digest in Spark
Reducing the Number of Elements: Sampling
Random Sampling
Stratified Sampling

28. Real-Time Machine Learning
Streaming Classification with Naive Bayes
streamDM Introduction
Naive Bayes in Practice
Training a Movie Review Classifier
Introducing Decision Trees
Hoeffding Trees
Hoeffding Trees in Spark, in Practice
Streaming Clustering with Online K-Means
K-Means Clustering
Online Data and K-Means
The Problem of Decaying Clusters
Streaming K-Means with Spark Streaming
D. References for Part IV
V. Beyond Apache Spark

29. Other Distributed Real-Time Stream Processing Systems
Apache Storm
Processing Model
The Storm Topology
The Storm Cluster
Compared to Spark
Apache Flink
A Streaming-First Framework
Compared to Spark
Kafka Streams
Kafka Streams Programming Model
Compared to Spark
In the Cloud
Amazon Kinesis on AWS
Microsoft Azure Stream Analytics
Apache Beam/Google Cloud Dataflow

30. Looking Ahead
Stay Plugged In
Seek Help on Stack Overflow
Start Discussions on the Mailing Lists
Attend Conferences
Attend Meetups
Read Books
Contributing to the Apache Spark Project

E. References for Part V

Index

Managementboek Top 100

Rubrieken

Populaire producten

    Personen

      Trefwoorden

        Stream Processing with Apache Spark