High Performance Spark
Best Practices for Scaling and Optimizing Apache Spark
Paperback Engels 2017 1e druk 9781491943205Samenvatting
Apache Spark is amazing when everything clicks. But if you haven’t seen the performance improvements you expected, or still don’t feel confident enough to use Spark in production, this practical book is for you. Authors Holden Karau and Rachel Warren demonstrate performance optimizations to help your Spark queries run faster and handle larger data sizes, while using fewer resources.
Ideal for software engineers, data engineers, developers, and system administrators working with large-scale data applications, this book describes techniques that can reduce data infrastructure costs and developer hours. Not only will you gain a more comprehensive understanding of Spark, you’ll also learn how to make it sing.
With this book, you’ll explore:
- How Spark SQL’s new interfaces improve performance over SQL’s RDD data structure
- The choice between data joins in Core Spark and Spark SQL
- Techniques for getting the most out of standard RDD transformations
- How to work around performance issues in Spark’s key/value pair paradigm
- Writing high-performance Spark code without Scala or the JVM
- How to test for functionality and performance when applying suggested improvements
- Using Spark MLlib and Spark ML machine learning libraries
- Spark’s Streaming components and external community packages
Specificaties
Lezersrecensies
Inhoudsopgave
1. Introduction to High Performance Spark
What Is Spark and Why Performance Matters
What You Can Expect to Get from This Book
Spark Versions
Why Scala?
Conclusion
2. How Spark Works
How Spark Fits into the Big Data Ecosystem
Spark Model of Parallel Computing: RDDs
Spark Job Scheduling
The Anatomy of a Spark Job
Conclusion
3. DataFrames, Datasets, and Spark SQL
Getting Started with the SparkSession (or HiveContext or SQLContext)
Spark SQL Dependencies
Basics of Schemas
DataFrame API
Data Representation in DataFrames and Datasets
Data Loading and Saving Functions
Datasets
Extending with User-Defined Functions and Aggregate Functions (UDFs, UDAFs)
Query Optimizer
Debugging Spark SQL Queries
JDBC/ODBC Server
Conclusion
4. Joins (SQL and Core)
Core Spark Joins
Spark SQL Joins
Conclusion
5. Effective Transformations
Narrow Versus Wide Transformations
What Type of RDD Does Your Transformation Return?
Minimizing Object Creation
Iterator-to-Iterator Transformations with mapPartitions
Set Operations
Reducing Setup Overhead
Reusing RDDs
Conclusion
6. Working with Key/Value Data
The Goldilocks Example
Actions on Key/Value Pairs
What’s So Dangerous About the groupByKey Function
Choosing an Aggregation Operation
Multiple RDD Operations
Partitioners and Key/Value Data
Dictionary of OrderedRDDOperations
Secondary Sort and repartitionAndSortWithinPartitions
Straggler Detection and Unbalanced Data
Conclusion
7. Going Beyond Scala
Beyond Scala within the JVM
Beyond Scala, and Beyond the JVM
Calling Other Languages from Spark
The Future
Conclusion
8. Testing and Validation
Unit Testing
Getting Test Data
Property Checking with ScalaCheck
Integration Testing
Verifying Performance
Job Validation
Conclusion
9. Spark MLlib and ML
Choosing Between Spark MLlib and Spark ML
Working with MLlib
Working with Spark ML
General Serving Considerations
Conclusion
10. Spark Components and Packages
Stream Processing with Spark
GraphX
Using Community Packages and Libraries
Conclusion
Appendix: Tuning, Debugging, and Other Things Developers Like to Pretend
Index
Rubrieken
- advisering
- algemeen management
- coaching en trainen
- communicatie en media
- economie
- financieel management
- inkoop en logistiek
- internet en social media
- it-management / ict
- juridisch
- leiderschap
- marketing
- mens en maatschappij
- non-profit
- ondernemen
- organisatiekunde
- personal finance
- personeelsmanagement
- persoonlijke effectiviteit
- projectmanagement
- psychologie
- reclame en verkoop
- strategisch management
- verandermanagement
- werk en loopbaan