Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project.

Meer over de auteurs

Alan Gates, Daniel Dai

Programming Pig

Name: Programming Pig
Author: Alan Gates

Dataflow Scripting with Hadoop

Paperback Engels 2016 2e druk 9781491937099

€ 45,51

In winkelwagen

Levertijd ongeveer 16 werkdagen

Gratis verzonden

Samenvatting

For many organizations, Hadoop is the first step for dealing with massive amounts of data. The next step? Processing and analyzing datasets with the Apache Pig scripting platform. With Pig, you can batch-process data without having to create a full-fledged application, making it easy to experiment with new datasets.

Updated with use cases and programming examples, this second edition is the ideal learning tool for new and experienced users alike. You’ll find comprehensive coverage on key features such as the Pig Latin scripting language and the Grunt shell. When you need to analyze terabytes of data, this book shows you how to do it efficiently with Pig.

- Delve into Pig’s data model, including scalar and complex data types
- Write Pig Latin scripts to sort, group, join, project, and filter your data
- Use Grunt to work with the Hadoop Distributed File System (HDFS)
- Build complex data processing pipelines with Pig’s macros and modularity features
- Embed Pig Latin in Python for iterative processing and other advanced tasks
- Use Pig with Apache Tez to build high-performance batch and interactive data processing applications
- Create your own load and store functions to handle data formats and storage mechanisms

Specificaties

ISBN13:9781491937099

Trefwoorden:Programmeertalen, Data analysis, scripting, Pig

Taal:Engels

Bindwijze:paperback

Aantal pagina's:346

Uitgever:O'Reilly

Druk:2

Verschijningsdatum:28-11-2016

Hoofdrubriek:IT-management / ICT

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Schrijf een recensie

Uw waardering

?

Log in om uw waardering te geven

Klik om uw waardering te geven

Over Alan Gates

Alan is an original member of the engineering team that took Pig from a Yahoo! Labs research project to a successful Apache open source project. In this role he oversaw the implementation of the language, including programming interfaces and the overall design. He has presented Pig at numerous conferences and user groups, universities, and companies. Alan is a member of the Apache Software Foundation and a co-founder of Hortonworks. He has a BS in Mathematics from Oregon State University and a MA in Theology from Fuller Theological Seminary.

Andere boeken door Alan Gates

Bekijk alle boeken

Inhoudsopgave

Preface

1. What Is Pig?
Pig Latin, a Parallel Data Flow Language
Pig on Hadoop
What Is Pig Useful For?
The Pig Philosophy
Pig’s History

2. Installing and Running Pig
Downloading and Installing Pig
Running Pig
Grunt

3. Pig’s Data Model
Types
Schemas

4. Introduction to Pig Latin
Preliminary Matters
Input and Output
Relational Operations
User-Defined Functions

5. Advanced Pig Latin
Advanced Relational Operations
Integrating Pig with Executables and Native Jobs
split and Nonlinear Data Flows
Controlling Execution
Pig Latin Preprocessor

6. Developing and Testing Pig Latin Scripts
Development Tools
Testing Your Scripts with PigUnit

7. Making Pig Fly
Writing Your Scripts to Perform Well
Writing Your UDFs to Perform
Tuning Pig and Hadoop for Your Job
Using Compression in Intermediate Results
Data Layout Optimization
Map-Side Aggregation
The JAR Cache
Processing Small Jobs Locally
Bloom Filters
Schema Tuple Optimization
Dealing with Failures

8. Embedding Pig
Embedding Pig Latin in Scripting Languages
Using the Pig Java APIs

9. Writing Evaluation and Filter Functions
Writing an Evaluation Function in Java
The Algebraic Interface
The Accumulator Interface
Writing Filter Functions
Writing Evaluation Functions in Scripting Languages

10. Writing Load and Store Functions
Load Functions
Store Functions
Shipping JARs Automatically
Handling Bad Records

11. Pig on Tez
What Is Tez?
Running Pig on Tez
Potential Differences When Running on Tez
Pig on Tez Internals

12. Pig and Other Members of the Hadoop Community
Pig and Hive
Cascading
Spark
NoSQL Databases
DataFu
Oozie

13. Use Cases and Programming Examples
Sparse Tuples
k-Means
intersect and except
Pig at Yahoo!
Pig at Particle News

Appendix A: Built-in User Defined Functions and PiggyBank

Index