Op werkdagen voor 23:00 besteld, morgen in huis Gratis verzending vanaf €20
,

Learning Apache Drill

Query and Analyze Distributed Data Sources with SQL

Paperback Engels 2018 1e druk 9781492032793
Verkooppositie 5657Hoogste positie: 5657
Verwachte levertijd ongeveer 16 werkdagen

Samenvatting

Get up to speed with Apache Drill, an extensible distributed SQL query engine that reads massive datasets in many popular file formats such as Parquet, JSON, and CSV. Drill reads data in HDFS or in cloud-native storage such as S3 and works with Hive metastores along with distributed databases such as HBase, MongoDB, and relational databases. Drill works everywhere: on your laptop or in your largest cluster.

In this practical book, Drill committers Charles Givre and Paul Rogers show analysts and data scientists how to query and analyze raw data using this powerful tool. Data scientists today spend about 80% of their time just gathering and cleaning data.

With this book, you'll learn how Drill helps you analyze data more effectively to drive down time to insight.
- Use Drill to clean, prepare, and summarize delimited data for further analysis
- Query file types including logfiles, Parquet, JSON, and other complex formats
- Query Hadoop, relational databases, MongoDB, and Kafka with standard SQL
- Connect to Drill programmatically using a variety of languages
- Use Drill even with challenging or ambiguous file formats
- Perform sophisticated analysis by extending Drill's functionality with user-defined functions
- Facilitate data analysis for network security, image metadata, and machine learning

Specificaties

ISBN13:9781492032793
Taal:Engels
Bindwijze:paperback
Aantal pagina's:311
Uitgever:O'Reilly
Druk:1
Verschijningsdatum:27-11-2018
Hoofdrubriek:IT-management / ICT

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Over Charles Givre

Mr. Charles Givre is a lead data scientist at Deutsche Bank in the Central Security Office (CSO) where he works in the intersection of cyber security and data science. Prior to joining Deutsche Bank, Mr. Givre was a Senior Lead Data Scientist at Booz Allen Hamilton on one of the firm's largest analytic programs where he led data science efforts and worked to expand the role of data science in the program. Mr. Givre is passionate about teaching others data science and analytic skills and has taught data science classes all over the world at conferences, universities and for clients. Most recently, Mr. Givre taught a data science class at the BlackHat conference in Las Vegas and the Center for Research in Applied Cryptography and Cyber Security at Bar Ilan University. He is a sought-after speaker and has delivered presentations at major industry conferences such as Strata-Hadoop World, BlackHat, Open Data Science Conference and others. Mr. Givre recently accepted the position of Program Chair of the Strategic Analytics Program at Brandeis University's Graduate School of Professional Studies. One of Mr. Givre's research interests is increasing the productivity of data science and analytic teams, and towards that end, he has been working extensively to promote the use of Apache Drill in security applications and has contributed to the code base. Mr. Givre teaches online classes for O'Reilly about Drill and Security Data Science and is a coauthor for the forthcoming O'Reilly book about Apache Drill. Prior to joining Booz Allen, Mr. Givre, worked as a counterterrorism analyst at the Central Intelligence Agency for five years. Mr. Givre holds a Masters Degree in Middle Eastern Studies from Brandeis University, as well as a Bachelors of Science in Computer Science and a Bachelor's of Music both from the University of Arizona. He speaks French reasonably well, plays trombone, lives in Baltimore with his family and in his non-existant spare time, is restoring a classic British sports car. Mr. Givre blogs at thedataist.com and tweets @cgivre.

Andere boeken door Charles Givre

Over Paul Rogers

Paul Rogers is an Apache Drill committer at MapR where he focuses on Drill’s execution engine. Paul has worked as a software architect at a number database and BI companies such as Oracle, Actuate and Informix. Paul was the early architect of the Eclipse BIRT project. His interests include making Drill even easier to use for end-users and plug-in developers.

Andere boeken door Paul Rogers

Inhoudsopgave

Preface
Who Should Read This Book
Why We Wrote This Book
Navigating This Book
Online Resources
Conventions Used in This Book
Using Code Examples
O’Reilly Safari
How to Contact Us
Acknowledgments
Special Thanks from Charles
Special Thanks from Paul

1. Introduction to Apache Drill
What Is Apache Drill?
Drill Is Versatile
Drill Is Easy to Use
A Word About Drill’s Performance
A Very Brief History of Big Data
Drill in the Big Data Ecosystem
Comparing Drill with Similar Tools

2. Installing and Running Drill
Preparing Your Machine for Drill
Special Configuration Instructions for Windows Installations
Installing Drill on Windows
Starting Drill on a Windows Machine
Installing Drill in Embedded Mode on macOS or Linux
Starting Drill on macOS or Linux in Embedded Mode
Installing Drill in Distributed Mode on macOS or Linux
Preparing Your Cluster for Drill
Starting Drill in Distributed Mode
Connecting to the Cluster
Conclusion

3. Overview of Apache Drill
The Apache Hadoop Ecosystem
Drill Is a Low-Latency Query Engine
Distributed Processing with HDFS
Elements of a Drill System
Drill Operation: The 30,000-Foot View
Drill Is a Query Engine, Not a Database
Drill Operation Overview
Drill Components
SQL Session State
Statement Preparation
Statement Execution
Low-Latency Features
Conclusion

4. Querying Delimited Data
Ways of Querying Data with Drill
Other Interfaces
Drill SQL Query Format
Choosing a Data Source
Defining a Workspace
Specifying a Default Data Source
Accessing Columns in a Query
Delimited Data with Column Headers
Table Functions
Querying Directories
Understanding Drill Data Types
Cleaning and Preparing Data Using String Manipulation Functions
Complex Data Conversion Functions
Working with Dates and Times in Drill
Converting Strings to Dates
Reformatting Dates
Date Arithmetic and Manipulation
Date and Time Functions in Drill
Creating Views
Data Analysis Using Drill
Summarizing Data with Aggregate Functions
Common Problems in Querying Delimited Data
Spaces in Column Names
Illegal Characters in Column Headers
Reserved Words in Column Names
Conclusion

5. Analyzing Complex and Nested Data
Arrays and Maps
Arrays in Drill
Accessing Maps (Key–Value Pairs) in Drill
Querying Nested Data
Analyzing Log Files with Drill
Configuring Drill to Read HTTPD Web Server Logs
Querying Web Server Logs
Other Log Analysis with Drill
Conclusion

6. Connecting Drill to Data Sources
Querying Multiple Data Sources
Configuring a New Storage Plug-in
Connecting Drill to a Relational Database
Querying Data in Hadoop from Drill
Connecting to and Querying HBase from Drill
Querying Hive Data from Drill
Connecting to and Querying Streaming Data with Drill and Kafka
Connecting to and Querying Kudu
Connecting to and Querying MongoDB from Drill
Connecting Drill to Cloud Storage
Querying Time Series Data from Drill and OpenTSDB
Conclusion

7. Connecting to Drill
Understanding Drill’s Interfaces
JDBC and Drill
ODBC and Drill
Drill’s REST Interface
Connecting to Drill with Python
Using drillpy to Query Drill
Connecting to Drill Using pydrill
Other Ways of Connecting to Drill from Python
Connecting to Drill Using R
Querying Drill from R Using sergeant
Connecting to Drill Using Java
Querying Drill with PHP
Using the Connector
Querying Drill from PHP
Interacting with Drill from PHP
Querying Drill Using Node.js
Using Drill as a Data Source in BI Tools
Exploring Data with Apache Zeppelin and Drill
Exploring Data with Apache Superset
Conclusion

8. Data Engineering with Drill
Schema-on-Read
The SQL Relational Model
Data Life Cycle: Data Exploration to Production
Schema Inference
Data Source Inference
Storage Plug-ins
Storage Configurations
Workspaces
Querying Directories
Default Schema
File Type Inference
Format Plug-ins and Format Configuration
Format Inference
File Format Variations
Schema Inference Overview
Distributed File Scans
Schema Inference for Delimited Data
CSV Summary
Schema Inference for JSON
Ambiguous Numeric Schemas
Aligning Schemas Across Files
JSON Objects
JSON Lists in Drill
JSON Summary
Using Drill with the Parquet File Format
Schema Evolution in Parquet
Partitioning Data Directories
Defining a Table Workspace
Working with Queries in Production
Capturing Schema Mapping in Views
Running Challenging Queries in Scripts
Conclusion

9. Deploying Drill in Production
Installing Drill
Prerequisites
Production Installation
Configuring ZooKeeper
Configuring Memory
Configuring Logging
Testing the Installation
Distributing Drill Binaries and Configuration
Starting the Drill Cluster
Configuring Storage
Working with Apache Hadoop HDFS
Working with Amazon S3
Admission Control
Additional Configuration
User-Defined Functions and Custom Plug-ins
Security
Logging Levels
Controlling CPU Usage
Monitoring
Monitoring the Drill Process
Monitoring JMX Metrics
Monitoring Queries
Other Deployment Options
MapR Installer
Drill-on-YARN
Docker
Conclusion

10. Setting Up Your Development Environment
Installing Maven
Creating the Drill Build Environment
Setting Up Git and Getting the Source Code
Building Drill from Source
Installing the IDE
Conclusion

11. Writing Drill User-Defined Functions
Use Case: Finding and Filtering Valid Credit Card Numbers
How User-Defined Functions Work in Drill
Structure of a Simple Drill UDF
The pom.xml File
The Function File
The Simple Function API
Putting It All Together
Building and Installing Your UDF
Statically Installing a UDF
Dynamically Installing a UDF
Complex Functions: UDFs That Return Maps or Arrays
Example: Extracting User Agent Metadata
The ComplexWriter
Writing Aggregate User-Defined Functions
The Aggregate Function API
Example Aggregate UDF: Kendall’s Rank Correlation Coefficient
Conclusion

12. Writing a Format Plug-in
The Example Regex Format Plug-in
Creating the “Easy” Format Plug-in
Creating the Maven pom.xml File
Creating the Plug-in Package
Drill Module Configuration
Format Plug-in Configuration
Cautions Before Getting Started
Creating the Regex Plug-in Configuration Class
Copyright Headers and Code Format
Testing the Configuration
Fixing Configuration Problems
Troubleshooting
Creating the Format Plug-in Class
Creating a Test File
Configuring RAT
Efficient Debugging
Creating the Unit Test
How Drill Finds Your Plug-in
The Record Reader
Testing the Reader Shell
Logging
Error Handling
Setup
Regex Parsing
Defining Column Names
Projection
Column Projection Accounting
Project None
Project All
Project Some
Opening the File
Record Batches
Drill’s Columnar Structure
Defining Vectors
Reading Data
Loading Data into Vectors
Releasing Resources
Testing the Reader
Testing the Wildcard Case
Testing Explicit Projection
Testing Empty Projection
Scaling Up
Additional Details
File Chunks
Default Format Configuration
Next Steps
Production Build
Contributing to Drill: The Pull Request
Maintaining Your Branch
Create a Plug-In Project
Conclusion

13. Unique Uses of Drill
Finding Photos Taken Within a Geographic Region
Drilling Excel Files
The pom.xml File
The Excel Custom Record Reader
Using the Excel Format Plug-in
Network Packet Analysis (PCAP) with Drill
Examples of Queries Using PCAP Data Files
Analyzing Twitter Data with Drill
Using Drill in a Machine Learning Pipeline
Making Predictions Within Drill
Building and Serializing a Model
Writing the UDF Wrapper
Making Predictions Using the UDF
Conclusion

A. List of Drill Functions
Aggregate and Window Functions
Window Functions
Cryptological and Hashing Functions
Data Conversion Functions
Geospatial Functions
Math and Trigonometric Functions
Networking Functions
Null Handling Functions
String Manipulation Functions
Approximate String Matching Functions
Phonetic Functions
String Distance Functions

B. Drill Formatting Strings

Index

Managementboek Top 100

Rubrieken

Populaire producten

    Personen

      Trefwoorden

        Learning Apache Drill