Op werkdagen voor 23:00 besteld, morgen in huis Gratis verzending vanaf €20
,

Natural Language Annotation for Machine Learning

A Guide to Corpus-Building for Application

Paperback Engels 2012 1e druk 9781449306663
Verwachte levertijd ongeveer 16 werkdagen

Samenvatting

Create your own natural language training corpus for machine learning. Whether you're working with English, Chinese, or any other natural language, this hands-on book guides you through a proven annotation development cycle-the process of adding metadata to your training corpus to help ML algorithms work more efficiently. You don't need any programming or linguistics experience to get started.

Using detailed examples at every step, you'll learn how the MATTER Annotation Development Process helps you Model, Annotate, Train, Test, Evaluate, and Revise your training corpus. You also get a complete walkthrough of a real-world annotation project.

- Define a clear annotation goal before collecting your dataset (corpus)
- Learn tools for analyzing the linguistic content of your corpus
- Build a model and specification for your annotation project
- Examine the different annotation formats, from basic XML to the Linguistic Annotation Framework
- Create a gold standard corpus that can be used to train and test ML algorithms
- Select the ML algorithms that will process your annotated data
- Evaluate the test results and revise your annotation task
- Learn how to use lightweight software for annotating texts and adjudicating the annotations

Specificaties

ISBN13:9781449306663
Taal:Engels
Bindwijze:paperback
Aantal pagina's:326
Uitgever:O'Reilly
Druk:1
Verschijningsdatum:20-10-2012
Hoofdrubriek:IT-management / ICT

Lezersrecensies

Wees de eerste die een lezersrecensie schrijft!

Over James Pustejovsky

James Pustejovsky holds the TJX/Felberg Chair in Computer Science at Brandeis University, where he directs the Lab for Linguistics and Computation, and chairs both the Program in Language and Linguistics and the Computational Linguistics MA Program. He has conducted research in computational linguistics, AI, lexical semantics, temporal reasoning, and corpus linguistics and language annotation. He is currently head of a working group within ISO/TC37/SC4 to develop a Semantic Annotation Framework, and is chief architect of TimeML and ISO-TimeML, a newly adopted ISO standard for temporal information in language, as well as the draft specification for spatial information, ISO-Space. Pustejovsky was PI of a large NSF-funded effort, "Towards a Comprehensive Linguistic Annotation of Language," that involved merging several diverse linguistic annotations (PropBank, NomBank, the Discourse Treebank, TimeBank, and Opinion Corpus) into a unified representation. Currently, he is Co-PI of a major project funded by the NSF to address interoperability for NLP data and tools. He has taught computational linguistics to both graduates and undergraduates for 20 years, and corpus linguistics for eight years. He has authored numerous books, including Interpreting Motion (with I. Mani, Oxford University Press, 2012), Recent Advances in Generative Lexicon Theory (Springer, 2012), Generative Lexicon (MIT, 1995), The Problem of Polysemy (with B. Boguraev, Cambridge, 1996), The Language of Time (Oxford, with I. Mani and R. Gaizauskas, 2005), and Semantics and the Lexicon (Kluwer, 1993). He is currently finishing a textbook for Cambridge University Press, entitled Lexicon, to appear in 2013.

Andere boeken door James Pustejovsky

Over Amber Stubbs

Amber Stubbs recently completed her Ph.D. in Computer Science at Brandeis University, and is currently a Postdoctoral Associate at SUNY Albany. Her dissertation focused on creating an annotation methodology to aid in extracting high-level information from natural language files, particularly biomedical texts. Her website can be found at http://pages.cs.brandeis.edu/~astubbs/

Andere boeken door Amber Stubbs

Inhoudsopgave

Preface

1. The Basics
-The Importance of Language Annotation
-A Brief History of Corpus Linguistics
-Language Data and Machine Learning
-The Annotation Development Cycle
-Summary

2. Defining Your Goal and Dataset
-Defining Your Goal
-Background Research
-Assembling Your Dataset
-The Size of Your Corpus
-Summary

3. Corpus Analytics
-Basic Probability for Corpus Analytics
-Counting Occurrences
-Language Models
-Summary

4. Building Your Model and Specification
-Some Example Models and Specs
-Adopting (or Not Adopting) Existing Models
-Different Kinds of Standards
-Summary

5. Applying and Adopting Annotation Standards
-Metadata Annotation: Document Classification
-Text Extent Annotation: Named Entities
-Linked Extent Annotation: Semantic Roles
-ISO Standards and You
-Summary

6. Annotation and Adjudication
-The Infrastructure of an Annotation Project
-Specification Versus Guidelines
-Be Prepared to Revise
-Preparing Your Data for Annotation
-Writing the Annotation Guidelines
-Annotators
-Choosing an Annotation Environment
-Evaluating the Annotations
-Creating the Gold Standard (Adjudication)
-Summary

7. Training: Machine Learning
-What Is Learning?
-Defining Our Learning Task
-Classifier Algorithms
-Sequence Induction Algorithms
-Clustering and Unsupervised Learning
-Semi-Supervised Learning
-Matching Annotation to Algorithms
-Summary

8. Testing and Evaluation
-Testing Your Algorithm
-Evaluating Your Algorithm
-Problems That Can Affect Evaluation
-Final Testing Scores
-Summary

9. Revising and Reporting
-Revising Your Project
-Reporting About Your Work
-Summary

10. Annotation: TimeML
-The Goal of TimeML
-Related Research
-Building the Corpus
-Model: Preliminary Specifications
-Annotation: First Attempts
-Model: The TimeML Specification Used in TimeBank
-Annotation: The Creation of TimeBank
-TimeML Becomes ISO-TimeML
-Modeling the Future: Directions for TimeML
-Summary

11. Automatic Annotation: Generating TimeML
-The TARSQI Components
-Improvements to the TTK
-TimeML Challenges: TempEval-2
-Future of the TTK
-Summary

12. Afterword: The Future of Annotation
-Crowdsourcing Annotation
-Handling Big Data
-NLP Online and in the Cloud
-And Finally...

Appendix A: List of Available Corpora and Specifications
Appendix B: List of Software Resources
Appendix C: MAE User Guide
Appendix D: MAI User Guide
Appendix E: Bibliography

Index

Managementboek Top 100

Rubrieken

Populaire producten

    Personen

      Trefwoorden

        Natural Language Annotation for Machine Learning