Natural Language Processing with Python

**Djug** · 06/02/2012, 08h11

Bonjour,

La rédaction de DVP a lu pour vous l'ouvrage suivant: Natural Language Processing with Python, de Steven Bird, Ewan Klein, et Edward Loper.

Envoyé par Résumé de l'éditeur

This book offers a highly accessible introduction to natural language processing, the field that supports a variety of language technologies, from predictive text and email filtering to automatic summarization and translation. With it, you'll learn how to write Python programs that work with large collections of unstructured text. You'll access richly annotated datasets using a comprehensive range of linguistic data structures, and you'll understand the main algorithms for analyzing the content and structure of written communication.

Packed with examples and exercises, Natural Language Processing with Python will help you:

Extract information from unstructured text, either to guess the topic or identify "named entities"
Analyze linguistic structure in text, including parsing and semantic analysis
Access popular linguistic databases, including WordNet and treebanks
Integrate techniques drawn from fields as diverse as linguistics and artificial intelligence

This book will help you gain practical skills in natural language processing using the Python programming language and the Natural Language Toolkit (NLTK) open source library. If you're interested in developing web applications, analyzing multilingual news sources, or documenting endangered languages -- or if you're simply curious to have a programmer's perspective on how human language works -- you'll find Natural Language Processing with Python both fascinating and immensely useful.

L'avez-vous lu? Comptez-vous le lire bientôt?

Quel est votre avis?

Exprimez-vous!! Votre avis nous intéresse.

**Franck Dernoncourt** · 06/02/2012, 09h00

Voici une liste de définitions que j'ai trouvé intéressantes dans ce livre (les pages indiquées sont sous format n° de page du livre / n° de page de mon PDF) :

hypernym/hyponym relation, i.e., the relation between superordinate and subordinate concepts (p69 / 90)
Another rimportant way to navigate the WordNet network is from items to their components (meronyms) or to the things they are contained in (holonyms) (p710 / 91)
the same dictionary word (or lemma) (p104 / 125)
strip off any affixes, a task known as stemming. (p107 / 128)
Tokenization is the task of cutting a string into identifiable linguistic units that constitute a piece of language data (p109 / 130)
Tokenization is an instance of a more general problem of segmentation. (p112 § 133)
The %s and %d symbols are called conversion specifiers (p118 / 139)
The process of classifying words into their parts-of-speech and labeling them accord-ingly is known as part-of-speech tagging, POS tagging, or simply tagging. Parts-of-speech are also known as word classes or lexical categories. The collection of tagsused for a particular task is known as a tagset. Our emphasis in this chapter is onexploiting tags, and tagging text automatically. (p179 / 200)
As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. As a consequence, there is a trade-off between the accuracy and the coverage of our results (and this is related to the precision/recall trade-off in information retrieval) (p205 / 226)
A convenient way to look at tagging errors is the confusion matrix. It charts expected tags (the gold standard) against actual tags gen-erated by a tagger (p207 / 228)
All languages acquire new lexical items. A list of words recently added to the Oxford Dictionary of English includes cyberslacker, fatoush, blamestorm, SARS, cantopop,bupkis, noughties, muggle, and robata. Notice that all these new words are nouns, and this is reflected in calling nouns an open class. By contrast, prepositions are regarded as a closed class. That is, there is a limited set of words belonging to the class. (p211 / 232)
Common tagsets often capture some morphosyntactic information, that is, informa-tion about the kind of morphological markings that words receive by virtue of theirsyntactic role. (p212 / 233)
Classification is the task of choosing the correct class label for a given input. (p221 / 242)
The first step in creating a classifier is deciding what features of the input are relevant,and how to encode those features. For this example, we’ll start by just looking at thefinal letter of a given name. The following feature extractor function builds a dictionary containing relevant information about a given name. (p223 / 244)
Recognizing the dialogue acts underlying the utterances in a dialogue can be an important first step in understanding the conversation. The NPS Chat Corpus, which was demonstrated in Section 2.1, consists of over 10,000 posts from instant messaging sessions. These posts have all been labeled with one of 15 dialogue act types, such as “Statement,” “Emotion,” “y/n Question,” and “Continuer.” (p235 / 256)
Recognizing textual entailment (RTE) is the task of determining whether a given piece of text T entails another text called the “hypothesis”. (p235 / 256)
A confusion matrix is a table where each cell [i,j] indicates how often label j was pre-dicted when the correct label was i. (p240 / 261)
Numeric features can be converted to binary features by binning, which replaces them with features such as “4<x<6.” (p249 / 270)
Named entities are definite noun phrases that refer to specific types of individuals, such as organizations, persons, dates, and so on. The goal of a named entity recognition (NER) system is to identify all textual men-tions of the named entities. This can be broken down into two subtasks: identifyingthe boundaries of the NE, and identifying its type. (p281 / 302)
Since our grammar licenses two trees for this sentence, the sentence is said to be structurally ambiguous. The ambiguity in question is called a prepositional phrase attachment ambiguity. (p299 / 320)
A grammar is said to be recursive if a category occurring on the left hand side of a production also appears on the righthand side of a production. (p301 / 322)
A parser processes input sentences according to the productions of a grammar, and builds one or more constituent structures that conform to the grammar. A grammar is a declarative specification of well-formedness—it is actually just a string, not a program. A parser is a procedural interpretation of the grammar. It searches through the space of trees licensed by a grammar to find one that has the required sentence alongits fringe. (p302 / 323)
Phrase structure grammar is concerned with how words and sequences of words combine to form constituents. A distinct and complementary approach, dependency grammar, focuses instead on how words relate to other words. (p310 / 331)
A dependency graph is projective if, when all the words are written in linear order, the edges can be drawn above the words without crossing. (p311 / 332)
In the tradition of dependency grammar, the verbs in Table 8-3 (whose dependents have Adj, NP, S and PP, which are often called complements of the respective verbs, are different) are said to have different valencies. (p313 / 335)
This ambiguity is unavoidable, and leads to horrendous inefficiency in parsing seemingly innocuous sentences. The solution to these problems is provided by probabilistic parsing, which allows us to rank the parses of an ambiguous sentence on the basis of evidence from corpora. (p318 / 339)
A probabilistic context-free grammar (or PCFG) is a context-free grammar that as-sociates a probability with each of its productions. It generates the same set of parses for a text that the corresponding context-free grammar does, and assigns a probability to each parse. The probability of a parse generated by a PCFG is simply the product ofthe probabilities of the productions used to generate it. (p320 / 341)
We can see that morphological properties of the verb co-vary with syntactic properties of the subject noun phrase. This co-variance is called agreement. (p329 / 350)
A feature path is a sequence of arcs that can be followed from the root node (p339 / 360)
A more general feature structure subsumes a less general one. (p341 / 362)
Merging information from two feature structures is called unification. (p342 / 363)
The two sentences in (5) can be both true, whereas those in (6) and (7) cannot be. In other words, the sentences in (5) are consistent, whereas those in (6) and (7) are inconsistent. (p365 / 386)
A model for a set W of sentences is a formal representation of a situation in which allthe sentences in W are true. (p367 / 388)
An argument is valid if there is no possible situation in which its premises are all true and its conclusion is not true. (p369 / 390)
In the sentences "Cyril is tall. He likes maths.", we say that he is coreferential with the noun phrase Cyril. (p373 / 394)
In the sentence "Angus had a dog but he disappeared.", "he" is bound by the indefinite NP "a dog", and this is a different relationship than coreference. If we replace the pronoun he by a dog, the result "Angus had a dog but a dog disappeared" is not semantically equivalent to the original sentence "Angus had a dog but he disappeared." (p374 / 395)
In general, an occurrence of a variable x in a formula F is free in F if that occurrence doesn’t fall within the scope of all x or some x in F. Conversely, if x is free in formula F, then it is bound in all x.F and exists x.F. If all variable occurrences in a formulaare bound, the formula is said to be closed. (p375 / 396)
The general process of determining truth or falsity of a formula in a model is called model checking. (p379 / 400)
Principle of Compositionality: the meaning of a whole is a function of the meaningsof the parts and of the way they are syntactically combined. (p385 / 406)
? is a binding operator, just as the first-order logic quantifiers are. (p387 / 408)
A discourse representation structure (DRS) presents the meaning of discourse in terms of a list of discourse referents and a list of conditions.The discourse referents are the things under discussion in the discourse, and they correspond to the individual variables of first-order logic. The DRS conditions apply to those discourse referents, and correspond to atomic open formulas of first-orderlogic. (p397 / 418)
Inline annotation modifies the original document by inserting special symbols or control sequences that carry the annotated information. For example, when part-of-speech tagging a document, the string "fly" might be replacedwith the string "fly/NN", to indicate that the word fly is a noun in this context. Incontrast, standoff annotation does not modify the original document, but instead creates a new file that adds annotation information using pointers that reference the original document. For example, this new document might contain the string "<token id=8pos='NN'/>", to indicate that token 8 is a noun. (p421 / 442)

Un autre dictionnaire de NLP disponible online : http://www.cse.unsw.edu.au/~billw/nlpdict.html

**Franck Dernoncourt** · 06/02/2012, 20h23

Également, pour ceux intéressés par le sujet, Stanford lance un cours d'introduction au traitement automatique des langues naturelles : http://www.nlp-class.org/

**Franck Dernoncourt** · 06/02/2012, 20h25

Le sommaire en détails :

Chapter 1. Language Processing and Python 
Section 1.1. Computing with Language: Texts and Words 
Section 1.2. A Closer Look at Python: Texts as Lists of Words 
Section 1.3. Computing with Language: Simple Statistics 
Section 1.4. Back to Python: Making Decisions and Taking Control 
Section 1.5. Automatic Natural Language Understanding 
Section 1.6. Summary 
Section 1.7. Further Reading 
Section 1.8. Exercises 
Chapter 2. Accessing Text Corpora and Lexical Resources 
Section 2.1. Accessing Text Corpora 
Section 2.2. Conditional Frequency Distributions 
Section 2.3. More Python: Reusing Code 
Section 2.4. Lexical Resources 
Section 2.5. WordNet 
Section 2.6. Summary 
Section 2.7. Further Reading 
Section 2.8. Exercises 
Chapter 3. Processing Raw Text 
Section 3.1. Accessing Text from the Web and from Disk 
Section 3.2. Strings: Text Processing at the Lowest Level 
Section 3.3. Text Processing with Unicode 
Section 3.4. Regular Expressions for Detecting Word Patterns 
Section 3.5. Useful Applications of Regular Expressions 
Section 3.6. Normalizing Text 
Section 3.7. Regular Expressions for Tokenizing Text 
Section 3.8. Segmentation 
Section 3.9. Formatting: From Lists to Strings 
Section 3.10. Summary 
Section 3.11. Further Reading 
Section 3.12. Exercises 
Chapter 4. Writing Structured Programs 
Section 4.1. Back to the Basics 
Section 4.2. Sequences 
Section 4.3. Questions of Style 
Section 4.4. Functions: The Foundation of Structured Programming 
Section 4.5. Doing More with Functions 
Section 4.6. Program Development 
Section 4.7. Algorithm Design 
Section 4.8. A Sample of Python Libraries 
Section 4.9. Summary 
Section 4.10. Further Reading 
Section 4.11. Exercises 
Chapter 5. Categorizing and Tagging Words 
Section 5.1. Using a Tagger 
Section 5.2. Tagged Corpora 
Section 5.3. Mapping Words to Properties Using Python Dictionaries 
Section 5.4. Automatic Tagging 
Section 5.5. N-Gram Tagging 
Section 5.6. Transformation-Based Tagging 
Section 5.7. How to Determine the Category of a Word 
Section 5.8. Summary 
Section 5.9. Further Reading 
Section 5.10. Exercises 
Chapter 6. Learning to Classify Text 
Section 6.1. Supervised Classification 
Section 6.2. Further Examples of Supervised Classification 
Section 6.3. Evaluation 
Section 6.4. Decision Trees 
Section 6.5. Naive Bayes Classifiers 
Section 6.6. Maximum Entropy Classifiers 
Section 6.7. Modeling Linguistic Patterns 
Section 6.8. Summary 
Section 6.9. Further Reading 
Section 6.10. Exercises 
Chapter 7. Extracting Information from Text 
Section 7.1. Information Extraction 
Section 7.2. Chunking 
Section 7.3. Developing and Evaluating Chunkers 
Section 7.4. Recursion in Linguistic Structure 
Section 7.5. Named Entity Recognition 
Section 7.6. Relation Extraction 
Section 7.7. Summary 
Section 7.8. Further Reading 
Section 7.9. Exercises 
Chapter 8. Analyzing Sentence Structure 
Section 8.1. Some Grammatical Dilemmas 
Section 8.2. What's the Use of Syntax? 
Section 8.3. Context-Free Grammar 
Section 8.4. Parsing with Context-Free Grammar 
Section 8.5. Dependencies and Dependency Grammar 
Section 8.6. Grammar Development 
Section 8.7. Summary 
Section 8.8. Further Reading 
Section 8.9. Exercises 
Chapter 9. Building Feature-Based Grammars 
Section 9.1. Grammatical Features 
Section 9.2. Processing Feature Structures 
Section 9.3. Extending a Feature-Based Grammar 
Section 9.4. Summary 
Section 9.5. Further Reading 
Section 9.6. Exercises 
Chapter 10. Analyzing the Meaning of Sentences 
Section 10.1. Natural Language Understanding 
Section 10.2. Propositional Logic 
Section 10.3. First-Order Logic 
Section 10.4. The Semantics of English Sentences 
Section 10.5. Discourse Semantics 
Section 10.6. Summary 
Section 10.7. Further Reading 
Section 10.8. Exercises 
Chapter 11. Managing Linguistic Data 
Section 11.1. Corpus Structure: A Case Study 
Section 11.2. The Life Cycle of a Corpus 
Section 11.3. Acquiring Data 
Section 11.4. Working with XML 
Section 11.5. Working with Toolbox Data 
Section 11.6. Describing Language Resources Using OLAC Metadata 
Section 11.7. Summary 
Section 11.8. Further Reading 
Section 11.9. Exercises

**CompuTux** · 06/02/2012, 21h39

Très intéressant ! Merci beaucoup pour cette lecture !

Je pense que je vais le feuilleter car je bosse sur une application de génération de texte.

Je pense que je vais avoir besoin d'un cadre théorique donc je vais aussi jeter un œil sur la théorie, pas que sur les applications pratiques en python.

Une question me vient à l'esprit, vous qui l'avez lu :

Peut-on adapter et convertir le code de NLTK en PHP5?

Je ne sais pas encore réellement quel langage vais je devoir choisir pour programmer mon application.

**Franck Dernoncourt** · 06/02/2012, 21h46

Content que cela t'intéresse, à ma connaissance il n'existe malheureusement pas d'outils pour convertir du code python en PHP (et a fortiori NLTK)... par contre, tu peux appeler sans problème du code Python à partir de PHP.

**CompuTux** · 06/02/2012, 22h11

J'avais pensé initialement écrire mon application en PHP5 mais je peux certainement changer d'avis, et l'écrire en python. En effet python est très bien, très simple, et se déploie facilement sur internet.

**Franck Dernoncourt** · 26/02/2012, 15h39

Why Python?

Natural Language Processing with Python

Livres

Discussions similaires

Partager

Partager