nltk.chunk package

Submodules

nltk.chunk.api module

class nltk.chunk.api.ChunkParserI[source]

Bases: nltk.parse.api.ParserI

A processing interface for identifying non-overlapping groups in unrestricted text. Typically, chunk parsers are used to find base syntactic constituents, such as base noun phrases. Unlike ParserI, ChunkParserI guarantees that the parse() method will always generate a parse.

evaluate(gold)[source]

Score the accuracy of the chunker against the gold standard. Remove the chunking the gold standard text, rechunk it using the chunker, and return a ChunkScore object reflecting the performance of this chunk peraser.

Parameters:gold (list(Tree)) – The list of chunked sentences to score the chunker on.
Return type:ChunkScore
parse(tokens)[source]

Return the best chunk structure for the given tokens and return a tree.

Parameters:tokens (list(tuple)) – The list of (word, tag) tokens to be chunked.
Return type:Tree

nltk.chunk.named_entity module

Named entity chunker

class nltk.chunk.named_entity.NEChunkParser(train)[source]

Bases: nltk.chunk.api.ChunkParserI

Expected input: list of pos-tagged words

parse(tokens)[source]

Each token should be a pos-tagged word

class nltk.chunk.named_entity.NEChunkParserTagger(train)[source]

Bases: nltk.tag.sequential.ClassifierBasedTagger

The IOB tagger used by the chunk parser.

nltk.chunk.named_entity.build_model(fmt='binary')[source]
nltk.chunk.named_entity.cmp_chunks(correct, guessed)[source]
nltk.chunk.named_entity.load_ace_data(roots, fmt='binary', skip_bnews=True)[source]
nltk.chunk.named_entity.load_ace_file(textfile, fmt)[source]
nltk.chunk.named_entity.postag_tree(tree)[source]
nltk.chunk.named_entity.shape(word)[source]
nltk.chunk.named_entity.simplify_pos(s)[source]

nltk.chunk.regexp module

class nltk.chunk.regexp.ChinkRule(tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to remove chinks to a ChunkString, using a matching tag pattern. When applied to a ChunkString, it will find any substring that matches this tag pattern and that is contained in a chunk, and remove it from that chunk, thus creating two new chunks.

unicode_repr()

Return a string representation of this rule. It has the form:

<ChinkRule: '<IN|VB.*>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.ChunkRule(tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to add chunks to a ChunkString, using a matching tag pattern. When applied to a ChunkString, it will find any substring that matches this tag pattern and that is not already part of a chunk, and create a new chunk containing that substring.

unicode_repr()

Return a string representation of this rule. It has the form:

<ChunkRule: '<IN|VB.*>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.ChunkRuleWithContext(left_context_tag_pattern, chunk_tag_pattern, right_context_tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to add chunks to a ChunkString, using three matching tag patterns: one for the left context, one for the chunk, and one for the right context. When applied to a ChunkString, it will find any substring that matches the chunk tag pattern, is surrounded by substrings that match the two context patterns, and is not already part of a chunk; and create a new chunk containing the substring that matched the chunk tag pattern.

Caveat: Both the left and right context are consumed when this rule matches; therefore, if you need to find overlapping matches, you will need to apply your rule more than once.

unicode_repr()

Return a string representation of this rule. It has the form:

<ChunkRuleWithContext: '<IN>', '<NN>', '<DT>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.ChunkString(chunk_struct, debug_level=1)[source]

Bases: builtins.object

A string-based encoding of a particular chunking of a text. Internally, the ChunkString class uses a single string to encode the chunking of the input text. This string contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>

ChunkString are created from tagged texts (i.e., lists of tokens whose type is TaggedType). Initially, nothing is chunked.

The chunking of a ChunkString can be modified with the xform() method, which uses a regular expression to transform the string representation. These transformations should only add and remove braces; they should not modify the sequence of angle-bracket delimited tags.

Variables:
  • _str

    The internal string representation of the text’s encoding. This string representation contains a sequence of angle-bracket delimited tags, with chunking indicated by braces. An example of this encoding is:

    {<DT><JJ><NN>}<VBN><IN>{<DT><NN>}<.>{<DT><NN>}<VBD><.>
  • _pieces – The tagged tokens and chunks encoded by this ChunkString.
  • _debug – The debug level. See the constructor docs.
  • IN_CHUNK_PATTERN – A zero-width regexp pattern string that will only match positions that are in chunks.
  • IN_CHINK_PATTERN – A zero-width regexp pattern string that will only match positions that are in chinks.
CHUNK_TAG = '(<[^\\{\\}<>]+?>)'
CHUNK_TAG_CHAR = '[^\\{\\}<>]'
IN_CHINK_PATTERN = '(?=[^\\}]*(\\{|$))'
IN_CHUNK_PATTERN = '(?=[^\\{]*\\})'
to_chunkstruct(chunk_label='CHUNK')[source]

Return the chunk structure encoded by this ChunkString.

Return type:Tree
Raises ValueError:
 If a transformation has generated an invalid chunkstring.
unicode_repr()

Return a string representation of this ChunkString. It has the form:

<ChunkString: '{<DT><JJ><NN>}<VBN><IN>{<DT><NN>}'>
Return type:str
xform(regexp, repl)[source]

Apply the given transformation to the string encoding of this ChunkString. In particular, find all occurrences that match regexp, and replace them using repl (as done by re.sub).

This transformation should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in improper bracketing. Note, in particular, that bracketing may not be nested.

Parameters:
  • regexp (str or regexp) – A regular expression matching the substring that should be replaced. This will typically include a named group, which can be used by repl.
  • repl (str) – An expression specifying what should replace the matched substring. Typically, this will include a named replacement group, specified by regexp.
Return type:

None

Raises ValueError:
 

If this transformation generated an invalid chunkstring.

class nltk.chunk.regexp.ExpandLeftRule(left_tag_pattern, right_tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to expand chunks in a ChunkString to the left, using two matching tag patterns: a left pattern, and a right pattern. When applied to a ChunkString, it will find any chunk whose beginning matches right pattern, and immediately preceded by a chink whose end matches left pattern. It will then expand the chunk to incorporate the new material on the left.

unicode_repr()

Return a string representation of this rule. It has the form:

<ExpandLeftRule: '<NN|DT|JJ>', '<NN|JJ>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.ExpandRightRule(left_tag_pattern, right_tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to expand chunks in a ChunkString to the right, using two matching tag patterns: a left pattern, and a right pattern. When applied to a ChunkString, it will find any chunk whose end matches left pattern, and immediately followed by a chink whose beginning matches right pattern. It will then expand the chunk to incorporate the new material on the right.

unicode_repr()

Return a string representation of this rule. It has the form:

<ExpandRightRule: '<NN|DT|JJ>', '<NN|JJ>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.MergeRule(left_tag_pattern, right_tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to merge chunks in a ChunkString, using two matching tag patterns: a left pattern, and a right pattern. When applied to a ChunkString, it will find any chunk whose end matches left pattern, and immediately followed by a chunk whose beginning matches right pattern. It will then merge those two chunks into a single chunk.

unicode_repr()

Return a string representation of this rule. It has the form:

<MergeRule: '<NN|DT|JJ>', '<NN|JJ>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.RegexpChunkParser(rules, chunk_label='NP', root_label='S', trace=0)[source]

Bases: nltk.chunk.api.ChunkParserI

A regular expression based chunk parser. RegexpChunkParser uses a sequence of “rules” to find chunks of a single type within a text. The chunking of the text is encoded using a ChunkString, and each rule acts by modifying the chunking in the ChunkString. The rules are all implemented using regular expression matching and substitution.

The RegexpChunkRule class and its subclasses (ChunkRule, ChinkRule, UnChunkRule, MergeRule, and SplitRule) define the rules that are used by RegexpChunkParser. Each rule defines an apply() method, which modifies the chunking encoded by a given ChunkString.

Variables:
  • _rules – The list of rules that should be applied to a text.
  • _trace – The default level of tracing.
parse(chunk_struct, trace=None)[source]
Parameters:
  • chunk_struct (Tree) – the chunk structure to be (further) chunked
  • trace (int) – The level of tracing that should be used when parsing a text. 0 will generate no tracing output; 1 will generate normal tracing output; and 2 or highter will generate verbose tracing output. This value overrides the trace level value that was given to the constructor.
Return type:

Tree

Returns:

a chunk structure that encodes the chunks in a given tagged sentence. A chunk is a non-overlapping linguistic group, such as a noun phrase. The set of chunks identified in the chunk structure depends on the rules used to define this RegexpChunkParser.

rules()[source]
Returns:the sequence of rules used by RegexpChunkParser.
Return type:list(RegexpChunkRule)
unicode_repr()
Returns:a concise string representation of this RegexpChunkParser.
Return type:str
class nltk.chunk.regexp.RegexpChunkRule(regexp, repl, descr)[source]

Bases: builtins.object

A rule specifying how to modify the chunking in a ChunkString, using a transformational regular expression. The RegexpChunkRule class itself can be used to implement any transformational rule based on regular expressions. There are also a number of subclasses, which can be used to implement simpler types of rules, based on matching regular expressions.

Each RegexpChunkRule has a regular expression and a replacement expression. When a RegexpChunkRule is “applied” to a ChunkString, it searches the ChunkString for any substring that matches the regular expression, and replaces it using the replacement expression. This search/replace operation has the same semantics as re.sub.

Each RegexpChunkRule also has a description string, which gives a short (typically less than 75 characters) description of the purpose of the rule.

This transformation defined by this RegexpChunkRule should only add and remove braces; it should not modify the sequence of angle-bracket delimited tags. Furthermore, this transformation may not result in nested or mismatched bracketing.

apply(chunkstr)[source]

Apply this rule to the given ChunkString. See the class reference documentation for a description of what it means to apply a rule.

Parameters:chunkstr (ChunkString) – The chunkstring to which this rule is applied.
Return type:None
Raises ValueError:
 If this transformation generated an invalid chunkstring.
descr()[source]

Return a short description of the purpose and/or effect of this rule.

Return type:str
static parse(s)[source]

Create a RegexpChunkRule from a string description. Currently, the following formats are supported:

{regexp}         # chunk rule
}regexp{         # chink rule
regexp}{regexp   # split rule
regexp{}regexp   # merge rule

Where regexp is a regular expression for the rule. Any text following the comment marker (#) will be used as the rule’s description:

>>> from nltk.chunk.regexp import RegexpChunkRule
>>> RegexpChunkRule.parse('{<DT>?<NN.*>+}')
<ChunkRule: '<DT>?<NN.*>+'>
unicode_repr()

Return a string representation of this rule. It has the form:

<RegexpChunkRule: '{<IN|VB.*>}'->'<IN>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.RegexpParser(grammar, root_label='S', loop=1, trace=0)[source]

Bases: nltk.chunk.api.ChunkParserI

A grammar based chunk parser. chunk.RegexpParser uses a set of regular expression patterns to specify the behavior of the parser. The chunking of the text is encoded using a ChunkString, and each rule acts by modifying the chunking in the ChunkString. The rules are all implemented using regular expression matching and substitution.

A grammar contains one or more clauses in the following form:

NP:
  {<DT|JJ>}          # chunk determiners and adjectives
  }<[\.VI].*>+{      # chink any tag beginning with V, I, or .
  <.*>}{<DT>         # split a chunk at a determiner
  <DT|JJ>{}<NN.*>    # merge chunk ending with det/adj
                     # with one starting with a noun

The patterns of a clause are executed in order. An earlier pattern may introduce a chunk boundary that prevents a later pattern from executing. Sometimes an individual pattern will match on multiple, overlapping extents of the input. As with regular expression substitution more generally, the chunker will identify the first match possible, then continue looking for matches after this one has ended.

The clauses of a grammar are also executed in order. A cascaded chunk parser is one having more than one clause. The maximum depth of a parse tree created by this chunk parser is the same as the number of clauses in the grammar.

When tracing is turned on, the comment portion of a line is displayed each time the corresponding pattern is applied.

Variables:
  • _start – The start symbol of the grammar (the root node of resulting trees)
  • _stages – The list of parsing stages corresponding to the grammar
parse(chunk_struct, trace=None)[source]

Apply the chunk parser to this input.

Parameters:
  • chunk_struct (Tree) – the chunk structure to be (further) chunked (this tree is modified, and is also returned)
  • trace (int) – The level of tracing that should be used when parsing a text. 0 will generate no tracing output; 1 will generate normal tracing output; and 2 or highter will generate verbose tracing output. This value overrides the trace level value that was given to the constructor.
Returns:

the chunked output.

Return type:

Tree

unicode_repr()
Returns:a concise string representation of this chunk.RegexpParser.
Return type:str
class nltk.chunk.regexp.SplitRule(left_tag_pattern, right_tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to split chunks in a ChunkString, using two matching tag patterns: a left pattern, and a right pattern. When applied to a ChunkString, it will find any chunk that matches the left pattern followed by the right pattern. It will then split the chunk into two new chunks, at the point between the two pattern matches.

unicode_repr()

Return a string representation of this rule. It has the form:

<SplitRule: '<NN>', '<DT>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
class nltk.chunk.regexp.UnChunkRule(tag_pattern, descr)[source]

Bases: nltk.chunk.regexp.RegexpChunkRule

A rule specifying how to remove chunks to a ChunkString, using a matching tag pattern. When applied to a ChunkString, it will find any complete chunk that matches this tag pattern, and un-chunk it.

unicode_repr()

Return a string representation of this rule. It has the form:

<UnChunkRule: '<IN|VB.*>'>

Note that this representation does not include the description string; that string can be accessed separately with the descr() method.

Return type:str
nltk.chunk.regexp.demo()[source]

A demonstration for the RegexpChunkParser class. A single text is parsed with four different chunk parsers, using a variety of rules and strategies.

nltk.chunk.regexp.demo_eval(chunkparser, text)[source]

Demonstration code for evaluating a chunk parser, using a ChunkScore. This function assumes that text contains one sentence per line, and that each sentence has the form expected by tree.chunk. It runs the given chunk parser on each sentence in the text, and scores the result. It prints the final score (precision, recall, and f-measure); and reports the set of chunks that were missed and the set of chunks that were incorrect. (At most 10 missing chunks and 10 incorrect chunks are reported).

Parameters:
  • chunkparser (ChunkParserI) – The chunkparser to be tested
  • text (str) – The chunked tagged text that should be used for evaluation.
nltk.chunk.regexp.tag_pattern2re_pattern(tag_pattern)[source]

Convert a tag pattern to a regular expression pattern. A “tag pattern” is a modified version of a regular expression, designed for matching sequences of tags. The differences between regular expression patterns and tag patterns are:

  • In tag patterns, '<' and '>' act as parentheses; so '<NN>+' matches one or more repetitions of '<NN>', not '<NN' followed by one or more repetitions of '>'.
  • Whitespace in tag patterns is ignored. So '<DT> | <NN>' is equivalant to '<DT>|<NN>'
  • In tag patterns, '.' is equivalant to '[^{}<>]'; so '<NN.*>' matches any single tag starting with 'NN'.

In particular, tag_pattern2re_pattern performs the following transformations on the given pattern:

  • Replace ‘.’ with ‘[^<>{}]’
  • Remove any whitespace
  • Add extra parens around ‘<’ and ‘>’, to make ‘<’ and ‘>’ act like parentheses. E.g., so that in ‘<NN>+’, the ‘+’ has scope over the entire ‘<NN>’; and so that in ‘<NN|IN>’, the ‘|’ has scope over ‘NN’ and ‘IN’, but not ‘<’ or ‘>’.
  • Check to make sure the resulting pattern is valid.
Parameters:tag_pattern (str) – The tag pattern to convert to a regular expression pattern.
Raises ValueError:
 If tag_pattern is not a valid tag pattern. In particular, tag_pattern should not include braces; and it should not contain nested or mismatched angle-brackets.
Return type:str
Returns:A regular expression pattern corresponding to tag_pattern.

nltk.chunk.util module

class nltk.chunk.util.ChunkScore(**kwargs)[source]

Bases: builtins.object

A utility class for scoring chunk parsers. ChunkScore can evaluate a chunk parser’s output, based on a number of statistics (precision, recall, f-measure, misssed chunks, incorrect chunks). It can also combine the scores from the parsing of multiple texts; this makes it significantly easier to evaluate a chunk parser that operates one sentence at a time.

Texts are evaluated with the score method. The results of evaluation can be accessed via a number of accessor methods, such as precision and f_measure. A typical use of the ChunkScore class is:

>>> chunkscore = ChunkScore()           
>>> for correct in correct_sentences:   
...     guess = chunkparser.parse(correct.leaves())   
...     chunkscore.score(correct, guess)              
>>> print('F Measure:', chunkscore.f_measure())       
F Measure: 0.823
Variables:
  • kwargs

    Keyword arguments:

    • max_tp_examples: The maximum number actual examples of true positives to record. This affects the correct member function: correct will not return more than this number of true positive examples. This does not affect any of the numerical metrics (precision, recall, or f-measure)
    • max_fp_examples: The maximum number actual examples of false positives to record. This affects the incorrect member function and the guessed member function: incorrect will not return more than this number of examples, and guessed will not return more than this number of true positive examples. This does not affect any of the numerical metrics (precision, recall, or f-measure)
    • max_fn_examples: The maximum number actual examples of false negatives to record. This affects the missed member function and the correct member function: missed will not return more than this number of examples, and correct will not return more than this number of true negative examples. This does not affect any of the numerical metrics (precision, recall, or f-measure)
    • chunk_label: A regular expression indicating which chunks should be compared. Defaults to '.*' (i.e., all chunks).
  • _tp – List of true positives
  • _fp – List of false positives
  • _fn – List of false negatives
  • _tp_num – Number of true positives
  • _fp_num – Number of false positives
  • _fn_num – Number of false negatives.
accuracy()[source]

Return the overall tag-based accuracy for all text that have been scored by this ChunkScore, using the IOB (conll2000) tag encoding.

Return type:float
correct()[source]

Return the chunks which were included in the correct chunk structures, listed in input order.

Return type:list of chunks
f_measure(alpha=0.5)[source]

Return the overall F measure for all texts that have been scored by this ChunkScore.

Parameters:alpha (float) – the relative weighting of precision and recall. Larger alpha biases the score towards the precision value, while smaller alpha biases the score towards the recall value. alpha should have a value in the range [0,1].
Return type:float
guessed()[source]

Return the chunks which were included in the guessed chunk structures, listed in input order.

Return type:list of chunks
incorrect()[source]

Return the chunks which were included in the guessed chunk structures, but not in the correct chunk structures, listed in input order.

Return type:list of chunks
missed()[source]

Return the chunks which were included in the correct chunk structures, but not in the guessed chunk structures, listed in input order.

Return type:list of chunks
precision()[source]

Return the overall precision for all texts that have been scored by this ChunkScore.

Return type:float
recall()[source]

Return the overall recall for all texts that have been scored by this ChunkScore.

Return type:float
score(correct, guessed)[source]

Given a correctly chunked sentence, score another chunked version of the same sentence.

Parameters:
  • correct (chunk structure) – The known-correct (“gold standard”) chunked sentence.
  • guessed (chunk structure) – The chunked sentence to be scored.
nltk.chunk.util.accuracy(chunker, gold)[source]

Score the accuracy of the chunker against the gold standard. Strip the chunk information from the gold standard and rechunk it using the chunker, then compute the accuracy score.

Parameters:
  • chunker (ChunkParserI) – The chunker being evaluated.
  • gold (tree) – The chunk structures to score the chunker on.
Return type:

float

nltk.chunk.util.conllstr2tree(s, chunk_types=('NP', 'PP', 'VP'), root_label='S')[source]

Return a chunk structure for a single sentence encoded in the given CONLL 2000 style string. This function converts a CoNLL IOB string into a tree. It uses the specified chunk types (defaults to NP, PP and VP), and creates a tree rooted at a node labeled S (by default).

Parameters:
  • s (str) – The CoNLL string to be converted.
  • chunk_types (tuple) – The chunk types to be converted.
  • root_label (str) – The node label to use for the root.
Return type:

Tree

nltk.chunk.util.conlltags2tree(sentence, chunk_types=('NP', 'PP', 'VP'), root_label='S', strict=False)[source]

Convert the CoNLL IOB format to a tree.

nltk.chunk.util.demo()[source]
nltk.chunk.util.ieerstr2tree(s, chunk_types=['LOCATION', 'ORGANIZATION', 'PERSON', 'DURATION', 'DATE', 'CARDINAL', 'PERCENT', 'MONEY', 'MEASURE'], root_label='S')[source]

Return a chunk structure containing the chunked tagged text that is encoded in the given IEER style string. Convert a string of chunked tagged text in the IEER named entity format into a chunk structure. Chunks are of several types, LOCATION, ORGANIZATION, PERSON, DURATION, DATE, CARDINAL, PERCENT, MONEY, and MEASURE.

Return type:Tree
nltk.chunk.util.tagstr2tree(s, chunk_label='NP', root_label='S', sep='/')[source]

Divide a string of bracketted tagged text into chunks and unchunked tokens, and produce a Tree. Chunks are marked by square brackets ([...]). Words are delimited by whitespace, and each word should have the form text/tag. Words that do not contain a slash are assigned a tag of None.

Parameters:
  • s (str) – The string to be converted
  • chunk_label (str) – The label to use for chunk nodes
  • root_label (str) – The label to use for the root of the tree
Return type:

Tree

nltk.chunk.util.tree2conllstr(t)[source]

Return a multiline string where each line contains a word, tag and IOB tag. Convert a tree to the CoNLL IOB string format

Parameters:t (Tree) – The tree to be converted.
Return type:str
nltk.chunk.util.tree2conlltags(t)[source]

Return a list of 3-tuples containing (word, tag, IOB-tag). Convert a tree to the CoNLL IOB tag format.

Parameters:t (Tree) – The tree to be converted.
Return type:list(tuple)

Module contents

Classes and interfaces for identifying non-overlapping linguistic groups (such as base noun phrases) in unrestricted text. This task is called “chunk parsing” or “chunking”, and the identified groups are called “chunks”. The chunked text is represented using a shallow tree called a “chunk structure.” A chunk structure is a tree containing tokens and chunks, where each chunk is a subtree containing only tokens. For example, the chunk structure for base noun phrase chunks in the sentence “I saw the big dog on the hill” is:

(SENTENCE:
  (NP: <I>)
  <saw>
  (NP: <the> <big> <dog>)
  <on>
  (NP: <the> <hill>))

To convert a chunk structure back to a list of tokens, simply use the chunk structure’s leaves() method.

This module defines ChunkParserI, a standard interface for chunking texts; and RegexpChunkParser, a regular-expression based implementation of that interface. It also defines ChunkScore, a utility class for scoring chunk parsers.

RegexpChunkParser

RegexpChunkParser is an implementation of the chunk parser interface that uses regular-expressions over tags to chunk a text. Its parse() method first constructs a ChunkString, which encodes a particular chunking of the input text. Initially, nothing is chunked. parse.RegexpChunkParser then applies a sequence of RegexpChunkRule rules to the ChunkString, each of which modifies the chunking that it encodes. Finally, the ChunkString is transformed back into a chunk structure, which is returned.

RegexpChunkParser can only be used to chunk a single kind of phrase. For example, you can use an RegexpChunkParser to chunk the noun phrases in a text, or the verb phrases in a text; but you can not use it to simultaneously chunk both noun phrases and verb phrases in the same text. (This is a limitation of RegexpChunkParser, not of chunk parsers in general.)

RegexpChunkRules

A RegexpChunkRule is a transformational rule that updates the chunking of a text by modifying its ChunkString. Each RegexpChunkRule defines the apply() method, which modifies the chunking encoded by a ChunkString. The RegexpChunkRule class itself can be used to implement any transformational rule based on regular expressions. There are also a number of subclasses, which can be used to implement simpler types of rules:

  • ChunkRule chunks anything that matches a given regular expression.
  • ChinkRule chinks anything that matches a given regular expression.
  • UnChunkRule will un-chunk any chunk that matches a given regular expression.
  • MergeRule can be used to merge two contiguous chunks.
  • SplitRule can be used to split a single chunk into two smaller chunks.
  • ExpandLeftRule will expand a chunk to incorporate new unchunked material on the left.
  • ExpandRightRule will expand a chunk to incorporate new unchunked material on the right.
Tag Patterns

A RegexpChunkRule uses a modified version of regular expression patterns, called “tag patterns”. Tag patterns are used to match sequences of tags. Examples of tag patterns are:

r'(<DT>|<JJ>|<NN>)+'
r'<NN>+'
r'<NN.*>'

The differences between regular expression patterns and tag patterns are:

  • In tag patterns, '<' and '>' act as parentheses; so '<NN>+' matches one or more repetitions of '<NN>', not '<NN' followed by one or more repetitions of '>'.
  • Whitespace in tag patterns is ignored. So '<DT> | <NN>' is equivalant to '<DT>|<NN>'
  • In tag patterns, '.' is equivalant to '[^{}<>]'; so '<NN.*>' matches any single tag starting with 'NN'.

The function tag_pattern2re_pattern can be used to transform a tag pattern to an equivalent regular expression pattern.

Efficiency

Preliminary tests indicate that RegexpChunkParser can chunk at a rate of about 300 tokens/second, with a moderately complex rule set.

There may be problems if RegexpChunkParser is used with more than 5,000 tokens at a time. In particular, evaluation of some regular expressions may cause the Python regular expression engine to exceed its maximum recursion depth. We have attempted to minimize these problems, but it is impossible to avoid them completely. We therefore recommend that you apply the chunk parser to a single sentence at a time.

Emacs Tip

If you evaluate the following elisp expression in emacs, it will colorize a ChunkString when you use an interactive python shell with emacs or xemacs (“C-c !”):

(let ()
  (defconst comint-mode-font-lock-keywords
    '(("<[^>]+>" 0 'font-lock-reference-face)
      ("[{}]" 0 'font-lock-function-name-face)))
  (add-hook 'comint-mode-hook (lambda () (turn-on-font-lock))))

You can evaluate this code by copying it to a temporary buffer, placing the cursor after the last close parenthesis, and typing “C-x C-e”. You should evaluate it before running the interactive session. The change will last until you close emacs.

Unresolved Issues

If we use the re module for regular expressions, Python’s regular expression engine generates “maximum recursion depth exceeded” errors when processing very large texts, even for regular expressions that should not require any recursion. We therefore use the pre module instead. But note that pre does not include Unicode support, so this module will not work with unicode strings. Note also that pre regular expressions are not quite as advanced as re ones (e.g., no leftward zero-length assertions).

type CHUNK_TAG_PATTERN:
 regexp
var CHUNK_TAG_PATTERN:
 A regular expression to test whether a tag pattern is valid.
nltk.chunk.batch_ne_chunk(tagged_sentences, binary=False)[source]

Use NLTK’s currently recommended named entity chunker to chunk the given list of tagged sentences, each consisting of a list of tagged tokens.

nltk.chunk.ne_chunk(tagged_tokens, binary=False)[source]

Use NLTK’s currently recommended named entity chunker to chunk the given list of tagged tokens.