nltk.parse package

Submodules

nltk.parse.api module

class nltk.parse.api.ParserI[source]

Bases: builtins.object

A processing class for deriving trees that represent possible structures for a sequence of tokens. These tree structures are known as “parses”. Typically, parsers are used to derive syntax trees for sentences. But parsers can also be used to derive other kinds of tree structure, such as morphological trees and discourse structures.

Subclasses must define:
  • at least one of: parse(), nbest_parse(), iter_parse(), batch_parse(), batch_nbest_parse(), batch_iter_parse().
Subclasses may define:
  • grammar()
  • either prob_parse() or batch_prob_parse() (or both)
batch_iter_parse(sents)[source]

Apply self.iter_parse() to each element of sents. I.e.:

return [self.iter_parse(sent) for sent in sents]
Return type:list(iter(Tree))
batch_nbest_parse(sents, n=None)[source]

Apply self.nbest_parse() to each element of sents. I.e.:

return [self.nbest_parse(sent, n) for sent in sents]
Return type:list(list(Tree))
batch_parse(sents)[source]

Apply self.parse() to each element of sents. I.e.:

return [self.parse(sent) for sent in sents]
Return type:list(Tree)
batch_prob_parse(sents)[source]

Apply self.prob_parse() to each element of sents. I.e.:

return [self.prob_parse(sent) for sent in sents]
Return type:list(ProbDistI(Tree))
grammar()[source]
Returns:The grammar used by this parser.
iter_parse(sent)[source]
Returns:An iterator that generates parse trees that represent

possible structures for the given sentence. When possible, this list is sorted from most likely to least likely.

Parameters:sent (list(str)) – The sentence to be parsed
Return type:iter(Tree)
nbest_parse(sent, n=None)[source]
Returns:A list of parse trees that represent possible

structures for the given sentence. When possible, this list is sorted from most likely to least likely. If n is specified, then the returned list will contain at most n parse trees.

Parameters:
  • sent (list(str)) – The sentence to be parsed
  • n (int) – The maximum number of trees to return.
Return type:

list(Tree)

parse(sent)[source]
Returns:A parse tree that represents the structure of the

given sentence, or None if no parse tree is found. If multiple parses are found, then return the best parse.

Parameters:sent (list(str)) – The sentence to be parsed
Return type:Tree
prob_parse(sent)[source]
Returns:A probability distribution over the possible parse

trees for the given sentence. If there are no possible parse trees for the given sentence, return a probability distribution that assigns a probability of 1.0 to None.

Parameters:sent (list(str)) – The sentence to be parsed
Return type:ProbDistI(Tree)

nltk.parse.chart module

Data classes and parser implementations for “chart parsers”, which use dynamic programming to efficiently parse a text. A chart parser derives parse trees for a text by iteratively adding “edges” to a “chart.” Each edge represents a hypothesis about the tree structure for a subsequence of the text. The chart is a “blackboard” for composing and combining these hypotheses.

When a chart parser begins parsing a text, it creates a new (empty) chart, spanning the text. It then incrementally adds new edges to the chart. A set of “chart rules” specifies the conditions under which new edges should be added to the chart. Once the chart reaches a stage where none of the chart rules adds any new edges, parsing is complete.

Charts are encoded with the Chart class, and edges are encoded with the TreeEdge and LeafEdge classes. The chart parser module defines three chart parsers:

  • ChartParser is a simple and flexible chart parser. Given a set of chart rules, it will apply those rules to the chart until no more edges are added.
  • SteppingChartParser is a subclass of ChartParser that can be used to step through the parsing process.
class nltk.parse.chart.AbstractChartRule[source]

Bases: nltk.parse.chart.ChartRuleI

An abstract base class for chart rules. AbstractChartRule provides:

  • A default implementation for apply, based on apply_iter.
  • A default implementation for apply_everywhere_iter, based on apply_iter.
  • A default implementation for apply_everywhere, based on apply_everywhere_iter. Currently, this implementation assumes that ``NUM_EDGES``<=3.
  • A default implementation for __str__, which returns a name basd on the rule’s class name.
apply(chart, grammar, *edges)[source]
apply_everywhere(chart, grammar)[source]
apply_everywhere_iter(chart, grammar)[source]
apply_iter(chart, grammar, *edges)[source]
unicode_repr None

x.__repr__() <==> repr(x)

class nltk.parse.chart.BottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser using a bottom-up parsing strategy. See ChartParser for more information.

class nltk.parse.chart.BottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser using a bottom-up left-corner parsing strategy. This strategy is often more efficient than standard bottom-up. See ChartParser for more information.

class nltk.parse.chart.BottomUpPredictCombineRule[source]

Bases: nltk.parse.chart.BottomUpPredictRule

A rule licensing any edge corresponding to a production whose right-hand side begins with a complete edge’s left-hand side. In particular, this rule specifies that [A -> alpha \*] licenses the edge [B -> A \* beta] for each grammar production B -> A beta.

Note:This is like BottomUpPredictRule, but it also applies the FundamentalRule to the resulting edge.
NUM_EDGES = 1
apply_iter(chart, grammar, edge)[source]
class nltk.parse.chart.BottomUpPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule licensing any edge corresponding to a production whose right-hand side begins with a complete edge’s left-hand side. In particular, this rule specifies that [A -> alpha \*] licenses the edge [B -> \* A beta] for each grammar production B -> A beta.

NUM_EDGES = 1
apply_iter(chart, grammar, edge)[source]
class nltk.parse.chart.CachedTopDownPredictRule[source]

Bases: nltk.parse.chart.TopDownPredictRule

A cached version of TopDownPredictRule. After the first time this rule is applied to an edge with a given end and next, it will not generate any more edges for edges with that end and next.

If chart or grammar are changed, then the cache is flushed.

apply_iter(chart, grammar, edge)[source]
class nltk.parse.chart.Chart(tokens)[source]

Bases: builtins.object

A blackboard for hypotheses about the syntactic constituents of a sentence. A chart contains a set of edges, and each edge encodes a single hypothesis about the structure of some portion of the sentence.

The select method can be used to select a specific collection of edges. For example chart.select(is_complete=True, start=0) yields all complete edges whose start indices are 0. To ensure the efficiency of these selection operations, Chart dynamically creates and maintains an index for each set of attributes that have been selected on.

In order to reconstruct the trees that are represented by an edge, the chart associates each edge with a set of child pointer lists. A child pointer list is a list of the edges that license an edge’s right-hand side.

Variables:
  • _tokens – The sentence that the chart covers.
  • _num_leaves – The number of tokens.
  • _edges – A list of the edges in the chart
  • _edge_to_cpls – A dictionary mapping each edge to a set of child pointer lists that are associated with that edge.
  • _indexes – A dictionary mapping tuples of edge attributes to indices, where each index maps the corresponding edge attribute values to lists of edges.
child_pointer_lists(edge)[source]

Return the set of child pointer lists for the given edge. Each child pointer list is a list of edges that have been used to form this edge.

Return type:list(list(EdgeI))
dot_digraph()[source]
edges()[source]

Return a list of all edges in this chart. New edges that are added to the chart after the call to edges() will not be contained in this list.

Return type:list(EdgeI)
See:iteredges, select
initialize()[source]

Clear the chart.

insert(edge, *child_pointer_lists)[source]

Add a new edge to the chart, and return True if this operation modified the chart. In particular, return true iff the chart did not already contain edge, or if it did not already associate child_pointer_lists with edge.

Parameters:
  • edge (EdgeI) – The new edge
  • child_pointer_lists (sequence of tuple(EdgeI)) – A sequence of lists of the edges that were used to form this edge. This list is used to reconstruct the trees (or partial trees) that are associated with edge.
Return type:

bool

insert_with_backpointer(new_edge, previous_edge, child_edge)[source]

Add a new edge to the chart, using a pointer to the previous edge.

iteredges()[source]

Return an iterator over the edges in this chart. It is not guaranteed that new edges which are added to the chart before the iterator is exhausted will also be generated.

Return type:iter(EdgeI)
See:edges, select
leaf(index)[source]

Return the leaf value of the word at the given index.

Return type:str
leaves()[source]

Return a list of the leaf values of each word in the chart’s sentence.

Return type:list(str)
num_edges()[source]

Return the number of edges contained in this chart.

Return type:int
num_leaves()[source]

Return the number of words in this chart’s sentence.

Return type:int
parses(root, tree_class=<class 'nltk.tree.Tree'>)[source]

Return a list of the complete tree structures that span the entire chart, and whose root node is root.

pp(width=None)[source]

Return a pretty-printed string representation of this chart.

Parameters:width – The number of characters allotted to each index in the sentence.
Return type:str
pp_edge(edge, width=None)[source]

Return a pretty-printed string representation of a given edge in this chart.

Return type:str
Parameters:width – The number of characters allotted to each index in the sentence.
pp_leaves(width=None)[source]

Return a pretty-printed string representation of this chart’s leaves. This string can be used as a header for calls to pp_edge.

select(**restrictions)[source]

Return an iterator over the edges in this chart. Any new edges that are added to the chart before the iterator is exahusted will also be generated. restrictions can be used to restrict the set of edges that will be generated.

Parameters:
  • span – Only generate edges e where e.span()==span
  • start – Only generate edges e where e.start()==start
  • end – Only generate edges e where e.end()==end
  • length – Only generate edges e where e.length()==length
  • lhs – Only generate edges e where e.lhs()==lhs
  • rhs – Only generate edges e where e.rhs()==rhs
  • nextsym – Only generate edges e where e.nextsym()==nextsym
  • dot – Only generate edges e where e.dot()==dot
  • is_complete – Only generate edges e where e.is_complete()==is_complete
  • is_incomplete – Only generate edges e where e.is_incomplete()==is_incomplete
Return type:

iter(EdgeI)

trees(edge, tree_class=<class 'nltk.tree.Tree'>, complete=False)[source]

Return a list of the tree structures that are associated with edge.

If edge is incomplete, then the unexpanded children will be encoded as childless subtrees, whose node value is the corresponding terminal or nonterminal.

Return type:list(Tree)
Note:If two trees share a common subtree, then the same Tree may be used to encode that subtree in both trees. If you need to eliminate this subtree sharing, then create a deep copy of each tree.
class nltk.parse.chart.ChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object at 0x111ea08d0>, <nltk.parse.chart.EmptyPredictRule object at 0x111ea0910>, <nltk.parse.chart.BottomUpPredictCombineRule object at 0x111ea0950>, <nltk.parse.chart.SingleEdgeFundamentalRule object at 0x111ea0990>], trace=0, trace_chart_width=50, use_agenda=True, chart_class=<class 'nltk.parse.chart.Chart'>)[source]

Bases: nltk.parse.api.ParserI

A generic chart parser. A “strategy”, or list of ChartRuleI instances, is used to decide what edges to add to the chart. In particular, ChartParser uses the following algorithm to parse texts:

Until no new edges are added:
For each rule in strategy:
Apply rule to any applicable edges in the chart.
Return any complete parses in the chart
chart_parse(tokens, trace=None)[source]

Return the final parse Chart from which all possible parse trees can be extracted.

Parameters:tokens (list(str)) – The sentence to be parsed
Return type:Chart
grammar()[source]
nbest_parse(tokens, n=None, tree_class=<class 'nltk.tree.Tree'>)[source]
class nltk.parse.chart.ChartRuleI[source]

Bases: builtins.object

A rule that specifies what new edges are licensed by any given set of existing edges. Each chart rule expects a fixed number of edges, as indicated by the class variable NUM_EDGES. In particular:

  • A chart rule with NUM_EDGES=0 specifies what new edges are licensed, regardless of existing edges.
  • A chart rule with NUM_EDGES=1 specifies what new edges are licensed by a single existing edge.
  • A chart rule with NUM_EDGES=2 specifies what new edges are licensed by a pair of existing edges.
Variables:NUM_EDGES – The number of existing edges that this rule uses to license new edges. Typically, this number ranges from zero to two.
apply(chart, grammar, *edges)[source]

Add the edges licensed by this rule and the given edges to the chart. Return a list of the edges that were added.

Parameters:edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply is specified by the NUM_EDGES class variable.
Return type:list(EdgeI)
apply_everywhere(chart, grammar)[source]

Add all the edges licensed by this rule and the edges in the chart to the chart. Return a list of the edges that were added.

Return type:list(EdgeI)
apply_everywhere_iter(chart, grammar)[source]

Return a generator that will add all edges licensed by this rule, given the edges that are currently in the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.

Return type:iter(EdgeI)
apply_iter(chart, grammar, *edges)[source]

Return a generator that will add edges licensed by this rule and the given edges to the chart, one at a time. Each time the generator is resumed, it will either add a new edge and yield that edge; or return.

Parameters:edges (list(EdgeI)) – A set of existing edges. The number of edges that should be passed to apply() is specified by the NUM_EDGES class variable.
Return type:iter(EdgeI)
class nltk.parse.chart.EdgeI[source]

Bases: builtins.object

A hypothesis about the structure of part of a sentence. Each edge records the fact that a structure is (partially) consistent with the sentence. An edge contains:

  • A span, indicating what part of the sentence is consistent with the hypothesized structure.
  • A left-hand side, specifying what kind of structure is hypothesized.
  • A right-hand side, specifying the contents of the hypothesized structure.
  • A dot position, indicating how much of the hypothesized structure is consistent with the sentence.

Every edge is either complete or incomplete:

  • An edge is complete if its structure is fully consistent with the sentence.
  • An edge is incomplete if its structure is partially consistent with the sentence. For every incomplete edge, the span specifies a possible prefix for the edge’s structure.

There are two kinds of edge:

  • A TreeEdge records which trees have been found to be (partially) consistent with the text.
  • A LeafEdge records the tokens occurring in the text.

The EdgeI interface provides a common interface to both types of edge, allowing chart parsers to treat them in a uniform manner.

dot()[source]

Return this edge’s dot position, which indicates how much of the hypothesized structure is consistent with the sentence. In particular, self.rhs[:dot] is consistent with tokens[self.start():self.end()].

Return type:int
end()[source]

Return the end index of this edge’s span.

Return type:int
is_complete()[source]

Return True if this edge’s structure is fully consistent with the text.

Return type:bool
is_incomplete()[source]

Return True if this edge’s structure is partially consistent with the text.

Return type:bool
length()[source]

Return the length of this edge’s span.

Return type:int
lhs()[source]

Return this edge’s left-hand side, which specifies what kind of structure is hypothesized by this edge.

See:TreeEdge and LeafEdge for a description of the left-hand side values for each edge type.
nextsym()[source]

Return the element of this edge’s right-hand side that immediately follows its dot.

Return type:Nonterminal or terminal or None
rhs()[source]

Return this edge’s right-hand side, which specifies the content of the structure hypothesized by this edge.

See:TreeEdge and LeafEdge for a description of the right-hand side values for each edge type.
span()[source]

Return a tuple (s, e), where tokens[s:e] is the portion of the sentence that is consistent with this edge’s structure.

Return type:tuple(int, int)
start()[source]

Return the start index of this edge’s span.

Return type:int
class nltk.parse.chart.EmptyPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule that inserts all empty productions as passive edges, in every position in the chart.

NUM_EDGES = 0
apply_iter(chart, grammar)[source]
class nltk.parse.chart.FilteredBottomUpPredictCombineRule[source]

Bases: nltk.parse.chart.BottomUpPredictCombineRule

apply_iter(chart, grammar, edge)[source]
class nltk.parse.chart.FilteredSingleEdgeFundamentalRule[source]

Bases: nltk.parse.chart.SingleEdgeFundamentalRule

class nltk.parse.chart.FundamentalRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule that joins two adjacent edges to form a single combined edge. In particular, this rule specifies that any pair of edges

  • [A -> alpha \* B beta][i:j]
  • [B -> gamma \*][j:k]

licenses the edge:

  • [A -> alpha B * beta][i:j]
NUM_EDGES = 2
apply_iter(chart, grammar, left_edge, right_edge)[source]
class nltk.parse.chart.LeafEdge(leaf, index)[source]

Bases: nltk.parse.chart.EdgeI

An edge that records the fact that a leaf value is consistent with a word in the sentence. A leaf edge consists of:

  • An index, indicating the position of the word.
  • A leaf, specifying the word’s content.

A leaf edge’s left-hand side is its leaf value, and its right hand side is (). Its span is [index, index+1], and its dot position is 0.

dot()[source]
end()[source]
is_complete()[source]
is_incomplete()[source]
length()[source]
lhs()[source]
nextsym()[source]
rhs()[source]
span()[source]
start()[source]
unicode_repr()
class nltk.parse.chart.LeafInitRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 0
apply_iter(chart, grammar)[source]
class nltk.parse.chart.LeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

class nltk.parse.chart.SingleEdgeFundamentalRule[source]

Bases: nltk.parse.chart.FundamentalRule

A rule that joins a given edge with adjacent edges in the chart, to form combined edges. In particular, this rule specifies that either of the edges:

  • [A -> alpha \* B beta][i:j]
  • [B -> gamma \*][j:k]

licenses the edge:

  • [A -> alpha B * beta][i:j]

if the other edge is already in the chart.

Note:This is basically FundamentalRule, with one edge left unspecified.
NUM_EDGES = 1
apply_iter(chart, grammar, edge)[source]
class nltk.parse.chart.SteppingChartParser(grammar, strategy=[], trace=0)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser that allows you to step through the parsing process, adding a single edge at a time. It also allows you to change the parser’s strategy or grammar midway through parsing a text.

The initialize method is used to start parsing a text. step adds a single edge to the chart. set_strategy changes the strategy used by the chart parser. parses returns the set of parses that has been found by the chart parser.

Variables:_restart – Records whether the parser’s strategy, grammar, or chart has been changed. If so, then step must restart the parsing algorithm.
chart()[source]

Return the chart that is used by this parser.

current_chartrule()[source]

Return the chart rule used to generate the most recent edge.

grammar()[source]

Return the grammar used by this parser.

initialize(tokens)[source]

Begin parsing the given tokens.

nbest_parse(tokens, n=None, tree_class=<class 'nltk.tree.Tree'>)[source]
parses(tree_class=<class 'nltk.tree.Tree'>)[source]

Return the parse trees currently contained in the chart.

set_chart(chart)[source]

Load a given chart into the chart parser.

set_grammar(grammar)[source]

Change the grammar used by the parser.

set_strategy(strategy)[source]

Change the strategy that the parser uses to decide which edges to add to the chart.

Parameters:strategy (list(ChartRuleI)) – A list of rules that should be used to decide what edges to add to the chart.
step()[source]

Return a generator that adds edges to the chart, one at a time. Each time the generator is resumed, it adds a single edge and yields that edge. If no more edges can be added, then it yields None.

If the parser’s strategy, grammar, or chart is changed, then the generator will continue adding edges using the new strategy, grammar, or chart.

Note that this generator never terminates, since the grammar or strategy might be changed to values that would add new edges. Instead, it yields None when no more edges can be added with the current strategy and grammar.

strategy()[source]

Return the strategy used by this parser.

class nltk.parse.chart.TopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

A ChartParser using a top-down parsing strategy. See ChartParser for more information.

class nltk.parse.chart.TopDownInitRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule licensing edges corresponding to the grammar productions for the grammar’s start symbol. In particular, this rule specifies that [S -> \* alpha][0:i] is licensed for each grammar production S -> alpha, where S is the grammar’s start symbol.

NUM_EDGES = 0
apply_iter(chart, grammar)[source]
class nltk.parse.chart.TopDownPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

A rule licensing edges corresponding to the grammar productions for the nonterminal following an incomplete edge’s dot. In particular, this rule specifies that [A -> alpha \* B beta][i:j] licenses the edge [B -> \* gamma][j:j] for each grammar production B -> gamma.

Note:This rule corresponds to the Predictor Rule in Earley parsing.
NUM_EDGES = 1
apply_iter(chart, grammar, edge)[source]
class nltk.parse.chart.TreeEdge(span, lhs, rhs, dot=0)[source]

Bases: nltk.parse.chart.EdgeI

An edge that records the fact that a tree is (partially) consistent with the sentence. A tree edge consists of:

  • A span, indicating what part of the sentence is consistent with the hypothesized tree.
  • A left-hand side, specifying the hypothesized tree’s node value.
  • A right-hand side, specifying the hypothesized tree’s children. Each element of the right-hand side is either a terminal, specifying a token with that terminal as its leaf value; or a nonterminal, specifying a subtree with that nonterminal’s symbol as its node value.
  • A dot position, indicating which children are consistent with part of the sentence. In particular, if dot is the dot position, rhs is the right-hand size, (start,end) is the span, and sentence is the list of tokens in the sentence, then tokens[start:end] can be spanned by the children specified by rhs[:dot].

For more information about edges, see the EdgeI interface.

dot()[source]
end()[source]
static from_production(production, index)[source]

Return a new TreeEdge formed from the given production. The new edge’s left-hand side and right-hand side will be taken from production; its span will be (index,index); and its dot position will be 0.

Return type:TreeEdge
is_complete()[source]
is_incomplete()[source]
length()[source]
lhs()[source]
move_dot_forward(new_end)[source]

Return a new TreeEdge formed from this edge. The new edge’s dot position is increased by 1, and its end index will be replaced by new_end.

Parameters:new_end (int) – The new end index.
Return type:TreeEdge
nextsym()[source]
rhs()[source]
span()[source]
start()[source]
unicode_repr()
nltk.parse.chart.demo(choice=None, should_print_times=True, should_print_grammar=False, should_print_trees=True, trace=2, sent='I saw John with a dog with my cookie', numparses=5)[source]

A demonstration of the chart parsers.

nltk.parse.chart.demo_grammar()[source]

nltk.parse.dependencygraph module

Tools for reading and writing dependency trees. The input is assumed to be in Malt-TAB format (http://stp.lingfil.uu.se/~nivre/research/MaltXML.html).

class nltk.parse.dependencygraph.DependencyGraph(tree_str=None)[source]

Bases: builtins.object

A container for the nodes and labelled edges of a dependency structure.

add_arc(head_address, mod_address)[source]

Adds an arc from the node specified by head_address to the node specified by the mod address.

add_node(node)[source]
connect_graph()[source]

Fully connects all non-root nodes. All nodes are set to be dependents of the root node.

contains_address(node_address)[source]

Returns true if the graph contains a node with the given node address, false otherwise.

contains_cycle()[source]
get_by_address(node_address)[source]

Returns the node with the given address.

get_cycle_path(curr_node, goal_node_index)[source]
left_children(node_index)[source]

Returns the number of left children under the node specified by the given address.

static load(file)[source]
Parameters:file – a file in Malt-TAB format
Returns:a list of DependencyGraphs
redirect_arcs(originals, redirect)[source]

Redirects arcs to any of the nodes in the originals list to the redirect node address.

remove_by_address(address)[source]

Removes the node with the given address. References to this node in others will still exist.

right_children(node_index)[source]

Returns the number of right children under the node specified by the given address.

to_conll(style)[source]

The dependency graph in CoNLL format.

Parameters:style (int) – the style to use for the format (3, 4, 10 columns)
Return type:str
tree()[source]

Starting with the root node, build a dependency tree using the NLTK Tree constructor. Dependency labels are omitted.

unicode_repr()
nltk.parse.dependencygraph.conll_demo()[source]

A demonstration of how to read a string representation of a CoNLL format dependency tree.

nltk.parse.dependencygraph.conll_file_demo()[source]
nltk.parse.dependencygraph.cycle_finding_demo()[source]
nltk.parse.dependencygraph.demo()[source]
nltk.parse.dependencygraph.malt_demo(nx=False)[source]

A demonstration of the result of reading a dependency version of the first sentence of the Penn Treebank.

nltk.parse.dependencygraph.nx_graph(self)[source]

Convert the data in a nodelist into a networkx labeled directed graph. :rtype: XDigraph

nltk.parse.earleychart module

Data classes and parser implementations for incremental chart parsers, which use dynamic programming to efficiently parse a text. A “chart parser” derives parse trees for a text by iteratively adding “edges” to a “chart”. Each “edge” represents a hypothesis about the tree structure for a subsequence of the text. The “chart” is a “blackboard” for composing and combining these hypotheses.

A parser is “incremental”, if it guarantees that for all i, j where i < j, all edges ending at i are built before any edges ending at j. This is appealing for, say, speech recognizer hypothesis filtering.

The main parser class is EarleyChartParser, which is a top-down algorithm, originally formulated by Jay Earley (1970).

class nltk.parse.earleychart.CompleteFundamentalRule[source]

Bases: nltk.parse.chart.SingleEdgeFundamentalRule

class nltk.parse.earleychart.CompleterRule[source]

Bases: nltk.parse.earleychart.CompleteFundamentalRule

apply_iter(chart, grammar, edge)[source]
class nltk.parse.earleychart.EarleyChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.FeatureCompleteFundamentalRule[source]

Bases: nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule

class nltk.parse.earleychart.FeatureCompleterRule[source]

Bases: nltk.parse.earleychart.CompleterRule

class nltk.parse.earleychart.FeatureEarleyChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeatureIncrementalBottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeatureIncrementalBottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeatureIncrementalChart(tokens)[source]

Bases: nltk.parse.earleychart.IncrementalChart, nltk.parse.featurechart.FeatureChart

select(end, **restrictions)[source]
class nltk.parse.earleychart.FeatureIncrementalChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object at 0x111fca790>, <nltk.parse.featurechart.FeatureEmptyPredictRule object at 0x111fca7d0>, <nltk.parse.featurechart.FeatureBottomUpPredictCombineRule object at 0x111fca810>, <nltk.parse.earleychart.FeatureCompleteFundamentalRule object at 0x111fca850>], trace_chart_width=20, chart_class=<class 'nltk.parse.earleychart.FeatureIncrementalChart'>, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser, nltk.parse.featurechart.FeatureChartParser

class nltk.parse.earleychart.FeatureIncrementalTopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.FeatureIncrementalChartParser

class nltk.parse.earleychart.FeaturePredictorRule[source]

Bases: nltk.parse.featurechart.FeatureTopDownPredictRule

class nltk.parse.earleychart.FeatureScannerRule[source]

Bases: nltk.parse.earleychart.ScannerRule

class nltk.parse.earleychart.FilteredCompleteFundamentalRule[source]

Bases: nltk.parse.chart.FilteredSingleEdgeFundamentalRule

apply_iter(chart, grammar, edge)[source]
class nltk.parse.earleychart.IncrementalBottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.IncrementalBottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.IncrementalChart(tokens)[source]

Bases: nltk.parse.chart.Chart

edges()[source]
initialize()[source]
iteredges()[source]
select(end, **restrictions)[source]
class nltk.parse.earleychart.IncrementalChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object at 0x111fca050>, <nltk.parse.chart.EmptyPredictRule object at 0x111fca090>, <nltk.parse.chart.BottomUpPredictCombineRule object at 0x111fca0d0>, <nltk.parse.earleychart.CompleteFundamentalRule object at 0x111fca110>], trace=0, trace_chart_width=50, chart_class=<class 'nltk.parse.earleychart.IncrementalChart'>)[source]

Bases: nltk.parse.chart.ChartParser

An incremental chart parser implementing Jay Earley’s parsing algorithm:

For each index end in [0, 1, ..., N]:
For each edge such that edge.end = end:
If edge is incomplete and edge.next is not a part of speech:
Apply PredictorRule to edge
If edge is incomplete and edge.next is a part of speech:
Apply ScannerRule to edge
If edge is complete:
Apply CompleterRule to edge
Return any complete parses in the chart
chart_parse(tokens, trace=None)[source]
class nltk.parse.earleychart.IncrementalLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.IncrementalTopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.earleychart.IncrementalChartParser

class nltk.parse.earleychart.PredictorRule[source]

Bases: nltk.parse.chart.CachedTopDownPredictRule

class nltk.parse.earleychart.ScannerRule[source]

Bases: nltk.parse.earleychart.CompleteFundamentalRule

apply_iter(chart, grammar, edge)[source]
nltk.parse.earleychart.demo(should_print_times=True, should_print_grammar=False, should_print_trees=True, trace=2, sent='I saw John with a dog with my cookie', numparses=5)[source]

A demonstration of the Earley parsers.

nltk.parse.featurechart module

Extension of chart parsing implementation to handle grammars with feature structures as nodes.

class nltk.parse.featurechart.FeatureBottomUpChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.featurechart.FeatureChartParser

class nltk.parse.featurechart.FeatureBottomUpLeftCornerChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.featurechart.FeatureChartParser

class nltk.parse.featurechart.FeatureBottomUpPredictCombineRule[source]

Bases: nltk.parse.chart.BottomUpPredictCombineRule

apply_iter(chart, grammar, edge)[source]
class nltk.parse.featurechart.FeatureBottomUpPredictRule[source]

Bases: nltk.parse.chart.BottomUpPredictRule

apply_iter(chart, grammar, edge)[source]
class nltk.parse.featurechart.FeatureChart(tokens)[source]

Bases: nltk.parse.chart.Chart

A Chart for feature grammars. :see: Chart for more information.

parses(start, tree_class=<class 'nltk.tree.Tree'>)[source]
select(**restrictions)[source]

Returns an iterator over the edges in this chart. See Chart.select for more information about the restrictions on the edges.

class nltk.parse.featurechart.FeatureChartParser(grammar, strategy=[<nltk.parse.chart.LeafInitRule object at 0x111ea85d0>, <nltk.parse.featurechart.FeatureEmptyPredictRule object at 0x111ea8610>, <nltk.parse.featurechart.FeatureBottomUpPredictCombineRule object at 0x111ea8650>, <nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule object at 0x111ea8690>], trace_chart_width=20, chart_class=<class 'nltk.parse.featurechart.FeatureChart'>, **parser_args)[source]

Bases: nltk.parse.chart.ChartParser

class nltk.parse.featurechart.FeatureEmptyPredictRule[source]

Bases: nltk.parse.chart.EmptyPredictRule

apply_iter(chart, grammar)[source]
class nltk.parse.featurechart.FeatureFundamentalRule[source]

Bases: nltk.parse.chart.FundamentalRule

A specialized version of the fundamental rule that operates on nonterminals whose symbols are FeatStructNonterminal``s.  Rather tha simply comparing the nonterminals for equality, they are unified.  Variable bindings from these unifications are collected and stored in the chart using a ``FeatureTreeEdge. When a complete edge is generated, these bindings are applied to all nonterminals in the edge.

The fundamental rule states that:

  • [A -> alpha \* B1 beta][i:j]
  • [B2 -> gamma \*][j:k]

licenses the edge:

  • [A -> alpha B3 \* beta][i:j]

assuming that B1 and B2 can be unified to generate B3.

apply_iter(chart, grammar, left_edge, right_edge)[source]
class nltk.parse.featurechart.FeatureSingleEdgeFundamentalRule[source]

Bases: nltk.parse.chart.SingleEdgeFundamentalRule

A specialized version of the completer / single edge fundamental rule that operates on nonterminals whose symbols are ``FeatStructNonterminal``s. Rather than simply comparing the nonterminals for equality, they are unified.

class nltk.parse.featurechart.FeatureTopDownChartParser(grammar, **parser_args)[source]

Bases: nltk.parse.featurechart.FeatureChartParser

class nltk.parse.featurechart.FeatureTopDownInitRule[source]

Bases: nltk.parse.chart.TopDownInitRule

apply_iter(chart, grammar)[source]
class nltk.parse.featurechart.FeatureTopDownPredictRule[source]

Bases: nltk.parse.chart.CachedTopDownPredictRule

A specialized version of the (cached) top down predict rule that operates on nonterminals whose symbols are ``FeatStructNonterminal``s. Rather than simply comparing the nonterminals for equality, they are unified.

The top down expand rule states that:

  • [A -> alpha \* B1 beta][i:j]

licenses the edge:

  • [B2 -> \* gamma][j:j]

for each grammar production B2 -> gamma, assuming that B1 and B2 can be unified.

apply_iter(chart, grammar, edge)[source]
class nltk.parse.featurechart.FeatureTreeEdge(span, lhs, rhs, dot=0, bindings=None)[source]

Bases: nltk.parse.chart.TreeEdge

A specialized tree edge that allows shared variable bindings between nonterminals on the left-hand side and right-hand side.

Each FeatureTreeEdge contains a set of bindings, i.e., a dictionary mapping from variables to values. If the edge is not complete, then these bindings are simply stored. However, if the edge is complete, then the constructor applies these bindings to every nonterminal in the edge whose symbol implements the interface SubstituteBindingsI.

bindings()[source]

Return a copy of this edge’s bindings dictionary.

static from_production(production, index)[source]
Returns:A new TreeEdge formed from the given production. The new edge’s left-hand side and right-hand side will be taken from production; its span will be (index,index); and its dot position will be 0.
Return type:TreeEdge
move_dot_forward(new_end, bindings=None)[source]
Returns:

A new FeatureTreeEdge formed from this edge. The new edge’s dot position is increased by 1, and its end index will be replaced by new_end.

Return type:

FeatureTreeEdge

Parameters:
  • new_end (int) – The new end index.
  • bindings (dict) – Bindings for the new edge.
next_with_bindings()[source]
unicode_repr()
variables()[source]
Returns:The set of variables used by this edge.
Return type:set(Variable)
class nltk.parse.featurechart.InstantiateVarsChart(tokens)[source]

Bases: nltk.parse.featurechart.FeatureChart

A specialized chart that ‘instantiates’ variables whose names start with ‘@’, by replacing them with unique new variables. In particular, whenever a complete edge is added to the chart, any variables in the edge’s lhs whose names start with ‘@’ will be replaced by unique new ``Variable``s.

initialize()[source]
insert(edge, child_pointer_list)[source]
inst_vars(edge)[source]
instantiate_edge(edge)[source]

If the edge is a FeatureTreeEdge, and it is complete, then instantiate all variables whose names start with ‘@’, by replacing them with unique new variables.

Note that instantiation is done in-place, since the parsing algorithms might already hold a reference to the edge for future use.

nltk.parse.featurechart.demo(should_print_times=True, should_print_grammar=True, should_print_trees=True, should_print_sentence=True, trace=1, parser=<class 'nltk.parse.featurechart.FeatureChartParser'>, sent='I saw John with a dog with my cookie')[source]
nltk.parse.featurechart.demo_grammar()[source]
nltk.parse.featurechart.run_profile()[source]

nltk.parse.generate module

nltk.parse.generate.demo(N=23)[source]
nltk.parse.generate.generate(grammar, start=None, depth=None, n=None)[source]

Generates an iterator of all sentences from a CFG.

Parameters:
  • grammar – The Grammar used to generate sentences.
  • start – The Nonterminal from which to start generate sentences.
  • depth – The maximal depth of the generated tree.
  • n – The maximum number of sentences to return.
Returns:

An iterator of lists of terminal tokens.

nltk.parse.malt module

class nltk.parse.malt.MaltParser(tagger=None, mco=None, working_dir=None, additional_java_args=None)[source]

Bases: nltk.parse.api.ParserI

batch_parse(sentences, verbose=False)[source]

Use MaltParser to parse multiple sentence. Takes multiple sentences as a list where each sentence is a list of words. Each sentence will be automatically tagged with this MaltParser instance’s tagger.

Parameters:sentences – Input sentences to parse
Returns:list(DependencyGraph) the dependency graph representation of each sentence
config_malt(bin=None, verbose=False)[source]

Configure NLTK’s interface to the malt package. This searches for a directory containing the malt jar

Parameters:bin (str) – The full path to the malt binary. If not specified, then nltk will search the system for a malt binary; and if one is not found, it will raise a LookupError exception.
parse(sentence, verbose=False)[source]

Use MaltParser to parse a sentence. Takes a sentence as a list of words; it will be automatically tagged with this MaltParser instance’s tagger.

Parameters:sentence (list(str)) – Input sentence to parse
Returns:DependencyGraph the dependency graph representation of the sentence
raw_parse(sentence, verbose=False)[source]

Use MaltParser to parse a sentence. Takes a sentence as a string; before parsing, it will be automatically tokenized and tagged with this MaltParser instance’s tagger.

Parameters:sentence (str) – Input sentence to parse
Returns:DependencyGraph the dependency graph representation of the sentence
tagged_batch_parse(sentences, verbose=False)[source]

Use MaltParser to parse multiple sentences. Takes multiple sentences where each sentence is a list of (word, tag) tuples. The sentences must have already been tokenized and tagged.

Parameters:sentences – Input sentences to parse
Returns:list(DependencyGraph) the dependency graph representation of each sentence
tagged_parse(sentence, verbose=False)[source]

Use MaltParser to parse a sentence. Takes a sentence as a list of (word, tag) tuples; the sentence must have already been tokenized and tagged.

Parameters:sentence (list(tuple(str, str))) – Input sentence to parse
Returns:DependencyGraph the dependency graph representation of the sentence
train(depgraphs, verbose=False)[source]

Train MaltParser from a list of DependencyGraph objects

Parameters:depgraphs – list of DependencyGraph objects for training input data
train_from_file(conll_file, verbose=False)[source]

Train MaltParser from a file

Parameters:conll_file – str for the filename of the training input data
nltk.parse.malt.demo()[source]

nltk.parse.nonprojectivedependencyparser module

class nltk.parse.nonprojectivedependencyparser.DemoScorer[source]

Bases: builtins.object

score(graph)[source]
train(graphs)[source]
class nltk.parse.nonprojectivedependencyparser.DependencyScorerI[source]

Bases: builtins.object

A scorer for calculated the weights on the edges of a weighted dependency graph. This is used by a ProbabilisticNonprojectiveParser to initialize the edge weights of a DependencyGraph. While typically this would be done by training a binary classifier, any class that can return a multidimensional list representation of the edge weights can implement this interface. As such, it has no necessary fields.

score(graph)[source]
Parameters:graph (DependencyGraph) – A dependency graph whose set of edges need to be

scored. :rtype: A three-dimensional list of numbers. :return: The score is returned in a multidimensional(3) list, such that the outer-dimension refers to the head, and the inner-dimension refers to the dependencies. For instance, scores[0][1] would reference the list of scores corresponding to arcs from node 0 to node 1. The node’s ‘address’ field can be used to determine its number identification.

For further illustration, a score list corresponding to Fig.2 of Keith Hall’s ‘K-best Spanning Tree Parsing’ paper:

scores = [[[], [5], [1], [1]],
[[], [], [11], [4]], [[], [10], [], [5]], [[], [8], [8], []]]

When used in conjunction with a MaxEntClassifier, each score would correspond to the confidence of a particular edge being classified with the positive training examples.

train(graphs)[source]
Parameters:graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.

Typically the edges present in the graphs can be used as positive training examples, and the edges not present as negative examples.

class nltk.parse.nonprojectivedependencyparser.NaiveBayesDependencyScorer[source]

Bases: nltk.parse.nonprojectivedependencyparser.DependencyScorerI

A dependency scorer built around a MaxEnt classifier. In this particular class that classifier is a NaiveBayesClassifier. It uses head-word, head-tag, child-word, and child-tag features for classification.

score(graph)[source]

Converts the graph into a feature-based representation of each edge, and then assigns a score to each based on the confidence of the classifier in assigning it to the positive label. Scores are returned in a multidimensional list.

Parameters:graph (DependencyGraph) – A dependency graph to score.
Return type:3 dimensional list
Returns:Edge scores for the graph parameter.
train(graphs)[source]

Trains a NaiveBayesClassifier using the edges present in graphs list as positive examples, the edges not present as negative examples. Uses a feature vector of head-word, head-tag, child-word, and child-tag.

Parameters:graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.
class nltk.parse.nonprojectivedependencyparser.NonprojectiveDependencyParser(dependency_grammar)[source]

Bases: builtins.object

A non-projective, rule-based, dependency parser. This parser will return the set of all possible non-projective parses based on the word-to-word relations defined in the parser’s dependency grammar, and will allow the branches of the parse tree to cross in order to capture a variety of linguistic phenomena that a projective parser will not.

parse(tokens)[source]

Parses the input tokens with respect to the parser’s grammar. Parsing is accomplished by representing the search-space of possible parses as a fully-connected directed graph. Arcs that would lead to ungrammatical parses are removed and a lattice is constructed of length n, where n is the number of input tokens, to represent all possible grammatical traversals. All possible paths through the lattice are then enumerated to produce the set of non-projective parses.

param tokens: A list of tokens to parse. type tokens: list(str) return: A set of non-projective parses. rtype: list(DependencyGraph)

class nltk.parse.nonprojectivedependencyparser.ProbabilisticNonprojectiveParser[source]

Bases: builtins.object

A probabilistic non-projective dependency parser. Nonprojective dependencies allows for “crossing branches” in the parse tree which is necessary for representing particular linguistic phenomena, or even typical parses in some languages. This parser follows the MST parsing algorithm, outlined in McDonald(2005), which likens the search for the best non-projective parse to finding the maximum spanning tree in a weighted directed graph.

best_incoming_arc(node_index)[source]

Returns the source of the best incoming arc to the node with address: node_index

Parameters:node_index (integer.) – The address of the ‘destination’ node,

the node that is arced to.

collapse_nodes(new_node, cycle_path, g_graph, b_graph, c_graph)[source]

Takes a list of nodes that have been identified to belong to a cycle, and collapses them into on larger node. The arcs of all nodes in the graph must be updated to account for this.

Parameters:
  • new_node (Node.) – A Node (Dictionary) to collapse the cycle nodes into.
  • cycle_path (A list of integers.) – A list of node addresses, each of which is in the cycle.
  • b_graph, c_graph (g_graph,) – Graphs which need to be updated.
compute_max_subtract_score(column_index, cycle_indexes)[source]

When updating scores the score of the highest-weighted incoming arc is subtracted upon collapse. This returns the correct amount to subtract from that edge.

Parameters:column_index (integer.) – A index representing the column of incoming arcs

to a particular node being updated :type cycle_indexes: A list of integers. :param cycle_indexes: Only arcs from cycle nodes are considered. This is a list of such nodes addresses.

compute_original_indexes(new_indexes)[source]

As nodes are collapsed into others, they are replaced by the new node in the graph, but it’s still necessary to keep track of what these original nodes were. This takes a list of node addresses and replaces any collapsed node addresses with their original addresses.

Parameters:new_addresses – A list of node addresses to check for

subsumed nodes.

initialize_edge_scores(graph)[source]

Assigns a score to every edge in the DependencyGraph graph. These scores are generated via the parser’s scorer which was assigned during the training process.

Parameters:graph (DependencyGraph) – A dependency graph to assign scores to.
original_best_arc(node_index)[source]

???

parse(tokens, tags)[source]

Parses a list of tokens in accordance to the MST parsing algorithm for non-projective dependency parses. Assumes that the tokens to be parsed have already been tagged and those tags are provided. Various scoring methods can be used by implementing the DependencyScorerI interface and passing it to the training algorithm.

Parameters:
  • tokens (list(str)) – A list of words or punctuation to be parsed.
  • tags (list(str)) – A list of tags corresponding by index to the words in the tokens list.
train(graphs, dependency_scorer)[source]

Trains a DependencyScorerI from a set of DependencyGraph objects, and establishes this as the parser’s scorer. This is used to initialize the scores on a DependencyGraph during the parsing procedure.

Parameters:
  • graphs (list(DependencyGraph)) – A list of dependency graphs to train the scorer.
  • dependency_scorer (DependencyScorerI) – A scorer which implements the DependencyScorerI interface.
update_edge_scores(new_node, cycle_path)[source]

Updates the edge scores to reflect a collapse operation into new_node.

Parameters:
  • new_node (A Node.) – The node which cycle nodes are collapsed into.
  • cycle_path (A list of integers.) – A list of node addresses that belong to the cycle.
nltk.parse.nonprojectivedependencyparser.demo()[source]
nltk.parse.nonprojectivedependencyparser.hall_demo()[source]
nltk.parse.nonprojectivedependencyparser.nonprojective_conll_parse_demo()[source]
nltk.parse.nonprojectivedependencyparser.rule_based_demo()[source]

nltk.parse.pchart module

Classes and interfaces for associating probabilities with tree structures that represent the internal organization of a text. The probabilistic parser module defines BottomUpProbabilisticChartParser.

BottomUpProbabilisticChartParser is an abstract class that implements a bottom-up chart parser for PCFG grammars. It maintains a queue of edges, and adds them to the chart one at a time. The ordering of this queue is based on the probabilities associated with the edges, allowing the parser to expand more likely edges before less likely ones. Each subclass implements a different queue ordering, producing different search strategies. Currently the following subclasses are defined:

  • InsideChartParser searches edges in decreasing order of their trees’ inside probabilities.
  • RandomChartParser searches edges in random order.
  • LongestChartParser searches edges in decreasing order of their location’s length.

The BottomUpProbabilisticChartParser constructor has an optional argument beam_size. If non-zero, this controls the size of the beam (aka the edge queue). This option is most useful with InsideChartParser.

class nltk.parse.pchart.BottomUpProbabilisticChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.api.ParserI

An abstract bottom-up parser for PCFG grammars that uses a Chart to record partial results. BottomUpProbabilisticChartParser maintains a queue of edges that can be added to the chart. This queue is initialized with edges for each token in the text that is being parsed. BottomUpProbabilisticChartParser inserts these edges into the chart one at a time, starting with the most likely edges, and proceeding to less likely edges. For each edge that is added to the chart, it may become possible to insert additional edges into the chart; these are added to the queue. This process continues until enough complete parses have been generated, or until the queue is empty.

The sorting order for the queue is not specified by BottomUpProbabilisticChartParser. Different sorting orders will result in different search strategies. The sorting order for the queue is defined by the method sort_queue; subclasses are required to provide a definition for this method.

Variables:
  • _grammar – The grammar used to parse sentences.
  • _trace – The level of tracing output that should be generated when parsing a text.
grammar()[source]
nbest_parse(tokens, n=None)[source]
sort_queue(queue, chart)[source]

Sort the given queue of Edge objects, placing the edge that should be tried first at the beginning of the queue. This method will be called after each Edge is added to the queue.

Parameters:
  • queue (list(Edge)) – The queue of Edge objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added.
  • chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type:

None

trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
class nltk.parse.pchart.InsideChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries edges in descending order of the inside probabilities of their trees. The “inside probability” of a tree is simply the probability of the entire tree, ignoring its context. In particular, the inside probability of a tree generated by production p with children c[1], c[2], ..., c[n] is P(p)P(c[1])P(c[2])...P(c[n]); and the inside probability of a token is 1 if it is present in the text, and 0 if it is absent.

This sorting order results in a type of lowest-cost-first search strategy.

sort_queue(queue, chart)[source]

Sort the given queue of edges, in descending order of the inside probabilities of the edges’ trees.

Parameters:
  • queue (list(Edge)) – The queue of Edge objects to sort. Each edge in this queue is an edge that could be added to the chart by the fundamental rule; but that has not yet been added.
  • chart (Chart) – The chart being used to parse the text. This chart can be used to provide extra information for sorting the queue.
Return type:

None

class nltk.parse.pchart.LongestChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries longer edges before shorter ones. This sorting order results in a type of best-first search strategy.

sort_queue(queue, chart)[source]
class nltk.parse.pchart.ProbabilisticBottomUpInitRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 0
apply_iter(chart, grammar)[source]
class nltk.parse.pchart.ProbabilisticBottomUpPredictRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 1
apply_iter(chart, grammar, edge)[source]
class nltk.parse.pchart.ProbabilisticFundamentalRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 2
apply_iter(chart, grammar, left_edge, right_edge)[source]
class nltk.parse.pchart.ProbabilisticLeafEdge(leaf, index)[source]

Bases: nltk.parse.chart.LeafEdge

prob()[source]
class nltk.parse.pchart.ProbabilisticTreeEdge(prob, *args, **kwargs)[source]

Bases: nltk.parse.chart.TreeEdge

static from_production(production, index, p)[source]
prob()[source]
class nltk.parse.pchart.RandomChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries edges in random order. This sorting order results in a random search strategy.

sort_queue(queue, chart)[source]
class nltk.parse.pchart.SingleEdgeProbabilisticFundamentalRule[source]

Bases: nltk.parse.chart.AbstractChartRule

NUM_EDGES = 1
apply_iter(chart, grammar, edge1)[source]
unicode_repr None

x.__repr__() <==> repr(x)

class nltk.parse.pchart.UnsortedChartParser(grammar, beam_size=0, trace=0)[source]

Bases: nltk.parse.pchart.BottomUpProbabilisticChartParser

A bottom-up parser for PCFG grammars that tries edges in whatever order.

sort_queue(queue, chart)[source]
nltk.parse.pchart.demo(choice=None, draw_parses=None, print_parses=None)[source]

A demonstration of the probabilistic parsers. The user is prompted to select which demo to run, and how many parses should be found; and then each parser is run on the same demo, and a summary of the results are displayed.

nltk.parse.projectivedependencyparser module

class nltk.parse.projectivedependencyparser.ChartCell(x, y)[source]

Bases: builtins.object

A cell from the parse chart formed when performing the CYK algorithm. Each cell keeps track of its x and y coordinates (though this will probably be discarded), and a list of spans serving as the cell’s entries.

add(span)[source]

Appends the given span to the list of spans representing the chart cell’s entries.

Parameters:span (DependencySpan) – The span to add.
unicode_repr()
Returns:A concise string representation of this ChartCell.
Return type:str.
class nltk.parse.projectivedependencyparser.DependencySpan(start_index, end_index, head_index, arcs, tags)[source]

Bases: builtins.object

A contiguous span over some part of the input string representing dependency (head -> modifier) relationships amongst words. An atomic span corresponds to only one word so it isn’t a ‘span’ in the conventional sense, as its _start_index = _end_index = _head_index for concatenation purposes. All other spans are assumed to have arcs between all nodes within the start and end indexes of the span, and one head index corresponding to the head word for the entire span. This is the same as the root node if the dependency structure were depicted as a graph.

head_index()[source]
Returns:An value indexing the head of the entire DependencySpan.
Return type:int
unicode_repr()
Returns:A concise string representatino of the DependencySpan.
Return type:str.
class nltk.parse.projectivedependencyparser.ProbabilisticProjectiveDependencyParser[source]

Bases: builtins.object

A probabilistic, projective dependency parser. This parser returns the most probable projective parse derived from the probabilistic dependency grammar derived from the train() method. The probabilistic model is an implementation of Eisner’s (1996) Model C, which conditions on head-word, head-tag, child-word, and child-tag. The decoding uses a bottom-up chart-based span concatenation algorithm that’s identical to the one utilized by the rule-based projective parser.

compute_prob(dg)[source]

Computes the probability of a dependency graph based on the parser’s probability model (defined by the parser’s statistical dependency grammar).

Parameters:dg (DependencyGraph) – A dependency graph to score.
Returns:The probability of the dependency graph.
Return type:int
concatenate(span1, span2)[source]

Concatenates the two spans in whichever way possible. This includes rightward concatenation (from the leftmost word of the leftmost span to the rightmost word of the rightmost span) and leftward concatenation (vice-versa) between adjacent spans. Unlike Eisner’s presentation of span concatenation, these spans do not share or pivot on a particular word/word-index.

Returns:A list of new spans formed through concatenation.
Return type:list(DependencySpan)
parse(tokens)[source]

Parses the list of tokens subject to the projectivity constraint and the productions in the parser’s grammar. This uses a method similar to the span-concatenation algorithm defined in Eisner (1996). It returns the most probable parse derived from the parser’s probabilistic dependency grammar.

train(graphs)[source]

Trains a StatisticalDependencyGrammar based on the list of input DependencyGraphs. This model is an implementation of Eisner’s (1996) Model C, which derives its statistics from head-word, head-tag, child-word, and child-tag relationships.

Parameters:graphs – A list of dependency graphs to train from.
Type:list(DependencyGraph)
class nltk.parse.projectivedependencyparser.ProjectiveDependencyParser(dependency_grammar)[source]

Bases: builtins.object

A projective, rule-based, dependency parser. A ProjectiveDependencyParser is created with a DependencyGrammar, a set of productions specifying word-to-word dependency relations. The parse() method will then return the set of all parses, in tree representation, for a given input sequence of tokens. Each parse must meet the requirements of the both the grammar and the projectivity constraint which specifies that the branches of the dependency tree are not allowed to cross. Alternatively, this can be understood as stating that each parent node and its children in the parse tree form a continuous substring of the input sequence.

concatenate(span1, span2)[source]

Concatenates the two spans in whichever way possible. This includes rightward concatenation (from the leftmost word of the leftmost span to the rightmost word of the rightmost span) and leftward concatenation (vice-versa) between adjacent spans. Unlike Eisner’s presentation of span concatenation, these spans do not share or pivot on a particular word/word-index.

Returns:A list of new spans formed through concatenation.
Return type:list(DependencySpan)
parse(tokens)[source]

Performs a projective dependency parse on the list of tokens using a chart-based, span-concatenation algorithm similar to Eisner (1996).

Parameters:tokens (list(str)) – The list of input tokens.
Returns:A list of parse trees.
Return type:list(Tree)
nltk.parse.projectivedependencyparser.arity_parse_demo()[source]

A demonstration showing the creation of a DependencyGrammar in which a specific number of modifiers is listed for a given head. This can further constrain the number of possible parses created by a ProjectiveDependencyParser.

nltk.parse.projectivedependencyparser.demo()[source]
nltk.parse.projectivedependencyparser.projective_prob_parse_demo()[source]

A demo showing the training and use of a projective dependency parser.

nltk.parse.projectivedependencyparser.projective_rule_parse_demo()[source]

A demonstration showing the creation and use of a DependencyGrammar to perform a projective dependency parse.

nltk.parse.rd module

class nltk.parse.rd.RecursiveDescentParser(grammar, trace=0)[source]

Bases: nltk.parse.api.ParserI

A simple top-down CFG parser that parses texts by recursively expanding the fringe of a Tree, and matching it against a text.

RecursiveDescentParser uses a list of tree locations called a “frontier” to remember which subtrees have not yet been expanded and which leaves have not yet been matched against the text. Each tree location consists of a list of child indices specifying the path from the root of the tree to a subtree or a leaf; see the reference documentation for Tree for more information about tree locations.

When the parser begins parsing a text, it constructs a tree containing only the start symbol, and a frontier containing the location of the tree’s root node. It then extends the tree to cover the text, using the following recursive procedure:

  • If the frontier is empty, and the text is covered by the tree, then return the tree as a possible parse.
  • If the frontier is empty, and the text is not covered by the tree, then return no parses.
  • If the first element of the frontier is a subtree, then use CFG productions to “expand” it. For each applicable production, add the expanded subtree’s children to the frontier, and recursively find all parses that can be generated by the new tree and frontier.
  • If the first element of the frontier is a token, then “match” it against the next token from the text. Remove the token from the frontier, and recursively find all parses that can be generated by the new tree and frontier.
See:nltk.grammar
grammar()[source]
nbest_parse(tokens, n=None)[source]
trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
class nltk.parse.rd.SteppingRecursiveDescentParser(grammar, trace=0)[source]

Bases: nltk.parse.rd.RecursiveDescentParser

A RecursiveDescentParser that allows you to step through the parsing process, performing a single operation at a time.

The initialize method is used to start parsing a text. expand expands the first element on the frontier using a single CFG production, and match matches the first element on the frontier against the next text token. backtrack undoes the most recent expand or match operation. step performs a single expand, match, or backtrack operation. parses returns the set of parses that have been found by the parser.

Variables:
  • _history – A list of (rtext, tree, frontier) tripples, containing the previous states of the parser. This history is used to implement the backtrack operation.
  • _tried_e – A record of all productions that have been tried for a given tree. This record is used by expand to perform the next untried production.
  • _tried_m – A record of what tokens have been matched for a given tree. This record is used by step to decide whether or not to match a token.
See:

nltk.grammar

backtrack()[source]

Return the parser to its state before the most recent match or expand operation. Calling undo repeatedly return the parser to successively earlier states. If no match or expand operations have been performed, undo will make no changes.

Returns:true if an operation was successfully undone.
Return type:bool
currently_complete()[source]
Returns:Whether the parser’s current state represents a complete parse.
Return type:bool
expand(production=None)[source]

Expand the first element of the frontier. In particular, if the first element of the frontier is a subtree whose node type is equal to production‘s left hand side, then add a child to that subtree for each element of production‘s right hand side. If production is not specified, then use the first untried expandable production. If all expandable productions have been tried, do nothing.

Returns:The production used to expand the frontier, if an expansion was performed. If no expansion was performed, return None.
Return type:Production or None
expandable_productions()[source]
Returns:A list of all the productions for which expansions are available for the current parser state.
Return type:list(Production)
frontier()[source]
Returns:A list of the tree locations of all subtrees that have not yet been expanded, and all leaves that have not yet been matched.
Return type:list(tuple(int))
initialize(tokens)[source]

Start parsing a given text. This sets the parser’s tree to the start symbol, its frontier to the root node, and its remaining text to token['SUBTOKENS'].

match()[source]

Match the first element of the frontier. In particular, if the first element of the frontier has the same type as the next text token, then substitute the text token into the tree.

Returns:The token matched, if a match operation was performed. If no match was performed, return None
Return type:str or None
nbest_parse(tokens, n=None)[source]
parses()[source]
Returns:A list of the parses that have been found by this parser so far.
Return type:list of Tree
remaining_text()[source]
Returns:The portion of the text that is not yet covered by the tree.
Return type:list(str)
set_grammar(grammar)[source]

Change the grammar used to parse texts.

Parameters:grammar (CFG) – The new grammar.
step()[source]

Perform a single parsing operation. If an untried match is possible, then perform the match, and return the matched token. If an untried expansion is possible, then perform the expansion, and return the production that it is based on. If backtracking is possible, then backtrack, and return 1. Otherwise, return 0.

Returns:0 if no operation was performed; a token if a match was performed; a production if an expansion was performed; and 1 if a backtrack operation was performed.
Return type:Production or String or bool
tree()[source]
Returns:A partial structure for the text that is currently being parsed. The elements specified by the frontier have not yet been expanded or matched.
Return type:Tree
untried_expandable_productions()[source]
Returns:A list of all the untried productions for which expansions are available for the current parser state.
Return type:list(Production)
untried_match()[source]
Returns:Whether the first element of the frontier is a token that has not yet been matched.
Return type:bool
nltk.parse.rd.demo()[source]

A demonstration of the recursive descent parser.

nltk.parse.sr module

class nltk.parse.sr.ShiftReduceParser(grammar, trace=0)[source]

Bases: nltk.parse.api.ParserI

A simple bottom-up CFG parser that uses two operations, “shift” and “reduce”, to find a single parse for a text.

ShiftReduceParser maintains a stack, which records the structure of a portion of the text. This stack is a list of strings and Trees that collectively cover a portion of the text. For example, while parsing the sentence “the dog saw the man” with a typical grammar, ShiftReduceParser will produce the following stack, which covers “the dog saw”:

[(NP: (Det: 'the') (N: 'dog')), (V: 'saw')]

ShiftReduceParser attempts to extend the stack to cover the entire text, and to combine the stack elements into a single tree, producing a complete parse for the sentence.

Initially, the stack is empty. It is extended to cover the text, from left to right, by repeatedly applying two operations:

  • “shift” moves a token from the beginning of the text to the end of the stack.
  • “reduce” uses a CFG production to combine the rightmost stack elements into a single Tree.

Often, more than one operation can be performed on a given stack. In this case, ShiftReduceParser uses the following heuristics to decide which operation to perform:

  • Only shift if no reductions are available.
  • If multiple reductions are available, then apply the reduction whose CFG production is listed earliest in the grammar.

Note that these heuristics are not guaranteed to choose an operation that leads to a parse of the text. Also, if multiple parses exists, ShiftReduceParser will return at most one of them.

See:nltk.grammar
grammar()[source]
parse(tokens)[source]
trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
class nltk.parse.sr.SteppingShiftReduceParser(grammar, trace=0)[source]

Bases: nltk.parse.sr.ShiftReduceParser

A ShiftReduceParser that allows you to setp through the parsing process, performing a single operation at a time. It also allows you to change the parser’s grammar midway through parsing a text.

The initialize method is used to start parsing a text. shift performs a single shift operation, and reduce performs a single reduce operation. step will perform a single reduce operation if possible; otherwise, it will perform a single shift operation. parses returns the set of parses that have been found by the parser.

Variables:_history – A list of (stack, remaining_text) pairs, containing all of the previous states of the parser. This history is used to implement the undo operation.
See:nltk.grammar
initialize(tokens)[source]

Start parsing a given text. This sets the parser’s stack to [] and sets its remaining text to tokens.

nbest_parse(tokens, n=None)[source]
parses()[source]
Returns:A list of the parses that have been found by this parser so far.
Return type:list of Tree
reduce(production=None)[source]

Use production to combine the rightmost stack elements into a single Tree. If production does not match the rightmost stack elements, then do nothing.

Returns:The production used to reduce the stack, if a reduction was performed. If no reduction was performed, return None.
Return type:Production or None
reducible_productions()[source]
Returns:A list of the productions for which reductions are available for the current parser state.
Return type:list(Production)
remaining_text()[source]
Returns:The portion of the text that is not yet covered by the stack.
Return type:list(str)
set_grammar(grammar)[source]

Change the grammar used to parse texts.

Parameters:grammar (CFG) – The new grammar.
shift()[source]

Move a token from the beginning of the remaining text to the end of the stack. If there are no more tokens in the remaining text, then do nothing.

Returns:True if the shift operation was successful.
Return type:bool
stack()[source]
Returns:The parser’s stack.
Return type:list(str and Tree)
step()[source]

Perform a single parsing operation. If a reduction is possible, then perform that reduction, and return the production that it is based on. Otherwise, if a shift is possible, then perform it, and return 1. Otherwise, return 0.

Returns:0 if no operation was performed; 1 if a shift was performed; and the CFG production used to reduce if a reduction was performed.
Return type:Production or bool
undo()[source]

Return the parser to its state before the most recent shift or reduce operation. Calling undo repeatedly return the parser to successively earlier states. If no shift or reduce operations have been performed, undo will make no changes.

Returns:true if an operation was successfully undone.
Return type:bool
nltk.parse.sr.demo()[source]

A demonstration of the shift-reduce parser.

nltk.parse.util module

Utility functions for parsers.

class nltk.parse.util.TestGrammar(grammar, suite, accept=None, reject=None)[source]

Bases: builtins.object

Unit tests for CFG.

run(show_trees=False)[source]
Sentences in the test suite are divided into two classes:
  • grammatical (accept) and
  • ungrammatical (reject).

If a sentence should parse accordng to the grammar, the value of trees will be a non-empty list. If a sentence should be rejected according to the grammar, then the value of trees will be None.

nltk.parse.util.extract_test_sentences(string, comment_chars='#%;', encoding=None)[source]

Parses a string with one test sentence per line. Lines can optionally begin with:

  • a bool, saying if the sentence is grammatical or not, or
  • an int, giving the number of parse trees is should have,

The result information is followed by a colon, and then the sentence. Empty lines and lines beginning with a comment char are ignored.

Returns:

a list of tuple of sentences and expected results, where a sentence is a list of str, and a result is None, or bool, or int

Parameters:
  • comment_charsstr of possible comment characters.
  • encoding – the encoding of the string, if it is binary
nltk.parse.util.load_parser(grammar_url, trace=0, parser=None, chart_class=None, beam_size=0, **load_args)[source]

Load a grammar from a file, and build a parser based on that grammar. The parser depends on the grammar format, and might also depend on properties of the grammar itself.

The following grammar formats are currently supported:
  • 'cfg' (CFGs: ContextFreeGrammar)
  • 'pcfg' (probabilistic CFGs: WeightedGrammar)
  • 'fcfg' (feature-based CFGs: ContextFreeGrammar)
Parameters:
  • grammar_url (str) – A URL specifying where the grammar is located. The default protocol is "nltk:", which searches for the file in the the NLTK data package.
  • trace (int) – The level of tracing that should be used when parsing a text. 0 will generate no tracing output; and higher numbers will produce more verbose tracing output.
  • parser – The class used for parsing; should be ChartParser or a subclass. If None, the class depends on the grammar format.
  • chart_class – The class used for storing the chart; should be Chart or a subclass. Only used for CFGs and feature CFGs. If None, the chart class depends on the grammar format.
  • beam_size (int) – The maximum length for the parser’s edge queue. Only used for probabilistic CFGs.
  • load_args – Keyword parameters used when loading the grammar. See data.load for more information.

nltk.parse.viterbi module

class nltk.parse.viterbi.ViterbiParser(grammar, trace=0)[source]

Bases: nltk.parse.api.ParserI

A bottom-up PCFG parser that uses dynamic programming to find the single most likely parse for a text. The ViterbiParser parser parses texts by filling in a “most likely constituent table”. This table records the most probable tree representation for any given span and node value. In particular, it has an entry for every start index, end index, and node value, recording the most likely subtree that spans from the start index to the end index, and has the given node value.

The ViterbiParser parser fills in this table incrementally. It starts by filling in all entries for constituents that span one element of text (i.e., entries where the end index is one greater than the start index). After it has filled in all table entries for constituents that span one element of text, it fills in the entries for constitutants that span two elements of text. It continues filling in the entries for constituents spanning larger and larger portions of the text, until the entire table has been filled. Finally, it returns the table entry for a constituent spanning the entire text, whose node value is the grammar’s start symbol.

In order to find the most likely constituent with a given span and node value, the ViterbiParser parser considers all productions that could produce that node value. For each production, it finds all children that collectively cover the span and have the node values specified by the production’s right hand side. If the probability of the tree formed by applying the production to the children is greater than the probability of the current entry in the table, then the table is updated with this new tree.

A pseudo-code description of the algorithm used by ViterbiParser is:

Create an empty most likely constituent table, MLC.
For width in 1...len(text):
For start in 1...len(text)-width:
For prod in grammar.productions:
For each sequence of subtrees [t[1], t[2], ..., t[n]] in MLC,
where t[i].label()==prod.rhs[i],
and the sequence covers [start:start+width]:
old_p = MLC[start, start+width, prod.lhs]
new_p = P(t[1])P(t[1])...P(t[n])P(prod)
if new_p > old_p:
new_tree = Tree(prod.lhs, t[1], t[2], ..., t[n])
MLC[start, start+width, prod.lhs] = new_tree
Return MLC[0, len(text), start_symbol]
Variables:
  • _grammar – The grammar used to parse sentences.
  • _trace – The level of tracing output that should be generated when parsing a text.
grammar()[source]
parse(tokens)[source]
trace(trace=2)[source]

Set the level of tracing output that should be generated when parsing a text.

Parameters:trace (int) – The trace level. A trace level of 0 will generate no tracing output; and higher trace levels will produce more verbose tracing output.
Return type:None
unicode_repr()
nltk.parse.viterbi.demo()[source]

A demonstration of the probabilistic parsers. The user is prompted to select which demo to run, and how many parses should be found; and then each parser is run on the same demo, and a summary of the results are displayed.

Module contents

NLTK Parsers

Classes and interfaces for producing tree structures that represent the internal organization of a text. This task is known as “parsing” the text, and the resulting tree structures are called the text’s “parses”. Typically, the text is a single sentence, and the tree structure represents the syntactic structure of the sentence. However, parsers can also be used in other domains. For example, parsers can be used to derive the morphological structure of the morphemes that make up a word, or to derive the discourse structure for a set of utterances.

Sometimes, a single piece of text can be represented by more than one tree structure. Texts represented by more than one tree structure are called “ambiguous” texts. Note that there are actually two ways in which a text can be ambiguous:

  • The text has multiple correct parses.
  • There is not enough information to decide which of several candidate parses is correct.

However, the parser module does not distinguish these two types of ambiguity.

The parser module defines ParserI, a standard interface for parsing texts; and two simple implementations of that interface, ShiftReduceParser and RecursiveDescentParser. It also contains three sub-modules for specialized kinds of parsing:

  • nltk.parser.chart defines chart parsing, which uses dynamic programming to efficiently parse texts.
  • nltk.parser.probabilistic defines probabilistic parsing, which associates a probability with each parse.