nltk.tag package

Submodules

nltk.tag.api module

Interface for tagging each token in a sentence with supplementary information, such as its part of speech.

class nltk.tag.api.FeaturesetTaggerI[source]

Bases: nltk.tag.api.TaggerI

A tagger that requires tokens to be featuresets. A featureset is a dictionary that maps from feature names to feature values. See nltk.classify for more information about features and featuresets.

class nltk.tag.api.TaggerI[source]

Bases: builtins.object

A processing interface for assigning a tag to each token in a list. Tags are case sensitive strings that identify some property of each token, such as its part of speech or its sense.

Some taggers require specific types for their tokens. This is generally indicated by the use of a sub-interface to TaggerI. For example, featureset taggers, which are subclassed from FeaturesetTagger, require that each token be a featureset.

Subclasses must define:
  • either tag() or batch_tag() (or both)
batch_tag(sentences)[source]

Apply self.tag() to each element of sentences. I.e.:

return [self.tag(sent) for sent in sentences]
evaluate(gold)[source]

Score the accuracy of the tagger against the gold standard. Strip the tags from the gold standard text, retag it using the tagger, then compute the accuracy score.

Parameters:gold (list(list(tuple(str, str)))) – The list of tagged sentences to score the tagger on.
Return type:float
tag(tokens)[source]

Determine the most appropriate tag sequence for the given token sequence, and return a corresponding list of tagged tokens. A tagged token is encoded as a tuple (token, tag).

Return type:list(tuple(str, str))

nltk.tag.brill module

Brill Tagger

The Brill Tagger is a transformational rule-based tagger. It starts by running an initial tagger, and then improves the tagging by applying a list of transformation rules. These transformation rules are automatically learned from the training corpus, based on one or more “rule templates.”

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> from nltk.tag.brill import SymmetricProximateTokensTemplate, ProximateTokensTemplate
>>> from nltk.tag.brill import ProximateTagsRule, ProximateWordsRule, FastBrillTaggerTrainer
>>> brown_train = list(brown.tagged_sents(categories='news')[:500])
>>> brown_test = list(brown.tagged_sents(categories='news')[500:600])
>>> unigram_tagger = UnigramTagger(brown_train)
>>> templates = [
...     SymmetricProximateTokensTemplate(ProximateTagsRule, (1,1)),
...     SymmetricProximateTokensTemplate(ProximateTagsRule, (2,2)),
...     SymmetricProximateTokensTemplate(ProximateTagsRule, (1,2)),
...     SymmetricProximateTokensTemplate(ProximateTagsRule, (1,3)),
...     SymmetricProximateTokensTemplate(ProximateWordsRule, (1,1)),
...     SymmetricProximateTokensTemplate(ProximateWordsRule, (2,2)),
...     SymmetricProximateTokensTemplate(ProximateWordsRule, (1,2)),
...     SymmetricProximateTokensTemplate(ProximateWordsRule, (1,3)),
...     ProximateTokensTemplate(ProximateTagsRule, (-1, -1), (1,1)),
...     ProximateTokensTemplate(ProximateWordsRule, (-1, -1), (1,1)),
...     ]
>>> trainer = FastBrillTaggerTrainer(initial_tagger=unigram_tagger,
...                                  templates=templates, trace=3,
...                                  deterministic=True)
>>> brill_tagger = trainer.train(brown_train, max_rules=10)
Training Brill tagger on 500 sentences...
Finding initial useful rules...
    Found 10210 useful rules.

           B      |
   S   F   r   O  |        Score = Fixed - Broken
   c   i   o   t  |  R     Fixed = num tags changed incorrect -> correct
   o   x   k   h  |  u     Broken = num tags changed correct -> incorrect
   r   e   e   e  |  l     Other = num tags changed incorrect -> incorrect
   e   d   n   r  |  e
------------------+-------------------------------------------------------
  46  46   0   0  | TO -> IN if the tag of the following word is 'AT'
  18  20   2   0  | TO -> IN if the tag of words i+1...i+3 is 'CD'
  14  14   0   0  | IN -> IN-TL if the tag of the preceding word is
                  |   'NN-TL', and the tag of the following word is
                  |   'NN-TL'
  11  11   0   1  | TO -> IN if the tag of the following word is 'NNS'
  10  10   0   0  | TO -> IN if the tag of the following word is 'JJ'
   8   8   0   0  | , -> ,-HL if the tag of the preceding word is 'NP-
                  |   HL'
   7   7   0   1  | NN -> VB if the tag of the preceding word is 'MD'
   7  13   6   0  | NN -> VB if the tag of the preceding word is 'TO'
   7   7   0   0  | NP-TL -> NP if the tag of words i+1...i+2 is 'NNS'
   7   7   0   0  | VBN -> VBD if the tag of the preceding word is
                  |   'NP'
>>> brill_tagger.evaluate(brown_test) 
0.742...
class nltk.tag.brill.BrillRule(original_tag, replacement_tag)[source]

Bases: yaml.YAMLObject

An interface for tag transformations on a tagged corpus, as performed by brill taggers. Each transformation finds all tokens in the corpus that are tagged with a specific original tag and satisfy a specific condition, and replaces their tags with a replacement tag. For any given transformation, the original tag, replacement tag, and condition are fixed. Conditions may depend on the token under consideration, as well as any other tokens in the corpus.

Brill rules must be comparable and hashable.

applies(tokens, index)[source]
Returns:

True if the rule would change the tag of tokens[index], False otherwise

Return type:

bool

Parameters:
  • tokens (list(str)) – A tagged sentence
  • index (int) – The index to check
apply(tokens, positions=None)[source]

Apply this rule at every position in positions where it applies to the given sentence. I.e., for each position p in positions, if tokens[p] is tagged with this rule’s original tag, and satisfies this rule’s condition, then set its tag to be this rule’s replacement tag.

Parameters:
  • tokens (list(tuple(str, str))) – The tagged sentence
  • positions (list(int)) – The positions where the transformation is to be tried. If not specified, try it at all positions.
Returns:

The indices of tokens whose tags were changed by this rule.

Return type:

int

class nltk.tag.brill.BrillTagger(initial_tagger, rules)[source]

Bases: nltk.tag.api.TaggerI, yaml.YAMLObject

Brill’s transformational rule-based tagger. Brill taggers use an initial tagger (such as tag.DefaultTagger) to assign an initial tag sequence to a text; and then apply an ordered list of transformational rules to correct the tags of individual tokens. These transformation rules are specified by the BrillRule interface.

Brill taggers can be created directly, from an initial tagger and a list of transformational rules; but more often, Brill taggers are created by learning rules from a training corpus, using either BrillTaggerTrainer or FastBrillTaggerTrainer.

rules()[source]
tag(tokens)[source]
yaml_tag = '!nltk.BrillTagger'
class nltk.tag.brill.BrillTaggerTrainer(initial_tagger, templates, trace=0, deterministic=None)[source]

Bases: builtins.object

A trainer for brill taggers.

Parameters:deterministic – If true, then choose between rules that have the same score by picking the one whose __repr__ is lexicographically smaller. If false, then just pick the first rule we find with a given score – this will depend on the order in which keys are returned from dictionaries, and so may not be the same from one run to the next. If not specified, treat as true iff trace > 0.
train(train_sents, max_rules=200, min_score=2)[source]

Trains the Brill tagger on the corpus train_sents, producing at most max_rules transformations, each of which reduces the net number of errors in the corpus by at least min_score.

Parameters:
  • train_sents (list(list(tuple))) – The corpus of tagged sentences
  • max_rules (int) – The maximum number of transformations to be created
  • min_score (int) – The minimum acceptable net error reduction that each transformation must produce in the corpus.
class nltk.tag.brill.BrillTemplateI[source]

Bases: builtins.object

An interface for generating lists of transformational rules that apply at given sentence positions. BrillTemplateI is used by Brill training algorithms to generate candidate rules.

applicable_rules(tokens, i, correctTag)[source]

Return a list of the transformational rules that would correct the i*th subtoken’s tag in the given token. In particular, return a list of zero or more rules that would change *tokens*[i][1] to *correctTag, if applied to *token*[i].

If the *i*th token already has the correct tag (i.e., if tagged_tokens[i][1] == correctTag), then applicable_rules() should return the empty list.

Parameters:
  • tokens (list(tuple)) – The tagged tokens being tagged.
  • i (int) – The index of the token whose tag should be corrected.
  • correctTag (any) – The correct tag for the *i*th token.
Return type:

list(BrillRule)

get_neighborhood(token, index)[source]

Returns the set of indices i such that applicable_rules(token, i, ...) depends on the value of the index*th token of *token.

This method is used by the “fast” Brill tagger trainer.

Parameters:
  • token (list(tuple)) – The tokens being tagged.
  • index (int) – The index whose neighborhood should be returned.
Return type:

set

class nltk.tag.brill.FastBrillTaggerTrainer(initial_tagger, templates, trace=0, deterministic=False)[source]

Bases: builtins.object

A faster trainer for brill taggers.

train(train_sents, max_rules=200, min_score=2)[source]
class nltk.tag.brill.ProximateTagsRule(original_tag, replacement_tag, *conditions)[source]

Bases: nltk.tag.brill.ProximateTokensRule

A rule which examines the tags of nearby tokens. See ProximateTokensRule for details. Also see SymmetricProximateTokensTemplate which generates these rules.

PROPERTY_NAME = 'tag'
static extract_property(token)[source]
Returns:The given token’s tag.
yaml_tag = '!ProximateTagsRule'
class nltk.tag.brill.ProximateTokensRule(original_tag, replacement_tag, *conditions)[source]

Bases: nltk.tag.brill.BrillRule

An abstract base class for brill rules whose condition checks for the presence of tokens with given properties at given ranges of positions, relative to the token.

Each subclass of proximate tokens brill rule defines a method extract_property(), which extracts a specific property from the the token, such as its text or tag. Each instance is parameterized by a set of tuples, specifying ranges of positions and property values to check for in those ranges: (start, end, value).

The brill rule is then applicable to the *n*th token iff:

  • The *n*th token is tagged with the rule’s original tag; and
  • For each (start, end, value) triple, the property value of at least one token between n+start and n+end (inclusive) is value.

For example, a proximate token brill template with start=end=-1 generates rules that check just the property of the preceding token. Note that multiple properties may be included in a single rule; the rule applies if they all hold.

Construct a new brill rule that changes a token’s tag from original_tag to replacement_tag if all of the properties specified in conditions hold.

Parameters:conditions (tuple(int, int, *)) – A list of 3-tuples (start, end, value), each of which specifies that the property of at least one token between n+start and n+end (inclusive) is value.
Raises ValueError:
 If start>end for any condition.
applies(tokens, index)[source]
static extract_property(token)[source]

Returns some property characterizing this token, such as its base lexical item or its tag.

Each implentation of this method should correspond to an implementation of the method with the same name in a subclass of ProximateTokensTemplate.

Parameters:token (tuple(str, str)) – The token
Returns:The property
Return type:any
classmethod from_yaml(loader, node)[source]
classmethod to_yaml(dumper, data)[source]
unicode_repr()
class nltk.tag.brill.ProximateTokensTemplate(rule_class, *boundaries)[source]

Bases: nltk.tag.brill.BrillTemplateI

A brill template that generates a list of ProximateTokensRule rules that apply at a given sentence position. In particular, each ProximateTokensTemplate is parameterized by a proximate token brill rule class and a list of boundaries, and generates all rules that:

  • use the given brill rule class
  • use the given list of boundaries as the start and end points for their conditions
  • are applicable to the given token.

Construct a template for generating proximate token brill rules.

Parameters:
  • rule_class (class) – The proximate token brill rule class that should be used to generate new rules. This class must be a subclass of ProximateTokensRule.
  • boundaries (tuple(int, int)) – A list of (start, end) tuples each of which specifies a range for which a condition should be created by each rule.
Raises ValueError:
 

If start>end for any boundary.

applicable_rules(tokens, index, correct_tag)[source]
get_neighborhood(tokens, index)[source]
class nltk.tag.brill.ProximateWordsRule(original_tag, replacement_tag, *conditions)[source]

Bases: nltk.tag.brill.ProximateTokensRule

A rule which examines the base types of nearby tokens. See ProximateTokensRule for details. Also see SymmetricProximateTokensTemplate which generates these rules.

PROPERTY_NAME = 'text'
static extract_property(token)[source]
Returns:The given token’s text.
yaml_tag = '!ProximateWordsRule'
class nltk.tag.brill.SymmetricProximateTokensTemplate(rule_class, *boundaries)[source]

Bases: nltk.tag.brill.BrillTemplateI

Simulates two ProximateTokensTemplate templates which are symmetric across the location of the token. For rules of the form “If the *n*th token is tagged A, and any tag preceding or following the *n*th token by a distance between x and y is B, and ... , then change the tag of the *n*th token from A to C.”

One ProximateTokensTemplate is formed by passing in the same arguments given to this class’s constructor: tuples representing intervals in which a tag may be found. The other ProximateTokensTemplate is constructed with the negative of all the arguments in reversed order. For example, a SymmetricProximateTokensTemplate using the pair (-2,-1) and the constructor SymmetricProximateTokensTemplate generates the same rules as a SymmetricProximateTokensTemplate using (-2,-1) plus a second SymmetricProximateTokensTemplate using (1,2).

This is useful because we typically don’t want templates to specify only “following” or only “preceding”; we’d like our rules to be able to look in either direction.

Construct a template for generating proximate token brill rules.

Parameters:
  • rule_class (class) – The proximate token brill rule class that should be used to generate new rules. This class must be a subclass of ProximateTokensRule.
  • boundaries (tuple(int, int)) – A list of tuples (start, end), each of which specifies a range for which a condition should be created by each rule.
Raises ValueError:
 

If start>end for any boundary.

applicable_rules(tokens, index, correctTag)[source]

See BrillTemplateI for full specifications.

Return type:list of ProximateTokensRule
get_neighborhood(tokens, index)[source]
nltk.tag.brill.demo(num_sents=2000, max_rules=200, min_score=3, error_output='errors.out', rule_output='rules.yaml', randomize=False, train=0.8, trace=3)[source]

Brill Tagger Demonstration

Parameters:
  • num_sents (int) – how many sentences of training and testing data to use
  • max_rules (int) – maximum number of rule instances to create
  • min_score (int) – the minimum score for a rule in order for it to be considered
  • error_output (str) – the file where errors will be saved
  • rule_output (str) – the file where rules will be saved
  • randomize (bool) – whether the training data should be a random subset of the corpus
  • train (float) – the fraction of the the corpus to be used for training (1=all)
  • trace (int) – the level of diagnostic tracing output to produce (0-4)
nltk.tag.brill.error_list(train_sents, test_sents, radius=2)[source]

Returns a list of human-readable strings indicating the errors in the given tagging of the corpus.

Parameters:
  • train_sents (list(tuple)) – The correct tagging of the corpus
  • test_sents (list(tuple)) – The tagged corpus
  • radius (int) – How many tokens on either side of a wrongly-tagged token to include in the error string. For example, if radius=2, each error string will show the incorrect token plus two tokens on either side.

nltk.tag.crf module

An interface to Mallet <http://mallet.cs.umass.edu/>’s Linear Chain Conditional Random Field (LC-CRF) implementation.

A user-supplied feature detector function is used to convert each token to a featureset. Each feature/value pair is then encoded as a single binary feature for Mallet.

class nltk.tag.crf.CRFInfo(states, gaussian_variance, default_label, max_iterations, transduction_type, weight_groups, add_start_state, add_end_state, model_filename, feature_detector)[source]

Bases: builtins.object

An object used to record configuration information about a MalletCRF object. This configuration information can be serialized to an XML file, which can then be read by NLTK’s custom interface to Mallet’s CRF.

CRFInfo objects are typically created by the MalletCRF.train() method.

Advanced users may wish to directly create custom CRFInfo.WeightGroup objects and pass them to the MalletCRF.train() function. See CRFInfo.WeightGroup for more information.

class State(name, initial_cost, final_cost, transitions)[source]

Bases: builtins.object

A description of a single CRF state.

toxml()[source]
class CRFInfo.Transition(destination, label, weightgroups)[source]

Bases: builtins.object

A description of a single CRF transition.

toxml()[source]
class CRFInfo.WeightGroup(name, src, dst, features='.*')[source]

Bases: builtins.object

A configuration object used by MalletCRF to specify how input-features (which are a function of only the input) should be mapped to joint-features (which are a function of both the input and the output tags).

Each weight group specifies that a given set of input features should be paired with all transitions from a given set of source tags to a given set of destination tags.

match(src, dst)[source]
toxml()[source]
static CRFInfo.fromstring(s)[source]
CRFInfo.toxml()[source]
CRFInfo.write(filename, encoding='utf8')[source]
class nltk.tag.crf.MalletCRF(filename, feature_detector=None)[source]

Bases: nltk.tag.api.FeaturesetTaggerI

A conditional random field tagger, which is trained and run by making external calls to Mallet. Tokens are converted to featuresets using a feature detector function:

feature_detector(tokens, index) -> featureset

These featuresets are then encoded into feature vectors by converting each feature (name, value) pair to a unique binary feature.

Ecah MalletCRF object is backed by a crf model file. This model file is actually a zip file, and it contains one file for the serialized model crf-model.ser and one file for information about the structure of the CRF crf-info.xml.

Create a new MalletCRF.

Parameters:
  • filename – The filename of the model file that backs this CRF.
  • feature_detector – The feature detector function that is used to convert tokens to featuresets. This parameter only needs to be given if the model file does not contain a pickled pointer to the feature detector (e.g., if the feature detector was a lambda function).
batch_tag(sentences)[source]
crf_info = None

A CRFInfo object describing this CRF.

feature_detector None[source]

The feature detector function that is used to convert tokens to featuresets. This function has the signature:

feature_detector(tokens, index) -> featureset

filename None[source]

The filename of the crf model file that backs this MalletCRF. The crf model file is actually a zip file, and it contains one file for the serialized model crf-model.ser and one file for information about the structure of the CRF crf-info.xml).

parse_mallet_output(s)[source]

Parse the output that is generated by the java script org.nltk.mallet.TestCRF, and convert it to a labeled corpus.

classmethod train(feature_detector, corpus, filename=None, weight_groups=None, gaussian_variance=1, default_label='O', transduction_type='VITERBI', max_iterations=500, add_start_state=True, add_end_state=True, trace=1)[source]

Train a new linear chain CRF tagger based on the given corpus of training sequences. This tagger will be backed by a crf model file, containing both a serialized Mallet model and information about the CRF’s structure. This crf model file will not be automatically deleted – if you wish to delete it, you must delete it manually. The filename of the model file for a MalletCRF crf is available as crf.filename().

Parameters:
  • corpus (list(tuple(str, str))) – Training data, represented as a list of sentences, where each sentence is a list of (token, tag) tuples.
  • filename (str) – The filename that should be used for the crf model file that backs the new MalletCRF. If no filename is given, then a new filename will be chosen automatically.
  • weight_groups (list(CRFInfo.WeightGroup)) – Specifies how input-features should be mapped to joint-features. See CRFInfo.WeightGroup for more information.
  • gaussian_variance (float) – The gaussian variance of the prior that should be used to train the new CRF.
  • default_label (str) – The “label for initial context and uninteresting tokens” (from Mallet’s SimpleTagger.java.) It’s unclear whether this currently has any effect.
  • transduction_type (str) – The type of transduction used by the CRF. Can be VITERBI, VITERBI_FBEAM, VITERBI_BBEAM, VITERBI_FBBEAM, or VITERBI_FBEAMKL.
  • max_iterations (int) – The maximum number of iterations that should be used for training the CRF.
  • add_start_state (bool) – If true, then NLTK will add a special start state, named ‘__start__’. The initial cost for the start state will be set to 0; and the initial cost for all other states will be set to +inf.
  • add_end_state (bool) – If true, then NLTK will add a special end state, named ‘__end__’. The final cost for the end state will be set to 0; and the final cost for all other states will be set to +inf.
  • trace (int) – Controls the verbosity of trace output generated while training the CRF. Higher numbers generate more verbose output.
unicode_repr()
write_test_corpus(corpus, stream, close_stream=True)[source]

Write a given test corpus to a given stream, in a format that can be read by the java script org.nltk.mallet.TestCRF.

write_training_corpus(corpus, stream, close_stream=True)[source]

Write a given training corpus to a given stream, in a format that can be read by the java script org.nltk.mallet.TrainCRF.

nltk.tag.crf.demo(train_size=100, test_size=100, java_home=None, mallet_home=None)[source]

nltk.tag.hmm module

Hidden Markov Models (HMMs) largely used to assign the correct label sequence to sequential data or assess the probability of a given label and data sequence. These models are finite state machines characterised by a number of states, transitions between these states, and output symbols emitted while in each state. The HMM is an extension to the Markov chain, where each state corresponds deterministically to a given event. In the HMM the observation is a probabilistic function of the state. HMMs share the Markov chain’s assumption, being that the probability of transition from one state to another only depends on the current state - i.e. the series of states that led to the current state are not used. They are also time invariant.

The HMM is a directed graph, with probability weighted edges (representing the probability of a transition between the source and sink states) where each vertex emits an output symbol when entered. The symbol (or observation) is non-deterministically generated. For this reason, knowing that a sequence of output observations was generated by a given HMM does not mean that the corresponding sequence of states (and what the current state is) is known. This is the ‘hidden’ in the hidden markov model.

Formally, a HMM can be characterised by:

  • the output observation alphabet. This is the set of symbols which may be observed as output of the system.
  • the set of states.
  • the transition probabilities a_{ij} = P(s_t = j | s_{t-1} = i). These represent the probability of transition to each state from a given state.
  • the output probability matrix b_i(k) = P(X_t = o_k | s_t = i). These represent the probability of observing each symbol in a given state.
  • the initial state distribution. This gives the probability of starting in each state.

To ground this discussion, take a common NLP application, part-of-speech (POS) tagging. An HMM is desirable for this task as the highest probability tag sequence can be calculated for a given sequence of word forms. This differs from other tagging techniques which often tag each word individually, seeking to optimise each individual tagging greedily without regard to the optimal combination of tags for a larger unit, such as a sentence. The HMM does this with the Viterbi algorithm, which efficiently computes the optimal path through the graph given the sequence of words forms.

In POS tagging the states usually have a 1:1 correspondence with the tag alphabet - i.e. each state represents a single tag. The output observation alphabet is the set of word forms (the lexicon), and the remaining three parameters are derived by a training regime. With this information the probability of a given sentence can be easily derived, by simply summing the probability of each distinct path through the model. Similarly, the highest probability tagging sequence can be derived with the Viterbi algorithm, yielding a state sequence which can be mapped into a tag sequence.

This discussion assumes that the HMM has been trained. This is probably the most difficult task with the model, and requires either MLE estimates of the parameters or unsupervised learning using the Baum-Welch algorithm, a variant of EM.

For more information, please consult the source code for this module, which includes extensive demonstration code.

class nltk.tag.hmm.HiddenMarkovModelTagger(symbols, states, transitions, outputs, priors, transform=<function _identity at 0x11261e7c0>, **kwargs)[source]

Bases: nltk.tag.api.TaggerI

Hidden Markov model class, a generative model for labelling sequence data. These models define the joint probability of a sequence of symbols and their labels (state transitions) as the product of the starting state probability, the probability of each state transition, and the probability of each observation being generated from each state. This is described in more detail in the module documentation.

This implementation is based on the HMM description in Chapter 8, Huang, Acero and Hon, Spoken Language Processing and includes an extension for training shallow HMM parsers or specialized HMMs as in Molina et. al, 2002. A specialized HMM modifies training data by applying a specialization function to create a new training set that is more appropriate for sequential tagging with an HMM. A typical use case is chunking.

Parameters:
  • symbols (seq of any) – the set of output symbols (alphabet)
  • states (seq of any) – a set of states representing state space
  • transitions (ConditionalProbDistI) – transition probabilities; Pr(s_i | s_j) is the probability of transition from state i given the model is in state_j
  • outputs (ConditionalProbDistI) – output probabilities; Pr(o_k | s_i) is the probability of emitting symbol k when entering state i
  • priors (ProbDistI) – initial state distribution; Pr(s_i) is the probability of starting in state i
  • transform (callable) – an optional function for transforming training instances, defaults to the identity function.
best_path(unlabeled_sequence)[source]

Returns the state sequence of the optimal (most probable) path through the HMM. Uses the Viterbi algorithm to calculate this part by dynamic programming.

Returns:the state sequence
Return type:sequence of any
Parameters:unlabeled_sequence (list) – the sequence of unlabeled symbols
best_path_simple(unlabeled_sequence)[source]

Returns the state sequence of the optimal (most probable) path through the HMM. Uses the Viterbi algorithm to calculate this part by dynamic programming. This uses a simple, direct method, and is included for teaching purposes.

Returns:the state sequence
Return type:sequence of any
Parameters:unlabeled_sequence (list) – the sequence of unlabeled symbols
entropy(unlabeled_sequence)[source]

Returns the entropy over labellings of the given sequence. This is given by:

H(O) = - sum_S Pr(S | O) log Pr(S | O)

where the summation ranges over all state sequences, S. Let Z = Pr(O) = sum_S Pr(S, O)} where the summation ranges over all state sequences and O is the observation sequence. As such the entropy can be re-expressed as:

H = - sum_S Pr(S | O) log [ Pr(S, O) / Z ]
= log Z - sum_S Pr(S | O) log Pr(S, 0)
= log Z - sum_S Pr(S | O) [ log Pr(S_0) + sum_t Pr(S_t | S_{t-1}) + sum_t Pr(O_t | S_t) ]

The order of summation for the log terms can be flipped, allowing dynamic programming to be used to calculate the entropy. Specifically, we use the forward and backward probabilities (alpha, beta) giving:

H = log Z - sum_s0 alpha_0(s0) beta_0(s0) / Z * log Pr(s0)
+ sum_t,si,sj alpha_t(si) Pr(sj | si) Pr(O_t+1 | sj) beta_t(sj) / Z * log Pr(sj | si)
+ sum_t,st alpha_t(st) beta_t(st) / Z * log Pr(O_t | st)

This simply uses alpha and beta to find the probabilities of partial sequences, constrained to include the given state(s) at some point in time.

log_probability(sequence)[source]

Returns the log-probability of the given symbol sequence. If the sequence is labelled, then returns the joint log-probability of the symbol, state sequence. Otherwise, uses the forward algorithm to find the log-probability over all label sequences.

Returns:the log-probability of the sequence
Return type:float
Parameters:sequence (Token) – the sequence of symbols which must contain the TEXT property, and optionally the TAG property
point_entropy(unlabeled_sequence)[source]

Returns the pointwise entropy over the possible states at each position in the chain, given the observation sequence.

probability(sequence)[source]

Returns the probability of the given symbol sequence. If the sequence is labelled, then returns the joint probability of the symbol, state sequence. Otherwise, uses the forward algorithm to find the probability over all label sequences.

Returns:the probability of the sequence
Return type:float
Parameters:sequence (Token) – the sequence of symbols which must contain the TEXT property, and optionally the TAG property
random_sample(rng, length)[source]

Randomly sample the HMM to generate a sentence of a given length. This samples the prior distribution then the observation distribution and transition distribution for each subsequent observation and state. This will mostly generate unintelligible garbage, but can provide some amusement.

Returns:

the randomly created state/observation sequence, generated according to the HMM’s probability distributions. The SUBTOKENS have TEXT and TAG properties containing the observation and state respectively.

Return type:

list

Parameters:
  • rng (Random (or any object with a random() method)) – random number generator
  • length (int) – desired output length
reset_cache()[source]
tag(unlabeled_sequence)[source]

Tags the sequence with the highest probability state sequence. This uses the best_path method to find the Viterbi path.

Returns:a labelled sequence of symbols
Return type:list
Parameters:unlabeled_sequence (list) – the sequence of unlabeled symbols
test(test_sequence, verbose=False, **kwargs)[source]

Tests the HiddenMarkovModelTagger instance.

Parameters:
  • test_sequence (list(list)) – a sequence of labeled test instances
  • verbose (bool) – boolean flag indicating whether training should be verbose or include printed output
classmethod train(labeled_sequence, test_sequence=None, unlabeled_sequence=None, **kwargs)[source]

Train a new HiddenMarkovModelTagger using the given labeled and unlabeled training instances. Testing will be performed if test instances are provided.

Returns:

a hidden markov model tagger

Return type:

HiddenMarkovModelTagger

Parameters:
  • labeled_sequence (list(list)) – a sequence of labeled training instances, i.e. a list of sentences represented as tuples
  • test_sequence (list(list)) – a sequence of labeled test instances
  • unlabeled_sequence (list(list)) – a sequence of unlabeled training instances, i.e. a list of sentences represented as words
  • transform (function) – an optional function for transforming training instances, defaults to the identity function, see transform()
  • estimator (class or function) – an optional function or class that maps a condition’s frequency distribution to its probability distribution, defaults to a Lidstone distribution with gamma = 0.1
  • verbose (bool) – boolean flag indicating whether training should be verbose or include printed output
  • max_iterations (int) – number of Baum-Welch interations to perform
unicode_repr()
class nltk.tag.hmm.HiddenMarkovModelTrainer(states=None, symbols=None)[source]

Bases: builtins.object

Algorithms for learning HMM parameters from training data. These include both supervised learning (MLE) and unsupervised learning (Baum-Welch).

Creates an HMM trainer to induce an HMM with the given states and output symbol alphabet. A supervised and unsupervised training method may be used. If either of the states or symbols are not given, these may be derived from supervised training.

Parameters:
  • states (sequence of any) – the set of state labels
  • symbols (sequence of any) – the set of observation symbols
train(labeled_sequences=None, unlabeled_sequences=None, **kwargs)[source]

Trains the HMM using both (or either of) supervised and unsupervised techniques.

Returns:

the trained model

Return type:

HiddenMarkovModelTagger

Parameters:
  • labelled_sequences (list) – the supervised training data, a set of labelled sequences of observations
  • unlabeled_sequences (list) – the unsupervised training data, a set of sequences of observations
  • kwargs – additional arguments to pass to the training methods
train_supervised(labelled_sequences, **kwargs)[source]

Supervised training maximising the joint probability of the symbol and state sequences. This is done via collecting frequencies of transitions between states, symbol observations while within each state and which states start a sentence. These frequency distributions are then normalised into probability estimates, which can be smoothed if desired.

Returns:

the trained model

Return type:

HiddenMarkovModelTagger

Parameters:
  • labelled_sequences (list) – the training data, a set of labelled sequences of observations
  • kwargs – may include an ‘estimator’ parameter, a function taking a FreqDist and a number of bins and returning a CProbDistI; otherwise a MLE estimate is used
train_unsupervised(unlabeled_sequences, update_outputs=True, **kwargs)[source]

Trains the HMM using the Baum-Welch algorithm to maximise the probability of the data sequence. This is a variant of the EM algorithm, and is unsupervised in that it doesn’t need the state sequences for the symbols. The code is based on ‘A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition’, Lawrence Rabiner, IEEE, 1989.

Returns:the trained model
Return type:HiddenMarkovModelTagger
Parameters:unlabeled_sequences (list) – the training data, a set of sequences of observations

kwargs may include following parameters:

Parameters:
  • model – a HiddenMarkovModelTagger instance used to begin the Baum-Welch algorithm
  • max_iterations – the maximum number of EM iterations
  • convergence_logprob – the maximum change in log probability to allow convergence
nltk.tag.hmm.demo()[source]
nltk.tag.hmm.demo_bw()[source]
nltk.tag.hmm.demo_pos()[source]
nltk.tag.hmm.demo_pos_bw(test=10, supervised=20, unsupervised=10, verbose=True, max_iterations=5)[source]
nltk.tag.hmm.load_pos(num_sents)[source]
nltk.tag.hmm.logsumexp2(arr)[source]

nltk.tag.hunpos module

A module for interfacing with the HunPos open-source POS-tagger.

class nltk.tag.hunpos.HunposTagger(path_to_model, path_to_bin=None, encoding='ISO-8859-1', verbose=False)[source]

Bases: nltk.tag.api.TaggerI

A class for pos tagging with HunPos. The input is the paths to:
  • a model trained on training data
  • (optionally) the path to the hunpos-tag binary
  • (optionally) the encoding of the training data (default: ISO-8859-1)

Example:

>>> from nltk.tag.hunpos import HunposTagger
>>> ht = HunposTagger('english.model')
>>> ht.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]
>>> ht.close()

This class communicates with the hunpos-tag binary via pipes. When the tagger object is no longer needed, the close() method should be called to free system resources. The class supports the context manager interface; if used in a with statement, the close() method is invoked automatically:

>>> with HunposTagger('english.model') as ht:
...     ht.tag('What is the airspeed of an unladen swallow ?'.split())
...
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'NN'), ('swallow', 'VB'), ('?', '.')]
close()[source]

Closes the pipe to the hunpos executable.

tag(tokens)[source]

Tags a single sentence: a list of words. The tokens should not contain any newline characters.

nltk.tag.hunpos.setup_module(module)[source]

nltk.tag.mapping module

Interface for converting POS tags from various treebanks to the universal tagset of Petrov, Das, & McDonald.

The tagset consists of the following 12 coarse tags:

VERB - verbs (all tenses and modes) NOUN - nouns (common and proper) PRON - pronouns ADJ - adjectives ADV - adverbs ADP - adpositions (prepositions and postpositions) CONJ - conjunctions DET - determiners NUM - cardinal numbers PRT - particles or other function words X - other: foreign words, typos, abbreviations . - punctuation

@see: http://arxiv.org/abs/1104.2086 and http://code.google.com/p/universal-pos-tags/

nltk.tag.mapping.map_tag(source, target, source_tag)[source]

Maps the tag from the source tagset to the target tagset.

>>> map_tag('en-ptb', 'universal', 'VBZ')
'VERB'
>>> map_tag('en-ptb', 'universal', 'VBP')
'VERB'
>>> map_tag('en-ptb', 'universal', '``')
'.'
nltk.tag.mapping.tagset_mapping(source, target)[source]

Retrieve the mapping dictionary between tagsets.

>>> tagset_mapping('ru-rnc', 'universal') == {'!': '.', 'A': 'ADJ', 'C': 'CONJ', 'AD': 'ADV',            'NN': 'NOUN', 'VG': 'VERB', 'COMP': 'CONJ', 'NC': 'NUM', 'VP': 'VERB', 'P': 'ADP',            'IJ': 'X', 'V': 'VERB', 'Z': 'X', 'VI': 'VERB', 'YES_NO_SENT': 'X', 'PTCL': 'PRT'}
True

nltk.tag.senna module

A module for interfacing with the SENNA pipeline.

class nltk.tag.senna.CHKTagger(path, encoding='utf-8')[source]

Bases: nltk.tag.senna.SennaTagger

A chunker.

The input is: - path to the directory that contains SENNA executables. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import CHKTagger
>>> chktagger = CHKTagger('/usr/share/senna-v2.0')
>>> chktagger.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', u'B-NP'), ('is', u'B-VP'), ('the', u'B-NP'), ('airspeed', u'I-NP'),
('of', u'B-PP'), ('an', u'B-NP'), ('unladen', u'I-NP'), ('swallow',u'I-NP'),
('?', u'O')]
batch_tag(sentences)[source]

Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).

exception nltk.tag.senna.Error[source]

Bases: builtins.Exception

Basic error handling class to be extended by the module specific exceptions

exception nltk.tag.senna.ExecutableNotFound[source]

Bases: nltk.tag.senna.Error

Raised if the senna executable does not exist

class nltk.tag.senna.NERTagger(path, encoding='utf-8')[source]

Bases: nltk.tag.senna.SennaTagger

A named entity extractor.

The input is: - path to the directory that contains SENNA executables. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import NERTagger
>>> nertagger = NERTagger('/usr/share/senna-v2.0')
>>> nertagger.tag('Shakespeare theatre was in London .'.split())
[('Shakespeare', u'B-PER'), ('theatre', u'O'), ('was', u'O'), ('in', u'O'),
('London', u'B-LOC'), ('.', u'O')]
>>> nertagger.tag('UN headquarters are in NY , USA .'.split())
[('UN', u'B-ORG'), ('headquarters', u'O'), ('are', u'O'), ('in', u'O'),
('NY', u'B-LOC'), (',', u'O'), ('USA', u'B-LOC'), ('.', u'O')]
batch_tag(sentences)[source]

Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).

class nltk.tag.senna.POSTagger(path, encoding='utf-8')[source]

Bases: nltk.tag.senna.SennaTagger

A Part of Speech tagger.

The input is: - path to the directory that contains SENNA executables. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import POSTagger
>>> postagger = POSTagger('/usr/share/senna-v2.0')
>>> postagger.tag('What is the airspeed of an unladen swallow ?'.split())
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'),
('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
batch_tag(sentences)[source]

Applies the tag method over a list of sentences. This method will return for each sentence a list of tuples of (word, tag).

exception nltk.tag.senna.RunFailure[source]

Bases: nltk.tag.senna.Error

Raised if the pipeline fails to execute

class nltk.tag.senna.SennaTagger(senna_path, operations, encoding='utf-8')[source]

Bases: nltk.tag.api.TaggerI

A general interface of the SENNA pipeline that supports any of the operations specified in SUPPORTED_OPERATIONS.

Applying multiple operations at once has the speed advantage. For example, senna v2.0 will calculate the POS tags in case you are extracting the named entities. Applying both of the operations will cost only the time of extracting the named entities.

SENNA pipeline has a fixed maximum size of the sentences that it can read. By default it is 1024 token/sentence. If you have larger sentences, changing the MAX_SENTENCE_SIZE value in SENNA_main.c should be considered and your system specific binary should be rebuilt. Otherwise this could introduce misalignment errors.

The input is: - path to the directory that contains SENNA executables. - List of the operations needed to be performed. - (optionally) the encoding of the input data (default:utf-8)

Example:

>>> from nltk.tag.senna import SennaTagger
>>> pipeline = SennaTagger('/usr/share/senna-v2.0', ['pos', 'chk', 'ner'])
>>> sent = u'Düsseldorf is an international business center'.split()
>>> pipeline.tag(sent)
[{'word': u'D\xfcsseldorf', 'chk': u'B-NP', 'ner': u'B-PER', 'pos': u'NNP'},
{'word': u'is', 'chk': u'B-VP', 'ner': u'O', 'pos': u'VBZ'},
{'word': u'an', 'chk': u'B-NP', 'ner': u'O', 'pos': u'DT'},
{'word': u'international', 'chk': u'I-NP', 'ner': u'O', 'pos': u'JJ'},
{'word': u'business', 'chk': u'I-NP', 'ner': u'O', 'pos': u'NN'},
{'word': u'center', 'chk': u'I-NP', 'ner': u'O','pos': u'NN'}]
SUPPORTED_OPERATIONS = ['pos', 'chk', 'ner']
batch_tag(sentences)[source]

Applies the tag method over a list of sentences. This method will return a list of dictionaries. Every dictionary will contain a word with its calculated annotations/tags.

executable None[source]

A property that determines the system specific binary that should be used in the pipeline. In case, the system is not known the senna binary will be used.

tag(tokens)[source]

Applies the specified operation(s) on a list of tokens.

exception nltk.tag.senna.SentenceMisalignment[source]

Bases: nltk.tag.senna.Error

Raised if the new sentence is shorter than the original one or the number of sentences in the result is less than the input.

nltk.tag.senna.setup_module(module)[source]

nltk.tag.sequential module

Classes for tagging sentences sequentially, left to right. The abstract base class SequentialBackoffTagger serves as the base class for all the taggers in this module. Tagging of individual words is performed by the method choose_tag(), which is defined by subclasses of SequentialBackoffTagger. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted instead. Any SequentialBackoffTagger may serve as a backoff tagger for any other SequentialBackoffTagger.

class nltk.tag.sequential.AffixTagger(train=None, model=None, affix_length=-3, min_stem_length=2, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.ContextTagger, yaml.YAMLObject

A tagger that chooses a token’s tag based on a leading or trailing substring of its word string. (It is important to note that these substrings are not necessarily “true” morphological affixes). In particular, a fixed-length substring of the word is looked up in a table, and the corresponding tag is returned. Affix taggers are typically constructed by training them on a tagged corpus.

Construct a new affix tagger.

Parameters:
  • affix_length – The length of the affixes that should be considered during training and tagging. Use negative numbers for suffixes.
  • min_stem_length – Any words whose length is less than min_stem_length+abs(affix_length) will be assigned a tag of None by this tagger.
context(tokens, index, history)[source]
yaml_tag = '!nltk.AffixTagger'
class nltk.tag.sequential.BigramTagger(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.NgramTagger

A tagger that chooses a token’s tag based its word string and on the preceding words’ tag. In particular, a tuple consisting of the previous tag and the word is looked up in a table, and the corresponding tag is returned.

Parameters:
  • train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
  • model (dict) – The tagger model
  • backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
  • cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
yaml_tag = '!nltk.BigramTagger'
class nltk.tag.sequential.ClassifierBasedPOSTagger(feature_detector=None, train=None, classifier_builder=<function train at 0x1124e62f8>, classifier=None, backoff=None, cutoff_prob=None, verbose=False)[source]

Bases: nltk.tag.sequential.ClassifierBasedTagger

A classifier based part of speech tagger.

feature_detector(tokens, index, history)[source]
class nltk.tag.sequential.ClassifierBasedTagger(feature_detector=None, train=None, classifier_builder=<function train at 0x1124e62f8>, classifier=None, backoff=None, cutoff_prob=None, verbose=False)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger, nltk.tag.api.FeaturesetTaggerI

A sequential tagger that uses a classifier to choose the tag for each token in a sentence. The featureset input for the classifier is generated by a feature detector function:

feature_detector(tokens, index, history) -> featureset

Where tokens is the list of unlabeled tokens in the sentence; index is the index of the token for which feature detection should be performed; and history is list of the tags for all tokens before index.

Construct a new classifier-based sequential tagger.

Parameters:
  • feature_detector – A function used to generate the featureset input for the classifier:: feature_detector(tokens, index, history) -> featureset
  • train – A tagged corpus consisting of a list of tagged sentences, where each sentence is a list of (word, tag) tuples.
  • backoff – A backoff tagger, to be used by the new tagger if it encounters an unknown context.
  • classifier_builder – A function used to train a new classifier based on the data in train. It should take one argument, a list of labeled featuresets (i.e., (featureset, label) tuples).
  • classifier – The classifier that should be used by the tagger. This is only useful if you want to manually construct the classifier; normally, you would use train instead.
  • backoff – A backoff tagger, used if this tagger is unable to determine a tag for a given token.
  • cutoff_prob – If specified, then this tagger will fall back on its backoff tagger if the probability of the most likely tag is less than cutoff_prob.
choose_tag(tokens, index, history)[source]
classifier()[source]

Return the classifier that this tagger uses to choose a tag for each word in a sentence. The input for this classifier is generated using this tagger’s feature detector. See feature_detector()

feature_detector(tokens, index, history)[source]

Return the feature detector that this tagger uses to generate featuresets for its classifier. The feature detector is a function with the signature:

feature_detector(tokens, index, history) -> featureset

See classifier()

unicode_repr()
class nltk.tag.sequential.ContextTagger(context_to_tag, backoff=None)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger

An abstract base class for sequential backoff taggers that choose a tag for a token based on the value of its “context”. Different subclasses are used to define different contexts.

A ContextTagger chooses the tag for a token by calculating the token’s context, and looking up the corresponding tag in a table. This table can be constructed manually; or it can be automatically constructed based on a training corpus, using the _train() factory method.

Variables:_context_to_tag – Dictionary mapping contexts to tags.
choose_tag(tokens, index, history)[source]
context(tokens, index, history)[source]
Returns:the context that should be used to look up the tag for the specified token; or None if the specified token should not be handled by this tagger.
Return type:(hashable)
size()[source]
Returns:The number of entries in the table used by this tagger to map from contexts to tags.
unicode_repr()
class nltk.tag.sequential.DefaultTagger(tag)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger, yaml.YAMLObject

A tagger that assigns the same tag to every token.

>>> from nltk.tag.sequential import DefaultTagger
>>> default_tagger = DefaultTagger('NN')
>>> list(default_tagger.tag('This is a test'.split()))
[('This', 'NN'), ('is', 'NN'), ('a', 'NN'), ('test', 'NN')]

This tagger is recommended as a backoff tagger, in cases where a more powerful tagger is unable to assign a tag to the word (e.g. because the word was not seen during training).

Parameters:tag (str) – The tag to assign to each token
choose_tag(tokens, index, history)[source]
unicode_repr()
yaml_tag = '!nltk.DefaultTagger'
class nltk.tag.sequential.NgramTagger(n, train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.ContextTagger, yaml.YAMLObject

A tagger that chooses a token’s tag based on its word string and on the preceding n word’s tags. In particular, a tuple (tags[i-n:i-1], words[i]) is looked up in a table, and the corresponding tag is returned. N-gram taggers are typically trained on a tagged corpus.

Train a new NgramTagger using the given training data or the supplied model. In particular, construct a new tagger whose table maps from each context (tag[i-n:i-1], word[i]) to the most frequent tag for that context. But exclude any contexts that are already tagged perfectly by the backoff tagger.

Parameters:
  • train – A tagged corpus consisting of a list of tagged sentences, where each sentence is a list of (word, tag) tuples.
  • backoff – A backoff tagger, to be used by the new tagger if it encounters an unknown context.
  • cutoff – If the most likely tag for a context occurs fewer than cutoff times, then exclude it from the context-to-tag table for the new tagger.
context(tokens, index, history)[source]
yaml_tag = '!nltk.NgramTagger'
class nltk.tag.sequential.RegexpTagger(regexps, backoff=None)[source]

Bases: nltk.tag.sequential.SequentialBackoffTagger, yaml.YAMLObject

Regular Expression Tagger

The RegexpTagger assigns tags to tokens by comparing their word strings to a series of regular expressions. The following tagger uses word suffixes to make guesses about the correct Brown Corpus part of speech tag:

>>> from nltk.corpus import brown
>>> from nltk.tag.sequential import RegexpTagger
>>> test_sent = brown.sents(categories='news')[0]
>>> regexp_tagger = RegexpTagger(
...     [(r'^-?[0-9]+(.[0-9]+)?$', 'CD'),   # cardinal numbers
...      (r'(The|the|A|a|An|an)$', 'AT'),   # articles
...      (r'.*able$', 'JJ'),                # adjectives
...      (r'.*ness$', 'NN'),                # nouns formed from adjectives
...      (r'.*ly$', 'RB'),                  # adverbs
...      (r'.*s$', 'NNS'),                  # plural nouns
...      (r'.*ing$', 'VBG'),                # gerunds
...      (r'.*ed$', 'VBD'),                 # past tense verbs
...      (r'.*', 'NN')                      # nouns (default)
... ])
>>> regexp_tagger
<Regexp Tagger: size=9>
>>> regexp_tagger.tag(test_sent)
[('The', 'AT'), ('Fulton', 'NN'), ('County', 'NN'), ('Grand', 'NN'), ('Jury', 'NN'),
('said', 'NN'), ('Friday', 'NN'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'NN'),
("Atlanta's", 'NNS'), ('recent', 'NN'), ('primary', 'NN'), ('election', 'NN'),
('produced', 'VBD'), ('``', 'NN'), ('no', 'NN'), ('evidence', 'NN'), ("''", 'NN'),
('that', 'NN'), ('any', 'NN'), ('irregularities', 'NNS'), ('took', 'NN'),
('place', 'NN'), ('.', 'NN')]
Parameters:regexps (list(tuple(str, str))) – A list of (regexp, tag) pairs, each of which indicates that a word matching regexp should be tagged with tag. The pairs will be evalutated in order. If none of the regexps match a word, then the optional backoff tagger is invoked, else it is assigned the tag None.
choose_tag(tokens, index, history)[source]
unicode_repr()
yaml_tag = '!nltk.RegexpTagger'
class nltk.tag.sequential.SequentialBackoffTagger(backoff=None)[source]

Bases: nltk.tag.api.TaggerI

An abstract base class for taggers that tags words sequentially, left to right. Tagging of individual words is performed by the choose_tag() method, which should be defined by subclasses. If a tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.

Variables:_taggers – A list of all the taggers that should be tried to tag a token (i.e., self and its backoff taggers).
backoff None[source]

The backoff tagger for this tagger.

choose_tag(tokens, index, history)[source]

Decide which tag should be used for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, return None – do not consult the backoff tagger. This method should be overridden by subclasses of SequentialBackoffTagger.

Return type:

str

Parameters:
  • tokens (list) – The list of words that are being tagged.
  • index (int) – The index of the word whose tag should be returned.
  • history (list(str)) – A list of the tags for all words before index.
tag(tokens)[source]
tag_one(tokens, index, history)[source]

Determine an appropriate tag for the specified token, and return that tag. If this tagger is unable to determine a tag for the specified token, then its backoff tagger is consulted.

Return type:

str

Parameters:
  • tokens (list) – The list of words that are being tagged.
  • index (int) – The index of the word whose tag should be returned.
  • history (list(str)) – A list of the tags for all words before index.
class nltk.tag.sequential.TrigramTagger(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.NgramTagger

A tagger that chooses a token’s tag based its word string and on the preceding two words’ tags. In particular, a tuple consisting of the previous two tags and the word is looked up in a table, and the corresponding tag is returned.

Parameters:
  • train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
  • model (dict) – The tagger model
  • backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
  • cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
yaml_tag = '!nltk.TrigramTagger'
class nltk.tag.sequential.UnigramTagger(train=None, model=None, backoff=None, cutoff=0, verbose=False)[source]

Bases: nltk.tag.sequential.NgramTagger

Unigram Tagger

The UnigramTagger finds the most likely tag for each word in a training corpus, and then uses that information to assign tags to new tokens.

>>> from nltk.corpus import brown
>>> from nltk.tag.sequential import UnigramTagger
>>> test_sent = brown.sents(categories='news')[0]
>>> unigram_tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> for tok, tag in unigram_tagger.tag(test_sent):
...     print("(%s, %s), " % (tok, tag))
(The, AT), (Fulton, NP-TL), (County, NN-TL), (Grand, JJ-TL),
(Jury, NN-TL), (said, VBD), (Friday, NR), (an, AT),
(investigation, NN), (of, IN), (Atlanta's, NP$), (recent, JJ),
(primary, NN), (election, NN), (produced, VBD), (``, ``),
(no, AT), (evidence, NN), ('', ''), (that, CS), (any, DTI),
(irregularities, NNS), (took, VBD), (place, NN), (., .),
Parameters:
  • train (list(list(tuple(str, str)))) – The corpus of training data, a list of tagged sentences
  • model (dict) – The tagger model
  • backoff (TaggerI) – Another tagger which this tagger will consult when it is unable to tag a word
  • cutoff (int) – The number of instances of training data the tagger must see in order not to use the backoff tagger
context(tokens, index, history)[source]
yaml_tag = '!nltk.UnigramTagger'

nltk.tag.stanford module

A module for interfacing with the Stanford taggers.

class nltk.tag.stanford.NERTagger(*args, **kwargs)[source]

Bases: nltk.tag.stanford.StanfordTagger

A class for ner tagging with Stanford Tagger. The input is the paths to:

  • a model trained on training data
  • (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable.
  • (optionally) the encoding of the training data (default: ASCII)

Example:

>>> from nltk.tag.stanford import NERTagger
>>> st = NERTagger('/usr/share/stanford-ner/classifiers/all.3class.distsim.crf.ser.gz',
...                '/usr/share/stanford-ner/stanford-ner.jar') 
>>> st.tag('Rami Eid is studying at Stony Brook University in NY'.split()) 
[('Rami', 'PERSON'), ('Eid', 'PERSON'), ('is', 'O'), ('studying', 'O'),
 ('at', 'O'), ('Stony', 'ORGANIZATION'), ('Brook', 'ORGANIZATION'),
 ('University', 'ORGANIZATION'), ('in', 'O'), ('NY', 'LOCATION')]
parse_output(text)[source]
class nltk.tag.stanford.POSTagger(*args, **kwargs)[source]

Bases: nltk.tag.stanford.StanfordTagger

A class for pos tagging with Stanford Tagger. The input is the paths to:
  • a model trained on training data
  • (optionally) the path to the stanford tagger jar file. If not specified here, then this jar file must be specified in the CLASSPATH envinroment variable.
  • (optionally) the encoding of the training data (default: ASCII)

Example:

>>> from nltk.tag.stanford import POSTagger
>>> st = POSTagger('/usr/share/stanford-postagger/models/english-bidirectional-distsim.tagger',
...                '/usr/share/stanford-postagger/stanford-postagger.jar') 
>>> st.tag('What is the airspeed of an unladen swallow ?'.split()) 
[('What', 'WP'), ('is', 'VBZ'), ('the', 'DT'), ('airspeed', 'NN'), ('of', 'IN'), ('an', 'DT'), ('unladen', 'JJ'), ('swallow', 'VB'), ('?', '.')]
class nltk.tag.stanford.StanfordTagger(path_to_model, path_to_jar=None, encoding='ascii', verbose=False, java_options='-mx1000m')[source]

Bases: nltk.tag.api.TaggerI

An interface to Stanford taggers. Subclasses must define:

  • _cmd property: A property that returns the command that will be executed.
  • _SEPARATOR: Class constant that represents that character that is used to separate the tokens from their tags.
  • _JAR file: Class constant that represents the jar file name.
batch_tag(sentences)[source]
parse_output(text)[source]
tag(tokens)[source]

nltk.tag.tnt module

Implementation of ‘TnT - A Statisical Part of Speech Tagger’ by Thorsten Brants

http://acl.ldc.upenn.edu/A/A00/A00-1031.pdf

class nltk.tag.tnt.TnT(unk=None, Trained=False, N=1000, C=False)[source]

Bases: nltk.tag.api.TaggerI

TnT - Statistical POS tagger

IMPORTANT NOTES:

  • DOES NOT AUTOMATICALLY DEAL WITH UNSEEN WORDS
    • It is possible to provide an untrained POS tagger to create tags for unknown words, see __init__ function
  • SHOULD BE USED WITH SENTENCE-DELIMITED INPUT
    • Due to the nature of this tagger, it works best when trained over sentence delimited input.
    • However it still produces good results if the training data and testing data are separated on all punctuation eg: [,.?!]
    • Input for training is expected to be a list of sentences where each sentence is a list of (word, tag) tuples
    • Input for tag function is a single sentence Input for tagdata function is a list of sentences Output is of a similar form
  • Function provided to process text that is unsegmented
    • Please see basic_sent_chop()

TnT uses a second order Markov model to produce tags for a sequence of input, specifically:

argmax [Proj(P(t_i|t_i-1,t_i-2)P(w_i|t_i))] P(t_T+1 | t_T)

IE: the maximum projection of a set of probabilities

The set of possible tags for a given word is derived from the training data. It is the set of all tags that exact word has been assigned.

To speed up and get more precision, we can use log addition to instead multiplication, specifically:

argmax [Sigma(log(P(t_i|t_i-1,t_i-2))+log(P(w_i|t_i)))] +
log(P(t_T+1|t_T))

The probability of a tag for a given word is the linear interpolation of 3 markov models; a zero-order, first-order, and a second order model.

P(t_i| t_i-1, t_i-2) = l1*P(t_i) + l2*P(t_i| t_i-1) +
l3*P(t_i| t_i-1, t_i-2)

A beam search is used to limit the memory usage of the algorithm. The degree of the beam can be changed using N in the initialization. N represents the maximum number of possible solutions to maintain while tagging.

It is possible to differentiate the tags which are assigned to capitalized words. However this does not result in a significant gain in the accuracy of the results.

tag(data)[source]

Tags a single sentence

Parameters:data ([string,]) – list of words
Returns:[(word, tag),]

Calls recursive function ‘_tagword’ to produce a list of tags

Associates the sequence of returned tags with the correct words in the input sequence

returns a list of (word, tag) tuples

tagdata(data)[source]

Tags each sentence in a list of sentences

:param data:list of list of words :type data: [[string,],] :return: list of list of (word, tag) tuples

Invokes tag(sent) function for each sentence compiles the results into a list of tagged sentences each tagged sentence is a list of (word, tag) tuples

train(data)[source]

Uses a set of tagged data to train the tagger. If an unknown word tagger is specified, it is trained on the same data.

Parameters:data (tuple(str)) – List of lists of (word, tag) tuples
nltk.tag.tnt.basic_sent_chop(data, raw=True)[source]

Basic method for tokenizing input into sentences for this tagger:

Parameters:
  • data (str or tuple(str, str)) – list of tokens (words or (word, tag) tuples)
  • raw (bool) – boolean flag marking the input data as a list of words or a list of tagged words
Returns:

list of sentences sentences are a list of tokens tokens are the same as the input

Function takes a list of tokens and separates the tokens into lists where each list represents a sentence fragment This function can separate both tagged and raw sequences into basic sentences.

Sentence markers are the set of [,.!?]

This is a simple method which enhances the performance of the TnT tagger. Better sentence tokenization will further enhance the results.

nltk.tag.tnt.demo()[source]
nltk.tag.tnt.demo2()[source]
nltk.tag.tnt.demo3()[source]

nltk.tag.util module

nltk.tag.util.str2tuple(s, sep='/')[source]

Given the string representation of a tagged token, return the corresponding tuple representation. The rightmost occurrence of sep in s will be used to divide s into a word string and a tag string. If sep does not occur in s, return (s, None).

>>> from nltk.tag.util import str2tuple
>>> str2tuple('fly/NN')
('fly', 'NN')
Parameters:
  • s (str) – The string representation of a tagged token.
  • sep (str) – The separator string used to separate word strings from tags.
nltk.tag.util.tuple2str(tagged_token, sep='/')[source]

Given the tuple representation of a tagged token, return the corresponding string representation. This representation is formed by concatenating the token’s word string, followed by the separator, followed by the token’s tag. (If the tag is None, then just return the bare word string.)

>>> from nltk.tag.util import tuple2str
>>> tagged_token = ('fly', 'NN')
>>> tuple2str(tagged_token)
'fly/NN'
Parameters:
  • tagged_token (tuple(str, str)) – The tuple representation of a tagged token.
  • sep (str) – The separator string used to separate word strings from tags.
nltk.tag.util.untag(tagged_sentence)[source]

Given a tagged sentence, return an untagged version of that sentence. I.e., return a list containing the first element of each tuple in tagged_sentence.

>>> from nltk.tag.util import untag
>>> untag([('John', 'NNP'), ('saw', 'VBD'), ('Mary', 'NNP')])
['John', 'saw', 'Mary']

Module contents

NLTK Taggers

This package contains classes and interfaces for part-of-speech tagging, or simply “tagging”.

A “tag” is a case-sensitive string that specifies some property of a token, such as its part of speech. Tagged tokens are encoded as tuples (tag, token). For example, the following tagged token combines the word 'fly' with a noun part of speech tag ('NN'):

>>> tagged_tok = ('fly', 'NN')

An off-the-shelf tagger is available. It uses the Penn Treebank tagset:

>>> from nltk.tag import pos_tag  
>>> from nltk.tokenize import word_tokenize 
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) 
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]

This package defines several taggers, which take a token list (typically a sentence), assign a tag to each token, and return the resulting list of tagged tokens. Most of the taggers are built automatically based on a training corpus. For example, the unigram tagger tags each word w by checking what the most frequent tag for w was in a training corpus:

>>> from nltk.corpus import brown
>>> from nltk.tag import UnigramTagger
>>> tagger = UnigramTagger(brown.tagged_sents(categories='news')[:500])
>>> sent = ['Mitchell', 'decried', 'the', 'high', 'rate', 'of', 'unemployment']
>>> for word, tag in tagger.tag(sent):
...     print(word, '->', tag)
Mitchell -> NP
decried -> None
the -> AT
high -> JJ
rate -> NN
of -> IN
unemployment -> None

Note that words that the tagger has not seen during training receive a tag of None.

We evaluate a tagger on data that was not seen during training:

>>> tagger.evaluate(brown.tagged_sents(categories='news')[500:600])
0.734...

For more information, please consult chapter 5 of the NLTK Book.

nltk.tag.batch_pos_tag(sentences)[source]

Use NLTK’s currently recommended part of speech tagger to tag the given list of sentences, each consisting of a list of tokens.

nltk.tag.pos_tag(tokens)[source]

Use NLTK’s currently recommended part of speech tagger to tag the given list of tokens.

>>> from nltk.tag import pos_tag 
>>> from nltk.tokenize import word_tokenize 
>>> pos_tag(word_tokenize("John's big idea isn't all that bad.")) 
[('John', 'NNP'), ("'s", 'POS'), ('big', 'JJ'), ('idea', 'NN'), ('is',
'VBZ'), ("n't", 'RB'), ('all', 'DT'), ('that', 'DT'), ('bad', 'JJ'),
('.', '.')]
Parameters:tokens (list(str)) – Sequence of tokens to be tagged
Returns:The tagged tokens
Return type:list(tuple(str, str))