nltk.tokenize package

Submodules

nltk.tokenize.api module

Tokenizer Interface

class nltk.tokenize.api.StringTokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that divides a string into substrings by splitting on the specified string (defined in subclasses).

span_tokenize(s)[source]
tokenize(s)[source]
class nltk.tokenize.api.TokenizerI[source]

Bases: builtins.object

A processing interface for tokenizing a string. Subclasses must define tokenize() or batch_tokenize() (or both).

batch_span_tokenize(strings)[source]

Apply self.span_tokenize() to each element of strings. I.e.:

return [self.span_tokenize(s) for s in strings]
Return type:iter(list(tuple(int, int)))
batch_tokenize(strings)[source]

Apply self.tokenize() to each element of strings. I.e.:

return [self.tokenize(s) for s in strings]
Return type:list(list(str))
span_tokenize(s)[source]

Identify the tokens using integer offsets (start_i, end_i), where s[start_i:end_i] is the corresponding token.

Return type:iter(tuple(int, int))
tokenize(s)[source]

Return a tokenized copy of s.

Return type:list of str

nltk.tokenize.punkt module

Punkt Sentence Tokenizer

This tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

The NLTK data package includes a pre-trained Punkt tokenizer for English.

>>> import nltk.data
>>> text = '''
... Punkt knows that the periods in Mr. Smith and Johann S. Bach
... do not mark sentence boundaries.  And sometimes sentences
... can start with non-capitalized words.  i is a good variable
... name.
... '''
>>> sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
>>> print('\n-----\n'.join(sent_detector.tokenize(text.strip())))
Punkt knows that the periods in Mr. Smith and Johann S. Bach
do not mark sentence boundaries.
-----
And sometimes sentences
can start with non-capitalized words.
-----
i is a good variable
name.

(Note that whitespace from the original text, including newlines, is retained in the output.)

Punctuation following sentences is also included by default (from NLTK 3.0 onwards). It can be excluded with the realign_boundaries flag.

>>> text = '''
... (How does it deal with this parenthesis?)  "It should be part of the
... previous sentence." "(And the same with this one.)" ('And this one!')
... "('(And (this)) '?)" [(and this.)]
... '''
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip())))
(How does it deal with this parenthesis?)
-----
"It should be part of the
previous sentence."
-----
"(And the same with this one.)"
-----
('And this one!')
-----
"('(And (this)) '?)"
-----
[(and this.)]
>>> print('\n-----\n'.join(
...     sent_detector.tokenize(text.strip(), realign_boundaries=False)))
(How does it deal with this parenthesis?
-----
)  "It should be part of the
previous sentence.
-----
" "(And the same with this one.
-----
)" ('And this one!
-----
')
"('(And (this)) '?
-----
)" [(and this.
-----
)]

However, Punkt is designed to learn parameters (a list of abbreviations, etc.) unsupervised from a corpus similar to the target domain. The pre-packaged models may therefore be unsuitable: use PunktSentenceTokenizer(text) to learn parameters from the given text.

PunktTrainer learns parameters such as a list of abbreviations (without supervision) from portions of text. Using a PunktTrainer directly allows for incremental training and modification of the hyper-parameters used to decide what is considered an abbreviation, etc.

PunktWordTokenizer uses a regular expression to divide a text into tokens, leaving all periods attached to words, but separating off other punctuation:

>>> from nltk.tokenize.punkt import PunktWordTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> PunktWordTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.', 'Please',
'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> PunktWordTokenizer().span_tokenize(s)
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44), 
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

The algorithm for this tokenizer is described in:

Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
  Boundary Detection.  Computational Linguistics 32: 485-525.
class nltk.tokenize.punkt.PunktBaseClass(lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object at 0x11263b690>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>, params=<nltk.tokenize.punkt.PunktParameters object at 0x11263b4d0>)[source]

Bases: builtins.object

Includes common components of PunktTrainer and PunktSentenceTokenizer.

class nltk.tokenize.punkt.PunktLanguageVars[source]

Bases: builtins.object

Stores variables, mostly regular expressions, which may be language-dependent for correct application of the algorithm. An extension of this class may modify its properties to suit a language other than English; an instance can then be passed as an argument to PunktSentenceTokenizer and PunktTrainer constructors.

internal_punctuation = ',:;'

sentence internal punctuation, which indicates an abbreviation if preceded by a period-final token.

period_context_re()[source]

Compiles and returns a regular expression to find contexts including possible sentence boundaries.

re_boundary_realignment = <_sre.SRE_Pattern object at 0x7fdd1aa9e770>

Used to realign punctuation that should be included in a sentence although it follows the period (or ?, !).

sent_end_chars = ('.', '?', '!')

Characters which are candidates for sentence boundaries

word_tokenize(s)[source]

Tokenize a string to split off punctuation other than periods

class nltk.tokenize.punkt.PunktParameters[source]

Bases: builtins.object

Stores data used to perform sentence boundary detection with Punkt.

abbrev_types = None

A set of word types for known abbreviations.

add_ortho_context(typ, flag)[source]
clear_abbrevs()[source]
clear_collocations()[source]
clear_ortho_context()[source]
clear_sent_starters()[source]
collocations = None

A set of word type tuples for known common collocations where the first word ends in a period. E.g., (‘S.’, ‘Bach’) is a common collocation in a text that discusses ‘Johann S. Bach’. These count as negative evidence for sentence boundaries.

ortho_context = None

A dictionary mapping word types to the set of orthographic contexts that word type appears in. Contexts are represented by adding orthographic context flags: ...

sent_starters = None

A set of word types for words that often appear at the beginning of sentences.

class nltk.tokenize.punkt.PunktSentenceTokenizer(train_text=None, verbose=False, lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object at 0x11263b6d0>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]

Bases: nltk.tokenize.punkt.PunktBaseClass, nltk.tokenize.api.TokenizerI

A sentence tokenizer which uses an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences; and then uses that model to find sentence boundaries. This approach has been shown to work well for many European languages.

PUNCTUATION = (';', ':', ',', '.', '!', '?')
debug_decisions(text)[source]

Classifies candidate periods as sentence breaks, yielding a dict for each that may be used to understand why the decision was made.

See format_debug_decision() to help make this output readable.

dump(tokens)[source]
sentences_from_text(text, realign_boundaries=True)[source]

Given a text, generates the sentences in that text by only testing candidate sentence breaks. If realign_boundaries is True, includes in the sentence closing punctuation that follows the period.

sentences_from_text_legacy(text)[source]

Given a text, generates the sentences in that text. Annotates all tokens, rather than just those with possible sentence breaks. Should produce the same results as sentences_from_text.

sentences_from_tokens(tokens)[source]

Given a sequence of tokens, generates lists of tokens, each list corresponding to a sentence.

span_tokenize(text)[source]

Given a text, returns a list of the (start, end) spans of sentences in the text.

text_contains_sentbreak(text)[source]

Returns True if the given text includes a sentence break.

tokenize(text, realign_boundaries=True)[source]

Given a text, returns a list of the sentences in that text.

train(train_text, verbose=False)[source]

Derives parameters from a given training text, or uses the parameters given. Repeated calls to this method destroy previous parameters. For incremental training, instantiate a separate PunktTrainer instance.

class nltk.tokenize.punkt.PunktToken(tok, **params)[source]

Bases: builtins.object

Stores a token of text with annotations produced during sentence boundary detection.

abbr None
ellipsis None
first_case None[source]
first_lower None[source]

True if the token’s first character is lowercase.

first_upper None[source]

True if the token’s first character is uppercase.

is_alpha None[source]

True if the token text is all alphabetic.

is_ellipsis None[source]

True if the token text is that of an ellipsis.

is_initial None[source]

True if the token text is that of an initial.

is_non_punct None[source]

True if the token is either a number or is alphabetic.

is_number None[source]

True if the token text is that of a number.

linestart None
parastart None
period_final None
sentbreak None
tok None
type None
type_no_period None[source]

The type with its final period removed if it has one.

type_no_sentperiod None[source]

The type with its final period removed if it is marked as a sentence break.

unicode_repr()

A string representation of the token that can reproduce it with eval(), which lists all the token’s non-default annotations.

class nltk.tokenize.punkt.PunktTrainer(train_text=None, verbose=False, lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object at 0x11263b710>, token_cls=<class 'nltk.tokenize.punkt.PunktToken'>)[source]

Bases: nltk.tokenize.punkt.PunktBaseClass

Learns parameters used in Punkt sentence boundary detection.

ABBREV = 0.3

cut-off value whether a ‘token’ is an abbreviation

ABBREV_BACKOFF = 5

upper cut-off for Mikheev’s(2002) abbreviation detection algorithm

COLLOCATION = 7.88

minimal log-likelihood value that two tokens need to be considered as a collocation

IGNORE_ABBREV_PENALTY = False

allows the disabling of the abbreviation penalty heuristic, which exponentially disadvantages words that are found at times without a final period.

INCLUDE_ABBREV_COLLOCS = False

this includes as potential collocations all word pairs where the first word is an abbreviation. Such collocations override the orthographic heuristic, but not the sentence starter heuristic. This is overridden by INCLUDE_ALL_COLLOCS, and if both are false, only collocations with initials and ordinals are considered.

INCLUDE_ALL_COLLOCS = False

this includes as potential collocations all word pairs where the first word ends in a period. It may be useful in corpora where there is a lot of variation that makes abbreviations like Mr difficult to identify.

MIN_COLLOC_FREQ = 1

this sets a minimum bound on the number of times a bigram needs to appear before it can be considered a collocation, in addition to log likelihood statistics. This is useful when INCLUDE_ALL_COLLOCS is True.

SENT_STARTER = 30

minimal log-likelihood value that a token requires to be considered as a frequent sentence starter

finalize_training(verbose=False)[source]

Uses data that has been gathered in training to determine likely collocations and sentence starters.

find_abbrev_types()[source]

Recalculates abbreviations given type frequencies, despite no prior determination of abbreviations. This fails to include abbreviations otherwise found as “rare”.

freq_threshold(ortho_thresh=2, type_thresh=2, colloc_thres=2, sentstart_thresh=2)[source]

Allows memory use to be reduced after much training by removing data about rare tokens that are unlikely to have a statistical effect with further training. Entries occurring above the given thresholds will be retained.

get_params()[source]

Calculates and returns parameters for sentence boundary detection as derived from training.

train(text, verbose=False, finalize=True)[source]

Collects training data from a given text. If finalize is True, it will determine all the parameters for sentence boundary detection. If not, this will be delayed until get_params() or finalize_training() is called. If verbose is True, abbreviations found will be listed.

train_tokens(tokens, verbose=False, finalize=True)[source]

Collects training data from a given list of tokens.

class nltk.tokenize.punkt.PunktWordTokenizer(lang_vars=<nltk.tokenize.punkt.PunktLanguageVars object at 0x11263b550>)[source]

Bases: nltk.tokenize.api.TokenizerI

span_tokenize(text)[source]

Given a text, returns a list of the (start, end) spans of words in the text.

tokenize(text)[source]
nltk.tokenize.punkt.demo(text, tok_cls=<class 'nltk.tokenize.punkt.PunktSentenceTokenizer'>, train_cls=<class 'nltk.tokenize.punkt.PunktTrainer'>)[source]

Builds a punkt model and applies it to the same text

nltk.tokenize.punkt.format_debug_decision(d)[source]

nltk.tokenize.regexp module

Regular-Expression Tokenizers

A RegexpTokenizer splits a string into substrings using a regular expression. For example, the following tokenizer forms tokens out of alphabetic sequences, money expressions, and any other non-whitespace sequences:

>>> from nltk.tokenize import RegexpTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

A RegexpTokenizer can use its regexp to match delimiters instead:

>>> tokenizer = RegexpTokenizer('\s+', gaps=True)
>>> tokenizer.tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']

Note that empty tokens are not returned when the delimiter appears at the start or end of the string.

The material between the tokens is discarded. For example, the following tokenizer selects just the capitalized words:

>>> capword_tokenizer = RegexpTokenizer('[A-Z]\w+')
>>> capword_tokenizer.tokenize(s)
['Good', 'New', 'York', 'Please', 'Thanks']

This module contains several subclasses of RegexpTokenizer that use pre-defined regular expressions.

>>> from nltk.tokenize import BlanklineTokenizer
>>> # Uses '\s*\n\s*\n\s*':
>>> BlanklineTokenizer().tokenize(s)
['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.',
'Thanks.']

All of the regular expression tokenizers are also available as functions:

>>> from nltk.tokenize import regexp_tokenize, wordpunct_tokenize, blankline_tokenize
>>> regexp_tokenize(s, pattern='\w+|\$[\d\.]+|\S+')
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
 '.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> blankline_tokenize(s)
['Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.', 'Thanks.']

Caution: The function regexp_tokenize() takes the text as its first argument, and the regular expression pattern as its second argument. This differs from the conventions used by Python’s re functions, where the pattern is always the first argument. (This is for consistency with the other NLTK tokenizers.)

class nltk.tokenize.regexp.BlanklineTokenizer[source]

Bases: nltk.tokenize.regexp.RegexpTokenizer

Tokenize a string, treating any sequence of blank lines as a delimiter. Blank lines are defined as lines containing no characters, except for space or tab characters.

class nltk.tokenize.regexp.RegexpTokenizer(pattern, gaps=False, discard_empty=True, flags=56)[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that splits a string using a regular expression, which matches either the tokens or the separators between tokens.

>>> tokenizer = RegexpTokenizer('\w+|\$[\d\.]+|\S+')
Parameters:
  • pattern (str) – The pattern used to build this tokenizer. (This pattern may safely contain grouping parentheses.)
  • gaps (bool) – True if this tokenizer’s pattern should be used to find separators between tokens; False if this tokenizer’s pattern should be used to find the tokens themselves.
  • discard_empty (bool) – True if any empty tokens ‘’ generated by the tokenizer should be discarded. Empty tokens can only be generated if _gaps == True.
  • flags (int) – The regexp flags used to compile this tokenizer’s pattern. By default, the following flags are used: re.UNICODE | re.MULTILINE | re.DOTALL.
span_tokenize(text)[source]
tokenize(text)[source]
unicode_repr()
class nltk.tokenize.regexp.WhitespaceTokenizer[source]

Bases: nltk.tokenize.regexp.RegexpTokenizer

Tokenize a string on whitespace (space, tab, newline). In general, users should use the string split() method instead.

>>> from nltk.tokenize import WhitespaceTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> WhitespaceTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
class nltk.tokenize.regexp.WordPunctTokenizer[source]

Bases: nltk.tokenize.regexp.RegexpTokenizer

Tokenize a text into a sequence of alphabetic and non-alphabetic characters, using the regexp \w+|[^\w\s]+.

>>> from nltk.tokenize import WordPunctTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> WordPunctTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
nltk.tokenize.regexp.regexp_tokenize(text, pattern, gaps=False, discard_empty=True, flags=56)[source]

Return a tokenized copy of text. See RegexpTokenizer for descriptions of the arguments.

nltk.tokenize.sexpr module

S-Expression Tokenizer

SExprTokenizer is used to find parenthesized expressions in a string. In particular, it divides a string into a sequence of substrings that are either parenthesized expressions (including any nested parenthesized expressions), or other whitespace-separated tokens.

>>> from nltk.tokenize import SExprTokenizer
>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

By default, SExprTokenizer will raise a ValueError exception if used to tokenize an expression with non-matching parentheses:

>>> SExprTokenizer().tokenize('c) d) e (f (g')
Traceback (most recent call last):
  ...
ValueError: Un-matched close paren at char 1

The strict argument can be set to False to allow for non-matching parentheses. Any unmatched close parentheses will be listed as their own s-expression; and the last partial sexpr with unmatched open parentheses will be listed as its own sexpr:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']

The characters used for open and close parentheses may be customized using the parens argument to the SExprTokenizer constructor:

>>> SExprTokenizer(parens='{}').tokenize('{a b {c d}} e f {g}')
['{a b {c d}}', 'e', 'f', '{g}']

The s-expression tokenizer is also available as a function:

>>> from nltk.tokenize import sexpr_tokenize
>>> sexpr_tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']
class nltk.tokenize.sexpr.SExprTokenizer(parens='()', strict=True)[source]

Bases: nltk.tokenize.api.TokenizerI

A tokenizer that divides strings into s-expressions. An s-expresion can be either:

  • a parenthesized expression, including any nested parenthesized expressions, or
  • a sequence of non-whitespace non-parenthesis characters.

For example, the string (a (b c)) d e (f) consists of four s-expressions: (a (b c)), d, e, and (f).

By default, the characters ( and ) are treated as open and close parentheses, but alternative strings may be specified.

Parameters:
  • parens (str or list) – A two-element sequence specifying the open and close parentheses that should be used to find sexprs. This will typically be either a two-character string, or a list of two strings.
  • strict – If true, then raise an exception when tokenizing an ill-formed sexpr.
tokenize(text)[source]

Return a list of s-expressions extracted from text. For example:

>>> SExprTokenizer().tokenize('(a b (c d)) e f (g)')
['(a b (c d))', 'e', 'f', '(g)']

All parentheses are assumed to mark s-expressions. (No special processing is done to exclude parentheses that occur inside strings, or following backslash characters.)

If the given expression contains non-matching parentheses, then the behavior of the tokenizer depends on the strict parameter to the constructor. If strict is True, then raise a ValueError. If strict is False, then any unmatched close parentheses will be listed as their own s-expression; and the last partial s-expression with unmatched open parentheses will be listed as its own s-expression:

>>> SExprTokenizer(strict=False).tokenize('c) d) e (f (g')
['c', ')', 'd', ')', 'e', '(f (g']
Parameters:text (str or iter(str)) – the string to be tokenized
Return type:iter(str)

nltk.tokenize.simple module

Simple Tokenizers

These tokenizers divide strings into substrings using the string split() method. When tokenizing using a particular delimiter string, use the string split() method directly, as this is more efficient.

The simple tokenizers are not available as separate functions; instead, you should just use the string split() method directly:

>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> s.split()
['Good', 'muffins', 'cost', '$3.88', 'in', 'New', 'York.',
'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks.']
>>> s.split(' ')
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
>>> s.split('\n')
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']

The simple tokenizers are mainly useful because they follow the standard TokenizerI interface, and so can be used with any code that expects a tokenizer. For example, these tokenizers can be used to specify the tokenization conventions when building a CorpusReader.

class nltk.tokenize.simple.CharTokenizer[source]

Bases: nltk.tokenize.api.StringTokenizer

Tokenize a string into individual characters. If this functionality is ever required directly, use for char in string.

span_tokenize(s)[source]
tokenize(s)[source]
class nltk.tokenize.simple.LineTokenizer(blanklines='discard')[source]

Bases: nltk.tokenize.api.TokenizerI

Tokenize a string into its lines, optionally discarding blank lines. This is similar to s.split('\n').

>>> from nltk.tokenize import LineTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> LineTokenizer(blanklines='keep').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', '', 'Thanks.']
>>> # same as [l for l in s.split('\n') if l.strip()]:
>>> LineTokenizer(blanklines='discard').tokenize(s)
['Good muffins cost $3.88', 'in New York.  Please buy me',
'two of them.', 'Thanks.']
Parameters:blanklines

Indicates how blank lines should be handled. Valid values are:

  • discard: strip blank lines out of the token list before returning it.
    A line is considered blank if it contains only whitespace characters.
  • keep: leave all blank lines in the token list.
  • discard-eof: if the string ends with a newline, then do not generate
    a corresponding token '' after that newline.
span_tokenize(s)[source]
tokenize(s)[source]
class nltk.tokenize.simple.SpaceTokenizer[source]

Bases: nltk.tokenize.api.StringTokenizer

Tokenize a string using the space character as a delimiter, which is the same as s.split(' ').

>>> from nltk.tokenize import SpaceTokenizer
>>> s = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
>>> SpaceTokenizer().tokenize(s)
['Good', 'muffins', 'cost', '$3.88\nin', 'New', 'York.', '',
'Please', 'buy', 'me\ntwo', 'of', 'them.\n\nThanks.']
class nltk.tokenize.simple.TabTokenizer[source]

Bases: nltk.tokenize.api.StringTokenizer

Tokenize a string use the tab character as a delimiter, the same as s.split('\t').

>>> from nltk.tokenize import TabTokenizer
>>> TabTokenizer().tokenize('a\tb c\n\t d')
['a', 'b c\n', ' d']
nltk.tokenize.simple.line_tokenize(text, blanklines='discard')[source]

nltk.tokenize.texttiling module

class nltk.tokenize.texttiling.TextTilingTokenizer(w=20, k=10, similarity_method=0, stopwords=None, smoothing_method=[0], smoothing_width=2, smoothing_rounds=1, cutoff_policy=1, demo_mode=False)[source]

Bases: nltk.tokenize.api.TokenizerI

Tokenize a document into topical sections using the TextTiling algorithm. This algorithm detects subtopic shifts based on the analysis of lexical co-occurrence patterns.

The process starts by tokenizing the text into pseudosentences of a fixed size w. Then, depending on the method used, similarity scores are assigned at sentence gaps. The algorithm proceeds by detecting the peak differences between these scores and marking them as boundaries. The boundaries are normalized to the closest paragraph break and the segmented text is returned.

Parameters:
  • w (int) – Pseudosentence size
  • k (int) – Size (in sentences) of the block used in the block comparison method
  • similarity_method (constant) – The method used for determining similarity scores: BLOCK_COMPARISON (default) or VOCABULARY_INTRODUCTION.
  • stopwords (list(str)) – A list of stopwords that are filtered out (defaults to NLTK’s stopwords corpus)
  • smoothing_method (constant) – The method used for smoothing the score plot: DEFAULT_SMOOTHING (default)
  • smoothing_width (int) – The width of the window used by the smoothing method
  • smoothing_rounds (int) – The number of smoothing passes
  • cutoff_policy (constant) – The policy used to determine the number of boundaries: HC (default) or LC
tokenize(text)[source]

Return a tokenized copy of text, where each “token” represents a separate topic.

class nltk.tokenize.texttiling.TokenSequence(index, wrdindex_list, original_length=None)[source]

Bases: builtins.object

A token list with its original length and its index

class nltk.tokenize.texttiling.TokenTableField(first_pos, ts_occurences, total_count=1, par_count=1, last_par=0, last_tok_seq=None)[source]

Bases: builtins.object

A field in the token table holding parameters for each token, used later in the process

nltk.tokenize.texttiling.demo(text=None)[source]
nltk.tokenize.texttiling.smooth(x, window_len=11, window='flat')[source]

smooth the data using a window with requested size.

This method is based on the convolution of a scaled window with the signal. The signal is prepared by introducing reflected copies of the signal (with the window size) in both ends so that transient parts are minimized in the beginning and end part of the output signal.

Parameters:
  • x – the input signal
  • window_len – the dimension of the smoothing window; should be an odd integer
  • window – the type of window from ‘flat’, ‘hanning’, ‘hamming’, ‘bartlett’, ‘blackman’ flat window will produce a moving average smoothing.
Returns:

the smoothed signal

example:

t=linspace(-2,2,0.1)
x=sin(t)+randn(len(t))*0.1
y=smooth(x)
See also:numpy.hanning, numpy.hamming, numpy.bartlett, numpy.blackman, numpy.convolve, scipy.signal.lfilter

TODO: the window parameter could be the window itself if an array instead of a string

nltk.tokenize.treebank module

Penn Treebank Tokenizer

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This implementation is a port of the tokenizer sed script written by Robert McIntyre and available at http://www.cis.upenn.edu/~treebank/tokenizer.sed.

class nltk.tokenize.treebank.TreebankWordTokenizer[source]

Bases: nltk.tokenize.api.TokenizerI

The Treebank tokenizer uses regular expressions to tokenize text as in Penn Treebank. This is the method that is invoked by word_tokenize(). It assumes that the text has already been segmented into sentences, e.g. using sent_tokenize().

This tokenizer performs the following steps:

  • split standard contractions, e.g. don't -> do n't and they'll -> they 'll

  • treat most punctuation characters as separate tokens

  • split off commas and single quotes, when followed by whitespace

  • separate periods that appear at the end of line

    >>> from nltk.tokenize import TreebankWordTokenizer
    >>> s = '''Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\nThanks.'''
    >>> TreebankWordTokenizer().tokenize(s)
    ['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York.', 'Please', 'buy', 'me', 'two', 'of', 'them.', 'Thanks', '.']
    >>> s = "They'll save and invest more."
    >>> TreebankWordTokenizer().tokenize(s)
    ['They', "'ll", 'save', 'and', 'invest', 'more', '.']
    
CONTRACTIONS2 = [<_sre.SRE_Pattern object at 0x7fdd1b02e9c0>, <_sre.SRE_Pattern object at 0x7fdd1b051810>, <_sre.SRE_Pattern object at 0x7fdd1b0421c0>, <_sre.SRE_Pattern object at 0x7fdd1b04dcc0>, <_sre.SRE_Pattern object at 0x7fdd1b033440>, <_sre.SRE_Pattern object at 0x7fdd1b02fc60>, <_sre.SRE_Pattern object at 0x7fdd1b032ef0>, <_sre.SRE_Pattern object at 0x7fdd1b041df0>]
CONTRACTIONS3 = [<_sre.SRE_Pattern object at 0x7fdd1b043960>, <_sre.SRE_Pattern object at 0x7fdd1b04b380>]
CONTRACTIONS4 = [<_sre.SRE_Pattern object at 0x7fdd1b04fe50>, <_sre.SRE_Pattern object at 0x7fdd1b035810>]
tokenize(text)[source]

nltk.tokenize.util module

nltk.tokenize.util.regexp_span_tokenize(s, regexp)[source]

Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each successive match of regexp.

>>> from nltk.tokenize import WhitespaceTokenizer
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36),
(38, 44), (45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]
Parameters:
  • s (str) – the string to be tokenized
  • regexp (str) – regular expression that matches token separators
Return type:

iter(tuple(int, int))

nltk.tokenize.util.spans_to_relative(spans)[source]

Return a sequence of relative spans, given a sequence of spans.

>>> from nltk.tokenize import WhitespaceTokenizer
>>> from nltk.tokenize.util import spans_to_relative
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(spans_to_relative(WhitespaceTokenizer().span_tokenize(s)))
[(0, 4), (1, 7), (1, 4), (1, 5), (1, 2), (1, 3), (1, 5), (2, 6),
(1, 3), (1, 2), (1, 3), (1, 2), (1, 5), (2, 7)]
Parameters:spans (iter(tuple(int, int))) – a sequence of (start, end) offsets of the tokens
Return type:iter(tuple(int, int))
nltk.tokenize.util.string_span_tokenize(s, sep)[source]

Return the offsets of the tokens in s, as a sequence of (start, end) tuples, by splitting the string at each occurrence of sep.

>>> from nltk.tokenize.util import string_span_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> list(string_span_tokenize(s, " "))
[(0, 4), (5, 12), (13, 17), (18, 26), (27, 30), (31, 36), (37, 37),
(38, 44), (45, 48), (49, 55), (56, 58), (59, 73)]
Parameters:
  • s (str) – the string to be tokenized
  • sep (str) – the token separator
Return type:

iter(tuple(int, int))

Module contents

NLTK Tokenizer Package

Tokenizers divide strings into lists of substrings. For example, tokenizers can be used to find the list of sentences or words in a string.

>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88\nin New York.  Please buy me
... two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> sent_tokenize(s)
['Good muffins cost $3.88\nin New York.', 'Please buy me\ntwo of them.', 'Thanks.']
>>> [word_tokenize(t) for t in sent_tokenize(s)]
[['Good', 'muffins', 'cost', '$', '3.88', 'in', 'New', 'York', '.'],
['Please', 'buy', 'me', 'two', 'of', 'them', '.'], ['Thanks', '.']]

Caution: only use word_tokenize() on individual sentences.

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

NLTK tokenizers can produce token-spans, represented as tuples of integers having the same semantics as string slices, to support efficient comparison of tokenizers. (These methods are implemented as generators.)

>>> from nltk.tokenize import WhitespaceTokenizer
>>> list(WhitespaceTokenizer().span_tokenize(s))
[(0, 4), (5, 12), (13, 17), (18, 23), (24, 26), (27, 30), (31, 36), (38, 44),
(45, 48), (49, 51), (52, 55), (56, 58), (59, 64), (66, 73)]

There are numerous ways to tokenize text. If you need more control over tokenization, see the other methods provided in this package.

For further information, please see Chapter 3 of the NLTK book.

nltk.tokenize.sent_tokenize(text)[source]

Return a sentence-tokenized copy of text, using NLTK’s recommended sentence tokenizer (currently PunktSentenceTokenizer).

nltk.tokenize.word_tokenize(text)[source]

Return a tokenized copy of text, using NLTK’s recommended word tokenizer (currently TreebankWordTokenizer). This tokenizer is designed to work on a sentence at a time.