View aliases

Compat aliases for migration

tf.compat.v1.keras.preprocessing.text.Tokenizer, tf.compat.v2.keras.preprocessing.text.Tokenizer

tf.keras.preprocessing.text.Tokenizer(
    num_words=None, filters='!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~\t\n', lower=True,
    split=' ', char_level=False, oov_token=None, document_count=0, **kwargs
)

This class allows to vectorize a text corpus, by turning each text into either a sequence of integers (each integer being the index of a token in a dictionary) or into a vector where the coefficient for each token could be binary, based on word count, based on tf-idf...

Arguments

num_words: the maximum number of words to keep, based
    on word frequency. Only the most common `num_words-1` words will
    be kept.
filters: a string where each element is a character that will be
    filtered from the texts. The default is all punctuation, plus
    tabs and line breaks, minus the `'` character.
lower: boolean. Whether to convert the texts to lowercase.
split: str. Separator for word splitting.
char_level: if True, every character will be treated as a token.
oov_token: if given, it will be added to word_index and used to
    replace out-of-vocabulary words during text_to_sequence calls

By default, all punctuation is removed, turning the texts into space-separated sequences of words (words maybe include the ' character). These sequences are then split into lists of tokens. They will then be indexed or vectorized.

Methods

fit_on_sequences

fit_on_sequences(
    sequences
)

Required before using sequences_to_matrix (if fit_on_texts was never called).

Arguments

sequences: A list of sequence.
    A "sequence" is a list of integer word indices.

fit_on_texts

fit_on_texts(
    texts
)

In the case where texts contains lists, we assume each entry of the lists to be a token.

Arguments

texts: can be a list of strings,
    a generator of strings (for memory-efficiency),
    or a list of list of strings.

get_config

Returns the tokenizer configuration as Python dictionary. The word count dictionaries used by the tokenizer get serialized into plain JSON, so that the configuration can be read by other projects.

Returns

A Python dictionary with the tokenizer configuration.

sequences_to_matrix

sequences_to_matrix(
    sequences, mode='binary'
)

Arguments

sequences: list of sequences
    (a sequence is a list of integer word indices).
mode: one of "binary", "count", "tfidf", "freq"

Returns

Raises

ValueError: In case of invalid `mode` argument,
    or if the Tokenizer requires to be fit to sample data.

sequences_to_texts

sequences_to_texts(
    sequences
)

Only top num_words-1 most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.

Arguments

sequences: A list of sequences (list of integers).

Returns

A list of texts (strings)

sequences_to_texts_generator

sequences_to_texts_generator(
    sequences
)

Each sequence has to a list of integers. In other words, sequences should be a list of sequences

Only top num_words-1 most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.

Arguments

sequences: A list of sequences.

Yields

Yields individual texts.

texts_to_matrix

texts_to_matrix(
    texts, mode='binary'
)

Arguments

texts: list of strings.
mode: one of "binary", "count", "tfidf", "freq".

Returns

texts_to_sequences

texts_to_sequences(
    texts
)

Only top num_words-1 most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.

Arguments

texts: A list of texts (strings).

Returns

A list of sequences.

texts_to_sequences_generator

texts_to_sequences_generator(
    texts
)

Each item in texts can also be a list, in which case we assume each item of that list to be a token.

Only top num_words-1 most frequent words will be taken into account. Only words known by the tokenizer will be taken into account.

Arguments

texts: A list of texts (strings).

Yields

Yields individual sequences.

to_json

Arguments

**kwargs: Additional keyword arguments
    to be passed to `json.dumps()`.

Returns

A JSON string containing the tokenizer configuration.

tf.keras.preprocessing.text.Tokenizer

View aliases

Arguments

Methods

`fit_on_sequences`

Arguments

`fit_on_texts`

Arguments

`get_config`

Returns

`sequences_to_matrix`

Arguments

Returns

Raises

`sequences_to_texts`

Arguments

Returns

`sequences_to_texts_generator`

Arguments

Yields

`texts_to_matrix`

Arguments

Returns

`texts_to_sequences`

Arguments

Returns

`texts_to_sequences_generator`

Arguments

Yields

`to_json`

Arguments

Returns